Awesome Visual Representation Learning with Transformers

Awesome Transformers (self-attention) in Computer Vision

About transformers

Attention Is All You Need, NeurIPS 2017
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- [paper] [official code] [pytorch implementation]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
- [paper] [offficial code] [huggingface/transformers]
Efficient Transformers: A Survey, arXiv 2020
- Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
- [paper]
A Survey on Visual Transformer, arXiv 2020
- Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, Dacheng Tao
- [paper]
Transformers in Vision: A Survey, arXiv 2021
- Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah
- [paper]

Combining CNN with self-attention

Attention augmented convolutional networks, ICCV 2019, image classification
- Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Quoc V. Le
- [paper] [pytorch implementation]
Self-Attention Generative Adversarial Networks, ICML 2019, generative model(GANs)
- Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena
- [paper] [official code]
Videobert: A joint model for video and language representation learning, ICCV 2019, video processing
- Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid
- [paper]
Visual Transformers: Token-based Image Representation and Processing for Computer Vision, arXiv 2020, image classification
- Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, Peter Vajda
- [paper]
Feature Pyramid Transformer, ECCV 2020, detection and segmentation
- Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, Qianru Sun
- [paper] [official code]
Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers, arXiv 2020, depth estimation
- Zhaoshuo Li, Xingtong Liu, Francis X. Creighton, Russell H. Taylor, and Mathias Unberath
- [paper] [official code]
End-to-end Lane Shape Prediction with Transformers, arXiv 2020, lane detection
- Ruijin Liu, Zejian Yuan, Tie Liu, Zhiliang Xiong
- [paper] [official code]
Taming Transformers for High-Resolution Image Synthesis, arXiv 2020, image synthesis
- Patrick Esser, Robin Rombach, Bjorn Ommer
- [paper][official code]
TransPose: Towards Explainable Human Pose Estimation by Transformer, arXiv 2020, pose estimation
- Sen Yang, Zhibin Quan, Mu Nie, Wankou Yang
- [paper]
End-to-End Video Instance Segmentation with Transformers, arXiv 2020, video instance segmentation
- Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, Huaxia Xia
- [paper]
TransTrack: Multiple-Object Tracking with Transformer, arXiv 2020, MOT
- Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, Ping Luo
- [paper][official code]
TrackFormer: Multi-Object Tracking with Transformers, arXiv 2021, MOT
- Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer
- [paper]
Line Segment Detection Using Transformers without Edges, arXiv 2021, line segmentation
- Yifan Xu, Weijian Xu, David Cheung, Zhuowen Tu
- [paper]
Segmenting Transparent Object in the Wild with Transformer, arXiv 2021, transparent object segmentation
- Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu, Ding Liang, Ping Luo
- [paper][official code]
Bottleneck Transformers for Visual Recognition, arXiv 2021, backbone design
- Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani
- [paper]

DETR Family

End-to-end object detection with transformers, ECCV 2020, object detection
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko
- [paper] [official code] [detectron2 implementation]
Deformable DETR: Deformable Transformers for End-to-End Object Detection, ICLR 2021, object detection
- Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai
- [paper] [official code]
End-to-End Object Detection with Adaptive Clustering Transformer, arXiv 2020, object detection
- Minghang Zheng, Peng Gao, Xiaogang Wang, Hongsheng Li, Hao Dong
- [paper]
UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, arXiv 2020, object detection
- Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen
- [paper]
DETR for Pedestrian Detection, arXiv 2020, pedestrian detection
- Matthieu Lin, Chuming Li, Xingyuan Bu, Ming Sun, Chen Lin, Junjie Yan, Wanli Ouyang, Zhidong Deng
- [paper]

Stand-alone transformers for Computer Vision

Self-attention only in local neighborhood

Image Transformer, ICML 2018
- Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran
- [paper] [official code]
Stand-alone self-attention in vision models, NeurIPS 2019
- Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jonathon Shlens
- [paper] [official code(underconstruction)]
On the relationship between self-attention and convolutional layers, ICLR 2020
- Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi
- [paper] [official code]
Exploring self-attention for image recognition, CVPR 2020
- Hengshuang Zhao, Jiaya Jia, Vladlen Koltun
- [paper] [official code]

Scalable approximations to global self-attention

Generating long sequences with sparse transformers, arXiv 2019
- Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever
- [paper] [official code]
Scaling autoregressive video models, ICLR 2019
- Dirk Weissenborn, Oscar Täckström, Jakob Uszkoreit
- [paper]
Axial attention in multidimensional transformers, arXiv 2019
- Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans
- [paper] [pytorch implementation]
Axial-deeplab: Stand-alone axial-attention for panoptic segmentation, ECCV 2020
- Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
- [paper] [pytorch implementation]
MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers, arXiv 2020
- Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
- [paper]

Global self-attention with image preprocessing

Generative pretraining from pixels, ICML 2020, iGPT
- Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever
- [paper] [official code]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021, ViT
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
- [paper] [pytorch implementation]
Pre-Trained Image Processing Transformer, arXiv, IPT
- Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, Wen Gao
- [paper]
Training data-efficient image transformers & distillation through attention, arXiv 2020, DeiT
- Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Herve Jegou
- [paper][official code]
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, arXiv 2020, SETR
- Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang
- [paper][official code]
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, arXiv 2021, T2T-ViT
- Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, Shuicheng Yan
- [paper][official code]
TransReID: Transformer-based Object Re-Identification, arXiv 2021
- Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang
- [paper]
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
- Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao
- [paper][official code]

Global self-attention on 3D point clouds

Point Transformer, arXiv 2020, points classification + part/semantic segmentation
- Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, Vladlen Koltun
- [paper]

Unified text-vision tasks

Focused on VQA

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019
- Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
- [paper] [official code]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019
- Hao Tan, Mohit Bansal
- [paper] [official code]
VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019
- Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang
- [paper] [official code]
VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020
- Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
- [paper] [official code]
UNITER: UNiversal Image-TExt Representation Learning, ECCV 2020
- Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
- [paper] [official code]
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020
- Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
- [paper]
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, arXiv 2021
- Wonjae Kim, Bokyung Son, Ildoo Kim
- [paper]

Focused on Image Retrieval

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020
- Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou
- [paper] [official code]
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data, arXiv 2020
- Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti
- [paper]
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020
- Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao
- [paper] [official code]
Training Vision Transformers for Image Retrieval, arXiv 2021
- Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Herve Jegou
- [paper]

Focused on OCR

LayoutLM: Pre-training of Text and Layout for Document Image Understanding
- Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou
- [paper] [official code]

Focused on Image Captioning

CPTR: Full Transformer Network for Image Captioning, arXiv 2021
- Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, Jing Liu
- [paper]

Multi-Task

12-in-1: Multi-Task Vision and Language Representation Learning
- Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee
- [paper] [official code]

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Visual Representation Learning with Transformers

About transformers

Combining CNN with self-attention

DETR Family

Stand-alone transformers for Computer Vision

Self-attention only in local neighborhood

Scalable approximations to global self-attention

Global self-attention with image preprocessing

Global self-attention on 3D point clouds

Unified text-vision tasks

Focused on VQA

Focused on Image Retrieval

Focused on OCR

Focused on Image Captioning

Multi-Task

About

Releases

Packages

Contributors 2

License

alohays/awesome-visual-representation-learning-with-transformers

Folders and files

Latest commit

History

Repository files navigation

Awesome Visual Representation Learning with Transformers

About transformers

Combining CNN with self-attention

DETR Family

Stand-alone transformers for Computer Vision

Self-attention only in local neighborhood

Scalable approximations to global self-attention

Global self-attention with image preprocessing

Global self-attention on 3D point clouds

Unified text-vision tasks

Focused on VQA

Focused on Image Retrieval

Focused on OCR

Focused on Image Captioning

Multi-Task

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages