Awesome Visual Representation Learning with Transformers

Awesome Transformers (self-attention) in Computer Vision

About transformers

Attention Is All You Need, NeurIPS 2017
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- [paper] [official code] [pytorch implementation]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
- [paper] [offficial code] [huggingface/transformers]
Efficient Transformers: A Survey, arXiv 2020
- Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
- [paper]

Combining CNN with self-attention

Attention augmented convolutional networks, ICCV 2019, image classification
- Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Quoc V. Le
- [paper] [pytorch implementation]
Self-Attention Generative Adversarial Networks, ICML 2019, generative model(GANs)
- Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena
- [paper] [official code]
Videobert: A joint model for video and language representation learning, ICCV 2019, video processing
- Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid
- [paper]
Visual Transformers: Token-based Image Representation and Processing for Computer Vision, arXiv 2020, image classification
- Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, Peter Vajda
- [paper]
Feature Pyramid Transformer, ECCV 2020, detection and segmentation
- Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, Qianru Sun
- [paper] [official code]
Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers, arXiv 2020, depth estimation
- Zhaoshuo Li, Xingtong Liu, Francis X. Creighton, Russell H. Taylor, and Mathias Unberath
- [paper] [official code]
End-to-end Lane Shape Prediction with Transformers, arXiv 2020, lane detection
- Ruijin Liu, Zejian Yuan, Tie Liu, Zhiliang Xiong
- [paper] [official code]

DETR Family

End-to-end object detection with transformers, ECCV 2020, object detection
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko
- [paper] [official code] [detectron2 implementation]
Deformable DETR: Deformable Transformers for End-to-End Object Detection, arXiv 2020, object detection
- Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai
- [paper] [official code]
End-to-End Object Detection with Adaptive Clustering Transformer, arXiv 2020, object detection
- Minghang Zheng, Peng Gao, Xiaogang Wang, Hongsheng Li, Hao Dong
- [paper]
UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, arXiv 2020, object detection
- Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen
- [paper]
DETR for Pedestrian Detection, arXiv 2020, pedestrian detection
- Matthieu Lin, Chuming Li, Xingyuan Bu, Ming Sun, Chen Lin, Junjie Yan, Wanli Ouyang, Zhidong Deng
- [paper]

Stand-alone transformers for Computer Vision

Self-attention only in local neighborhood

Image Transformer, ICML 2018
- Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran
- [paper] [official code]
Stand-alone self-attention in vision models, NeurIPS 2019
- Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jonathon Shlens
- [paper] [official code(underconstruction)]
On the relationship between self-attention and convolutional layers, ICLR 2020
- Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi
- [paper] [official code]
Exploring self-attention for image recognition, CVPR 2020
- Hengshuang Zhao, Jiaya Jia, Vladlen Koltun
- [paper] [official code]

Scalable approximations to global self-attention

Generating long sequences with sparse transformers, arXiv 2019
- Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever
- [paper] [official code]
Scaling autoregressive video models, ICLR 2019
- Dirk Weissenborn, Oscar Täckström, Jakob Uszkoreit
- [paper]
Axial attention in multidimensional transformers, arXiv 2019
- Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans
- [paper] [pytorch implementation]
Axial-deeplab: Stand-alone axial-attention for panoptic segmentation, ECCV 2020
- Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
- [paper] [pytorch implementation]

Global self-attention with image preprocessing

Generative pretraining from pixels, ICML 2020, iGPT
- Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever
- [paper] [official code]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv 2020, ViT
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
- [paper] [pytorch implementation]
Pre-Trained Image Processing Transformer, arXiv, IPT
- Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, Wen Gao
- [paper]

Global self-attention on 3D point clouds

Point Transformer, arXiv 2020, points classification + part/semantic segmentation
- Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, Vladlen Koltun
- [paper]

Unified text-vision tasks

Focused on VQA

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019
- Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
- [paper] [official code]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019
- Hao Tan, Mohit Bansal
- [paper] [official code]
VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019
- Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang
- [paper] [official code]
VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020
- Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
- [paper] [official code]
UNITER: UNiversal Image-TExt Representation Learning, ECCV 2020
- Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
- [paper] [official code]

Focused on Image Retrieval

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020
- Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou
- [paper] [official code]
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data, arXiv 2020
- Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti
- [paper]
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020
- Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao
- [paper] [official code]

Focused on OCR

LayoutLM: Pre-training of Text and Layout for Document Image Understanding
- Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou
- [paper] [official code]

Multi-Task

12-in-1: Multi-Task Vision and Language Representation Learning
- Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee
- [paper] [official code]

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Visual Representation Learning with Transformers

About transformers

Combining CNN with self-attention

DETR Family

Stand-alone transformers for Computer Vision

Self-attention only in local neighborhood

Scalable approximations to global self-attention

Global self-attention with image preprocessing

Global self-attention on 3D point clouds

Unified text-vision tasks

Focused on VQA

Focused on Image Retrieval

Focused on OCR

Multi-Task

About

Releases

Packages

ashishpatel26/awesome-visual-representation-learning-with-transformers

Folders and files

Latest commit

History

Repository files navigation

Awesome Visual Representation Learning with Transformers

About transformers

Combining CNN with self-attention

DETR Family

Stand-alone transformers for Computer Vision

Self-attention only in local neighborhood

Scalable approximations to global self-attention

Global self-attention with image preprocessing

Global self-attention on 3D point clouds

Unified text-vision tasks

Focused on VQA

Focused on Image Retrieval

Focused on OCR

Multi-Task

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages