Vision Model Implementations

This repository contains a growing list of computer vision models implemented in PyTorch. I am mostly focusing on recent-ish models, especially those based on attention. My goal is to write my own implementation based on the paper, but also learn tricks and check my work by referencing existing implementations.

Models

MLP Mixer: MLP-based architecture based on a paper from Google AI Research, which shows that you don't have to use convolutions or attention to get good performance on computer vision tasks.
Vision Transformer: Attention-based architecture, adapted from the paper "Transformers for Image Recognition at Scale". The authors split images into patches, and feed the resulting sequence into a transformer encoder.
Masked Autoencoder Vision Transformer: Using the Vision Transformer architecture above, the authors of the paper "[Masked Autoencoders Are Scalable Vision Learners](Masked Autoencoders Are Scalable Vision Learners)" adopt a self-supervised pretraining objective similar to masked language modeling (Devlin et al., 2018), where image patches are masked before they are fed into a transformer encoder, and a lightweight decoder must reconstruct them.

Data & Training

My initial tests for these models are still in progress, involve training them on CIFAR-100. I will release code and results of some of these experiments soon.

References

Papers Implemented

Code References

Andrej Karpathy's mingpt: Referenced for some tricks related to implementation of multi-head attention.
Einops Documentation: Referenced for more tricks related to multi-head attention, namely, Einstein notation.
Phil Wang's ViT repository: Referenced for Vision Transformer and masked autoencoder, and subsequent proposed modifications/improvements in the literature, including replacing CLS token with average pooling after the final transformer block. I borrowed the elegant approach of wrapping the attention and FFN blocks in a "PreNorm" layer that handles normalization, tweaking it slightly to include the residual connection. This results in a much cleaner transformer block implementation.
Google Research MLP Mixer & ViT implementations: Referenced for my MLP Mixer and Vision Transformer implementations.
Facebook Research MAE implementation: Referenced for my Masked Autoencoder implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
data.py		data.py
layers.py		layers.py
mae.py		mae.py
mixer.py		mixer.py
train.py		train.py
utils.py		utils.py
vit.py		vit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

data.py

data.py

layers.py

layers.py

mae.py

mae.py

mixer.py

mixer.py

train.py

train.py

utils.py

utils.py

vit.py

vit.py

Repository files navigation

Vision Model Implementations

Models

Data & Training

References

Papers Implemented

Code References

About

Releases

Packages

Languages

andersonbcdefg/vision-models

Folders and files

Latest commit

History

Repository files navigation

Vision Model Implementations

Models

Data & Training

References

Papers Implemented

Code References

About

Resources

Stars

Watchers

Forks

Languages