Skip to content

deependujha/Vision-Transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision-Transformers

Vision Transformers

Steps:

  • Split the image into patches
  • Embed each patch into a lower dimensional vector using a linear projection layer.
  • Add positional encodings
  • Add an additional learnable classification token to the sequence that will be used to make predictions.
  • Initialize the classification token with random values and then train it along with the rest of the model.
  • Pass the embeddings (including the classification token) through a series of encoder blocks.
  • The output of the final encoder block is called the pooled features or global representation of the image. The term "contextual embeddings" is not commonly used in the context of vision transformers.
  • Pass the classification token's embedding through a Multi-Layer Perceptron (MLP) to make predictions.

  • ViT is a simple vision transformer architecture that replaces the convolutions in the backbone of the popular convolutional neural networks with a transformer encoder.

Abstract:

ViT-1

Proposed Model:

ViT-2

Pre-Train then fine-tune ViT:

ViT-3

Architecture:

ViT-4


Sample Prediction:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published