Skip to content

Develop an image classificator using HuggingFace framework for its ease to implement transformer or other novel models that are SOTA for lot of different tasks related with images

Notifications You must be signed in to change notification settings

davertor/Convnext_HuggingFace_image_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Made-love LinkedIn

Convnext_image_classifier

Open In Collab

Dataset

The idea behind this project is to develop an image classifier using HuggingFace framework for its ease to implement transformer or other novel models that are SOTA for lots of different tasks related with images.

For the model, I have selected ConvNext model. The ConvNeXT model was proposed in A ConvNet for the 2020s and it is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them.

ImageNet Sota

Introduction

In the last years, Transformers have great success with NLP and are now applied to images. One special type of transformers called Vision Transformers (ViTs) have been developed for outperform convolutional networks in computer vision tasks. While CNN uses pixel arrays, ViT divides the image into visual tokens.

ViT architecture

ConvNeXt came out in January of 2022 and it is a pure-conv network that was inspired by some of the new advances with ViTs, and appropriated a few concepts that helped it make a leap in CNN-based accuracy.

Description of the ConvNext improvements

The authors started by taking the well know ResNet architecture and iteratively improving it following new best practices and discoveries made in the last decade. The authors focused on Swin-Transformer and follows closely its design choices. They divided their roadmap into two parts: macro design and micro design. Macro design is all the changes we do from a high-level perspective, e.g. the number of stages, while micro design is more about smaller things, e.g. which activation to use. Some of the main improvements they do are:

  • AdamW as optimizer
  • Adjusted the number of blocks in each stage, with a large kernel size and a stride such that the sliding window does not overlap
  • Designing an inverted bottleneck with an expansion ratio of 4 (as it happens in Swin Transformers)
  • Increasing the kernel size to equal the power of vision transformer models with a global receptive field. The vision transformer tends to view the whole image at once through self-attention that spans the entire image.
  • Replacing RELU activation function by the Gaussian Error Linear Unit (GELU)
  • Use fewer normalization layers because transformers use them less often.
  • Replace Batch Normalization (BL) with Layer Normalization (LN).

Resnet-improvements

References

Contact

About

Develop an image classificator using HuggingFace framework for its ease to implement transformer or other novel models that are SOTA for lot of different tasks related with images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published