## Image Recognition with Vision Transformer (ViT)

![](https://github.com/google-research/vision_transformer/blob/main/vit_figure.png?raw=true)

In [1]:
# Install the required libraries
!pip install transformers
!pip install torch
!pip install torchvision
!pip install pillow

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

In [None]:
# Import necessary libraries
import torch
from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests

In [None]:
# Load a pre-trained Vision Transformer model and feature extractor
model_name = 'google/vit-base-patch16-224'
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
model = ViTForImageClassification.from_pretrained(model_name)

In [None]:

# Function to preprocess the image
def preprocess_image(image_path):
    image = Image.open(image_path)
    inputs = feature_extractor(images=image, return_tensors="pt")
    return inputs

# Function to perform inference
def infer_image(image_path):
    inputs = preprocess_image(image_path)
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_idx = logits.argmax(-1).item()
    return model.config.id2label[predicted_class_idx]



In [None]:
# Test the inference with an example image
image_url = 'https://huggingface.co/datasets/huggingface/transformers-sample-images/resolve/main/african_elephant.jpg'
image_path = 'african_elephant.jpg'
response = requests.get(image_url)
with open(image_path, 'wb') as f:
    f.write(response.content)




In [None]:
# Perform inference and print the result
predicted_label = infer_image(image_path)
print(f'The predicted class is: {predicted_label}')

## References
- [Vision transformer - Wikipedia](https://en.wikipedia.org/wiki/Vision_transformer)
- [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929)
- [Transformers in Vision: A Survey - 2022](https://arxiv.org/abs/2101.01169)
- [Vision Transformers (ViT) in Image Recognition – 2024 Guide: viso.ai](https://viso.ai/deep-learning/vision-transformer-vit/#:~:text=Moreover%2C%20ViT%20models%20outperform%20CNNs,globally%20across%20the%20overall%20image.)

**Image Classification**:
- [Image classification with Vision Transformer](https://keras.io/examples/vision/image_classification_with_vision_transformer/)

**Object Detection (2D/3D?):**
- [Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?](https://arxiv.org/abs/2209.07026)

**Semantic Segmentation:**
- [Semantic Segmentation using Vision Transformers: A survey](https://arxiv.org/abs/2305.03273)

**Instance Segmentation:**

- [A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation](https://arxiv.org/abs/2112.09747)


**Action Recognition:**
- [Vision Transformers for Action Recognition: A Survey](https://arxiv.org/abs/2209.05700)

**Code:**

- [Google Research Github Repo](https://github.com/google-research/vision_transformer?tab=readme-ov-file#available-vit-models)

## Datasets

List of datasets for Vision transformers:

**Image classification**
- ImageNet
- CIFAR-10
- CIFAR-100
- SVHN
- Tiny ImageNet

**Object detection**
- COCO
- Pascal VOC
- ImageNet Detection
- Open Images Dataset

**Semantic segmentation**
- ADE20K
- Cityscapes
- Pascal Context
- Mapillary Vistas

**Video classification**
- Kinetics
- UCF101
- HMDB51
- Something-Something V2

**Video object detection**
- YouTube-VOS
- DAVIS
- SegTrack V2

## ViT Family Models

The current state-of-the-art Vision Transformer (ViT) model for image recognition tasks is **ViT-Huge**:
- 18 billion parameters.
- 89.5% top-1 accuracy on the ImageNet-1k validation set.

Other notable ViT models for image recognition include:

- **ViT-B/16**: 300 million parameters, 84.6% top-1 accuracy on ImageNet-1k.
- **ViT-L/16**: 700 million parameters, 86.3% top-1 accuracy on ImageNet-1k.
- **ViT-H/14**: 1.2 billion parameters, 87.8% top-1 accuracy on ImageNet-1k.

ref: https://github.com/google-research/vision_transformer?tab=readme-ov-file#available-vit-models

Google cloud storage: https://console.cloud.google.com/storage/browser/vit_models/imagenet21k?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))
