<a href="https://colab.research.google.com/github/afondiel/computer-science-notes/blob/master/computer-vision-notes/lab/notebooks/Image_Recognition_benchmark_ViT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vision Transformer (ViT) Benchmark for Image Recognition Tasks

![](https://github.com/google-research/vision_transformer/blob/main/vit_figure.png?raw=true)

In [2]:
# Install the required libraries
!pip install transformers
!pip install transformers datasets
!pip install torch
!pip install torchvision
!pip install pillow

Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)


In [3]:
# Import necessary libraries
import torch
from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests

## Image Classification

In [4]:
# Load the pre-trained ViT model and feature extractor
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')

# Load an example image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# Prepare the image
inputs = feature_extractor(images=image, return_tensors="pt")

# Perform inference
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print(f'Predicted class: {model.config.id2label[predicted_class_idx]}')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]



Predicted class: Egyptian cat


## Object detection

In [None]:
!pip install transformers
!pip install timm

In [None]:
import torch
import requests
from PIL import Image, ImageDraw
from transformers import DetrImageProcessor, DetrForObjectDetection

# Load pre-trained model and processor
model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')
processor = DetrImageProcessor.from_pretrained('facebook/detr-resnet-50')

# Load an example image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# Prepare the image
inputs = processor(images=image, return_tensors="pt")

# Perform inference
outputs = model(**inputs)

# Process the outputs
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes)[0]

# Draw boxes on the image
draw = ImageDraw.Draw(image)
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    if score > 0.9:
        box = [round(i, 2) for i in box.tolist()]
        draw.rectangle(box, outline="red", width=3)
        draw.text((box[0], box[1]), f'{model.config.id2label[label.item()]}: {round(score.item(), 2)}', fill="red")

image.show()


## Semantic segmentation

In [None]:
!pip install transformers


In [None]:
import torch
from transformers import SegformerForSemanticSegmentation, SegformerFeatureExtractor
from PIL import Image
import requests
import numpy as np

# Load pre-trained model and feature extractor
model = SegformerForSemanticSegmentation.from_pretrained('nvidia/segformer-b0-finetuned-ade-512-512')
feature_extractor = SegformerFeatureExtractor.from_pretrained('nvidia/segformer-b0-finetuned-ade-512-512')

# Load an example image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# Prepare the image
inputs = feature_extractor(images=image, return_tensors="pt")

# Perform inference
outputs = model(**inputs)
logits = outputs.logits  # shape (batch_size, num_classes, height, width)

# Get the predicted class for each pixel
seg = torch.argmax(logits, dim=1)[0]  # shape (height, width)

# Convert to a PIL image
segmentation = Image.fromarray(seg.byte().cpu().numpy())

# Display the segmentation map
segmentation.show()


## Instance Segmentation

In [None]:
!pip install pyyaml==5.1
!pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

In [None]:

import torch
import detectron2
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
import cv2
import matplotlib.pyplot as plt

# Load the config and weights
cfg = get_cfg()
cfg.merge_from_file(detectron2.model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # set the threshold for this model
cfg.MODEL.WEIGHTS = detectron2.model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
predictor = DefaultPredictor(cfg)

# Load an image
img_path = 'http://images.cocodataset.org/val2017/000000039769.jpg'
im = cv2.imread(img_path)

# Perform inference
outputs = predictor(im)

# Visualize the results
v = Visualizer(im[:, :, ::-1], MetadataCatalog.get(cfg.DATASETS.TRAIN[0]), scale=1.2)
out = v.draw_instance_predictions(outputs["instances"].to("cpu"))

plt.figure(figsize=(14, 10))
plt.imshow(out.get_image()[:, :, ::-1])
plt.show()


## Action Recognition

In [None]:
!pip install pytorchvideo

In [None]:

import torch
from pytorchvideo.models.hub import slowfast_r50

# Load pre-trained SlowFast model
model = slowfast_r50(pretrained=True)

# Assuming we have a video tensor of shape (B, C, T, H, W)
# Here B is batch size, C is number of channels, T is time frames, H and W are height and width
# Example: Dummy tensor for demonstration
video = torch.randn(1, 3, 32, 224, 224)  # Adjust dimensions according to your video

# Perform inference
model = model.eval()
with torch.no_grad():
    outputs = model(video)
    predicted_class_idx = torch.argmax(outputs, dim=1)
    print(f'Predicted action class: {predicted_class_idx.item()}')


## Video classification

In [None]:
!pip install transformers

In [None]:

from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
import torch
import requests
from PIL import Image
from io import BytesIO

# Load pre-trained model and feature extractor
model = VideoMAEForVideoClassification.from_pretrained('MCG-NJU/videomae-base')
feature_extractor = VideoMAEFeatureExtractor.from_pretrained('MCG-NJU/videomae-base')

# Load a video
url = 'https://path_to_your_video.mp4'  # replace with your video URL
video = ...  # Load your video here

# Prepare the video
inputs = feature_extractor(video, return_tensors="pt")

# Perform inference
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print(f'Predicted class: {model.config.id2label[predicted_class_idx]}')


## Video object detection

In [None]:
# This code is highly dependent on the dataset and specific pre-trained models.
# For a demonstration, we'll use the SlowFast model again as an example.

!pip install pytorchvideo

In [None]:

import torch
from pytorchvideo.models.hub import slowfast_r50

# Load pre-trained SlowFast model
model = slowfast_r50(pretrained=True)

# Assuming we have a video tensor of shape (B, C, T, H, W)
# Here B is batch size, C is number of channels, T is time frames, H and W are height and width
# Example: Dummy tensor for demonstration
video = torch.randn(1, 3, 32, 224, 224)  # Adjust dimensions according to your video

# Perform inference
model = model.eval()
with torch.no_grad():
    outputs = model(video)
    predicted_class_idx = torch.argmax(outputs, dim=1)
    print(f'Predicted action class: {predicted_class_idx.item()}')


## References

Docs
- [Vision transformer - Wikipedia](https://en.wikipedia.org/wiki/Vision_transformer)
- [ViT Docs - huggingface.co](https://huggingface.co/docs/transformers/main/en/model_doc/vit)

Paper:
- [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929)
- [Transformers in Vision: A Survey - 2022](https://arxiv.org/abs/2101.01169)

Blog:
- [Vision Transformers (ViT) in Image Recognition – 2024 Guide: viso.ai](https://viso.ai/deep-learning/vision-transformer-vit/#:~:text=Moreover%2C%20ViT%20models%20outperform%20CNNs,globally%20across%20the%20overall%20image.)

**Image Classification**:
- [Image classification with Vision Transformer](https://keras.io/examples/vision/image_classification_with_vision_transformer/)

**Object Detection (2D/3D?):**
- [Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?](https://arxiv.org/abs/2209.07026)

**Semantic Segmentation:**
- [Semantic Segmentation using Vision Transformers: A survey](https://arxiv.org/abs/2305.03273)

**Instance Segmentation:**

- [A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation](https://arxiv.org/abs/2112.09747)


**Action Recognition:**
- [Vision Transformers for Action Recognition: A Survey](https://arxiv.org/abs/2209.05700)

**Code:**

- [Google Research  - Github Repo](https://github.com/google-research/vision_transformer?tab=readme-ov-file#available-vit-models)
- [Googl - Hugging Face](https://huggingface.co/google/vit-base-patch16-224)

## Datasets

List of datasets for Vision transformers:

**Image classification**
- ImageNet
- CIFAR-10
- CIFAR-100
- SVHN
- Tiny ImageNet

**Object detection**
- COCO
- Pascal VOC
- ImageNet Detection
- Open Images Dataset

**Semantic segmentation**
- ADE20K
- Cityscapes
- Pascal Context
- Mapillary Vistas

**Video classification**
- Kinetics
- UCF101
- HMDB51
- Something-Something V2

**Video object detection**
- YouTube-VOS
- DAVIS
- SegTrack V2

## ViT Models

The current state-of-the-art Vision Transformer (ViT) model for image recognition tasks is **ViT-Huge**:
- 18 billion parameters.
- 89.5% top-1 accuracy on the ImageNet-1k validation set.

Other notable ViT models for image recognition include:

- **ViT-B/16**: 300 million parameters, 84.6% top-1 accuracy on ImageNet-1k.
- **ViT-L/16**: 700 million parameters, 86.3% top-1 accuracy on ImageNet-1k.
- **ViT-H/14**: 1.2 billion parameters, 87.8% top-1 accuracy on ImageNet-1k.

ref: https://github.com/google-research/vision_transformer?tab=readme-ov-file#available-vit-models

Google cloud storage: https://console.cloud.google.com/storage/browser/vit_models/imagenet21k?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))
