<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Course Series</font></h1>
</center>

---

<center>
    <h1><font color="red">Introduction to DINOv2 with PyTorch</font></h1>
</center>

# <font color="blue"> References</font>

- [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/pdf/2304.07193) by Maxime Oquab et al.
- [DINOv2 by Meta: A Self-Supervised foundational vision model](https://learnopencv.com/dinov2-self-supervised-vision-transformer/) by Bhomik Sharma, April 2025.
- [01.Meta-DinoV2-Getting Started](https://www.kaggle.com/code/shravankumar147/01-meta-dinov2-getting-started)
- [DINOv2](https://huggingface.co/docs/transformers/en/model_doc/dinov2) from hugginface.co
- [Building the DINO model from Scratch with PyTorch: Self-Supervised Vision Transformer](https://medium.com/thedeephub/self-supervised-vision-transformer-implementing-the-dino-model-from-scratch-with-pytorch-62203911bcc9) by Shubh Mishra
- [How to Classify Images with DINOv2](https://blog.roboflow.com/how-to-classify-images-with-dinov2/) by James Gallagher (May 30, 2023
- [Deploying DINOv2 to A Rest API Endpoint for Image Classification | Modelbit](https://colab.research.google.com/github/write-with-neurl/modelbit-09/blob/main/notebook/Deploying_DINOv2_for_Image_Classification_with_Modelbit.ipynb#scrollTo=q06RxQlCzQnG)
- [DINOv2: Self-supervised Learning Model Explained](https://encord.com/blog/dinov2-self-supervised-learning-explained/) eNCORD Blog, November 2024.
- [How to Classify Images with DINOv2](https://blog.roboflow.com/how-to-classify-images-with-dinov2/) by James Gallagher, May 2023.


- Need for computer vision foundation models that generate visual features that work out of the box on any task, both at the image level, e.g., image classification, and pixel level, e.g., segmentation.

# <font color="red">What is DINOv2? </font>

#### Self-supervised model
- DINOv2 (self-__DIstillation of knowledge with NO labels v2__) is a self-supervised vision transformer model that consists of s family of foundation models producing universal features suitable for image-level visual tasks (image classification, instance retrieval, video understanding) as well as pixel-level visual tasks (depth estimation, semantic segmentation).
- It is an advanced self-supervised learning technique to train models, enhancing computer vision by accurately identifying individual objects within images and video frames.

#### Self-distillation framework

- DINOv2 uses self-supervized learning (SSL) and knowledge (or model) distillation methods.
   - SSL is a “self-supervision” technique that involves a two-step process of pretraining and fine-tuning, where models learn representations from unlabeled data through auxiliary tasks and adapt to specific tasks using smaller amounts of labeled data. 
   - Knowledge distillation is the process of training a smaller model to mimic the larger model. In this case, you transfer the knowledge from the larger model (often called the “teacher”) to the smaller model (often called the “student”).
      - __Step 1__: Train the teacher model with labeled data; it produces an output, so you map the input and output from the teacher model and use the smaller model to copy the output, while being more efficient in terms of model size and computational requirements.
      - __Step 2__: Use a large dataset of unlabeled data to train the student models to perform as well as or better than the teacher models. The idea here is to train the large models with your techniques and distill a set of smaller models. This technique is very good for saving computing costs, and DINOv2 is built with it. 

#### Power of data
- DINOv2 is trained on a colossal dataset comprising over 142 million images.
- The dataset encompasses a wide variety of scenes, objects, and viewpoints, crucial for learning representations applicable across different tasks.
- This massive scale training enables the model to learn richer, more generalizable visual representations that capture the intricate nuances of the visual world.
- Training with massive batches allows the model to learn from a more diverse set of examples simultaneously, leading to better generalization and faster convergence.

In [None]:
import requests

In [None]:
import matplotlib.pyplot as plt

In [None]:
import numpy as np

In [None]:
from PIL import Image

In [None]:
import torch
#import torchvision.transforms as T
from torchvision import transforms

# <font color="red">Sample workflow</font>

### <font color="blue">Choose your device</font>

Use CUDA if available, otherwise use CPU

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### <font color="blue">Load the DINOv2 model</font>

In [None]:
dinov2_vits14 = torch.hub.load("facebookresearch/dinov2", "dinov2_vits14")

### <font color="blue">Bring up the model to the device</font>

In [None]:
dinov2_vits14.to(device)

### <font color="blue">Set the model to evaluation mode</font>

In [None]:
dinov2_vits14.eval()

### <font color="blue">Get the image of interest</font>

In [None]:
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

#### Access image properties

In [None]:
print(image.format)

In [None]:
print(image.size)

In [None]:
print(image.mode)

#### Display the image

In [None]:
plt.imshow(image);

### <font color="blue">Create a image preprocessor and apply over input image</font>

- We preprocess the image to make it ready for the model.
- We use the `torchvision.transforms` modeule to perform a series of manipulations on the image:
   - `Resize()`: Resize the input to the given size.
   - `ToTensor()`: Convert the image to a tensor.
      - A Tensor Image is  a tensor with (`C`, `H`, `W`) shape, where `C` is a number of channels, `H` and `W` are image height and width. 
   - `Normalize()`: Normalize the tensor image with mean (mean values for the three channels) and standard deviation (std values for the three channels).

In [None]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),       
    transforms.ToTensor(),              
    transforms.Normalize(                
        mean=[0.485, 0.456, 0.406], 
        std=[0.229, 0.224, 0.225]
    )
])

In [None]:
input_image = transform(image).unsqueeze(0).to(device)

In [None]:
type(input_image)

In [None]:
input_image.shape

### <font color="blue"> Feed the image to the model to extract features</font>

In [None]:
with torch.no_grad():
    features = dinov2_vits14(input_image)

In [None]:
print(features.shape)
# Expected output: torch.Size([1, 384]) for dinov2_vits14 model

In [None]:
#a = np.array(features[0].cpu().numpy()).reshape(1, -1)
#a.tolist()