# Tutorial 10-4: Vision Transformers – "Image Patches as Tokens"

**Course:** CSEN 342: Deep Learning  
**Topic:** Vision Transformers (ViTs), Patching, and Transfer Learning

## Objective
In the lecture, we learned that Vision Transformers (ViTs) introduce the idea of "patches" by dividing an image into smaller, fixed-size patches (e.g., 16x16 pixels). By using patches, ViTs can avoid convolutional layers, making them simpler and potentially more versatile than CNN-based models.

In this tutorial, we will:

1.  **Understand Patching:** See how these patches are then treated as tokens, similar to words in an NLP transformer.
2.  **Fine-Tune ViT-Base:** Adapt a pre-trained ViT-Base model (12 layers, 768 hidden size, 86M params) to classify images.
3.  **Perform Inference:** Build a pipeline to classify a new image.

**NOTE**: Run this notebook under the `Transformers Bundle` kernel.

---


## Part 0: Robust Setup (The Offline Cache)

Compute nodes often block direct downloads inside the notebook. We will manually download the pre-trained ViT model components to our local folder, similar to our NLP tutorials.


In [7]:
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import torchvision
from tqdm import tqdm

# Define paths
data_root = '../data'
model_root = '../data/vit_local'
os.makedirs(data_root, exist_ok=True)
os.makedirs(model_root, exist_ok=True)

def download_file(url, save_path):
    if not os.path.exists(save_path):
        print(f"Downloading {os.path.basename(save_path)}...")
        os.system(f"wget -nc -q -O {save_path} {url}")

# Download ViT-Base Model Files (Hugging Face Hub)
base_hf_url = "https://huggingface.co/google/vit-base-patch16-224-in21k/resolve/main/"
files_to_fetch = [
    "config.json",
    "pytorch_model.bin",
    "preprocessor_config.json"
]

for filename in files_to_fetch:
    download_file(base_hf_url + filename, os.path.join(model_root, filename))

print("All files ready.")


Downloading pytorch_model.bin...
All files ready.


---


## Part 1: The Image Processor (Patching)

Since transformers lack a natural understanding of spatial arrangement, positional encodings are added to each patch embedding to provide spatial context. CNNs use local receptive fields, which focus on parts of the image incrementally, whereas transformers with global attention can analyze the entire image context at once.

Each patch is linearly embedded into a fixed-size vector, making it compatible with the transformer's processing layers.




In [8]:
from transformers import ViTImageProcessor
from PIL import Image
import urllib.request

# Load from local path
processor = ViTImageProcessor.from_pretrained(model_root)

# Download a sample image to see the processor in action
urllib.request.urlretrieve("http://images.cocodataset.org/val2017/000000039769.jpg", "sample.jpg")
image = Image.open("sample.jpg")

# The processor resizes the image to 224x224 and normalizes it
inputs = processor(images=image, return_tensors="pt")

print(f"Original Image Size: {image.size}")
print(f"Processed Tensor Shape: {inputs['pixel_values'].shape}")

# Note on the shape: [1, 3, 224, 224] means 1 image, 3 color channels, 224x224 pixels.
# ViT-Base splits this into 16x16 patches. 
# (224 / 16) * (224 / 16) = 14 * 14 = 196 patches total!


Original Image Size: (640, 480)
Processed Tensor Shape: torch.Size([1, 3, 224, 224])


---


## Part 2: The Dataset Class

Vision transformers, unlike CNNs, generally require large amounts of data to perform well, especially for visual recognition tasks. ViTs are computationally intensive due to their use of self-attention, which scales quadratically with the number of patches.

Because of this, ViTs benefit from transfer learning, where they are pretrained on large datasets and then fine-tuned for specific tasks. We will use a small, heavily-reduced subset of the CIFAR-10 dataset for this tutorial's speed.


In [9]:
class CIFAR10Subset(Dataset):
    def __init__(self, processor, train=True, samples=400):
        # Download CIFAR10
        self.dataset = torchvision.datasets.CIFAR10(root=data_root, train=train, download=True)
        self.processor = processor
        self.samples = samples
        
    def __len__(self):
        return self.samples
    
    def __getitem__(self, idx):
        image, label = self.dataset[idx]
        # Convert image to the format expected by ViT
        encoding = self.processor(images=image, return_tensors="pt")
        
        return {
            'pixel_values': encoding['pixel_values'].squeeze(0), # Remove batch dimension
            'label': torch.tensor(label, dtype=torch.long)
        }

# Create Loaders
train_ds = CIFAR10Subset(processor, train=True, samples=400)
val_ds = CIFAR10Subset(processor, train=False, samples=100)

train_loader = DataLoader(train_ds, batch_size=8, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=8)
cifar10_classes = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]


---


## Part 3: The Model & Fine-Tuning

On their own, ViTs are similar to encoder LLMs like BERT. The multi-headed self-attention mechanism enables the model to assess relationships between patches across the entire image.

During training, the ViT is attached to a classification head which allows the transformer to use the encoded vectors for some task. This uses an extra learnable [class] embedding.




In [10]:
from transformers import ViTForImageClassification

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load Model from local path
model = ViTForImageClassification.from_pretrained(
    model_root, 
    num_labels=10, # 10 classes in CIFAR-10
    ignore_mismatched_sizes=True # We are replacing the original 21k classification head with a new 10-class head
)
model = model.to(device)

optimizer = optim.AdamW(model.parameters(), lr=2e-5)

def train(epochs=1):
    print("Starting Fine-Tuning (this might take a few mins)...")
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        
        for batch in tqdm(train_loader):
            pixel_values = batch['pixel_values'].to(device)
            labels = batch['label'].to(device)
            
            optimizer.zero_grad()
            
            outputs = model(pixel_values, labels=labels)
            loss = outputs.loss
            
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}: Loss {avg_loss:.4f}")

train(epochs=1)


Some weights of ViTForImageClassification were not initialized from the model checkpoint at ../data/vit_local and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting Fine-Tuning (this might take a few mins)...


100%|██████████| 50/50 [00:12<00:00,  3.90it/s]

Epoch 1: Loss 2.2217





---


## Part 4: Inference

Let's test the model on a new, unseen image from the validation set.


In [11]:
def predict(image):
    model.eval()
    inputs = processor(images=image, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        prediction = torch.argmax(logits, dim=1).item()
        
    return cifar10_classes[prediction]

# Grab a test image
test_dataset = torchvision.datasets.CIFAR10(root=data_root, train=False, download=True)
test_image, true_label = test_dataset[0]

print("--- Predictions ---")
print(f"True Label:      {cifar10_classes[true_label]}")
print(f"Predicted Label: {predict(test_image)}")


--- Predictions ---
True Label:      cat
Predicted Label: cat


### Conclusion
You have successfully fine-tuned a Vision Transformer!

**Important Note:**
ViTs don't create images by generating tokens, this is typically done with Diffusion models, which leverage ViTs as an encoder backbone.
