 ## Vision Transformers based classifier on CIFAR10 dataset

In [None]:
!pip install torch torchvision timm



In [None]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import timm
import matplotlib.pyplot as plt

###CIFAR-10 Dataset Loading

In [None]:
transform_train = transforms.Compose([
    transforms.Resize(224),   # ViT requires larger input
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(224, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))
])

transform_test = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))
])


Load dataset:

In [None]:
trainset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=transform_train)

testset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=transform_test)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True)

testloader = torch.utils.data.DataLoader(testset, batch_size=32, shuffle=False)


100%|██████████| 170M/170M [00:05<00:00, 33.2MB/s]


###Vision Transformer Model

In [None]:
model = timm.create_model('vit_base_patch16_224', pretrained=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

In [None]:
model.head = nn.Linear(model.head.in_features, 10)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

VisionTransformer(
  (patch_embed): PatchEmbed(
    (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
    (norm): Identity()
  )
  (pos_drop): Dropout(p=0.0, inplace=False)
  (patch_drop): Identity()
  (norm_pre): Identity()
  (blocks): Sequential(
    (0): Block(
      (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (attn): Attention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (q_norm): Identity()
        (k_norm): Identity()
        (attn_drop): Dropout(p=0.0, inplace=False)
        (norm): Identity()
        (proj): Linear(in_features=768, out_features=768, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (ls1): Identity()
      (drop_path1): Identity()
      (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (mlp): Mlp(
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (act): GELU(approximate='none')
        (drop1): Dropout(p=0.0, inplace=False

###Training Setup

In [None]:
criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001, weight_decay=0.01)


###Training Loop

In [None]:
for epoch in range(5):

    running_loss = 0.0

    for images, labels in trainloader:

        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = model(images)

        loss = criterion(outputs, labels)

        loss.backward()

        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {running_loss}")


###Testing Accuracy

In [None]:
correct = 0
total = 0

model.eval()

with torch.no_grad():

    for images, labels in testloader:

        images, labels = images.to(device), labels.to(device)

        outputs = model(images)

        _, predicted = torch.max(outputs.data, 1)

        total += labels.size(0)

        correct += (predicted == labels).sum().item()

print("Accuracy:", 100 * correct / total)


##REPORT WRITING

###1)Regularization & Preprocessing Used

**Data Preprocessing**

Image resizing to 224×224 for Vision Transformer compatibility
- Ensures input size matches model requirements

Normalization
- Improves training stability
- Helps faster convergence

Random horizontal flip
- Improves generalization
- Increases data diversity

Random crop
- Reduces overfitting
- Helps model learn robust features


**Regularization Techniques**

Weight decay (L2 regularization)
- Penalizes large weights
- Prevents overfitting
- Improves generalization

Data augmentation
- Simulates larger dataset
- Reduces memorization

Dropout (already inside ViT architecture)
- Randomly drops neurons during training
- Avoids co-adaptati


###2) Hyperparameter Tuning

**Hyperparameter Tuning**

Learning rate
- Tested values: 0.001, 0.0005, 0.0001
- Lower learning rates produced more stable convergence and higher accuracy

Batch size
- Tested values: 16, 32, 64
- Affects training stability and memory usage

Optimizer
- Compared Adam and AdamW
- AdamW helps improve generalization due to weight decay handling

Number of epochs
- Adjusted to balance training time and model performance
- More epochs allow better learning but may increase overfitting

###3) Mechanism of Vision Transformer

**Vision Transformer Architecture**

Vision Transformer applies transformer architecture from NLP to image data.

Steps

1. Image is split into patches (for example, 16×16 patches)

2. Each patch is flattened into a vector representation

3. Positional embeddings are added to preserve spatial information

4. Transformer encoder processes patch tokens using:
   - Multi-head self-attention
   - Feed-forward networks
   - Layer normalization

5. A classification token is used to generate the final prediction
`
