## Goals
The goals of this assignment are to:

1. Build a Vision Transformer (ViT) from scratch .
2. Build a Semantic Segmentation model using a ViT encoder

By the end of this assignment, you will have gained experience with:

- Working with PyTorch and the MiniPlaces dataset for image classification.
- Implementing and training different types of neural networks using PyTorch.
- Debugging and troubleshooting issues that may arise during the development process.


Good luck and happy coding! Remember, the most important thing is to have fun and learn something new.




## Setup Code


To begin, you will need to download the MiniPlaces dataset using the provided link.

-----


Recall the introduction about the storage system of CoLab we went through in the assignment 1. For efficient development of our models, we will still use the temporary storage space to hold our data. This means that every time you open up this notebook, we will need to re-download and process the dataset. Don't worry though - this shouldn't take long, usually just a minute or less. Okay, let's get started!

In [None]:
!pip install einops
# Downloading this file takes about a few seconds.
# Download the tar.gz file from google drive using its file ID.
!pip3 install --upgrade gdown --quiet
!wget https://web.cs.ucla.edu/~smo3/data.tar.gz


Collecting einops
  Downloading einops-0.7.0-py3-none-any.whl (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: einops
Successfully installed einops-0.7.0
--2024-03-02 20:30:30--  https://web.cs.ucla.edu/~smo3/data.tar.gz
Resolving web.cs.ucla.edu (web.cs.ucla.edu)... 131.179.128.29
Connecting to web.cs.ucla.edu (web.cs.ucla.edu)|131.179.128.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 460347416 (439M) [application/x-gzip]
Saving to: ‘data.tar.gz’


2024-03-02 20:30:57 (17.0 MB/s) - ‘data.tar.gz’ saved [460347416/460347416]



In [None]:
import os
import tarfile
from tqdm import tqdm
import urllib.request

def setup(file_link_dict={},
          folder_name='Assignment3'):
  # Let's make our assignment directory
  CS188_path = './'
  os.makedirs(os.path.join(CS188_path, 'Assignment3', 'data'), exist_ok=True)
  # Now, let's specify the assignment path we will be working with as the root.
  root_dir = os.path.join(CS188_path, 'Assignment3')
  # Open the tar.gz file
  tar = tarfile.open("data.tar.gz", "r:gz")
  # Extract the file "./Assignment3/data" folder
  total_size = sum(f.size for f in tar.getmembers())
  with tqdm(total=total_size, unit="B", unit_scale=True, desc="Extracting tar.gz file") as pbar:
      for member in tar.getmembers():
          tar.extract(member, os.path.join(root_dir, 'data'))
          pbar.update(member.size)
  # Close the tar.gz file
  tar.close()
  # Next, we download the train/val/test txt files:
  for file_name, file_link in file_link_dict.items():
      print(f'Downloding {file_name}.txt from {file_link}')
      urllib.request.urlretrieve(file_link, f'{root_dir}/data/{file_name}.txt')
  return root_dir

In [None]:

val_url = 'https://raw.githubusercontent.com/CSAILVision/miniplaces/master/data/val.txt'
train_url = 'https://raw.githubusercontent.com/CSAILVision/miniplaces/master/data/train.txt'
root_dir = setup(
    file_link_dict={'train':train_url, 'val':val_url},
    folder_name='Assignment3')

Extracting tar.gz file: 100%|██████████| 566M/566M [00:33<00:00, 16.9MB/s]


Downloding train.txt from https://raw.githubusercontent.com/CSAILVision/miniplaces/master/data/train.txt
Downloding val.txt from https://raw.githubusercontent.com/CSAILVision/miniplaces/master/data/val.txt


### Define the data transform


In [None]:
from torchvision import transforms
import torch

image_net_mean = torch.Tensor([0.485, 0.456, 0.406])
image_net_std = torch.Tensor([0.229, 0.224, 0.225])

# Notice we are resize images to 128x128 instead of 64x64.
data_transform = transforms.Compose([
    transforms.Resize([128, 128]),
    transforms.ToTensor(),
    transforms.Normalize(image_net_mean, image_net_std),
])


### Define the dataset and dataloader

In [None]:
# You can copy your dataset from Assignment2.
from einops import rearrange, repeat
from einops.layers.torch import Rearrange
import os
import torch.nn as nn
from torch.utils.data import Dataset
from PIL import Image
import numpy as np

class MiniPlaces(Dataset):
    def __init__(self, root_dir, split, transform=None, label_dict=None):
        """
        Initialize the MiniPlaces dataset with the root directory for the images,
        the split (train/val/test), an optional data transformation,
        and an optional label dictionary.

        Args:
            root_dir (str): Root directory for the MiniPlaces images.
            split (str): Split to use ('train', 'val', or 'test').
            transform (callable, optional): Optional data transformation to apply to the images.
            label_dict (dict, optional): Optional dictionary mapping integer labels to class names.
        """
        assert split in ['train', 'val', 'test']
        self.root_dir = root_dir
        self.split = split
        self.transform = transform
        self.filenames = []
        self.labels = []

        self.label_dict = label_dict if label_dict is not None else {}

        with open(f'{root_dir}/{split}.txt', 'r') as f:
            lines = f.readlines()

        for line in lines:
            image_path, label = line.split(' ')
            label = int(label)
            self.filenames.append(image_path)
            self.labels.append(label)
            if split == 'train':
                self.label_dict[label] = image_path.split('/')[-2]

    def __len__(self):
        """
        Return the number of images in the dataset.

        Returns:
            int: Number of images in the dataset.
        """
        return len(self.filenames)

    def __getitem__(self, idx):
        """
        Return a single image and its corresponding label when given an index.

        Args:
            idx (int): Index of the image to retrieve.

        Returns:
            tuple: Tuple containing the image and its label.
        """
        label = self.labels[idx]
        image_path = self.filenames[idx]
        image = Image.open(os.path.join(self.root_dir, f'images/{image_path}'))
        image = self.transform(image)

        return image, label

### Define the train method

In [None]:
def train(model, train_loader, val_loader, optimizer, criterion, device, num_epochs):
    """
    Train the MLP classifier on the training set and evaluate it on the validation set every epoch.

    Args:
        model (MLP): MLP classifier to train.
        train_loader (torch.utils.data.DataLoader): Data loader for the training set.
        val_loader (torch.utils.data.DataLoader): Data loader for the validation set.
        optimizer (torch.optim.Optimizer): Optimizer to use for training.
        criterion (callable): Loss function to use for training.
        device (torch.device): Device to use for training.
        num_epochs (int): Number of epochs to train the model.
    """
    # Place model on device
    model = model.to(device)

    for epoch in range(num_epochs):
        model.train()  # Set model to training mode

        # Use tqdm to display a progress bar during training
        with tqdm(total=len(train_loader), desc=f'Epoch {epoch + 1}/{num_epochs}') as pbar:
            for inputs, labels in train_loader:
                # Move inputs and labels to device
                inputs = inputs.to(device)
                labels = labels.to(device)

                # Zero out gradients
                optimizer.zero_grad()

                # Compute the logits and loss
                logits = model(inputs)
                loss = criterion(logits, labels)

                # Backpropagate the loss
                loss.backward()

                # Update the weights
                optimizer.step()

                # Update the progress bar
                pbar.update(1)
                pbar.set_postfix(loss=loss.item())

        # Evaluate the model on the validation set
        avg_loss, accuracy = evaluate(model, val_loader, criterion, device)
        print(f'Validation set: Average loss = {avg_loss:.4f}, Accuracy = {accuracy:.4f}')

def evaluate(model, test_loader, criterion, device):
    """
    Evaluate the MLP classifier on the test set.

    Args:
        model (MLP): MLP classifier to evaluate.
        test_loader (torch.utils.data.DataLoader): Data loader for the test set.
        criterion (callable): Loss function to use for evaluation.
        device (torch.device): Device to use for evaluation.

    Returns:
        float: Average loss on the test set.
        float: Accuracy on the test set.
    """
    model.eval()  # Set model to evaluation mode

    with torch.no_grad():
        total_loss = 0.0
        num_correct = 0
        num_samples = 0

        for inputs, labels in test_loader:
            # Move inputs and labels to device
            inputs = inputs.to(device)
            labels = labels.to(device)

            # Compute the logits and loss
            logits = model(inputs)
            loss = criterion(logits, labels)
            total_loss += loss.item()

            # Compute the accuracy
            _, predictions = torch.max(logits, dim=1)
            num_correct += (predictions == labels).sum().item()
            num_samples += len(inputs)

    # Compute the average loss and accuracy
    avg_loss = total_loss / len(test_loader)
    accuracy = num_correct / num_samples

    return avg_loss, accuracy

In [None]:
def compute_distances_no_loops(x_train, x_test):
  num_train = x_train.shape[0]
  num_test = x_test.shape[0]
  dists = x_train.new_zeros(num_train, num_test)

  A = x_train.reshape(num_train,-1)
  B = x_test.reshape(num_test,-1)
  AB2 = A.mm(B.T)*2
  dists = ((A**2).sum(dim = 1).reshape(-1,1) - AB2 + (B**2).sum(dim = 1).reshape(1,-1))**(1/2)
  return dists

def predict_labels(dists, y_train, k=1):
  num_train, num_test = dists.shape
  y_pred = torch.zeros(num_test, dtype=torch.int64)

  values, indices = torch.topk(dists, k, dim=0, largest=False)
  for i in range(indices.shape[1]):
    _, idx = torch.max(y_train[indices[:,i]].bincount(), dim = 0)
    y_pred[i] = idx
  return indices, y_pred

class KnnClassifier:
  def __init__(self, x_train, y_train):
    self.x_train = x_train
    self.y_train = y_train

  def predict(self, x_test, k=1):
    y_test_pred = None

    dists = compute_distances_no_loops(self.x_train, x_test)
    _, y_test_pred =  predict_labels(dists, self.y_train, k)

    return y_test_pred

  def check_accuracy(self, x_test, y_test, k=1, quiet=False):
    y_test_pred = self.predict(x_test, k=k)
    num_samples = x_test.shape[0]
    num_correct = (y_test == y_test_pred).sum().item()
    accuracy = 100.0 * num_correct / num_samples
    msg = (f'Got {num_correct} / {num_samples} correct; '
           f'accuracy is {accuracy:.2f}%')
    if not quiet:
      print(msg)
    return accuracy

In [None]:
# Also, seed everything for reproducibility
# code from https://gist.github.com/ihoromi4/b681a9088f348942b01711f251e5f964#file-seed_everything-py
def seed_everything(seed: int):
    import random, os
    import numpy as np
    import torch
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

In [None]:
# Define the device to use for training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device == torch.device('cuda'):
    print(f'Using device: {device}. Good to go!')
else:
    print('Please set GPU via Edit -> Notebook Settings.')

Using device: cuda. Good to go!


In [None]:
! nvidia-smi

Sat Mar  2 20:31:45 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8              10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Q1: Steps to build a ViT from scratch
Vision Transformer (ViT) is a state-of-the-art neural network architecture for image classification tasks. Unlike traditional convolutional neural networks (CNNs), which have been the standard in computer vision for many years, ViT relies on a self-attention mechanism to extract features from images. This approach has shown to achieve competitive results on various benchmark datasets, while also offering the flexibility to handle tasks that require attention over long-range dependencies in images. ViT has quickly gained popularity in the computer vision community, and has spurred further research into the use of self-attention mechanisms in other areas of deep learning.

You will implement the ViT model on the Miniplaces dataset.

To implement ViT model for image classification, you will need to follow these steps：
1.  Extract feature vectors from the input images using a trainable linear projection layer, which converts the 2D image patches into 1D feature vectors.
2. Positional encoding: Add a learnable positional encoding to each feature vector, which provides spatial information to the model.
3. Transformer encoder: Stack multiple Transformer encoder layers to process the encoded features, which allows the model to learn both local and global interactions between the image patches.
4. Classification head: Add a classification head on top of the final encoded feature vector, which maps the learned representations to the corresponding class labels.
5. Training and evaluation: Train the ViT model using an appropriate optimization algorithm and loss function, and evaluate its performance on the validation and testing sets.

If you are not familiair with ViT model, then you can read our textbook [Transformers for Vision](https://d2l.ai/chapter_attention-mechanisms-and-transformers/vision-transformer.html#fig-vit), or review our [discussion slides](https://drive.google.com/file/d/1RKSnE9MOAGBu9T-_2TaBEm4ASF189Fms/view).

### Q1.1: Tokenization:
At this step, we need to divide each image into a set of non-overlapping patches, and treat each patch as a token. This is the key step that distinguishes ViT from other computer vision models.

#### Q1.1.1 Tokenize_image Method

In [None]:
def tokenize_image(img, patch_size=16, stride=16):
    """
    Tokenize an image into non-overlapping image patches.
    Args:
        img (torch.Tensor): The input image with shape (C, H, W).
        patch_size (int): The size of each patch.
        stride (int): The stride of the sliding window.
    Returns:
        patches (torch.Tensor): The tokenized patches with shape (N, patch_size*patch_size*C).
    """

    C, H, W = img.shape
    patches = []
    # Each patch is flattened into a 1-dimensional vector and stacked into a
    # tensor with shape (N, patch_size(H) * patch_size(W) * C), where N is the number of patches.
    # We only consider the case image size can be modulo by the patch_size

    # Additionally, before flattening, remember to permute the patch such that
    # it has shape (patch_size(H), patch_size(W), C)
    for i in range(0,H,stride):
        for j in range(0,W,stride):
            flat = img[:,i:i+patch_size,j:j+patch_size]
            flat = flat.permute(1, 2, 0)
            flat = torch.flatten(flat)
            patches.append(flat)
    patches = torch.stack(patches)
    return patches

In [None]:

# test your implementation of tokenize_image
random_img = torch.rand(3,64,64)
patched_img = tokenize_image(random_img,8,8)

for i in [32,16,8,4,2]:
    out = tokenize_image(random_img,i,i)

    fast_patch = Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = i, p2 = i)

    answer = fast_patch(random_img.unsqueeze(0))
    equal = torch.allclose(out,answer.squeeze(0))
    #print('Difference: ', equal)
    if equal:
      print('Good! For patch_size: %d, the output match' %(i))
    else:
      print('Uh-oh! For patch_size: %d, the output are different' %(i))
      break

Good! For patch_size: 32, the output match
Good! For patch_size: 16, the output match
Good! For patch_size: 8, the output match
Good! For patch_size: 4, the output match
Good! For patch_size: 2, the output match


#### Q1.1.2 linear projection layer

At this step, you will need to implement the linear projection linear project layer combined with tokenize operation.

This layer is used to transfer a single image to the image embedding.

In [None]:
class Tokenization_layer(nn.Module):
  def __init__(self, dim, patch_dim,patch_height, patch_width):
    super().__init__()
    """
        Args:
          dim (int): input and output dimension.
          patch_dim(int): falttened vectot dimension for image patch
          patch_height (int): height of one image patch
          patch_weight (int): weight of one image patch

        You can use Pytorch's built-in function and the above Rearrange method.
        Input and output shapes of each layer:
        1) Rerrange the image: (batch_size, channels, H,W) -> (batch_size,N,patch_dim)
        2) Norm Layer1 (LayerNorm): (batch_size,N,patch_dim) -> (batch_size,N,patch_dim)
        3) Linear Projection layer: (batch_size,N,patch_dim) -> (batch_size,N,dim)
        4) Norm Layer2 (LayerNorm): (batch_size,N,dim) -> (batch_size,N,dim)
    """

    self.to_patch = Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width)
    self.norm1 = None
    self.fc1 = None
    self.norm2 = None

    ################# Your Implementations #################################
    # Hints: You can use the Rearrange method above to achieve faster patch operation
    # Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width)


    ################# End of your Implementations ##########################

  def forward(self, x):
    """
    Args:
      x (torch.Tensor): input tensor in the shape of (batch_size,C,H,W)
    Return:
      out (torch.Tensor): output patch embedding tensor in the shape of (batch_size,N,dim)

     The input tensor 'x' should pass through the following layers:
     1) self.to_patch: Rerrange image
     2) self.norm1: LayerNorm
     3) self.fc1: Fully-Connected layer
     4) self.norm2: LayerNorm

    """

    out = None
    ################# Your Implementations #################################


    ################# End of your Implementations ##########################
    return out


### Q1.2 Attention:
You will need to follow the steps to implement multi-head attention in this question.
1. **Obtain Q,K,V vectors**: To obtain the Q, K, and V vectors, the input vectors are processed through three distinct single linear layers. In our implementation, we use a single linear layer with 3xD output channels, and then we divide the output into three chunks. We consider the first chunk as the Q vectors, the second chunk as the K vectors, and the last chunk as the V vectors.

2. **Calculate similarity**: Compute the similarity scores between query vectors and a set of key vectors using a dot product.

3. **Apply softmax**: Apply a softmax function to normalize the similarity scores across the key vectors. This creates a probability distribution that represents the relative importance of each key vector with respect to the query vector.

4. **Compute weighted sum**: Compute a weighted sum of the** value vectors**, where the weights are the probability distribution obtained in step 2. This produces a context vector that summarizes the most relevant information from the value vectors with respect to the query vector.

5. **Concatenate output**: The outputs of each head are then concatenated and passed through another linear projection to produce the final output.

For more details, you can read our [textbook](https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html).

In [None]:
import torch.nn.functional as F
class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        """
        Args:
          dim (int): input and output dimension.
          heads (int): number of attention heads.
          dim_head (int): input dimension of each attention head.
          dropout (float): dropout rate for attention and final_linear layer.

        Initialize a attention block.
        You can use Pytorch's built-in function.
        Input and output shapes of each layer:
        1) Define the inner dimension as number of heads* dimension of each head
        2) to_qkv: (batch_size, dim) -> (batch_size,3*inner_dimension)
        3) final_linear: (batch_size, inner_dim) -> (batch_size, dim)
        """

        self.heads = heads
        self.dim_head = dim_head

        self.inner_dim = dim_head *  heads


        self.attend = None
        self.dropout = None
        self.final_linear = None


        # Here, you should define
        # 1) self.to_qkv: (batch_size, dim) -> (batch_size,3*inner_dimension)
        # 2) self.dropout: Dropout layer with ratio defined by dropout variable
        # 3) self.final_linear: (batch_size, inner_dim) -> (batch_size, dim)
        ################# Your Implementations #################################



        ################# End of your Implementations ##########################

    def forward(self, x):
        '''
        Forward pass of the attention block.
        Args:
            x (torch.Tensor): input tensor in the shape of (batch_size,N,dim).
        Returns:
            out (torch.Tensor): output tensor in the shape of (batch_size,N,dim).

        The input tensor 'x' should pass through the following layers:
        1) to_qkv: (batch_size,N,dim) -> (batch_size,N,3*inner_dimension)
        2) Divide the ouput of to qkv to q,k,v and then divide them in to n heads
            (batch_size,N,inner_dim) -> (batch_size,N,num_head,head_dim)
        3) Use torch.matmul to get the product of q and k
        4) Divide the above tensor by the squre root of head dimension
        5) Apply softmax and then dropout on the above tensor
        6) Mutiply the above tensor with v to get attention
        7) Concatenate the attentions from multi-heads
            (batch_size,N,num_head,head_dim) -> (batch_size,N,inner_dim)
        8) Pass the output from last step to a fully connected layer
        9) Apply dropout for the last step output
        '''
        out = None

        qkv = self.to_qkv(x).chunk(3, dim = -1)
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
        ################# Your Implementations #################################
        # Hint you can use :
        #    out = rearrange(out, 'b h n d -> b n (h d)')
        # to concatenate the output from all attention heads
        # This operation will change the tensor shape from (batch_size,N,num_head,head_dim)
        # to  (batch_size,N,inner_dim)

        ################# End of your Implementations ##########################
        return out

In [None]:
# You can use this cell to check if the output shape of attention'
for dim in [512,768,1096]:
  test_tensor = torch.rand(2,196,dim)
  att_layer = Attention(dim,8,64,0.4)
  output_tensor = att_layer(test_tensor)
  equal =  test_tensor.shape == output_tensor.shape
  if equal:
    print('Good! For input dim: %d, the output shape is correct' %(dim))
  else:
    print('Uh-oh! For input dim: %d, the output shape is wrong' %(dim))
    break

Good! For input dim: 512, the output shape is correct
Good! For input dim: 768, the output shape is correct
Good! For input dim: 1096, the output shape is correct


The norm layer in Vision Transformer (ViT) is a layer that performs layer normalization on the input. It is typically applied after the Multi-Head Attention (MHA) and the MLP layers in the ViT architecture. The norm layer is used to help the model learn better representations by ensuring that the activations are normalized and centered.

In [None]:
### PreNorm function
class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        # keey the residual connection here
        return self.fn(self.norm(x), **kwargs)+x

In [None]:
#You can use
a = PreNorm(768, Attention(768, heads = 8, dim_head = 64, dropout = 0.2))
# to create a combination of layer norm and any other layer
test_tensor = torch.rand(2,196,768)
# you can use the following line to do the forward pass
output_tensor = a(test_tensor)

###Q1.3 PositionwiseFeedForward
You will need to implement the posiotionwiseFeedForward layer in Vision Transformer.

The FFN layer is called "position-wise" because it applies a separate feedforward network to each position in the sequence independently. It consists of two linear transformations with a non-linear activation function in between, typically GELU. The first linear transformation maps the input feature vector from its original dimension to a higher-dimensional space, and the second linear transformation maps it back to the original dimension. The output of the FFN layer is the element-wise sum of the input and the transformed feature vector.

In [None]:
class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."
    def __init__(self, dim, mlp_dim, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        """
         Args:
          dim (int): input and output dimension.
          mlp_dim (int): the output dimension of the first layer.
          dropout (float): dropout rate for both linear layers.

        Initialize an MLP.
        You can use Pytorch's built-in nn.Linear function.
        Input and output sizes of each layer:
          1) fc1: dim, mlp_dim
          2) fc2: mlp_dim, dim
        """

        self.fc1 = None
        self.fc2 = None
        self.dropout = None
        self.activation = nn.GELU()
        ################# Your Implementations #################################


        ################# End of your Implementations ##########################

    def forward(self, x):
        '''
        Args:
            x (torch.Tensor): input tensor in the shape of (batch_size,N,dim).
        Returns:
            out (torch.Tensor): output tensor in the shape of (batch_size,N,dim).

        The input tensor 'x' should pass through the following layers:
        1) fc1: (batch_size,N,dim) ->  (batch_size,N,mlp_dim)
        2) Apply activation function
        3) Apply dropout
        3) fc2: (batch_size,N,mlp_dim) -> (batch_size,N,dim)
        4) Apply dropout
        '''

        out = None
        ################# Your Implementations #################################

        ################# End of your Implementations ##########################
        return out

In [None]:
# You can use this cell to check if the output shape of PositionwiseFeedForward
for dim in [512,768,1096]:
  test_tensor = torch.rand(2,196,dim)
  ffn = PositionwiseFeedForward(dim,dim*4,0.1)
  output_tensor = ffn(test_tensor)
  equal =  test_tensor.shape == output_tensor.shape
  if equal:
    print('Good! For input dim: %d, the output shape is correct' %(dim))
  else:
    print('Uh-oh! For input dim: %d, the output shape is wrong' %(dim))
    break

Good! For input dim: 512, the output shape is correct
Good! For input dim: 768, the output shape is correct
Good! For input dim: 1096, the output shape is correct


### Q1.4 TransformerBlock
Now you can follow the steps and use above class to implement the standard transformer block as demostrated in the following image.

 <img src="https://web.cs.ucla.edu/~smo3/cs188/assignment3/transformer_block.png"  width="20%" height="40%">

1. Apply Layer-norm to the input tensor
2. Apply the Multi-Head Attention (MHA) layer to the output tensor from step1. The MHA layer takes in the input tensor, and returns the attention scores and the attention output tensor.
3. Add the residual connection to the output of the MHA layer.
4. Apply Layer-norm to output of last step
5. Apply the Position-wise Feedforward Network (FFN) layer to the output of the previous step. The FFN layer takes in the output tensor, and returns the transformed output tensor.
6. Add the residual connection to the output of the FFN layer.

In [None]:
class Transformer(nn.Module):
    def __init__(self, dim, heads, dim_head, mlp_dim, dropout = 0.):
        "Implements Transformer block."
        super().__init__()
        '''
        Args:
          dim (int): input and output dimension.
          heads (int): number of attention heads.
          dim_head (int): input dimension of each attention head.
          mlp_dim (int):
          dropout (float): dropout rate for attention and FFN layers.

        '''
        # Use the PreNorm,Attention and PositionwiseFeedForword class to build your
        # Transformer block
        self.attn = None
        self.ff = None

        ################# Your Implementations #################################

        ################# End of your Implementations ##########################

    def forward(self, x):
        """
        Args:
            x (torch.Tensor): input tensor in the shape of (batch_size,N,dim).
        Returns:
            out (torch.Tensor): output tensor in the shape of (batch_size,N,dim).
        """
        ################# Your Implementations #################################

        ################# End of your Implementations ##########################
        return x

In [None]:
# You can use this cell to check if the output shape of Transformer
for dim in [512,768,1096]:
  test_tensor = torch.rand(2,196,dim)
  transformer_block = Transformer(dim,8,64,dim*4,0.1)
  output_tensor = transformer_block(test_tensor)
  equal =  test_tensor.shape == output_tensor.shape
  if equal:
    print('Good! For input dim: %d, the output shape is correct' %(dim))
  else:
    print('Uh-oh! For input dim: %d, the output shape is wrong' %(dim))
    break

Good! For input dim: 512, the output shape is correct
Good! For input dim: 768, the output shape is correct
Good! For input dim: 1096, the output shape is correct


###Q1.5 ViTModel
Now you can use above classes to build your Vision Transfromer. Recall the ViT Architecture.

 <img src="https://web.cs.ucla.edu/~smo3/cs188/assignment3/vit.png"  width="40%" height="40%">

 Recall the pipline for Vision Transformer model:

1. Load the input images and preprocess them into a set of image patches. The patches should be non-overlapping and should cover the entire input image. Each patch should be flattened into a vector and projected into a lower-dimensional/equal-dimensional space using a linear layer.

2. Add cls token and learnable positional embeddings to the projected patch vectors. The positional embedding should encode the spatial location of each patch in the input image.

3. Stack several Transformer blocks to process the patch vectors. Each Transformer block should consist of a Multi-Head Attention (MHA) layer and a Position-wise Feedforward Network (FFN) layer, with residual connections and layer normalization applied after each layer.

3. Apply a mean pooling operation over the output of the last Transformer block or take the output vector related to the cls token to obtain a fixed-size feature vector.

5. Feed the feature vector into a fully-connected classification head to predict the class label of the input image.

6. Train the model using a supervised learning objective, such as cross-entropy loss, and backpropagation to update the model weights.

In [None]:
# helper method
def pair(t):
    return t if isinstance(t, tuple) else (t, t)

class ViT(nn.Module):
    "Implements Vision Transfromer"
    def __init__(self, *,
                 image_size,
                 patch_size,
                 num_classes,
                 dim,
                 depth,
                 heads,
                 mlp_dim,
                 pool = 'cls',
                 channels = 3,
                 dim_head = 64,
                 dropout = 0.,
                 emb_dropout = 0.,
                ):
        super().__init__()
        """
        Args:
          image_size (int): the height/weight of the input image.
          patch_size (int): image patch size. In the ViT paper, this value is 16.
          num_classes (num_class): Number of image classes for MLP prediction head.
          dim (int): patch and position embedding dimension.
          depth (int): number of stacked transformer blocks.
          heads (int): number of attention heads.
          mlp_dim (int): inner dimension for MLP in transformer blocks.
          pool (str): choice between "cls", "mean", "none".
                      For cls, you will need to use the cls token for perdiction
                      For mean, you will need to take the mean of last transformer output
                      For none, you can just return the last transformer output.
                                This will mainly be used for dense prediction tasks.
          channels (int): Input image channels. Set to 3 for RGB image.
          dropout (float): dropout rate for transformer blocks.
          emb_dropout (float): dropout rate for patch embedding.
        """
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)

        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'

        num_patches = 0
        patch_dim = 0

        ################# Your Implementations #################################
        # TODO: Compute the num_patches and patch_dim

        ################# End of your Implementations ##########################

        assert pool in {'cls', 'mean', 'none'}, 'pool type must be either cls (cls token), mean (mean pooling), or none (no pooling)'
        self.pool = pool

        self.to_patch_embedding = None

        self.pos_embedding = None
        self.cls_token = None
        self.dropout = None
        self.transformers = nn.ModuleList([])
        self.mlp_head = None
        ################# Your Implementations #################################
        # TODO:
        # 1) Define self.to_patch_embedding usinng the Tokenization_layer class
        # 2) Define learnable 1-D pos_embedding using torch.randn, the number of
        #    embedding should be num_patches+1
        # 3) Define learnable 1-D cls_token with dimension = dim. You can use
        #    nn.Parameter and torch.randn to initialize this
        # 4) Define dropout with emb_dropout
        # 5) Define array of d Transformer modules, where d=depth
        # 6) Using nn.Sqeuential to create the MLP head including two layers:
        #    The first layer in the MLP head is a LayerNorm layer.
        #    The second layer in the MLP head is a linear layer change dimension to num_classes
        #    Note that this MLP head should have 'dim' input dimensions, not 'mlp_dim' which is
        #    used for the MLP in the transformer block instead.



        ################# End of your Implementations ##########################


    def forward(self, img):
        '''
        Args:
            img (torch.Tensor): input tensor in the shape of (batch_size,C,H,W).
        Returns:
            out (torch.Tensor): output tensor in the shape of (batch_size,num_class).

        The input tensor 'img' should pass through the following layers:
        1) self.to_patch_embedding: (batch_size,C,H,W) -> (batch_size,N,dim)
        2) Using torch.Tensor.repeat to repeat the cls alone batch dimension.
           Then, concatenate with cls token (batch_size,N,dim) -> (batch_size,N+1,dim)
        3) Take sum of patch embedding and position embedding, then apply dropout.
        4) Passing through all the transformer blocks (batch_size,N+1,dim) -> (batch_size,N+1,dim)
        5) If pool is none, simply return the output of (4). Else, proceed to (5).
        5) Use cls token or use pool method to get latent code of batched images
            (batch_size,N+1,dim) -> (batch_size,dim)
        6) Apply MLP head to the output of last step: (batch_size,dim) -> (batch_size,num_class)

        '''
        out = None
        ################# Your Implementations #################################



        ################# End of your Implementations ##########################
        return out

Then let's train your ViT model with with cls token as pool policy.

In [None]:
seed_everything(0)

#Define the model, optimizer, and criterion (loss_fn)
model = ViT(image_size = 128,
    patch_size = 16,
    num_classes = 100,
    dim = 192,
    depth = 8,
    heads = 4,
    dim_head = 48,
    mlp_dim = 768,
    dropout = 0.1,
    emb_dropout = 0.1
           )

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-4,)

criterion = nn.CrossEntropyLoss()


# Define the dataset and data transform with flatten functions appended
data_root = os.path.join(root_dir, 'data')
train_dataset = MiniPlaces(
    root_dir=data_root, split='train',
    transform=data_transform)

val_dataset = MiniPlaces(
    root_dir=data_root, split='val',
    transform=data_transform,
    label_dict=train_dataset.label_dict)

# Define the batch size and number of workers
batch_size = 64
num_workers = 2

# Define the data loaders
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=True)
val_loader = torch.utils.data.DataLoader(
    val_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=False)

# Train the model
train(model, train_loader, val_loader, optimizer, criterion, device, num_epochs=2)

Epoch 1/2: 100%|██████████| 1563/1563 [02:02<00:00, 12.78it/s, loss=3.93]


Validation set: Average loss = 3.7836, Accuracy = 0.1214


Epoch 2/2: 100%|██████████| 1563/1563 [02:02<00:00, 12.79it/s, loss=3.54]


Validation set: Average loss = 3.5456, Accuracy = 0.1539


I got an accuracy of 14.43% using my own implementation. How about you?

Then let's train your ViT model with with average pooling as pool policy.

In [None]:
seed_everything(0)

#Define the model, optimizer, and criterion (loss_fn)
model = ViT(image_size = 128,
    patch_size = 16,
    num_classes = 100,
    dim = 192,
    depth = 8,
    heads = 4,
    pool = 'mean',
    dim_head = 48,
    mlp_dim = 768,
    dropout = 0.1,
    emb_dropout = 0.1
           )

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-4,)

criterion = nn.CrossEntropyLoss()


# Define the dataset and data transform with flatten functions appended
data_root = os.path.join(root_dir, 'data')
train_dataset = MiniPlaces(
    root_dir=data_root, split='train',
    transform=data_transform)

val_dataset = MiniPlaces(
    root_dir=data_root, split='val',
    transform=data_transform,
    label_dict=train_dataset.label_dict)

# Define the batch size and number of workers
batch_size = 64
num_workers = 2

# Define the data loaders
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=True)
val_loader = torch.utils.data.DataLoader(
    val_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=False)

# Train the model
train(model, train_loader, val_loader, optimizer, criterion, device, num_epochs=2)

Epoch 1/2: 100%|██████████| 1563/1563 [02:14<00:00, 11.60it/s, loss=3.89]


Validation set: Average loss = 3.8176, Accuracy = 0.1161


Epoch 2/2: 100%|██████████| 1563/1563 [02:03<00:00, 12.65it/s, loss=3.6]


Validation set: Average loss = 3.5833, Accuracy = 0.1525


I got an accuracy of 14.11% using my own implementation. How about you?