# Welcome to CS 5242 **Assignment 6**

ASSIGNMENT DEADLINE ⏰ : ** 18 April 2024** 

In this assignment, we learn how to adopt a parameter-efficient fine-tuning (PEFT) method to fine-tune a large model. The parameter-efficient fine-tuning mehtod is a very popular technique for large model training.

we have three parts:
1. Load pre-trained parameters correctly. 2 points
2. Implement an adapter module and insert it into your model.  6 points
3. Fine-tuning the model for 1 epoch. 2 points.

Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs. In this semester, we will use Colab to run our experiments.
1. Login Google Colab https://colab.research.google.com/
2. In this assignment, We **need GPU** to training the CNN model. You may need to **choose GPU in Runtime -> Change runtime type -> Hardware accerator**



### **Grades Policy**

We have 10 points for this homework. 15% off per day late, 0 scores if you submit it 7 days after the deadline.

### **Cautions**

**DO NOT** copy the code from the internet, e.g. GitHub.
---

**DO NOT** use any LLMs to write the code, e.g. ChatGPT.
---

### **Contact**

Please feel free to contact us if you have any question about this homework or need any further information.

Slack: Wangbo Zhao


> If you have not join the slack group, you can click [here](https://join.slack.com/t/cs5242-2024spring/shared_invite/zt-2cw3jgqab-wFhoaIVa4RIX4fCZ_k~vjQ)

## Setup

Start by running the cell below to set up all required software.

In [1]:
!pip install numpy matplotlib 
!pip install torch torchvision
!pip install timm==0.9.16


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Import the neccesary library and fix seed for Python, NumPy and PyTorch.

In [2]:
import math
import random

import numpy as np
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

random.seed(0)
np.random.seed(0)
torch.manual_seed(0)

<torch._C.Generator at 0x12c3e4330>

Now let's setup the GPU environment. The colab provides a free GPU to use. Do as follows:

- Runtime -> Change Runtime Type -> select `GPU` in Hardware accelerator
- Click `connect` on the top-right

After connecting to one GPU, you can check its status using `nvidia-smi` command.

In [3]:
!nvidia-smi

torch.cuda.is_available()

zsh:1: command not found: nvidia-smi


False

Everything is ready, you can move on and ***Good Luck !*** 😃

## Load parameters from pre-trained model.

In this section, you need to load a checkpoint of the vision transformer.



Implement the code for ViT-Base. Do not worry, I have done it for you.

In [4]:
from functools import partial
from collections import OrderedDict
from typing import Optional
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint
from typing import Any, Callable, Dict, Optional, Sequence, Set, Tuple, Type, Union, List
from timm.models.layers import DropPath, trunc_normal_
from torch.jit import Final
from timm.layers import PatchEmbed, Mlp, DropPath, PatchDropout, trunc_normal_, use_fused_attn

    

class Attention(nn.Module):
    fused_attn: Final[bool]

    def __init__(
            self,
            dim,
            num_heads=8,
            qkv_bias=False,
            qk_norm=False,
            attn_drop=0.,
            proj_drop=0.,
            norm_layer=nn.LayerNorm,
    ):
        super().__init__()
        assert dim % num_heads == 0, 'dim should be divisible by num_heads'
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
        self.scale = self.head_dim ** -0.5
        self.fused_attn = use_fused_attn()

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.q_norm = norm_layer(self.head_dim) if qk_norm else nn.Identity()
        self.k_norm = norm_layer(self.head_dim) if qk_norm else nn.Identity()
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4)
        q, k, v = qkv.unbind(0)
        q, k = self.q_norm(q), self.k_norm(k)

        if self.fused_attn:
            x = F.scaled_dot_product_attention(
                q, k, v,
                dropout_p=self.attn_drop.p,
            )
        else:
            q = q * self.scale
            attn = q @ k.transpose(-2, -1)
            attn = attn.softmax(dim=-1)
            attn = self.attn_drop(attn)
            x = attn @ v

        x = x.transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x
    
    
class LayerScale(nn.Module):
    def __init__(self, dim, init_values=1e-5, inplace=False):
        super().__init__()
        self.inplace = inplace
        self.gamma = nn.Parameter(init_values * torch.ones(dim))

    def forward(self, x):
        return x.mul_(self.gamma) if self.inplace else x * self.gamma


class Block(nn.Module):

    def __init__(
            self,
            dim,
            num_heads,
            mlp_ratio=4.,
            qkv_bias=False,
            qk_norm=False,
            proj_drop=0.,
            attn_drop=0.,
            init_values=None,
            drop_path=0.,
            act_layer=nn.GELU,
            norm_layer=nn.LayerNorm,
            mlp_layer=Mlp
    ):
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(
            dim,
            num_heads=num_heads,
            qkv_bias=qkv_bias,
            qk_norm=qk_norm,
            attn_drop=attn_drop,
            proj_drop=proj_drop,
            norm_layer=norm_layer,
        )
        self.ls1 = LayerScale(dim, init_values=init_values) if init_values else nn.Identity()
        self.drop_path1 = DropPath(drop_path) if drop_path > 0. else nn.Identity()

        self.norm2 = norm_layer(dim)
        self.mlp = mlp_layer(
            in_features=dim,
            hidden_features=int(dim * mlp_ratio),
            act_layer=act_layer,
            drop=proj_drop,
        )
        self.ls2 = LayerScale(dim, init_values=init_values) if init_values else nn.Identity()
        self.drop_path2 = DropPath(drop_path) if drop_path > 0. else nn.Identity()
        
        
        
    def forward(self, x):
        x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
        x = x + self.drop_path2(self.ls2(self.mlp(self.norm2(x))))
        return x
    
        
        




class VisionTransformer(nn.Module):
    """ Vision Transformer

    A PyTorch impl of : `An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale`
        - https://arxiv.org/abs/2010.11929
    """

    def __init__(
            self,
            img_size: Union[int, Tuple[int, int]] = 224,
            patch_size: Union[int, Tuple[int, int]] = 16,
            in_chans: int = 3,
            num_classes: int = 1000,
            global_pool: str = 'token',
            embed_dim: int = 768,
            depth: int = 12,
            num_heads: int = 12,
            mlp_ratio: float = 4.,
            qkv_bias: bool = True,
            qk_norm: bool = False,
            init_values: Optional[float] = None,
            class_token: bool = True,
            no_embed_class: bool = False,
            pre_norm: bool = False,
            fc_norm: Optional[bool] = None,
            drop_rate: float = 0.,
            pos_drop_rate: float = 0.,
            patch_drop_rate: float = 0.,
            proj_drop_rate: float = 0.,
            attn_drop_rate: float = 0.,
            drop_path_rate: float = 0.,
            weight_init: str = '',
            embed_layer: Callable = PatchEmbed,
            norm_layer: Optional[Callable] = None,
            act_layer: Optional[Callable] = None,
            block_fn: Callable = Block,
            mlp_layer: Callable = Mlp
    ):
        """
        Args:
            img_size: Input image size.
            patch_size: Patch size.
            in_chans: Number of image input channels.
            num_classes: Mumber of classes for classification head.
            global_pool: Type of global pooling for final sequence (default: 'token').
            embed_dim: Transformer embedding dimension.
            depth: Depth of transformer.
            num_heads: Number of attention heads.
            mlp_ratio: Ratio of mlp hidden dim to embedding dim.
            qkv_bias: Enable bias for qkv projections if True.
            init_values: Layer-scale init values (layer-scale enabled if not None).
            class_token: Use class token.
            fc_norm: Pre head norm after pool (instead of before), if None, enabled when global_pool == 'avg'.
            drop_rate: Head dropout rate.
            pos_drop_rate: Position embedding dropout rate.
            attn_drop_rate: Attention dropout rate.
            drop_path_rate: Stochastic depth rate.
            weight_init: Weight initialization scheme.
            embed_layer: Patch embedding layer.
            norm_layer: Normalization layer.
            act_layer: MLP activation layer.
            block_fn: Transformer block layer.
        """
        super().__init__()
        assert global_pool in ('', 'avg', 'token')
        assert class_token or global_pool != 'token'
        use_fc_norm = global_pool == 'avg' if fc_norm is None else fc_norm
        norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
        act_layer = act_layer or nn.GELU

        self.num_classes = num_classes
        self.global_pool = global_pool
        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
        self.num_prefix_tokens = 1 if class_token else 0
        self.no_embed_class = no_embed_class
        self.grad_checkpointing = False

        self.patch_embed = embed_layer(
            img_size=img_size,
            patch_size=patch_size,
            in_chans=in_chans,
            embed_dim=embed_dim,
            bias=not pre_norm,  # disable bias if pre-norm is used (e.g. CLIP)
        )
        num_patches = self.patch_embed.num_patches

        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) if class_token else None
        embed_len = num_patches if no_embed_class else num_patches + self.num_prefix_tokens
        self.pos_embed = nn.Parameter(torch.randn(1, embed_len, embed_dim) * .02)
        self.pos_drop = nn.Dropout(p=pos_drop_rate)
        if patch_drop_rate > 0:
            self.patch_drop = PatchDropout(
                patch_drop_rate,
                num_prefix_tokens=self.num_prefix_tokens,
            )
        else:
            self.patch_drop = nn.Identity()
        self.norm_pre = norm_layer(embed_dim) if pre_norm else nn.Identity()

        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
        self.blocks = nn.Sequential(*[
            block_fn(
                dim=embed_dim,
                num_heads=num_heads,
                mlp_ratio=mlp_ratio,
                qkv_bias=qkv_bias,
                qk_norm=qk_norm,
                init_values=init_values,
                proj_drop=proj_drop_rate,
                attn_drop=attn_drop_rate,
                drop_path=dpr[i],
                norm_layer=norm_layer,
                act_layer=act_layer,
                mlp_layer=mlp_layer
            )
            for i in range(depth)])
        self.norm = norm_layer(embed_dim) if not use_fc_norm else nn.Identity()

        # Classifier Head
        self.fc_norm = norm_layer(embed_dim) if use_fc_norm else nn.Identity()
        self.head_drop = nn.Dropout(drop_rate)
        self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()

        if self.cls_token is not None:
            nn.init.normal_(self.cls_token, std=1e-6)
        self.apply(self.init_weights)


    def init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight, std=.02)
            if m.bias is not None:
                nn.init.zeros_(m.bias)
        elif hasattr(m, '_init_weights'):
            m._init_weights()



        
    @torch.jit.ignore
    def no_weight_decay(self):
        return {'pos_embed', 'cls_token', 'dist_token'}



    def forward_features(self, x):
        x = self.patch_embed(x)
        
        if self.cls_token is not None:
            x = torch.cat((self.cls_token.expand(x.shape[0], -1, -1), x), dim=1)
        x = x + self.pos_embed
        x = self.pos_drop(x)
        
        x = self.patch_drop(x)
        x = self.norm_pre(x)


        for i, blk in enumerate(self.blocks):
            x = blk(x)
        
        x = self.norm(x)
        return x

    def forward_head(self, x, pre_logits: bool = False):
        if self.global_pool:
            x = x[:, self.num_prefix_tokens:].mean(dim=1) if self.global_pool == 'avg' else x[:, 0]
        x = self.fc_norm(x)
        x = self.head_drop(x)
        return x if pre_logits else self.head(x)

    def forward(self, x):
        x = self.forward_features(x)
        x = self.forward_head(x)
        return x


def convert_list_to_tensor(list_convert):
    if len(list_convert):
        result = torch.stack(list_convert, dim=1)
    else :
        result = None
    return result 



    
def vit_base_patch16_224_in21k(**kwargs):
    """ ViT-Base model (ViT-B/16) from original paper (https://arxiv.org/abs/2010.11929).
    ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
    NOTE: this model has valid 21k classifier head and no representation (pre-logits) layer
    """
    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, **kwargs)
    model = VisionTransformer(**model_kwargs)
    return model





Download pre-trained model from "https://github.com/huggingface/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_patch16_224_in21k-e5005f0a.pth".
Then load the parameters correctly.

In [5]:
model = vit_base_patch16_224_in21k(num_classes=100)
# checkpoint_model = torch.load("/path/jx_vit_base_patch16_224_in21k-e5005f0a.pth") 
########## write code to load the checkpoint_model without error ##########
checkpoint_model = torch.load("jx_vit_base_patch16_224_in21k-e5005f0a.pth") 
del checkpoint_model['head.weight']
del checkpoint_model['head.bias']
########## write code to load the checkpoint_model without error ##########
msg = model.load_state_dict(checkpoint_model, strict=False)
print(msg)

_IncompatibleKeys(missing_keys=['head.weight', 'head.bias'], unexpected_keys=['pre_logits.fc.bias', 'pre_logits.fc.weight'])


# Implement a adapter module. 
[AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition](https://proceedings.neurips.cc/paper_files/paper/2022/file/69e2f49ab0837b71b0e0cb7c555990f8-Paper-Conference.pdf)


![Alt text](image.png)!
The scale s is set to 0.1.

In [6]:
import math
import torch
import torch.nn as nn


class Adapter(nn.Module):
    def __init__(self,
                 n_embd=768,
                 down_size=8,
                 dropout=0.0,
                 adapter_scalar="0.1"):
        super().__init__()
    ### implement the adapter module ####
        self.down_proj = nn.Linear(n_embd, n_embd // down_size)  # Linear layer for downsizing
        self.up_proj = nn.Linear(n_embd // down_size, n_embd)  # Linear layer for upsizing
        self.dropout = nn.Dropout(dropout)  # Dropout layer
        self.activation = nn.ReLU()  # Activation function
        self.adapter_scalar = float(adapter_scalar)  # Adapter scaling factor

    ### implement the adapter module ####
    
    def forward(self, x):
    ### implement the adapter module ####
        # Forward pass logic for the adapter
        down_projected = self.down_proj(x)  # Pass through the downsizing linear layer
        activated = self.activation(down_projected)  # Apply activation function
        dropped = self.dropout(activated)  # Apply Dropout
        up_projected = self.up_proj(dropped)  # Pass through the upsizing linear layer
        # output = x + up_projected * self.adapter_scalar  # Add the adapter's output to the original input
        # 在 Adapter 类的 forward 方法中
        scalar_tensor = torch.tensor(self.adapter_scalar, device=up_projected.device, dtype=up_projected.dtype)
        output = x + up_projected * scalar_tensor


    ### implement the adapter module ####
        return output

Insert the adapter module to your vision transformer



In [7]:
import timm


######## implement you code here###########
def set_adapter(model):
    for name, module in model.named_modules():
        if isinstance(module, Block):
            # Attempt to deduce the embedding dimension from the attention module's query/key/value linear layer
            if hasattr(module.attn, 'qkv'):
                n_embd = module.attn.qkv.in_features
            else:
                # Alternatively, try to get the embedding dimension from the first linear layer inside the block
                linear_layers = [m for m in module.modules() if isinstance(m, nn.Linear)]
                if linear_layers:
                    n_embd = linear_layers[0].in_features
                else:
                    raise AttributeError("Unable to deduce the embedding dimension for the adapter module.")

            # Add the adapter module to the block
            module.adapter = Adapter(n_embd=n_embd)

            # Override the forward method to include the adapter
            original_forward = module.forward

            def forward_with_adapter(self, x, original_forward=original_forward):
                x = original_forward(x)
                x = self.adapter(x)
                return x

            module.forward = forward_with_adapter.__get__(module)

######## implement you code here ###########



set_adapter(model=model)

# load pre-trained parameter agrain 
msg = model.load_state_dict(checkpoint_model, strict=False)
print(msg)


# freeze all but the head
for name, p in model.named_parameters():
    if name in msg.missing_keys:
        p.requires_grad = True
    else:
        p.requires_grad = False
for _, p in model.head.named_parameters():
    p.requires_grad = True
    
n_parameters = 0
for n, p in model.named_parameters():
    if p.requires_grad:
        n_parameters = n_parameters + p.numel()


print('number of tunable params (M): %.2f' % (n_parameters / 1.e6))

_IncompatibleKeys(missing_keys=['blocks.0.adapter.down_proj.weight', 'blocks.0.adapter.down_proj.bias', 'blocks.0.adapter.up_proj.weight', 'blocks.0.adapter.up_proj.bias', 'blocks.1.adapter.down_proj.weight', 'blocks.1.adapter.down_proj.bias', 'blocks.1.adapter.up_proj.weight', 'blocks.1.adapter.up_proj.bias', 'blocks.2.adapter.down_proj.weight', 'blocks.2.adapter.down_proj.bias', 'blocks.2.adapter.up_proj.weight', 'blocks.2.adapter.up_proj.bias', 'blocks.3.adapter.down_proj.weight', 'blocks.3.adapter.down_proj.bias', 'blocks.3.adapter.up_proj.weight', 'blocks.3.adapter.up_proj.bias', 'blocks.4.adapter.down_proj.weight', 'blocks.4.adapter.down_proj.bias', 'blocks.4.adapter.up_proj.weight', 'blocks.4.adapter.up_proj.bias', 'blocks.5.adapter.down_proj.weight', 'blocks.5.adapter.down_proj.bias', 'blocks.5.adapter.up_proj.weight', 'blocks.5.adapter.up_proj.bias', 'blocks.6.adapter.down_proj.weight', 'blocks.6.adapter.down_proj.bias', 'blocks.6.adapter.up_proj.weight', 'blocks.6.adapter.up_

## Parameter-efficient fine-tuning for 1 epoch

In [8]:
import torch 
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from timm.data.constants import IMAGENET_INCEPTION_MEAN, IMAGENET_INCEPTION_STD
import PIL.Image
import math

_mean = IMAGENET_INCEPTION_MEAN
_std = IMAGENET_INCEPTION_STD

class RandomResizedCrop(transforms.RandomResizedCrop):
    """
    RandomResizedCrop for matching TF/TPU implementation: no for-loop is used.
    This may lead to results different with torchvision's version.
    Following BYOL's TF code:
    https://github.com/deepmind/deepmind-research/blob/master/byol/utils/dataset.py#L206
    """
    @staticmethod
    def get_params(img, scale, ratio):
        assert isinstance(img, PIL.Image.Image)
        # width, height = F._get_image_size(img)
        width, height = img.width, img.height
        area = height * width

        target_area = area * torch.empty(1).uniform_(scale[0], scale[1]).item()
        log_ratio = torch.log(torch.tensor(ratio))
        aspect_ratio = torch.exp(
            torch.empty(1).uniform_(log_ratio[0], log_ratio[1])
        ).item()

        w = int(round(math.sqrt(target_area * aspect_ratio)))
        h = int(round(math.sqrt(target_area / aspect_ratio)))

        w = min(w, width)
        h = min(h, height)

        i = torch.randint(0, height - h + 1, size=(1,)).item()
        j = torch.randint(0, width - w + 1, size=(1,)).item()

        return i, j, h, w
    
        
train_set = torchvision.datasets.CIFAR100(
    root = 'Cifar100',
    train = True,
    download = True,
    transform = transforms.Compose([
        RandomResizedCrop(224, interpolation=3),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=_mean, std=_std)])
)

test_set = torchvision.datasets.CIFAR100(
    root = 'Cifar100',
    train = False,
    download = True,
    transform = transforms.Compose([
        transforms.Resize(256, interpolation=3),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=_mean, std=_std)])
)

train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
print(len(train_loader))
test_loader = torch.utils.data.DataLoader(test_set, batch_size=100)

Files already downloaded and verified
Files already downloaded and verified
500


In [9]:
network = model

if torch.cuda.is_available():
    network = network.cuda()
    
# you may aduject the learning rate or weight deacy here.

######### complete the optimizer. ##################
optimizer = optim.Adam(network.parameters(), weight_decay = 1e-5, lr=1e-3)
######### complete the optimizer. ##################



Build the train loop. You should finish the first epoch training!!!!!!!!!!!!!!

In [None]:
import tqdm
for epoch in range(1):
    total_loss = 0
    total_correct = 0
    for batch in tqdm.tqdm(train_loader):  
        images, labels = batch
        if torch.cuda.is_available():
            images = images.cuda()
            labels = labels.cuda()

        optimizer.zero_grad()  
        preds = network(images)
        loss = F.cross_entropy(preds, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _,prelabels=torch.max(preds,dim=1)
        total_correct += (prelabels==labels).sum().item()

    accuracy = total_correct/len(train_set)
    print("Epoch:%d  ,  Loss:%f  , Train Accuracy:%f "%(epoch, total_loss, accuracy * 100))
