# Homework 2

## Student : Jad El Karchi - AI

# Task 1 [2 pts]
Consider a fully-connected layer with a weight matrix of shape $(N_{in}, N_{out})$.

We want to reduce the number of MAC required for propagation through the layer.
We do that by replacing weight matrix $W$ with its low-rank matrix approximation $\hat{W}$,
$$\hat{W} = W_1W_2,$$
where $W_1$ and $W_2$ have shapes $(N_{in}, R)$ and $(R, N_{out})$, respectively.

a) Derive and explicit expression for $R$ that corresponds to the 4 times MAC reduction, when we replace layer $Y = W^TX$ by decomposed layer $Y=\hat{W}^TX$. Shape of $X$ is $(N_{in}, 1)$.

b) Implement both initial and decomposed layers using PyTorch and measure their MAC using FlopCO. For this task assume you have $N_{in} = 48, N_{out}=16$.

*Hint: Implement fully-connected layer using nn.Linear()*

a) Decomposition : $Y = W^{T}.X = \hat{W}^{T}.X = (W_{1}.W_{2})^{T}.X = W_{2}^{T}.W_{1}^{T}.X$

$n_M$ : number of MAC before decomposition.

$\hat{n_M}$ : number of MAC after decomposition.

Constraint to calculate R is $ n_M = \frac{\hat{n_M}}{4}$.

$\hat{n_M} = R.N_{in} + R.N_{out}$

$n_M = N_{in}.N_{out}$

Using this equation and the first constraint :
$\frac{R.(N_{in} + N_{out})}{4} = N_{in}.N_{out}$ 

So : $R = \frac{4.N_{in}.N_{out}}{(N_{in}+N_{out})}$

b) 

In [9]:
# ANSWER
import torch
import torch.nn as nn
from flopco import FlopCo

def calculate_R(n_in, n_out):
    return int(4*(n_in * n_out) / (n_in + n_out))

def create_initial_layer(n_in, n_out):
    return nn.Sequential(
        nn.Linear(n_in, n_out)
    )

def create_decomposed_layer(n_in, R, n_out):
    return nn.Sequential(
        nn.Linear(n_in, R),  # W_1
        nn.Linear(R, n_out)   # W_2
    )

n_in, n_out = 48, 16

# initiate variables
R_val = calculate_R(n_in, n_out)
initial_layer = create_initial_layer(n_in, n_out)
decomposed_layer = create_decomposed_layer(n_in, R_val, n_out)

initial_stats = FlopCo(initial_layer, img_size=(1, 1, 1, n_in))
decomposed_stats = FlopCo(decomposed_layer, img_size=(1, 1, 1, n_in))

print("Initial FC Layer:", initial_stats.total_macs)
print("Decomposed FC Layer:", decomposed_stats.total_macs)

# checking the constraint
assert round(int(decomposed_stats.total_macs / initial_stats.total_macs)) == 4

Initial FC Layer: 768
Decomposed FC Layer: 3072


# Task 2 [2 pts]

Consider convolutional layer with a weight of shape $(C_{out}, C_{in}, k_{h}, k_{w})$.

We want to reduce the number of MAC required for propagation through the layer. We do that by replacing weight tensor $W$ with its low-rank tensor approximation $\hat{W}$

a) $\hat{W}$ is a rank-$R$ CP decomposition of reshaped weight tensor (we get 3-dimensional tensor of shape $(C_{out}, C_{in}, k_{h} \times k_{w})$ by merging spatial dimensions of 4-d tensor). Find R such that MACs will be reduced 4 times after compression.

- *Hint: see slides from the lecture (Layer compression via weight approximation)*

b) Implement both initial and decomposed layers using PyTorch and measure their MAC using FlopCO. For this task assume you have $C_{in} = 16, C_{out}=32, k_{h} = 3, k_{w} = 3$. Both horizontal and vertical paddings are equal to 1. Stride = 1

- The goal of this task is to build a compressed layer with correct shapes. So, for simplicity, you can initialize weights in decomposed layers with random weights.

- If calculated R is not integer, round it down to the nearest integer value.

- *Hint: Implement convolutional layer using nn.Conv2d()*




a) Using $n_m$ and $\hat{n_m}$ we have :

$n_m = C_{out}.C_{in}.k_{h}.k_{w}$

$\hat{n_m} = R.(C_{out} + C_{int} + k_{h}.k_{w})$

Using this equation and the first constraint :
$C_{out}.C_{in}.k_{h}.k_{w}=\frac{R.(C_{out} + C_{int} + k_{h}.k_{w})}{4}$

So : $R = \frac{4.C_{out}.C_{in}.k_{h}.k_{w}}{(C_{out} + C_{int} + k_{h}.k_{w})}$



b)

In [18]:
# ANSWER
def calculate_R(C_in, C_out, k_h, k_w):
    return int(4*(C_in * C_out * k_h * k_w) / (C_out + C_in + k_h * k_w))

def create_initial_convolution(C_in, C_out, k_h, k_w, padding, stride):
    return nn.Conv2d(in_channels=C_in, out_channels=C_out, kernel_size=(k_h, k_w), padding=padding, stride=stride)

def create_decomposed_convolution(C_in, R, C_out, k_h, k_w, padding, stride):
    return nn.Sequential(
        nn.Conv2d(in_channels=C_in, out_channels=R, kernel_size=(1, 1), stride=stride, padding=padding),
        # Depth convolution
        nn.Conv2d(in_channels=R, out_channels=R, kernel_size=(k_h, k_w), stride=stride, padding=padding, groups=R),
        nn.Conv2d(in_channels=R, out_channels=C_out, kernel_size=(1, 1), stride=stride, padding=padding),
    )

C_in, C_out = 16, 32
k_h, k_w = 3, 3
padding = 1
stride = 1
H, W = 1, 1  # height and width

# initiate variables
R = calculate_R(C_in, C_out, k_h, k_w)
initial_convolution = create_initial_convolution(C_in, C_out, k_h, k_w, padding, stride)
decomposed_convolution = create_decomposed_convolution(C_in, R, C_out, k_h, k_w, padding, stride)

initial_stats = FlopCo(initial_convolution, img_size=(1, C_in, H, W))
decomposed_stats = FlopCo(decomposed_convolution, img_size=(1, C_in, H, W))

print("Initial convolutional layer macs:", initial_stats.total_macs)
print("Decomposed convolutional layer macs:", decomposed_stats.total_macs)

# checking the constraint
print(decomposed_stats.total_macs / initial_stats.total_macs)
assert round(int(decomposed_stats.total_macs / initial_stats.total_macs)) == 4

Initial convolutional layer macs: 4608
Decomposed convolutional layer macs: 331075
71.84787326388889


AssertionError: 

# Helper functions

## Load ResNet for Cifar100

In [21]:
#Import Libraries

import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.optim as optim

import tensorly as tl
from flopco import FlopCo

from resnet_8x import ResNet18_8x
from utils import batchnorm_callibration, get_validation_scores, fix_random_seed, get_cifar100_dataloader

import copy
import matplotlib.pyplot as plt

%matplotlib inline

tl.set_backend('pytorch')

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

In [22]:
model = ResNet18_8x(num_classes=100)
model.load_state_dict(torch.load("cifar100-resnet18_8x.pt"))

<All keys matched successfully>

In [23]:
dataset_path = './'
batch_size = 256
num_workers = 0

train_loader, val_loader = get_cifar100_dataloader(dataset_path, batch_size, num_workers, download=True)
calibrate_batches = 200

Files already downloaded and verified
Files already downloaded and verified


In [24]:
model.to(device)
top1_acc_orig, top5_acc_orig = get_validation_scores(model, val_loader, device=device)
print(f'Original model. Top 1 acc: {top1_acc_orig:.3f}, Top 5 acc: {top5_acc_orig:.3f}')

Original model. Top 1 acc: 0.771, Top 5 acc: 0.936


# Task 3 [4 pts]

  a) Implement Weight-SVD decomposition of conv layer [1 pts]

  b) Wrap Weight-SVD in a class as we have done in the seminar [1 pts]

  c) Compress multiple conv layers by Weight-SVD, such that compression FLOPs ratio of the network will be close to 8 [1 pts]

  d) Fine-tune the compressed model for 10 epochs using SGD with learning rate = 0.001 and momentum = 0.9 [1 pts]

  



In [25]:
def calculate_layer_cr(model_stats, lnames_to_compress, cr=2):
  '''
  When we compres whole model with compression ratio `cr`,
  we need to calculate layer compression ratio for each layer
  from `lnames_to_compress`. We apply the same compression rate
  to all layers.

  Returns: float
           layer compression ratio
  '''

  flops_to_compress = 0
  for lname in lnames_to_compress:
    flops_to_compress += model_stats.flops[lname][0]
  uncompressed_flops = model_stats.total_flops - flops_to_compress
  layer_cr = flops_to_compress * cr / (flops_to_compress +
                                       uncompressed_flops * (1- cr))
  return layer_cr


def cr_to_svd_rank(layer, decomposition='spatial-svd', cr=2.):
  '''
  Returns layer decomposition rank given layer compression ratio `cr`.
  Decomposition can be `spatial-svd` or `weight-svd`

  Parameters:
    layer:          nn.Module
    decomposition:  str
    cr:             float

  Returns: int
           layer decomposition rank

  '''

  weight_shape = layer.weight.shape
  cout, cin, kh, kw = weight_shape

  initial_count = cout * cin * kh * kw

  if decomposition == 'spatial-svd':
    rank = initial_count // (cr * (cin * kh + kw * cout))
  elif decomposition == 'weight-svd':
    rank = initial_count // (cr * (cin * kh * kw + cout))
  else:
    print('Wrong decomposiiton name. Should be spatial-svd or weight-svd')
    rank = None

  return int(rank)

## 2.1. Implement Weight-SVD decomposition of conv layer [3 pts]


**Weight-SVD**
![**Weight-SVD**](https://github.com/k-sobolev/m5-forecasting-accuracy/blob/main/Weight-SVD.PNG?raw=true)




In [64]:
layer = model.layer2[0].conv1
weight = layer.weight
bias = layer.bias

is_bias = layer.bias is not None

In [82]:
c_out, c_in, h, w = weight.shape
c_out = layer.out_channels
padding = layer.padding
stride = layer.stride
kernel_size = layer.kernel_size

# reshape  conv. kernel to matrix: C_out, C_in, h, w -> C_out x C_in x w x h
# ANSWER
weight_reshaped = np.array(weight.view(c_out, -1).detach().cpu())

# perform decomposition
U, S, Vt = np.linalg.svd(weight_reshaped, full_matrices=False)

rank = 10

# perform truncation and fuse S matrix
w0 = np.dot(np.diag(np.sqrt(S[0:rank])),Vt[0:rank, :])
w1 = np.dot(U[:, 0:rank], np.diag(np.sqrt(S[0:rank])))


# create conv1: 3x3 conv with C_in input channels and :rank: output channels
# do not forget about stride and padding

# ANSWER
conv1 = nn.Conv2d(in_channels=c_in, out_channels=rank, kernel_size=kernel_size, padding=padding, stride=stride)
# insert a weight, do not forget that conv kernel should have shape (C_in, C_out, h, w)
conv1.weight = nn.Parameter(torch.FloatTensor(w0).view(c_in, rank, h, w))

# create conv2: 1x1 conv with :rank: input channels and C_out output channels
conv2 = nn.Conv2d(in_channels=rank, out_channels=c_out, kernel_size=(1,1), padding=padding, stride=stride)
# insert a weight, do not forget to reshape weight
conv2.weight = nn.Parameter(torch.FloatTensor(w1).view(rank, c_out, 1, 1))

factorized_layer = nn.Sequential(conv1, conv2)

In [83]:
compressed_model = copy.deepcopy(model)
print("replaced layer :", compressed_model.layer2[0].conv1)
compressed_model.layer2[0].conv1 = factorized_layer

replaced layer : Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)


In [84]:
factorized_layer

Sequential(
  (0): Conv2d(64, 10, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (1): Conv2d(10, 128, kernel_size=(1, 1), stride=(2, 2), padding=(1, 1))
)

In [86]:
# You should have around 0.5 top-1 acc here

compressed_model.to(device)
top1_acc, top5_acc = get_validation_scores(compressed_model, val_loader, device=device)
print(f'Compressed model. Top 1 acc: {top1_acc:.3f}, Top 5 acc: {top5_acc:.3f}')

## 2.2. Wrap Weight-SVD decomposition of conv layer in a class as we have done in the seminar [1 pts]

In [46]:
class SVD_Weight_conv_layer(torch.nn.Module):
    def __init__(self, layer, rank=None, rank_selection='manual'):
        super(SVD_Weight_conv_layer, self).__init__()

        self.c_in = layer.in_channels
        self.c_out = layer.out_channels
        self.padding = layer.padding
        self.stride = layer.stride
        self.kernel_size = layer.kernel_size
        self.h = layer.kernel_size[0]
        self.w = layer.kernel_size[1]
        self.is_bias = layer.bias is not None
        if self.is_bias:
            self.bias = layer.bias

        if rank is None or type(rank) is not int:
            raise AttributeError('Rank should be an integer number')
        else:
            self.rank = rank


        self.svd_decomposition = self.__replace__(layer)

    def __replace__(self, layer):
        """ Gets a conv layer and a target rank,
            returns a nn.Sequential object with
            each layer representing a decomposed factor"""

        # ANSWER
        weight_reshaped = np.array(layer.weight.view(self.c_out, -1).detach().cpu())

        U, S, Vt = np.linalg.svd(weight_reshaped, full_matrices=False)

        w0 = np.dot(np.diag(np.sqrt(S[0:self.rank])),Vt[0:self.rank, :])
        w1 = np.dot(U[:, 0:self.rank], np.diag(np.sqrt(S[0:self.rank])))

        # ANSWER
        new_layers = [
            nn.Conv2d(self.c_in, self.rank, kernel_size=self.kernel_size, padding=self.padding, stride=self.stride),
            nn.Conv2d(self.rank, self.c_out, kernel_size=(1,1), padding=self.padding, stride=self.stride)
        ]

        # ANSWER
        input_kernel_size = self.kernel_size  # Input kernel size remains the same
        output_kernel_size = (1, 1)  # Adjust based on padding
        new_kernels = [
            torch.FloatTensor(w0).reshape(self.c_in, self.rank, *input_kernel_size),
            torch.FloatTensor(w1).reshape(self.c_out, self.rank, *output_kernel_size)
        ]

        with torch.no_grad():
            for i in range(len(new_kernels)):
                new_layers[i].weight = nn.Parameter(new_kernels[i].cpu())
                if i == len(new_kernels)-1 and self.is_bias:
                    new_layers[i].bias = nn.Parameter(self.bias)

        return nn.Sequential(*new_layers)

    def forward(self, x):
        out = self.svd_decomposition(x)
        return out

## 2.3. Compress many layers [1 pts]

In [47]:
model.to(device)
model_stats = FlopCo(model, img_size = (1, 3, 32, 32), device = device)
lnames_to_compress = [lname for lname, _ in model.named_modules() if 'conv' in lname]
lnames_to_compress = lnames_to_compress[1:]

model_compression_ratio = 8
layer_cr = calculate_layer_cr(model_stats, lnames_to_compress, cr=model_compression_ratio)

In [49]:
from utils import get_layer_by_name, train, replace_layer_by_name

compressed_model = copy.deepcopy(model)
for lname in lnames_to_compress:
  layer = get_layer_by_name(compressed_model, lname)
  # ANSWER
  # get weight-svd svd rank for given layer_cr here
  r = cr_to_svd_rank(layer, decomposition='weight-svd', cr=layer_cr)
  # ANSWER
  # get compressed layer here using SVD_Weight_conv_layer
  compressed_layer = SVD_Weight_conv_layer(layer, rank=r)
  # replace layer by compressed layer
  replace_layer_by_name(compressed_model, lname, compressed_layer)

In [51]:
compressed_model.to(device)
top1_acc, top5_acc = get_validation_scores(compressed_model, val_loader, device=device)
print(f'Compressed model. Top 1 acc: {top1_acc:.3f}, Top 5 acc: {top5_acc:.3f}')

In [55]:
compressed_model_stats = FlopCo(compressed_model, img_size = (1, 3, 32, 32), device = device)

print(f"Accuracy score : {model_stats.total_flops / compressed_model_stats.total_flops}")

## 2.4. Fine-tune the model

As we can see, accuracy of our model has dropped significantly. Let's fine-tune it and see how well can we recover the accuracy.

In [88]:
# ANSWER
optimizer = optim.SGD(compressed_model.parameters(), lr=0.001, momentum=0.9)  # Adjust lr and momentum as needed
compressed_model.to(device)

# ANSWER
for epoch in range(10):
    train(compressed_model, device, train_loader, optimizer, epoch, log_interval=100, verbose=True) # TODO: train the model
    top1_acc, top5_acc = get_validation_scores(compressed_model, val_loader, device=device)
    print(f'Epoch: {epoch}, top-1 acc.:{top1_acc}')

# Task 4 [2 pts]

In this task you will perform whole-model compression using MUSCO package.

a) Compress all 3x3 onvolutional layers:
 -  with 'cp3' with param reduction rates: 2, 4, 8
 -  with 'tucker2' with param reduction rates: 2, 4, 8

b) Compare accuracy - FLOPs reduction trade off for  'cp3' and 'tucker2' - based model compressions.

In this sub task, for each type of decomposition you should plot the dependancy of top-1 accuracy on FLOPs reduction rate.


- Note: You do need to fine-tune compressed models.

In [94]:
from musco.pytorch import CompressorVBMF, CompressorPR, CompressorManual
from utils import get_layer_by_name

AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

: 

In [None]:
def get_compressed_model(model, conv2d_nn_decomposition, model_compression_ratio=2):
  device = 'cuda'
  model.to(device)
  model_stats = FlopCo(model, img_size = (1, 3, 32, 32), device = device)
  lnames = list(model_stats.flops.keys())
  lnames_to_compress = [lname for lname, _ in model.named_modules() if 'conv' in lname]
  lnames_to_compress = lnames_to_compress[1:]

  layer_cr = calculate_layer_cr(model_stats, lnames_to_compress, cr=model_compression_ratio)

  param_reduction_rates = {lname: layer_cr for lname in lnames_to_compress}

  ### Implement CompressorPR class here
  # ANSWER
  compressor = CompressorPR(model,
                            model_stats,
                            rank_selection = 'param_reduction',
                            conv2d_nn_decomposition = conv2d_nn_decomposition,
                            ranks = [cr_to_svd_rank(get_layer_by_name(model, lname), decomposition='weight-svd', cr=layer_cr) for lname in lnames_to_compress],
                            param_reduction_rates = param_reduction_rates)

  ###
  compressor.lnames = lnames_to_compress

  compressor.compression_step()

  return compressor.compressed_model

In [None]:
# Example of calculating model accuracy

compressed_model = get_compressed_model(model, 'tucker2', 8)

compressed_model.to(device)
top1_acc, top5_acc = get_validation_scores(compressed_model, val_loader, device=device)
print(f'Compressed model. Top 1 acc: {top1_acc:.3f}, Top 5 acc: {top5_acc:.3f}')

In [None]:
# For each model cr and each decomposition calculate:
# FLOPs compression ratio and top-1 accuracy

model_compression_ratio_list = [2, 4, 8]

flops_ratio_tucker2_list = []
top1_acc_tucker2_list = []

flops_ratio_cp3_list = []
top1_acc_cp3_list = []

for ratio in model_compression_ratio_list:

  compressed_model = get_compressed_model(model, 'tucker2', ratio)
  compressed_model_stats = FlopCo(compressed_model, img_size=(1, 3, 32, 32), device=device)

  flops_ratio_tucker2_list.append(compressed_model_stats.total_flops / model_stats.total_flops)
  top1_acc_tucker2_list.append(get_validation_scores(compressed_model, val_loader, device=device)[0])

  compressed_model = get_compressed_model(model, 'cp3', ratio)
  compressed_model_stats = FlopCo(compressed_model, img_size=(1, 3, 32, 32), device=device)

  flops_ratio_cp3_list.append(compressed_model_stats.total_flops / model_stats.total_flops)
  top1_acc_cp3_list.append(get_validation_scores(compressed_model, val_loader, device=device)[0])

In [None]:
# Plot and analyze results

plt.plot(flops_ratio_tucker2_list, top1_acc_tucker2_list, marker='o', label='tucker2')
plt.plot(flops_ratio_cp3_list, top1_acc_cp3_list, marker='o', label='cp3')

plt.legend()

plt.show()

# Task 5: Bonus task [1 pts]

In Task 2, initialize weights of decomposed layer by using factors from tensor decomposition instead of random initialization.

- Hint: you can see how to perform correct factor reshapes needed for weights initialization in the seminar

In [None]:
# ANSWER
def calculate_R(C_in, C_out, k_h, k_w):
    return int((C_in * C_out * k_h * k_w) / (4 * (C_in + k_h * k_w + C_out)))

def create_initial_convolution(C_in, C_out, k_h, k_w, padding, stride):
    return nn.Conv2d(in_channels=C_in, out_channels=C_out, kernel_size=(k_h, k_w), padding=padding, stride=stride)

def create_decomposed_convolution(C_in, R, C_out, k_h, k_w, padding, stride):
    return nn.Sequential(
        nn.Conv2d(in_channels=C_in, out_channels=R, kernel_size=(1, 1), stride=1, padding=padding),
        # Depth convolution
        nn.Conv2d(in_channels=R, out_channels=R, kernel_size=(k_h, k_w), stride=1, padding=padding, groups=R),
        nn.Conv2d(in_channels=R, out_channels=C_out, kernel_size=(1, 1), stride=1, padding=padding),
    )

# ANSWER    
def initialize_decomposed_convolution_weights(layer, w0, w1):
    # Initialize the weights of the decomposed convolutional layer
    with torch.no_grad():
        layer[0].weight = nn.Parameter(torch.Tensor(w0))
        layer[1].weight = nn.Parameter(torch.Tensor(w1))

# initiate variables
rank = calculate_R(C_in, C_out, k_h, k_w)
initial_convolution = create_initial_convolution(C_in, C_out, k_h, k_w, padding, stride)
decomposed_convolution = create_decomposed_convolution(C_in, R, C_out, k_h, k_w, padding, stride)

# Calculate SVD and get factors w0 and w1
weight_reshaped = np.array(initial_convolution.weight.view(C_out, -1).detach().cpu())

U, S, Vt = np.linalg.svd(weight_reshaped, full_matrices=False)

w0 = np.dot(np.diag(np.sqrt(S[0:rank])), Vt[0:rank, :])
w1 = np.dot(U[:, 0:R], np.diag(np.sqrt(S[0:rank])))

# Initialize the decomposed convolutional layer weights
initialize_decomposed_convolution_weights(decomposed_convolution, w0, w1)

# Calculate and print the FLOPs for both layers
initial_stats = FlopCo(initial_convolution, img_size=(1, C_in, H, W))
decomposed_stats = FlopCo(decomposed_convolution, img_size=(1, C_in, H, W))

print("Initial convolutional layer macs:", initial_stats.total_macs)
print("Decomposed convolutional layer macs:", decomposed_stats.total_macs)

# Checking the constraint
assert int(initial_stats.total_macs / decomposed_stats.total_macs) == 4