<a href="https://colab.research.google.com/github/ML-HW-SYS/a2-WDaugherty/blob/main/2_size_estimator_and_profiler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **2. Model Size Estimation**

It is no surprise that with such a tiny package, your Ardunio Nano 33 BLE Sense comes with limited memory and processing power. Therefore, you must be aware of the size and components of your model in order to have it run efficiently on your MCU.

This notebook explores how various neural network layers affect the number of parameters, the amount memory, the number of floating point operations, and the CPU runtime of your model.

## 2.0 Setup GDrive and Git

In [2]:
# Mount google drive
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
# Make sure your token is stored in a txt file at the location below.
# This way there is no risk that you will push it to your repo
# Never share your token with anyone, it is basically your github password!
with open('/content/gdrive/MyDrive/ece5545/token.txt') as f:
    token = f.readline().strip()
# Use another file to store your github username    
with open('/content/gdrive/MyDrive/ece5545/git_username.txt') as f:
    handle = f.readline().strip()

In [4]:
# Clone your github repo
YOUR_TOKEN = token
YOUR_HANDLE = handle
BRANCH = "main"

%mkdir /content/gdrive/MyDrive/ece5545
%cd /content/gdrive/MyDrive/ece5545
!git clone https://{YOUR_TOKEN}@github.com/ML-HW-SYS/a2-{YOUR_HANDLE}.git
%cd /content/gdrive/MyDrive/ece5545/a2-{YOUR_HANDLE}
!git checkout {BRANCH}
!git pull
%cd /content/gdrive/MyDrive/ece5545

PROJECT_ROOT = f"/content/gdrive/MyDrive/ece5545/a2-{YOUR_HANDLE}"

mkdir: cannot create directory ‘/content/gdrive/MyDrive/ece5545’: File exists
/content/gdrive/MyDrive/ece5545
fatal: destination path 'a2-WDaugherty' already exists and is not an empty directory.
/content/gdrive/MyDrive/ece5545/a2-WDaugherty
M	2_size_estimator_and_profiler.ipynb
Already on 'main'
Your branch is up to date with 'origin/main'.
Already up to date.
/content/gdrive/MyDrive/ece5545


In [5]:
# This extension reloads all imports before running each cell
%load_ext autoreload
%autoreload 2

### Import code dependencies

In [6]:
import sys
print(PROJECT_ROOT)
!ls {PROJECT_ROOT}/src
print(sys.path)

/content/gdrive/MyDrive/ece5545/a2-WDaugherty
bonus_constants.py  data_proc.py  __pycache__	       size_estimate.py
bonus_quant.py	    loaders.py	  quant_conversion.py  train_val_test_utils.py
constants.py	    networks.py   quant.py
['/content', '/env/python', '/usr/lib/python39.zip', '/usr/lib/python3.9', '/usr/lib/python3.9/lib-dynload', '', '/usr/local/lib/python3.9/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.9/dist-packages/IPython/extensions', '/root/.ipython']


In [7]:
import sys,os

# Adding assignment 2 to the system path
# Make sure this matches your git directory
sys.path.insert(0, PROJECT_ROOT)

import torch
import torch.nn as nnt
import src.data_proc as data_proc
from src.constants import *
import numpy as np

print("Imported code dependencies")

Model folders are created, 
PyTorch models will be saved in /content/gdrive/MyDrive/ece5545/models/torch_models, 
ONNX models will be saved in /content/gdrive/MyDrive/ece5545/models/onnx_models, 
TensorFlow Saved Models will be saved in /content/gdrive/MyDrive/ece5545/models/tf_models, 
TensorFlow Lite models will be saved in /content/gdrive/MyDrive/ece5545/models/tflite_models, 
TensorFlow Lite Micro models will be saved in /content/gdrive/MyDrive/ece5545/models/micro_models.
Imported code dependencies


## 2.2 Define the Model 

### Create the model
Our TinyConv model currently consists of 7 layers:


1. [Reshape](https://pytorch.org/docs/stable/generated/torch.reshape.html)
2. [Conv2D](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html#torch.nn.Conv2d)
3. [Relu](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU) 
4. [Dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html#torch.nn.Dropout) 
5. Reshape
6. [Fully Connected (Linear)](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear)
7. [Softmax](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html#torch.nn.Softmax)


Please refer to `<github_dir>/src/networks.py` for more detail.

In [8]:
# Define device
from src.networks import TinyConv

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Using {device} to run the training scrpit.')

Using cpu to run the training scrpit.


### Create data_proc.AudioProcessor() object for data preprocessing
When an AudioProcessor instance is created: 

1. Download speech_command dataset from DATA_URL (defined in constants.py) to data_dir (default: '/content/gdrive/MyDrive/ece5545/data')
default dataset url: 'https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz'

2. Determine classes and their numerical indices for training and testing based on WANTED_WORDS 
(defined in constants.py): 
eg. if WANTED_WORDS is ['yes', 'no'], model will be trained to identify "yes" and "no" as yes and no, 
other words as unkown, and background noises as silence

3. Determine and save the settings for data processing feature generator based on relavent constants 
in constants.py

4. Determine which audio files in the dataset are for testing, training, or validating using hash method

5. Prepare and save background noise data using the background noise data inside dataset

In [9]:
# Create audio processor (this takes some time the first time)
# And continues to run for a bit after reaching 100% while it's extracting files
audio_processor = data_proc.AudioProcessor()

In [10]:
# Create model
model_fp32 = TinyConv(audio_processor.model_settings)
model_fp32

TinyConv(
  (conv_reshape): Reshape(output_shape=(-1, 1, 49, 40))
  (conv): Conv2d(1, 8, kernel_size=(10, 8), stride=(2, 2), padding=(5, 3))
  (relu): ReLU()
  (dropout): Dropout(p=0.5, inplace=False)
  (fc_reshape): Reshape(output_shape=(-1, 4000))
  (fc): Linear(in_features=4000, out_features=4, bias=True)
  (softmax): Softmax(dim=1)
)

## 2.3 Model Estimates
Run the next few cells to see how each layer impacts memory and runtime of the below TinyConv neural network model. Then experiment with reshaping it to see how adding or removing layers alters the metrics.

### Memory Utilization

There are two important forms of memory that we care about for MCUs: **flash memory** and **random access memory (RAM)**. Flash is **non-volatile** aka persistent storage memory; its data is saved when powered off. This is where your model's weights and code live, thus they must be able to fit within the capacity of your MCU's flash memory (1MB). On the other hand, RAM is **volatile** or non-persistent memory, thus it is used for temporary storage like input buffers and intermediate tensors. Together, they cannot exceed the size of your RAM storage (256KB).  

### TODO 1: Implement the `count_trainable_parameters` function in `src/size_estimate.py` to compute model size and get an estimate of the flash usage of this model


In [11]:
#Pulls the neest version of the repo
%cd /content/gdrive/MyDrive/ece5545/a2-WDaugherty/src
!git pull 

/content/gdrive/MyDrive/ece5545/a2-WDaugherty/src
Already up to date.


In [12]:
# Sends model weights to the GPU if tensors are on GPU
if torch.cuda.is_available():
    model_fp32.cuda()

from src.size_estimate import count_trainable_parameters
num_params = count_trainable_parameters(model_fp32)
print("Total number of trainable parameters: ", num_params / float(1e6), "M") # Should be about 0.016652 M

Total number of trainable parameters:  0.016652 M


### TODO 2: Implement the `compute_forward_size` function in `src/size_estimate.py` to compute the memory needed for a forward pass. This is how much RAM you will be using.

In [13]:
# Sends model weights to the GPU if tensors are on GPU
if torch.cuda.is_available():
    model_fp32.cuda()

from src.size_estimate import compute_forward_memory
frd_memory = compute_forward_memory(
    model_fp32,
    (1, model_fp32.model_settings['fingerprint_width'], model_fp32.model_settings['spectrogram_length']),
    device
)
print("Forward memory: ", frd_memory / float(1e6), "M") # Should be about 0.03462 M

Forward memory:  0.007856 M


As you can see above, the number of parameters in a neural network can add up fast which is a concern when dealing with a small amount of RAM. With the TinyConv neural network only consuming 0.21MB out of 1MB, our model will easily fit within flash memory. 

### Number of Operations

### TODO 3: Implement the `flop` function in `src/size_estimate.py` to count the total FLOPS in a forward pass with batch size = 1

In [14]:
from pprint import pprint
from src.size_estimate import flop

if torch.cuda.is_available():
    model_fp32.cuda()

# The total number of floating point operations 
flop_by_layers = flop(
    model=model_fp32, 
    input_shape=(
        1, 
        model_fp32.model_settings['fingerprint_width'], 
        model_fp32.model_settings['spectrogram_length']
    ), 
    device=device)
total_param_flops = sum([sum(val.values()) for val in flop_by_layers.values()])


print(f'total number of floating operations: {total_param_flops}')  # total number of floating operations: 340004
print('Number of FLOPs by layer and parameters:') 
print("Conv: ", flop_by_layers['conv'])  # {'bias': 4000, 'weight': 320000} divide by 2
print("FC:   ", flop_by_layers['fc'])  # {'bias': 4, 'weight': 16000} divide by 4

total number of floating operations: 1316004
Number of FLOPs by layer and parameters:
Conv:  {'bias': 4000, 'weight': 1280000}
FC:    {'bias': 4, 'weight': 32000}


### CPU runtime

### TODO 4: Measure the server/desktop CPU runtime to compare to the MCU runtime later in this assignment

In [25]:
model_fp32.cpu()
model_fp32.eval()
inputs = torch.rand([1,1960]).cpu()

# Run a profiler to see the cpu time for inference 
from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU], record_shapes=True, with_flops=True, with_stack=True) as prof:
    with record_function("model_inference"):
        model_fp32(inputs)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))

----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  Total KFLOPs  
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
             model_inference         0.41%     644.000us        98.68%     154.655ms     154.655ms             1            --  
                aten::conv2d         0.00%       7.000us        81.62%     127.929ms     127.929ms             1       640.000  
           aten::convolution         0.02%      33.000us        81.62%     127.922ms     127.922ms             1            --  
          aten::_convolution         0.01%      19.000us        81.60%     127.889ms     127.889ms             1            --  
    aten::mkldnn_convolution        81.55%     127.815ms        81.59%     127.870ms     127.870m

In [27]:
#Modified profiler to show cuda and cpu times
model_fp32.cuda()
model_fp32.eval()
inputs = torch.rand([1,1960]).cuda()

# Run a profiler to see the gpu time for inference 
from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CUDA], record_shapes=True, with_flops=True, with_stack=True) as prof:
    with record_function("model_inference"):
        model_fp32(inputs)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
void implicit_convolve_sgemm<float, float, 128, 5, 5...         0.00%       0.000us         0.00%       0.000us       0.000us      18.000us        41.86%      18.000us      18.000us             1  
void dot_kernel<float, 128, 0, cublasDotParams<cubla...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us        13.95%       6.000us       6.000us             1  
void at::