# **Part1** is run in **Kaggle** with **P100**
# **Part2** is run in **CoLab** with **T4**

# **Part 1: Run MobileNet on GPU**

In this tutorial, we will explore how to train a neural network with PyTorch.

### Setup (5%)

We will first install a few packages that will be used in this tutorial and also define the path of CUDA library:

In [None]:
print('Begin.')
!pip install torchprofile 1>/dev/null
!ldconfig /usr/lib64-nvidia 2>/dev/null
!pip install onnx 1>/dev/null
!pip install onnxruntime 1>/dev/null

Begin.


We will then import a few libraries:

In [None]:
import random

import numpy as np
import torch
import torchvision
from torch import nn
from torch.optim import *
from torch.optim.lr_scheduler import *
from torch.utils.data import DataLoader
from torchprofile import profile_macs
from torchvision.datasets import *
from torchvision.transforms import *
from tqdm.notebook import tqdm

In [None]:
print(torch.__version__)
print(torchvision.__version__)

2.5.1+cu121
0.20.1+cu121


To ensure the reproducibility, we will control the seed of random generators:

In [None]:
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)

<torch._C.Generator at 0x790227322c50>

We must decide the HYPER-parameter before training the model:

In [None]:
NUM_CLASSES = 10

# TODO:
# Decide your own hyper-parameters
BATCH_SIZE = 128
LEARNING_RATE = 4e-3
NUM_EPOCH = 100

### Data  (5%)

In this lab, we will use CIFAR-10 as our target dataset. This dataset contains images from 10 classes, where each image is of
size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.

Before using the data as input, we can do data pre-processing with transform function:

In [None]:
# TODO:
# Resize images to 224x224, i.e., the input image size of MobileNet,
# Convert images to PyTorch tensors, and
# Normalize the images with mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
transform = torchvision.transforms.Compose([
  torchvision.transforms.Resize((224, 224)),
  torchvision.transforms.ToTensor(),
  torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

from torchvision.datasets import CIFAR10
dataset = {}
for split in ["train", "test"]:
  dataset[split] = CIFAR10(
    root="data/cifar10",
    train=(split == "train"),
    download=True,
    transform=transform,
  )

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to data/cifar10/cifar-10-python.tar.gz


100%|██████████| 170M/170M [00:02<00:00, 76.8MB/s] 


Extracting data/cifar10/cifar-10-python.tar.gz to data/cifar10
Files already downloaded and verified


To train a neural network, we will need to feed data in batches.

We create data loaders with the batch size determined previously in setup section:

In [None]:
dataflow = {}
for split in ['train', 'test']:
  dataflow[split] = DataLoader(
    dataset[split],
    batch_size=BATCH_SIZE,
    shuffle=(split == 'train'),
    num_workers=3,
    pin_memory=True,
    drop_last=True
  )

We can print the data type and shape from the training data loader:

In [None]:
for inputs, targets in dataflow["train"]:
  print(f"[inputs] dtype: {inputs.dtype}, shape: {inputs.shape}")
  print(f"[targets] dtype: {targets.dtype}, shape: {targets.shape}")
  break

[inputs] dtype: torch.float32, shape: torch.Size([128, 3, 224, 224])
[targets] dtype: torch.int64, shape: torch.Size([128])


### Model (10%)

In this tutorial, we will import MobileNet provided by torchvision, and use the pre-trained weight:

In [None]:
# TODO:
# Load pre-trained MobileNetV2
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights

model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT)

# print(model)



Downloading: "https://download.pytorch.org/models/mobilenet_v2-7ebf99e0.pth" to /root/.cache/torch/hub/checkpoints/mobilenet_v2-7ebf99e0.pth
100%|██████████| 13.6M/13.6M [00:00<00:00, 104MB/s] 


You should observe that the output dimension of the classifier does not match the number of cleasses in CIFAR-10.

Now change the output dimension of the classifer to number of classes:

In [None]:
# TODO:
# Change the output dimension of the classifer to number of classes
in_features = model.classifier[1].in_features
model.classifier[1] = nn.Linear(in_features, NUM_CLASSES)

# Send the model from cpu to gpu
if not torch.cuda.is_available():
    raise Exception("Cuda is not available.")
model = model.cuda()

Now the output dimension of the classifer matches.

As this course focuses on efficiency, we will then inspect its model size and (theoretical) computation cost.


* The model size can be estimated by the number of trainable parameters:

In [None]:
num_params = 0
for param in model.parameters():
  if param.requires_grad:
    num_params += param.numel()
print("#Params:", num_params)

#Params: 2236682


* The computation cost can be estimated by the number of [multiply–accumulate operations (MACs)](https://en.wikipedia.org/wiki/Multiply–accumulate_operation) using [TorchProfile](https://github.com/zhijian-liu/torchprofile), we will further use this profiling tool in the future labs .

In [None]:
num_macs = profile_macs(model, torch.zeros(1, 3, 224, 224).cuda())
print("#MACs:", num_macs)

#MACs: 306186464


This model has 2.2M parameters and requires 306M MACs for inference. We will work together in the next few labs to improve its efficiency.

### Optimization (10%)

As we are working on a classification problem, we will apply [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy) as our loss function to optimize the model:

In [None]:
# TODO:
# Apply cross entropy as our loss function
criterion = torch.nn.CrossEntropyLoss()

We should decide an optimizer for the model:

In [None]:
# TODO:
# Choose an optimizer.

optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

(Optional) We can apply a learning rate scheduler during the training:

In [None]:
# TODO(optional):
# 余弦退火
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=NUM_EPOCH)

### Training (25%)

We first define the function that optimizes the model for one batch:

In [None]:
import torch
from torch.cuda.amp import autocast, GradScaler

def train_one_batch(
  model: nn.Module,
  criterion: nn.Module,
  optimizer,
  inputs: torch.Tensor,
  targets: torch.Tensor,
  scheduler,
  scaler
) -> None:

    # Step 1: Reset the gradients
    optimizer.zero_grad()

    # Step 2: Forward inference
    with torch.amp.autocast('cuda'):
        output = model(inputs)
        # Step 3: Calculate the loss
        loss = criterion(output, targets)

    # Step 4: Backward propagation
    scaler.scale(loss).backward()

    # Step 5: Update optimizer
    scaler.step(optimizer)
    scaler.update()

    # (Optional Step 6: scheduler)
    scheduler.step()

    return loss.cpu().detach().numpy()


We then define the training function:

In [None]:
def train(
    model: nn.Module,
    dataflow: DataLoader,
    criterion: nn.Module,
    optimizer,
    scheduler
):

    model.train()
    n_data = 0
    total_loss = 0.0
    scaler = torch.amp.GradScaler('cuda')

    for inputs, targets in tqdm(dataflow, desc='train', leave=False):
        # Move the data from CPU to GPU
        inputs = inputs.cuda()
        targets = targets.cuda()

        # Call train_one_batch function with the scaler
        loss_batch = train_one_batch(model, criterion, optimizer, inputs, targets, scheduler, scaler)
        total_loss += loss_batch
        n_data += 1

    return total_loss / n_data


Last, we define the evaluation function:

In [None]:
def evaluate(
  model: nn.Module,
  dataflow: DataLoader
) -> float:

    model.eval()
    num_samples = 0
    num_correct = 0

    with torch.no_grad():
        for inputs, targets in tqdm(dataflow, desc="eval", leave=False):
            # TODO:
            # Step 1: Move the data from CPU to GPU
            inputs, targets = inputs.cuda(), targets.cuda()
            # Step 2: Forward inference
            output = model(inputs)
            # Step 3: Convert logits to class indices (predicted class)
            predicts = output.argmax(dim=1)
            # Update metrics
            num_samples += targets.size(0)
            num_correct += (predicts == targets).sum()

    return (num_correct / num_samples * 100).item()

With training and evaluation functions, we can finally start training the model!

If the training is done properly, the accuracy should simply reach higher than 0.925:

***Please screenshot the output model accuracy, hand in as YourID_acc_1.png***

In [None]:
bar = tqdm(range(1, NUM_EPOCH + 1))
for epoch_num in bar:
  # torch.cuda.empty_cache()
  loss_epoch = train(model, dataflow["train"], criterion, optimizer, scheduler)
  acc = evaluate(model, dataflow["test"])
  bar.set_postfix_str(f"loss: {loss_epoch:.6f}, acc: {acc:.4f}")
  print(f"epoch: {epoch_num}, loss: {loss_epoch:.6f}, acc: {acc:.4f}")
  if acc >= 92.5:
    print(f"Early stop at epoch {epoch_num}.")
    break

print(f">>> Final accuracy: {acc:.4f}")

  0%|          | 0/100 [00:00<?, ?it/s]

train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 1, loss: 2.111910, acc: 56.5004


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 2, loss: 1.551154, acc: 70.0721


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 3, loss: 1.052189, acc: 75.1603


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 4, loss: 0.796599, acc: 78.7260


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 5, loss: 0.659948, acc: 81.8009


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 6, loss: 0.568719, acc: 83.6639


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 7, loss: 0.502712, acc: 85.0861


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 8, loss: 0.453088, acc: 86.1078


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 9, loss: 0.415023, acc: 86.9691


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 10, loss: 0.384532, acc: 87.8305


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 11, loss: 0.357514, acc: 88.2913


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 12, loss: 0.335523, acc: 88.9824


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 13, loss: 0.316275, acc: 89.1526


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 14, loss: 0.303763, acc: 89.5433


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 15, loss: 0.288680, acc: 89.5332


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 16, loss: 0.272876, acc: 89.8938


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 17, loss: 0.259297, acc: 90.1442


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 18, loss: 0.248396, acc: 90.4948


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 19, loss: 0.239228, acc: 90.7752


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 20, loss: 0.228160, acc: 90.4748


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 21, loss: 0.218777, acc: 90.8954


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 22, loss: 0.212095, acc: 90.9455


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 23, loss: 0.202758, acc: 90.8454


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 24, loss: 0.194558, acc: 91.5966


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 25, loss: 0.187069, acc: 91.6266


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 26, loss: 0.182133, acc: 91.7468


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 27, loss: 0.175187, acc: 91.7268


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 28, loss: 0.171883, acc: 92.0072


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 29, loss: 0.164022, acc: 92.0072


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 30, loss: 0.157361, acc: 92.1875


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 31, loss: 0.152952, acc: 92.1474


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 32, loss: 0.146274, acc: 92.1575


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 33, loss: 0.140128, acc: 92.3878


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 34, loss: 0.135608, acc: 92.2276


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 35, loss: 0.131803, acc: 92.3878


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 36, loss: 0.128689, acc: 92.3678


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 37, loss: 0.123695, acc: 92.3077


train:   0%|          | 0/390 [00:00<?, ?it/s]

eval:   0%|          | 0/78 [00:00<?, ?it/s]

epoch: 38, loss: 0.118296, acc: 92.5881
Early stop at epoch 38.
>>> Final accuracy: 92.5881


Save the weight of the model as "model.pt":

In [None]:
# TODO:
# Save the model weight
torch.save(model.state_dict(), "model.pt")
print('save model')

save model


You will find "model.pt" in the current folder.

### Export Model (5%)

We can also save the model weight in [ONNX Format](https://pytorch.org/docs/stable/onnx_torchscript.html):

In [None]:
import torch.onnx

# TODO:
# Specify the input shape

onnx_path = 'model.onnx'

# TODO:
# Export the model to ONNX format
dummy_input = torch.randn(1, 3, 224, 224).cuda()
input_names = ["input"]
output_names = ["output"]
torch.onnx.export(model, dummy_input, onnx_path, verbose=True, input_names=input_names, output_names=output_names)

print(f"Model exported to {onnx_path}")

Model exported to model.onnx


In onnx format, we can observe the model structure using [Netron](https://netron.app/).

***Please download the model structure, hand in as YourID_onnx.png.***

### Inference (10%)

Load the saved model weight:



In [None]:
# TODO:
# Step 1: Get the model structure (mobilenet_v2 and the classifier)
loaded_model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT)
in_features = loaded_model.classifier[1].in_features
loaded_model.classifier[1] = nn.Linear(in_features, NUM_CLASSES)

# Step 2: Load the model weight from "model.pt".
loaded_model.load_state_dict(torch.load("model.pt"))

# Step 3: Send the model from cpu to gpu
loaded_model = loaded_model.cuda()

  loaded_model.load_state_dict(torch.load("model.pt"))


Run inference with the loaded model weight and check the accuracy

***Please screenshot the output model accuracy, hand in as YourID_acc_2.png***

In [None]:
acc = evaluate(loaded_model, dataflow["test"])
print(f"accuracy: {acc}")

eval:   0%|          | 0/78 [00:00<?, ?it/s]

accuracy: 92.58814239501953


If the accurracy is the same as the accuracy before saved, you have completed PART 1.

Congratulations!

# **Part 2: LLM with torch.compile**

In part 2, we will compare the inference speed of the LLM whether we use torch.compile.

```torch.compile``` is a new feature in PyTorch 2.0.

The following tutorial will help you get to know the usage.

[Introduction to torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)

We will choose ```Llama-3.2-1B-Instruct``` as our LLM model.

Make sure you have access to llama before starting Part 2.

https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

### Loading LLM (20%)

We will first install huggingface and login with your token

In [None]:
!pip install -U "huggingface_hub[cli]"
# !huggingface-cli login
# Do not submit the token to E3
!huggingface-cli login --token 

Collecting huggingface_hub[cli]
  Downloading huggingface_hub-0.29.1-py3-none-any.whl.metadata (13 kB)
Collecting InquirerPy==0.3.4 (from huggingface_hub[cli])
  Downloading InquirerPy-0.3.4-py3-none-any.whl.metadata (8.1 kB)
Collecting pfzy<0.4.0,>=0.3.1 (from InquirerPy==0.3.4->huggingface_hub[cli])
  Downloading pfzy-0.3.4-py3-none-any.whl.metadata (4.9 kB)
Downloading InquirerPy-0.3.4-py3-none-any.whl (67 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.7/67.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-0.29.1-py3-none-any.whl (468 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.0/468.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pfzy-0.3.4-py3-none-any.whl (8.5 kB)
Installing collected packages: pfzy, InquirerPy, huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.28.1
    Uninstalling huggingface-hub-0.28.1:
      Successfully u

We choose LLaMa 3.2 1B Instruct as our LLM model and load the pretrained model.

Model ID: **"meta-llama/Llama-3.2-1B-Instruct"**


In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# TODO:
# Load the LLaMA 3.2 1B Instruct model
model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).cuda()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

First we need to decide our prompt to feed into LLM and the maximum token length as well.

You can also change the iteration times of testing for the following tests.

In [3]:
# TODO:
# Input prompt
# You can change the prompt whatever you want, e.g. "How to learn a new language?", "What is Edge AI?"

prompt = "What is Edge AI? As detailed and content-rich as possible."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
max_token_length = 512
iter_times = 10

### Inference with torch.compile (10%)


Let's define a timer function to compare the speed up of ```torch.compile```

In [4]:
def timed(fn):
  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)
  start.record()
  result = fn()
  end.record()
  torch.cuda.synchronize()
  return result, start.elapsed_time(end) / 1000

After everything is set up, let's start!

We first simply run the inference without ```torch.compile```


In [5]:
original_times = []

# Timing without torch.compile
for i in range(iter_times):
  with torch.no_grad():
    original_output, original_time = timed(lambda: model.generate(**inputs, max_length=max_token_length, pad_token_id=tokenizer.eos_token_id))
  original_times.append(original_time)
  print(f"Time taken without torch.compile: {original_time} seconds")

# Decode the output
output_text = tokenizer.decode(original_output[0], skip_special_tokens=True)
print(f"Output without torch.compile: {output_text}")

Time taken without torch.compile: 12.41825 seconds
Time taken without torch.compile: 10.8038701171875 seconds
Time taken without torch.compile: 10.866525390625 seconds
Time taken without torch.compile: 10.9125634765625 seconds
Time taken without torch.compile: 10.8023046875 seconds
Time taken without torch.compile: 10.9470576171875 seconds
Time taken without torch.compile: 10.645412109375 seconds
Time taken without torch.compile: 10.6027451171875 seconds
Time taken without torch.compile: 11.56204296875 seconds
Time taken without torch.compile: 10.807638671875 seconds
Output without torch.compile: What is Edge AI? As detailed and content-rich as possible. Edge AI refers to artificial intelligence (AI) that is integrated into the edge of the network, meaning it is located at the edge of the network, rather than in the cloud. The term "edge AI" was first introduced in 2016 by Google, and it has since gained significant attention and investment from various technology companies.

**What is

Before using ```torch.compile```, we need to access the model's ```generation_config``` attribute and set the ```cache_implementation``` to "static".

To use ```torch.compile```, we need to call ```torch.compile``` on the model to compile the forward pass with the static kv-cache.

Reference: https://huggingface.co/docs/transformers/llm_optims?static-kv=basic+usage%3A+generation_config

In [6]:
compile_times = []

# Remind that whenever you use torch.compile, you need to use torch._dynamo.reset() to clear all compilation caches and restores the system to its initial state.
import torch._dynamo
torch._dynamo.reset()

# TODO:
# Compile the model
model.generation_config.cache_implementation = "static"
compiled_model = torch.compile(model)

# Timing with torch.compile
for i in range(iter_times):
  with torch.no_grad():
    compile_output, compile_time = timed(lambda: compiled_model.generate(**inputs, max_length=max_token_length, pad_token_id=tokenizer.eos_token_id))
  compile_times.append(compile_time)
  print(f"Time taken with torch.compile: {compile_time} seconds")

# Decode output
output_text = tokenizer.decode(compile_output[0], skip_special_tokens=True)
print(f"\nOutput with torch.compile: {output_text}")

The 'batch_size' attribute of StaticCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.
Using `torch.compile`.


Time taken with torch.compile: 35.2224921875 seconds
Time taken with torch.compile: 6.9120751953125 seconds
Time taken with torch.compile: 7.46201123046875 seconds
Time taken with torch.compile: 7.4125146484375 seconds
Time taken with torch.compile: 7.0374970703125 seconds
Time taken with torch.compile: 7.0274775390625 seconds
Time taken with torch.compile: 6.9582412109375 seconds
Time taken with torch.compile: 6.97588818359375 seconds
Time taken with torch.compile: 6.89907275390625 seconds
Time taken with torch.compile: 6.9553994140625 seconds

Output with torch.compile: What is Edge AI? As detailed and content-rich as possible. Edge AI refers to the integration of artificial intelligence (AI) and machine learning (ML) capabilities directly into the edge devices, such as smart speakers, smartphones, and other IoT devices. The goal is to enhance the performance of these devices in real-time, while minimizing latency and reducing the amount of data being transmitted to the cloud.

**Wha

We can easily observe that after the first inference, the inference time drops a lot!

Below code can tell you how much faster did ```torch.compile``` did.

***Please screenshot the inference time and speedup below, hand in as YourID_speedup.png***

In [7]:
import numpy as np
original_med = np.median(original_times)
compile_med = np.median(compile_times)
speedup = original_med / compile_med
print(f"Original median: {original_med},\nCompile median: {compile_med},\nSpeedup: {speedup}x")

Original median: 10.83708203125,
Compile median: 7.001682861328125,
Speedup: 1.547782475425394x


You've finished part 2.

Congratulations!