<details><summary style="display:list-item; font-size:16px; color:blue;">Jupyter Help</summary>
    
Having trouble testing your work? Double-check that you have followed the steps below to write, run, save, and test your code!
    
[Click here for a walkthrough GIF of the steps below](https://static-assets.codecademy.com/Courses/ds-python/jupyter-help.gif)

Run all initial cells to import libraries and datasets. Then follow these steps for each question:
    
1. Add your solution to the cell with `## YOUR SOLUTION HERE ## `.
2. Run the cell by selecting the `Run` button or the `Shift`+`Enter` keys.
3. Save your work by selecting the `Save` button, the `command`+`s` keys (Mac), or `control`+`s` keys (Windows).
4. Select the `Test Work` button at the bottom left to test your work.

![Screenshot of the buttons at the top of a Jupyter Notebook. The Run and Save buttons are highlighted](https://static-assets.codecademy.com/Paths/ds-python/jupyter-buttons.png)

**Setup**
Run the following cell to import libraries and helper function.

In [1]:
import time
import torch
import torch.nn as nn
torch.manual_seed(0)

class SimpleMLP(nn.Module):
    def __init__(self, input_size=128, hidden_size=516, output_size=1):
        super().__init__()
        self.fc1   = nn.Linear(input_size, hidden_size)
        self.relu1 = nn.ReLU()
        self.fc2   = nn.Linear(hidden_size, hidden_size)
        self.relu2 = nn.ReLU()
        self.fc3   = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.relu1(self.fc1(x))
        x = self.relu2(self.fc2(x))
        x = self.fc3(x)
        return x

# Move model to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SimpleMLP().to(device)

from custom_torchinfo import custom_summary
custom_summary(model, input_size=(64, 128))

        Layer (type)              Output Shape         Param #
            Linear-1                 [64, 516]          66,564
              ReLU-2                 [64, 516]               0
            Linear-3                 [64, 516]         266,772
              ReLU-4                 [64, 516]               0
            Linear-5                   [64, 1]             517
         SimpleMLP-6                   [64, 1]               0
Total params: 333,853
Trainable params: 333,853
Non-trainable params: 0


#### Checkpoint 1/3

Create the `model_size_bytes` function that returns a model's size in bytes (parameters + buffers). The function should have a single input for a PyTorch model.

Use the function to calculate the byte size of the model created using the `SimpleMLP` class in the previous cell. Be sure to run the setup cell above to instantiate the model in the variable `model`. 

Print out the model size converted into megabytes (MB) by dividing the number of bytes by `1024**2`.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [10]:
def model_size_bytes(model):
    """Return model size in bytes (params + buffers)."""
    model_params = list(model.parameters()) + list(model.buffers())
    size = 0
    for t in model_params:
        size += t.numel() * t.element_size()
    return size
    
size_bytes = model_size_bytes(model)

# Show output
print(f"[Model Size] {size_bytes:,} bytes ~ {size_bytes / (1024**2):.3f} MB")

[Model Size] 1,335,412 bytes ~ 1.274 MB


#### Checkpoint 2/3

Create the `measure_latency` function that returns a model's latency in milliseconds (ms) when passing input data through its forward pass. The function should have three inputs:
- `model`: PyTorch model to test.
- `x`: Input data.
- `iters`: Number of iterations to calculate the average latency (ms). 

Use the function to calculate the latency of the model from before on synthetic input data with different batch sizes:
- The first batch should have a batch size of `64` with an input size of `128` dimensions. Apply the function to the first batch to calculate the latency over `50` iterations and save the average latency to the variable `latency_x1`.
- The second batch should have a smaller batch size of `8` with an input size of `128` dimensions. Apply the function to the second batch to calculate the latency over `50` iterations and save the average latency to the variable `latency_x2`.

Print and compare the latencies of the model processing data with different batch sizes. Which do you expect to be faster?

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [16]:
# Create test batches -- DO NOT MODIFY
x_base = torch.randn(1, 128, device=device)
batch_size1 = 64
x1 = x_base.expand(batch_size1, -1).contiguous()
batch_size2 = 8
x2 = x_base.expand(batch_size2, -1).contiguous()

## YOUR SOLUTION HERE ##
def measure_latency(model, x, iters=50):
    """Return average latency (ms) per forward pass."""
    model.eval()
    start = time.perf_counter()
    with torch.inference_mode():
        for _ in range(iters):
            _ = model(x)
            if x.device.type == "cuda":
                torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    latency = elapsed/iters * 1e3
    return latency

latency_x1 = measure_latency(model, x1, iters=50)
latency_x2 = measure_latency(model, x2, iters=50)

# Show output
print(f"[Latency] {latency_x1:.3f} ms per forward pass (batch size={batch_size1})")
print(f"[Latency] {latency_x2:.3f} ms per forward pass (batch size={batch_size2})")

[Latency] 0.136 ms per forward pass (batch size=64)
[Latency] 0.122 ms per forward pass (batch size=8)


#### Checkpoint 3/3

Create the `measure_gpu_memory` function that returns a model's GPU memory currently being used **and** the peak allocated memory used when the model passes input data through its forward pass. The function should have two inputs:
- `model`: PyTorch model to test.
- `x`: Input data.

The function should return two outputs:
- `current`: The GPU memory currently being allocated by the model.
- `peak` The peak allocated memory used when the model passes input data through its forward pass.

Use the function to calculate the difference in memory allocations using the same input as before, but with different batch sizes. Save the memory calculation of the first batch to the variable `mem1` and the second batch to the variable `mem2`. 

Print and compare the memory allocations.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [15]:
# Create test batches -- DO NOT MODIFY
x_base = torch.randn(1, 128, device=device)
batch_size1 = 64
x1 = x_base.expand(batch_size1, -1).contiguous()
batch_size2 = 8
x2 = x_base.expand(batch_size2, -1).contiguous()

## YOUR SOLUTION HERE ##
def measure_gpu_memory(model, x):
    """Return current and peak GPU memory in MB after one forward."""
    torch.cuda.empty_cache()               
    torch.cuda.reset_peak_memory_stats()
    
    model.eval()
    with torch.inference_mode():
        y = model(x)
        
    torch.cuda.synchronize()
    current = torch.cuda.memory_allocated() / (1024**2)
    peak    = torch.cuda.max_memory_allocated() / (1024**2)
    return current, peak

mem1 = measure_gpu_memory(model, x1)
mem2 = measure_gpu_memory(model, x2)

# Show output
print(f"[GPU Memory] Current allocated: {mem1[0]:.2f} MB | Peak during forward: {mem1[1]:.2f} MB (batch size={batch_size1})")
print(f"[GPU Memory] Current allocated: {mem2[0]:.2f} MB | Peak during forward: {mem2[1]:.2f} MB (batch size={batch_size2})")

[GPU Memory] Current allocated: 9.44 MB | Peak during forward: 10.69 MB (batch size=64)
[GPU Memory] Current allocated: 9.44 MB | Peak during forward: 10.47 MB (batch size=8)


#### Clean up session

In [None]:
import gc, torch

del model, x1, x2 
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()