---
# **Lab10: PyTorch**
---

# ‚ñ∂Ô∏è GPU tools...

In [1]:
!nvidia-smi

Wed Feb  4 16:08:38 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   45C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# ‚úÖ Batch Normalization for RGB Images

Define a tensor of shape $[B, 3, H, W]$ filled with random values to simulate a batch of RGB images.

In [2]:
import torch

# Define batch parameters
B, C, H, W = 8, 3, 64, 64
images = torch.rand(B, C, H, W)

print("Batch shape:", images.shape)

Batch shape: torch.Size([8, 3, 64, 64])


## ‚ÜòÔ∏è TODO...

**Steps to perform channel-wise batch normalization**:

1. Compute the **mean per color channel** by averaging over the batch and spatial dimensions
2. Compute the **standard deviation per color channel** over the same dimensions
3. Inspect the computed statistics to ensure they correspond to the RGB channels
    - print("Mean per channel:", mean)
    - print("Std per channel:", std)


4. Reshape the mean and standard deviation tensors to make them compatible with broadcasting
    - Verify the reshaped dimensions:
        - print("Reshaped mean:", mean.shape)
        - print("Reshaped std:", std.shape)


5. Apply channel-wise centering and normalization using broadcasting

6. Verify the result by recomputing the mean and standard deviation of the normalized batch
7. Check that the normalized images have approximately zero mean and unit variance per channel

    - print("New mean per channel:", new_mean)
    - print("New std per channel:", new_std)



In [15]:
mean = images.mean(dim=[0, 2, 3])
std = images.std(dim=[0, 2, 3])

print("Mean per channel: ", mean)
print("Std per channel: ", std)
# reshaping
mean = mean.view(1, -1, 1, 1)
std = std.view(1, -1, 1, 1)
print("Reshaped mean: ", mean)
print("Reshaped std: ", std)

normalized_images = (images - mean) /  std
new_mean = normalized_images.mean(dim=[0, 2, 3])
new_std = normalized_images.std(dim=[0, 2, 3])
print("New mean per channel: ", new_mean)
print("New std per channel: ", new_std)

import numpy as np
# checks
assert np.allclose(new_mean, 0.0, atol=1e-5) 
assert np.allclose(new_std, 1.0, atol=1e-5) 

Mean per channel:  tensor([0.9453, 1.8125, 0.7812])
Std per channel:  tensor([1.8403e+08, 3.4465e+08, 1.6559e+08])
Reshaped mean:  tensor([[[[0.9453]],

         [[1.8125]],

         [[0.7812]]]])
Reshaped std:  tensor([[[[1.8403e+08]],

         [[3.4465e+08]],

         [[1.6559e+08]]]])
New mean per channel:  tensor([4.9331e-09, 4.3656e-09, 4.1910e-09])
New std per channel:  tensor([1., 1., 1.])


# ‚úÖ Mini multi-head attention block

## ‚ÜòÔ∏è TODO...

- Implement a **mini multi-head attention block** *from scratch* using only:
    - tensor reshaping (`view`, `reshape`, `transpose`, `permute`)
    - batched matrix multiplication (`matmul`)
    - broadcasting


<br> üîπ **Setup**

- batch size $B = 4$
- sequence length $T = 16$
- embedding dimension $D = 64$
- number of heads $H = 8$ (so head dimension $d = D/H = 8$)

- Create:
    - input tensor `x` with shape `(B, T, D)`  
    - learnable projection matrices `Wq, Wk, Wv, Wo` with shapes `(D, D)`


<br> üîπ **Tasks**


- **1) Create the input**
  - Generate `x = torch.randn(B, T, D, requires_grad=True)`
<br>


- **2) Compute Q, K, V**
  - Compute:
    - `Q = x @ Wq`
    - `K = x @ Wk`
    - `V = x @ Wv`
  - Ensure each has shape `(B, T, D)`
<br>

- **3) Split into heads**
  - Reshape and permute so that:
    - `Qh, Kh, Vh` have shape `(B, H, T, d)`
  - Hint: use `.view(B, T, H, d)` then `.transpose(1, 2)`
<br>


- **4) Compute attention scores**
  - Compute scaled dot-product attention:
    $$
    S = \frac{Q_h K_h^T}{\sqrt{d}}
    $$
  - `S` must have shape `(B, H, T, T)`
  - Use `torch.matmul(Qh, Kh.transpose(-2, -1))`
<br>

- **5) Softmax + weighted sum**
  - Apply:
    $$
    A = \mathrm{softmax}(S)
    $$
    $$
    O_h = A V_h
    $$
  - `Oh` must have shape `(B, H, T, d)`
<br>

- **6) Merge heads**
  - Convert `(B, H, T, d)` back to `(B, T, D)` using transpose + reshape.
<br>

- **7) Output projection**
  - Compute final output:
    - `y = out @ Wo`
  - Shape must be `(B, T, D)`

<br> üîπ What to Submit

- A short printout of shapes at each step:
  - `Qh.shape`, `S.shape`, `A.shape`, `Oh.shape`, `y.shape`
- Confirmation that gradients are computed


<br> üîπ **Expected Key Shapes**

- `x`: `(B, T, D)`
- `Qh, Kh, Vh`: `(B, H, T, d)`
- `S`: `(B, H, T, T)`
- `A`: `(B, H, T, T)`
- `Oh`: `(B, H, T, d)`
- `y`: `(B, T, D)`

In [29]:
import torch
import numpy as np

# step 1
B, T, D, H = 4, 16, 64, 8
d = D // H



Wq, Wk, Wv, Wo = [torch.randn(D, D) for _ in range(4)]
x = torch.randn(B, T, D, requires_grad=True)

In [30]:
# step 2
Q = x @ Wq
K = x @ Wk
V = x @ Wv

for v in [Q, K, V]:
    assert v.shape == (B, T, D)

In [31]:
# step 3
print(d)
Qh = Q.view(B, T, H, d).transpose(1, 2)
Kh = K.view(B, T, H, d).transpose(1, 2)
Vh = V.view(B, T, H, d).transpose(1, 2)


8


In [39]:
# step 4 
import math
S = (Qh @ Kh.transpose(-2, -1)) / math.sqrt(d)

assert S.shape == (B, H, T, T)

In [43]:
# step 5
A = torch.softmax(S, dim=-1)
Oh = A @ Vh
assert Oh.shape == (B, H, T, d)

In [None]:
out = Oh.transpose(1, 2).reshape(B, T, D)
y = out @ Wo
assert y.shape == (B, T, D)

print(f'Qh shape: {Qh.shape}')
print(f'S shape: {S.shape}')
print(f'A shape: {A.shape}')
print(f'Oh shape: {Oh.shape}')
print(f'y shape: {y.shape}')


RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [4, 16] but got: [4, 64].

In [54]:
# tests
assert x.shape == (B, T, D)
for t in [Qh, Kh, Vh]:
    assert t.shape == (B, H, T, d)
assert S.shape == (B, H, T, T)
assert A.shape == (B, H, T, T)
assert Oh.shape == (B, H, T, d)
assert y.shape == (B, T, D)

AssertionError: 