### Problem (benchmarking_script)

#### (b) Time the forward and backward passes for the model sizes described in §1.1.2. Use 5 warmup steps and compute the average and standard deviation of timings over 10 measurement steps. How long does a forward pass take? How about a backward pass? Do you see high variability across measurements, or is the standard deviation small?

    model  forward_mean  forward_std  backward_mean  backward_std
0   small      0.162622     0.000616       0.325898      0.000811
1  medium      0.520552     0.002540       1.003874      0.004929

In [None]:
!python3 cs336-basics/benchmark/benchmarking.py

!nsys profile -o result python cs336-basics/benchmark/benchmark.py
!nsys profile -o result_annotated python cs336-basics/benchmark/benchmarking.py --use_annotated

!python3 cs336-basics/benchmark/benchmarking.py --profile_memory

### Problem (mixed_precision_accumulation)
Run the following code and commment on the (accuracy of the) results.

The results show that using float16 for accumulation leads to significant precision loss (resulting in 9.9531 instead of 10), while float32 accumulation maintains high accuracy (close to 10). When float16 values are accumulated in float32, the accuracy is nearly as good as pure float32, demonstrating why mixed-precision training uses float32 accumulators to preserve numerical accuracy.

In [1]:
import torch

s = torch.tensor(0, dtype=torch.float32)
for i in range(1000):
    s += torch.tensor(0.01, dtype=torch.float32)
print(s)

s = torch.tensor(0, dtype=torch.float16)
for i in range(1000):
    s += torch.tensor(0.01, dtype=torch.float16)
print(s)

s = torch.tensor(0, dtype=torch.float32)
for i in range(1000):
    s += torch.tensor(0.01, dtype=torch.float16)
print(s)

s = torch.tensor(0, dtype=torch.float32)
for i in range(1000):
    x = torch.tensor(0.01, dtype=torch.float16)
    s += x.type(torch.float32)
print(s)

tensor(10.0001)
tensor(9.9531, dtype=torch.float16)
tensor(10.0021)
tensor(10.0021)


### Problem (benchmarking_mixed_precision)

#### (a) Consider the following model:
```python
class ToyModel(nn.Module):
    def __init__(self, in_features: int, out_features: int):
        super().__init__()
        self.fc1 = nn.Linear(in_features, 10, bias=False)
        self.ln = nn.LayerNorm(10)
        self.fc2 = nn.Linear(10, out_features, bias=False)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.ln(x)
        x = self.fc2(x)
        return x
```
Suppose we are training the model on a GPU and that the model parameters are originally in
FP32. We’d like to use autocasting mixed precision with FP16. What are the data types of:
• the model parameters within the autocast context,
• the output of the first feed-forward layer (ToyModel.fc1),
• the output of layer norm (ToyModel.ln),
• the model’s predicted logits,
• the loss,
• and the model’s gradients?

#### You should have seen that FP16 mixed precision autocasting treats the layer normalization layer differently than the feed-forward layers. What parts of layer normalization are sensitive to mixed precision? If we use BF16 instead of FP16, do we still need to treat layer normalization differently? Why or why not?

Layer normalization is sensitive to mixed precision because its mean and variance computations can suffer significant numerical errors in FP16 due to limited precision, which can destabilize training. Therefore, these operations are typically performed in FP32 even during mixed precision training. If using BF16, which has a much larger dynamic range than FP16, layer normalization is generally stable and does not require special treatment, so it can safely run in BF16.

#### (c) Modify your benchmarking script to optionally run the model using mixed precision with BF16. Time the forward and backward passes with and without mixed-precision for each language model size described in §1.1.2. Compare the results of using full vs. mixed precision, and comment on any trends as model size changes. You may find the nullcontext no-op context manager to be useful.
Deliverable: A 2-3 sentence response with your timings and commentary.


In [3]:
import torch
import torch.nn as nn


class ToyModel(nn.Module):
    def __init__(self, in_features: int, out_features: int):
        super().__init__()
        self.fc1 = nn.Linear(in_features, 10, bias=False)
        self.ln = nn.LayerNorm(10)
        self.fc2 = nn.Linear(10, out_features, bias=False)
        self.relu = nn.ReLU()

    def forward(self, x):
        print("fc1.weight dtype (model param):", self.fc1.weight.dtype)
        x1 = self.fc1(x)
        print("fc1 output dtype:", x1.dtype)
        x2 = self.relu(x1)
        x3 = self.ln(x2)
        print("layernorm output dtype:", x3.dtype)
        x4 = self.fc2(x3)
        print("logits dtype:", x4.dtype)
        return x4


model = ToyModel(5, 2).cuda()
x = torch.randn(3, 5, device="cuda")

with torch.autocast(device_type="cuda", dtype=torch.float16):
    print("model param dtype (in autocast):", model.fc1.weight.dtype)
    logits = model(x)
    loss = logits.sum()
    print("loss dtype:", loss.dtype)

    loss.backward()
    print("model param grad dtype:", model.fc1.weight.grad.dtype)

model param dtype (in autocast): torch.float32
fc1.weight dtype (model param): torch.float32
fc1 output dtype: torch.float16
layernorm output dtype: torch.float32
logits dtype: torch.float16
loss dtype: torch.float32
model param grad dtype: torch.float32
