# 🔵 Lesson 6: The U-Net Predictor

The U-Net is the **engine** of Stable Diffusion. It is the artist.

### Goal
Its only job is to answer one question:
> "Given this noisy image and this text, **how much noise should I remove?**"

### Architecture
It is called a U-Net because its shape looks like a 'U':
1.  **Downsample**: Compress the image to extract features (Shapes -> Textures -> Concepts).
2.  **Bottleneck**: Process deep concepts.
3.  **Upsample**: Rebuild the image with the new information.

It uses **Cross-Attention** to inject the text prompt at every stage.

In [None]:
# 1. Setup
import notebook_utils
project_root, device, dtype = notebook_utils.setup_notebook()

from diffusers import UNet2DConditionModel
import torch

## 1. Load the Beast
The U-Net is massive. It has ~860 million parameters.

In [None]:
model_id = "runwayml/stable-diffusion-v1-5"
unet = UNet2DConditionModel.from_pretrained(model_id, subfolder="unet", torch_dtype=dtype).to(device)

params = sum(p.numel() for p in unet.parameters())
print(f"U-Net Loaded! Parameters: {params:,}")

## 2. Manual Prediction

Let's manually ask the U-Net to predict noise for a random tensor.

**Inputs needed:**
1.  `sample`: The noisy latent image (Match 64x64 size).
2.  `timestep`: What time is it? (T=1000 is pure noise, T=0 is clean).
3.  `encoder_hidden_states`: The text embeddings (from Lesson 5).

In [None]:
# Create dummy inputs
batch_size = 1
channels = 4
height = 64
width = 64

# 1. Random noise input
sample = torch.randn((batch_size, channels, height, width), device=device, dtype=dtype)

# 2. Timestep (Say we are halfway through generation)
timestep = torch.tensor([500], device=device)

# 3. Fake Text Embeddings (77 tokens, 768 dim)
encoder_hidden_states = torch.randn((batch_size, 77, 768), device=device, dtype=dtype)

print("Running U-Net pass... (This is the heavy math)")

with torch.no_grad():
    noise_pred = unet(
        sample=sample,
        timestep=timestep,
        encoder_hidden_states=encoder_hidden_states
    ).sample

print(f"Output Shape: {noise_pred.shape}")
print("Success! The U-Net output a tensor of the SAME SHAPE as the input.")

## 3. What did it output?

Depending on the scheduler configuration, the U-Net usually outputs **Epsilon ($\epsilon$)**—the estimated noise.

The scheduler then subtracts this noise:  
$ \text{New Image} = \text{Old Image} - \text{Predicted Noise} $

We will explore Schedulers in the next lesson.