<a href="https://colab.research.google.com/github/alex-jk/painting-lora-finetune/blob/main/neural_style_transfer_photos_to_paintings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Neural style transfer — brief**
**Goal**<br>
Turn your photo into a painting-like image **without moving objects**: keep the photo’s layout, borrow the painting’s colors/texture.

**How it works (high level)**<br>
We create an output image `X` and **optimize its pixels** so that:<br>
• its **deep features** (from a CNN) match the photo → preserves structure/content;<br>
• its **feature statistics** (Gram matrices) match the painting → transfers style/brushwork.

**VGG-19**<br>
**VGG-19** is a classic **convolutional neural network (CNN)** trained on ImageNet. It stacks many conv layers that detect edges, textures, parts, and objects.<br>
• **Shallow layers** respond to color/texture; **deep layers** to object/layout.<br>
• We **freeze** VGG (no training) and use it only to extract features that act as our “perceptual rulers.”<br>
• Matching **deep** VGG features → keeps object placements. Matching **Gram stats** across layers → injects painting style.

**Pipeline**<br>
Load & normalize images → run through frozen VGG → compute **content loss** (deep layer), **style loss** (Gram matrices across several layers), plus tiny **TV** smoothing → backprop gradients **to the image pixels** with L-BFGS/Adam until it looks right.

**Total Variation (TV) smoothing** - A regularizer that penalizes rapid pixel-to-pixel changes. It prefers images that are locally smooth (piecewise-smooth) without simply blurring everything.

**Install dependencies**

In [1]:
# Colab usually has recent torch/torchvision, but this is safe.
!pip -q install --upgrade torch torchvision pillow
import torch, torchvision, PIL
print("Torch:", torch.__version__, "| Vision:", torchvision.__version__)
print("CUDA available:", torch.cuda.is_available())

Torch: 2.8.0+cu126 | Vision: 0.23.0+cu126
CUDA available: True


In [2]:
import math
from pathlib import Path
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models, transforms
from PIL import Image

#### **Set device and ImageNet normalization (for VGG-19 features)**

- **Device selection:** Use GPU (`cuda`) if available; otherwise fall back to CPU. All tensors/ops must be on the **same device**.
- **ImageNet normalization:** VGG-19 expects RGB inputs scaled to **[0,1]** and normalized per channel with:
  - mean = `[0.485, 0.456, 0.406]`
  - std  = `[0.229, 0.224, 0.225]`
<br>We apply it as: `(x - mean) / std` (broadcast per channel over H×W).
- **Why:** Feeding the exact normalization used in training keeps VGG feature distributions correct; skipping it can cause unstable optimization or odd colors.
- **`.to(device)` on mean/std:** Puts these constants on the same device as your images to avoid device-mismatch errors.

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
IMAGENET_MEAN = torch.tensor([0.485, 0.456, 0.406]).to(device)
IMAGENET_STD  = torch.tensor([0.229, 0.224, 0.225]).to(device)

###**NST Code**

#### Image I/O helpers

**load_image(path, target_long_side=None)**<br>
Convert to **RGB** so every image is 3 channels as VGG expects.<br>
Resize by the **longer side** to control compute/memory while keeping aspect ratio; **LANCZOS** gives high-quality downscaling (less aliasing → cleaner features).<br>
`ToTensor()` makes a float tensor scaled to **[0,1]**, which is required for the ImageNet normalization we apply later.<br>
`unsqueeze(0)` adds a **batch dimension** → models expect shape `(N, C, H, W)` even for a single image.<br>

**save_image(tensor, path)**<br>
`detach()` drops the autograd graph since we’re done optimizing pixels and just want the data.<br>
`clamp(0,1)` enforces valid image range after optimization so saved colors are not out of bounds.<br>
`cpu()` moves data to CPU because PIL saves from CPU tensors/arrays.<br>
`squeeze(0)` removes the batch dimension; PIL expects `(C, H, W)` (or `(H, W, C)` after conversion).<br>

In [4]:
def load_image(path, target_long_side=None):
    img = Image.open(path).convert("RGB")
    if target_long_side is not None:
        w, h = img.size
        scale = target_long_side / max(w, h)
        img = img.resize((round(w*scale), round(h*scale)), Image.LANCZOS)
    x = transforms.ToTensor()(img).unsqueeze(0).to(device)  # (1,3,H,W) in [0,1]
    return x

def save_image(tensor, path):
    x = tensor.detach().clamp(0,1).cpu().squeeze(0)
    transforms.ToPILImage()(x).save(Path(path))

#### Normalization module

**What it does**<br>
Applies per-channel ImageNet normalization to an input tensor: `(x - mean) / std`.<br>
Stores `mean` and `std` reshaped to `(1, 3, 1, 1)` so they broadcast over H×W.

**Why we need it**<br>
VGG-19 was trained on ImageNet-normalized RGB; matching that distribution makes its features meaningful and stable for NST.<br>
`register_buffer(...)` keeps `mean/std` on the right device (move with `.to(device)`), included in `state_dict`, and **not** trainable parameters.

In [5]:
class Normalization(nn.Module):
    def __init__(self, mean, std):
        super().__init__()
        self.register_buffer("mean", mean.view(1,3,1,1))
        self.register_buffer("std",  std.view(1,3,1,1))
    def forward(self, x): return (x - self.mean) / self.std

### VGG layer “taps”

A *tap* is a read-only checkpoint inside VGG where we capture that layer’s activation (feature map) during the forward pass. VGG stays frozen; we only **read** these tensors.

**`VGG_LAYER_NAMES`**<br>
Maps raw indices in `vgg19.features` to human-readable names (e.g., index 1 → `relu1_1`). This tells the code exactly which intermediate outputs to capture.

**`CONTENT_LAYERS = ["relu4_2"]`**<br>
Deep feature(s) used for the **content loss** so the output keeps the photo’s object/layout structure.

**`STYLE_LAYERS = ["relu1_1","relu2_1","relu3_1","relu4_1","relu5_1"]`**<br>
Shallow→deep features used for the **style loss**. We form Gram matrices at each to capture multi-scale texture/brush/color statistics.

**Why these choices**<br>
Shallow layers encode edges/colors; deeper layers encode parts/layout. Using one deep content layer plus several style layers yields strong layout preservation with rich style transfer.


In [6]:
# VGG layer taps
VGG_LAYER_NAMES = {1:"relu1_1", 6:"relu2_1", 11:"relu3_1", 20:"relu4_1", 22:"relu4_2", 29:"relu5_1"}
CONTENT_LAYERS = ["relu4_2"]
STYLE_LAYERS   = ["relu1_1","relu2_1","relu3_1","relu4_1","relu5_1"]

### `VGGFeatures` — frozen VGG wrapper that returns tapped activations

**Purpose**<br>
Wrap a pretrained VGG-19 so we can **normalize inputs** and **capture specific layer outputs** (our taps) in one forward pass.

**Init (`__init__`)**<br>
- Loads VGG-19 features (with a fallback for older torchvision).<br>
- `eval()` puts VGG in inference mode (no dropout/bn updates).<br>
- Freezes weights: `requires_grad_(False)` so we only optimize pixels, not VGG.<br>
- Creates `Normalization(mean,std)` to apply ImageNet normalization.

**Forward (`forward`)**<br>
- Normalizes input `x` to match VGG’s training distribution.<br>
- Iterates through VGG layers; after each layer, if its index is in `VGG_LAYER_NAMES`, **stores the activation** in a dict under its human-readable name (e.g., `relu4_2`).<br>
- Returns a dict `{layer_name: activation}` used to compute **content** and **style** losses.

In [7]:
class VGGFeatures(nn.Module):
    def __init__(self):
        super().__init__()
        # Compatibility with different torchvision versions
        try:
            feats = models.vgg19(weights=models.VGG19_Weights.IMAGENET1K_FEATURES).features
        except Exception:
            feats = models.vgg19(pretrained=True).features
        self.vgg = feats.eval().to(device)
        for p in self.vgg.parameters(): p.requires_grad_(False)
        self.norm = Normalization(IMAGENET_MEAN, IMAGENET_STD)
    def forward(self, x):
        out = {}
        x = self.norm(x)
        for i, layer in enumerate(self.vgg):
            x = layer(x)
            if i in VGG_LAYER_NAMES:
                out[VGG_LAYER_NAMES[i]] = x
        return out