# Module 5: Transfer Learning ‚Äî Fine-Tuning Pretrained Models

---
## Don't Start from Scratch

### üß† Brain Analogy
A radiologist spends 6 years in medical school learning general medicine, then 2 years specialising in reading X-rays. They don't re-learn anatomy ‚Äî they **transfer** general knowledge to the new speciality.

Transfer learning works the same way:
- **ImageNet pretraining** = medical school (1.2M images, 1000 classes, months of GPU time)
- **Your fine-tuning** = specialisation (hundreds of your images, minutes to hours)
- **Pretrained weights** = accumulated expertise: edges ‚Üí shapes ‚Üí textures ‚Üí object parts

The model already knows how to see. You just teach it YOUR specific patterns.

### ‚öôÔ∏è Engineer Analogy
ResNet-18 trained on ImageNet = a visual feature hierarchy producing rich 512-d embeddings for any image. Transfer learning: remove old head (1000 classes) ‚Üí add new head (your N classes). Two strategies: Feature Extraction (freeze backbone, fast) or Fine-Tuning (partial unfreeze, more accurate).

**Level:** Intermediate  
**Duration:** ~3 hours  
**Dataset:** Dogs vs Cats ([Kaggle](https://www.kaggle.com/c/dogs-vs-cats)) + built-in torchvision subsets  
**Real-World Use Case:** Medical imaging, defect detection, any domain with limited labeled data

## What You'll Learn
- Loading pretrained ImageNet models from `torchvision.models`
- Feature extraction (freeze backbone, train only head)
- Fine-tuning (unfreeze layers progressively)
- Differential learning rates (different LR for each layer group)
- Grad-CAM ‚Äî visualising what the model focuses on

## Why Transfer Learning?
```
Training ResNet-50 from scratch on ImageNet:
  ‚Üí 1.2M images, ~90 GPU hours

Transfer Learning to your 1000-image dataset:
  ‚Üí ~10 minutes, ~90% accuracy  ‚úì

Pretrained weights = millions of training hours of feature learning ‚Äî already done for you!
```

## Two Strategies
```
1. Feature Extraction:  freeze ALL backbone layers ‚Üí only train new head
   Best when: your data is small & similar to ImageNet

2. Fine-Tuning:         unfreeze last N layers ‚Üí train with small LR
   Best when: your data is larger or quite different from ImageNet
```

In [None]:
# üß† Loading tools: torchvision has pretrained models ready to download
# ‚öôÔ∏è torchvision.models provides ResNet/EfficientNet/ViT with pretrained ImageNet weights
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split

import torchvision
import torchvision.transforms as transforms
import torchvision.models as models
from torchvision.datasets import ImageFolder

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from pathlib import Path
import time

torch.manual_seed(42)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
print(f"torchvision version: {torchvision.__version__}")

## 5.1  Dataset Options

### üß† Brain Analogy
Pretrained models were trained with specific colour settings. Using different settings = giving a doctor inverted X-rays ‚Äî their pattern matching breaks completely. ALWAYS use the same normalisation (mean/std) the model was trained with.

### ‚öôÔ∏è Engineer Analogy
ALL pretrained models require ImageNet normalisation: `mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]`. Wrong normalisation silently breaks pretrained features ‚Äî model outputs garbage. Val: `CenterCrop(224)` (deterministic). Train: `RandomCrop(224)` (free augmentation).

### Option A ‚Äî Kaggle Dogs vs Cats (recommended for full experience)
```bash
# Install Kaggle CLI and download:
pip install kaggle
kaggle competitions download -c dogs-vs-cats
unzip dogs-vs-cats.zip
# Then organize into data/dogs_cats/train/dogs/ and data/dogs_cats/train/cats/
```

### Option B ‚Äî Use STL-10 (built-in, no account needed)
We use STL-10 below (96√ó96 images, similar task ‚Äî animals vs objects)

In [None]:
# üß† Same "colour calibration" as original training ‚Äî wrong calibration breaks all the learned patterns
# ‚öôÔ∏è Resize(256)‚ÜíCenterCrop(224): standard val preprocessing for all 224√ó224 pretrained models
# ‚îÄ‚îÄ ImageNet normalization stats (required for pretrained models) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD  = [0.229, 0.224, 0.225]

# ‚îÄ‚îÄ Transforms ‚Äî resize to 224√ó224 (ImageNet standard) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
train_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_MEAN, IMAGENET_STD)
])

val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_MEAN, IMAGENET_STD)
])

# ‚îÄ‚îÄ Using STL-10 (10 classes, 96√ó96, 5000 train + 8000 test) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
stl_train = torchvision.datasets.STL10('./data', split='train',
                                        transform=train_transform, download=True)
stl_test  = torchvision.datasets.STL10('./data', split='test',
                                        transform=val_transform, download=True)

# Use only 2 classes for binary classification demo (airplane=0, bird=1)
def filter_classes(dataset, classes):
    indices = [i for i, (_, lbl) in enumerate(dataset) if lbl in classes]
    dataset = torch.utils.data.Subset(dataset, indices)
    return dataset

# airplane=0, bird=1 in STL-10
train_ds = filter_classes(stl_train, [0, 1])
test_ds  = filter_classes(stl_test,  [0, 1])

# Split train into train+val
n_train = int(0.8 * len(train_ds))
n_val   = len(train_ds) - n_train
train_ds, val_ds = random_split(train_ds, [n_train, n_val])

train_loader = DataLoader(train_ds, batch_size=32, shuffle=True,  num_workers=2)
val_loader   = DataLoader(val_ds,   batch_size=64, shuffle=False, num_workers=2)
test_loader  = DataLoader(test_ds,  batch_size=64, shuffle=False, num_workers=2)

print(f"Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}")
CLASS_NAMES = ['airplane', 'bird']

## 5.2  Load Pretrained ResNet-18

### üß† Brain Analogy
Loading an expert who has already seen 1.2 million images. Their brain can classify 1000 objects. We just replace their "final decision layer" ‚Äî currently mapping 512 features to 1000 ImageNet classes ‚Äî with one for our 2 classes. Keep all learned visual knowledge, swap the final decision.

### ‚öôÔ∏è Engineer Analogy
ResNet-18: `layer4 ‚Üí avgpool ‚Üí fc(512‚Üí1000)`. Replace `fc(512‚Üí1000)` with custom head. `in_features = model.fc.in_features = 512` ‚Äî always the bottleneck size regardless of what head is attached.



In [None]:
# üß† Load the expert: 1.2M images studied, 1000 objects known, ready to specialise
# ‚öôÔ∏è weights=DEFAULT downloads pretrained checkpoint (~45MB from PyTorch Hub)
# ‚îÄ‚îÄ Available pretrained models ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# models.resnet18, resnet50, resnet101
# models.vgg16, vgg19
# models.efficientnet_b0, efficientnet_b4
# models.mobilenet_v3_small, mobilenet_v3_large
# models.vit_b_16  (Vision Transformer)

backbone = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

# Inspect the architecture
print(backbone)
print(f"\nOriginal FC layer: {backbone.fc}")
print(f"Original output classes: {backbone.fc.out_features}")

## 5.3  Strategy 1: Feature Extraction (Frozen Backbone)

### üß† Brain Analogy
Strategy 1: Lock the expert's general knowledge, teach only the new speciality decision. Freeze all backbone parameters ‚Äî they hold valuable learned visual features. Only train the new head. Works because ImageNet features are general enough to describe airplanes and birds too.

### ‚öôÔ∏è Engineer Analogy
`requires_grad=False` ‚Üí autograd skips frozen params. `filter(p.requires_grad, params)` ‚Üí only trainable params to optimizer. Only ~2% of parameters train ‚Üí very fast, minimal overfitting risk with small datasets.



In [None]:
# üß† Lock the expert's knowledge, only teach the new 2-class decision layer
# ‚öôÔ∏è requires_grad=False prevents gradient computation; filter ensures optimizer only tracks trainable params
def build_feature_extractor(num_classes=2):
    model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

    # ‚ë† Freeze ALL backbone parameters
    for param in model.parameters():
        param.requires_grad = False

    # ‚ë° Replace final fully connected layer (this is the only trainable part)
    in_features = model.fc.in_features   # 512 for ResNet-18
    model.fc = nn.Sequential(
        nn.Linear(in_features, 256),
        nn.ReLU(),
        nn.Dropout(0.4),
        nn.Linear(256, num_classes)
    )
    # The new fc layer has requires_grad=True by default

    return model


fe_model = build_feature_extractor(num_classes=2).to(device)

# Only new head parameters are trained
trainable = sum(p.numel() for p in fe_model.parameters() if p.requires_grad)
total     = sum(p.numel() for p in fe_model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)")

In [None]:
# üß† Training only the speciality decision ‚Äî backbone just extracts features unchanged
# ‚öôÔ∏è filter(requires_grad) in optimizer: saves memory and state vs passing all parameters
# ‚îÄ‚îÄ Train feature extractor ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def train_model(model, train_loader, val_loader, epochs, lr, model_name):
    criterion = nn.CrossEntropyLoss()
    # Only pass trainable parameters to optimizer
    optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

    history = {'tr_acc': [], 'vl_acc': []}
    best_acc = 0.0

    for epoch in range(1, epochs + 1):
        # Train
        model.train()
        correct, total = 0, 0
        for X, y in train_loader:
            X, y = X.to(device), y.to(device)
            optimizer.zero_grad()
            loss = criterion(model(X), y)
            loss.backward()
            optimizer.step()
            preds   = model(X).argmax(1)
            correct += (preds == y).sum().item()
            total   += len(y)
        tr_acc = correct / total

        # Validate
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for X, y in val_loader:
                X, y = X.to(device), y.to(device)
                preds   = model(X).argmax(1)
                correct += (preds == y).sum().item()
                total   += len(y)
        vl_acc = correct / total

        scheduler.step()
        history['tr_acc'].append(tr_acc)
        history['vl_acc'].append(vl_acc)

        if vl_acc > best_acc:
            best_acc = vl_acc
            torch.save(model.state_dict(), f'{model_name}_best.pth')

        if epoch % 3 == 0:
            print(f"{model_name} | Epoch {epoch:2d}/{epochs} | "
                  f"train {tr_acc:.3f} | val {vl_acc:.3f}")

    print(f"\nBest val acc ({model_name}): {best_acc:.1%}\n")
    return history, best_acc


fe_history, fe_best = train_model(fe_model, train_loader, val_loader,
                                   epochs=15, lr=1e-3, model_name='fe_model')

## 5.4  Strategy 2: Fine-Tuning (Unfreeze Last Layers)

### üß† Brain Analogy
Strategy 2: Allow the expert to also refine their high-level perception for the new task. Unfreeze `layer4` ‚Äî it detects high-level features (wings/fuselage for planes, feathers/beak for birds) that ARE task-specific. Keep early layers frozen ‚Äî they detect universal features (edges, textures) shared across all visual tasks.

### ‚öôÔ∏è Engineer Analogy
`if 'layer4' not in name and 'fc' not in name: requires_grad = False`. Layer1-3: universal visual primitives (freeze). Layer4: high-level, task-specific (unfreeze). FC: new head (always trainable). Trainable params increase from 2% to ~15%.



In [None]:
# üß† Unlock high-level vision (layer4) while keeping basic vision (edges, textures) frozen
# ‚öôÔ∏è name-based selection: "layer4" in name targets only the final convolutional block
def build_finetuned_model(num_classes=2):
    model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

    # Freeze early layers (they detect generic features like edges)
    for name, param in model.named_parameters():
        if 'layer4' not in name and 'fc' not in name:
            param.requires_grad = False    # freeze layer1-3

    # Replace head
    model.fc = nn.Sequential(
        nn.Linear(model.fc.in_features, 256),
        nn.ReLU(),
        nn.Dropout(0.4),
        nn.Linear(256, num_classes)
    )
    return model


ft_model = build_finetuned_model().to(device)
trainable = sum(p.numel() for p in ft_model.parameters() if p.requires_grad)
total     = sum(p.numel() for p in ft_model.parameters())
print(f"Fine-tune trainable: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)")

In [None]:
# ‚îÄ‚îÄ Differential learning rates ‚Äî critical for fine-tuning ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

### üß† Brain Analogy
Differential learning rates = "fast lane for new speciality, slow lane for existing knowledge." Update backbone with 10√ó smaller LR = tiny adjustments that gently tune, not overwrite. Without this, fine-tuning can cause catastrophic forgetting of ImageNet features.

### ‚öôÔ∏è Engineer Analogy
Two param groups with different `lr`: `backbone_params: lr=1e-4`, `head_params: lr=1e-3`. This is standard practice for all production fine-tuning. Without differential LR, fine-tuning often hurts performance on small datasets.

# Head gets 10√ó higher learning rate than backbone

backbone_params = [p for n, p in ft_model.named_parameters()
                   if 'fc' not in n and p.requires_grad]
head_params     = [p for n, p in ft_model.named_parameters() if 'fc' in n]

optimizer = optim.Adam([
    {'params': backbone_params, 'lr': 1e-4},   # lower LR for pretrained layers
    {'params': head_params,     'lr': 1e-3},   # higher LR for new head
])

print("Optimizer with differential learning rates:")
for i, pg in enumerate(optimizer.param_groups):
    n = sum(p.numel() for p in pg['params'])
    print(f"  Group {i}: lr={pg['lr']}, params={n:,}")

ft_history, ft_best = train_model(ft_model, train_loader, val_loader,
                                   epochs=15, lr=1e-3, model_name='ft_model')

## 5.5  Compare Feature Extraction vs Fine-Tuning

### üß† Brain Analogy
Side-by-side: "Did refreshing high-level vision help vs just teaching the new decision?" Fine-tuning typically gains 2-5% when the target domain differs enough from ImageNet to benefit from backbone adaptation.

### ‚öôÔ∏è Engineer Analogy
FE = Feature Extraction (frozen backbone, fast, low overfitting risk). FT = Fine-Tuning (layer4+head trainable, higher accuracy, needs more data to avoid overfitting).



In [None]:
# üß† Which worked better: locked expert vs. refreshed expert? Side-by-side comparison.
# ‚öôÔ∏è FE = Feature Extraction; FT = Fine-Tuning. Bar chart shows best val accuracy per strategy.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(fe_history['tr_acc'], '‚Äìo', label='FE Train',  color='blue',   alpha=0.7)
ax1.plot(fe_history['vl_acc'], '‚Äìs', label='FE Val',    color='blue',   alpha=0.4)
ax1.plot(ft_history['tr_acc'], '‚Äìo', label='FT Train',  color='orange', alpha=0.7)
ax1.plot(ft_history['vl_acc'], '‚Äìs', label='FT Val',    color='orange', alpha=0.4)
ax1.set_title('Accuracy: Feature Extraction vs Fine-Tuning')
ax1.set_xlabel('Epoch'); ax1.set_ylabel('Accuracy')
ax1.legend()

bars = ax2.bar(['Feature\nExtraction', 'Fine-Tuning'],
               [fe_best, ft_best],
               color=['steelblue', 'darkorange'], width=0.4)
for bar, val in zip(bars, [fe_best, ft_best]):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
             f'{val:.1%}', ha='center', va='bottom', fontweight='bold')
ax2.set_ylim(0, 1.05)
ax2.set_title('Best Validation Accuracy')
ax2.set_ylabel('Accuracy')

plt.suptitle('Transfer Learning Strategies Comparison')
plt.tight_layout()
plt.show()

## 5.6  Grad-CAM ‚Äî What Does the Model See?

### üß† Brain Analogy
If the model says "airplane" ‚Äî WHERE in the image was it looking? Grad-CAM highlights regions that influenced the decision. A good model focuses on the OBJECT (wings, fuselage), not the background (sky). Highlighting background = "spurious correlation" ‚Äî will fail on airplanes on runways.

### ‚öôÔ∏è Engineer Analogy
Grad-CAM: gradient of class score w.r.t. target conv layer ‚Üí spatial importance weights ‚Üí weighted feature map sum ‚Üí ReLU ‚Üí resize. Works post-hoc on ANY pretrained model without modification. Standard tool for model interpretability.

Gradient-weighted Class Activation Mapping shows which image regions influenced the prediction.

In [None]:
# üß† "Where was the model looking when it decided airplane vs bird?" ‚Äî model attention map
# ‚öôÔ∏è Grad-CAM: backward hook captures dScore/dActivation; weight feature maps by importance
class GradCAM:
    """Simplified Grad-CAM for any model with a target conv layer."""

    def __init__(self, model, target_layer):
        self.model = model
        self.gradients   = None
        self.activations = None

        target_layer.register_forward_hook(
            lambda m, i, o: setattr(self, 'activations', o)
        )
        target_layer.register_full_backward_hook(
            lambda m, gi, go: setattr(self, 'gradients', go[0])
        )

    def __call__(self, x, class_idx=None):
        self.model.eval()
        logits = self.model(x)
        if class_idx is None:
            class_idx = logits.argmax(1).item()

        self.model.zero_grad()
        logits[0, class_idx].backward()

        # Pool gradients over spatial dims ‚Üí importance per channel
        weights = self.gradients.mean(dim=(2, 3), keepdim=True)   # (1,C,1,1)
        cam = (weights * self.activations).sum(dim=1, keepdim=True)  # (1,1,H,W)
        cam = torch.relu(cam)                                         # only positives
        cam = cam - cam.min()
        cam = cam / (cam.max() + 1e-8)
        return cam.squeeze().cpu().numpy(), class_idx


# Attach Grad-CAM to layer4 of ResNet (last conv block)
ft_model.load_state_dict(torch.load('ft_model_best.pth', map_location=device))
grad_cam = GradCAM(ft_model, ft_model.layer4[1].conv2)

# Run on a few validation images
def unnormalize(t):
    t = t.clone()
    for c, (m, s) in enumerate(zip(IMAGENET_MEAN, IMAGENET_STD)):
        t[c] = t[c] * s + m
    return t.permute(1, 2, 0).clamp(0, 1).numpy()

images, labels = next(iter(DataLoader(val_ds, batch_size=6, shuffle=True)))

fig, axes = plt.subplots(2, 6, figsize=(16, 6))
for i in range(6):
    img_t = images[i:i+1].to(device).requires_grad_(True)
    cam, pred = grad_cam(img_t)

    original = unnormalize(images[i])
    import cv2
    heatmap = plt.get_cmap('jet')(cam)[:, :, :3]   # no alpha

    axes[0][i].imshow(original)
    axes[0][i].set_title(f"True:{CLASS_NAMES[labels[i].item()]}\nPred:{CLASS_NAMES[pred]}",
                          fontsize=8)
    axes[0][i].axis('off')

    # Overlay (resize CAM to image size)
    import PIL.Image
    cam_resized = np.array(PIL.Image.fromarray((cam * 255).astype(np.uint8)).resize(
        (original.shape[1], original.shape[0]), PIL.Image.BILINEAR
    )) / 255.0
    overlay = 0.5 * original + 0.5 * plt.get_cmap('jet')(cam_resized)[:, :, :3]

    axes[1][i].imshow(np.clip(overlay, 0, 1))
    axes[1][i].set_title("Grad-CAM", fontsize=8)
    axes[1][i].axis('off')

plt.suptitle('Grad-CAM: Regions Used for Prediction', fontsize=12)
plt.tight_layout()
plt.show()

## 5.7  Quick Reference ‚Äî Pretrained Model APIs

### üß† Brain Analogy
Quick guide to different pretrained "expert brain" models. Each was trained differently: ResNet uses skip connections, EfficientNet scales depth/width together, ViT uses attention instead of convolution.

### ‚öôÔ∏è Engineer Analogy
All models expose `in_features` for head input dimension ‚Äî never hardcode 512 as it varies by architecture. ONNX export enables framework-independent deployment for production systems.

```python
# Available models (torchvision 0.15+)
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.DEFAULT)
model = models.vit_b_16(weights=models.ViT_B_16_Weights.DEFAULT)   # Vision Transformer
model = models.mobilenet_v3_small(weights=models.MobileNet_V3_Small_Weights.DEFAULT)

# Get head input size programmatically
in_features = model.fc.in_features           # ResNet, InceptionNet
in_features = model.classifier[-1].in_features  # EfficientNet, MobileNet
in_features = model.heads.head.in_features   # ViT

# Replace head
model.fc = nn.Linear(in_features, num_classes)
```

## Exercises

1. Download the [Dogs vs Cats dataset from Kaggle](https://www.kaggle.com/c/dogs-vs-cats) and use `ImageFolder`.
2. Swap ResNet-18 for EfficientNet-B0. Compare accuracy and inference time.
3. Implement **progressive unfreezing**: start with only the head, then unfreeze layer4, then layer3, etc. every 5 epochs.
4. Export the model to **ONNX** for deployment: `torch.onnx.export(model, dummy, 'model.onnx')`.

---
**Next ‚Üí** [Module 06: RNNs & LSTMs for Sequential Data](./Module_06_RNN_LSTM_Sentiment.ipynb)