# 🔁 Week 09-10 · Notebook 04 · Forward & Backward Pass Diagnostics

Trace gradients through manufacturing classifiers, detect instability, and enforce quality gates before deploying fine-tuned models.

## 🎯 Learning Objectives
- **Build Intuition:** Manually compute gradients for a simple network to understand the mechanics of backpropagation.
- **Inspect Gradients:** Use PyTorch's autograd hooks to inspect layer-wise gradient norms during training.
- **Diagnose Instability:** Identify and mitigate exploding or vanishing gradients, which are common when training on imbalanced manufacturing data (e.g., rare equipment failures).
- **Implement Governance:** Create a monitoring and alerting strategy for training stability, ensuring model quality and compliance before deployment.

## 🧩 Scenario
You fine-tuned a classifier to flag safety incidents. After deploying to four plants, you observe erratic confidence scores. Root cause: gradients exploded due to class imbalance between critical and routine events.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd

torch.manual_seed(7)

## 🧮 Manual Gradient Walkthrough
For a simple logistic unit, compute the gradient explicitly and compare with autograd.

In [None]:
x = torch.tensor([[0.5, 1.0]], requires_grad=True)
w = torch.tensor([[1.2], [-0.7]], requires_grad=True)
y_true = torch.tensor([[1.0]])

logits = x @ w
y_pred = torch.sigmoid(logits)
loss = F.binary_cross_entropy(y_pred, y_true)
loss.backward()
x.grad, w.grad

Compare the output with the analytical derivative to build confidence in your gradient checks.

## 🧱 Safety Incident Classifier
A miniature MLP flags whether a maintenance note contains a safety-critical issue.

In [None]:
class IncidentNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(16, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return torch.sigmoid(self.fc3(x))

model = IncidentNet()
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

## 📊 Gradient Norm Logging
Attach hooks to capture gradient statistics after each backward pass and store them for dashboards.

In [None]:
gradient_log = []
gradient_norms = {}

def capture_gradient_hook(module, grad_input, grad_output):
    """
    A hook that runs after a backward pass.
    It captures the norm of the gradient of the output of a specific layer.
    """
    # Find the name of the module that this hook is attached to.
    layer_name = [name for name, mod in model.named_modules() if mod is module][0]
    
    # grad_output is a tuple containing gradients with respect to the module's output.
    if grad_output[0] is not None:
        grad_norm = grad_output[0].norm().item()
        # Store the latest norm for quick access
        gradient_norms[layer_name] = grad_norm
        # Append to a log for historical analysis
        gradient_log.append({'layer': layer_name, 'grad_norm': grad_norm})

def register_hooks(model):
    """
    Registers a backward hook on all nn.Linear layers of the model.
    This allows us to inspect gradients without altering the training loop.
    """
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            # register_full_backward_hook is the modern and recommended way to register hooks.
            module.register_full_backward_hook(capture_gradient_hook)

register_hooks(model)

In [None]:
def training_step(batch_size=32):
    model.train()
    feature_scale = torch.tensor([1.0] * 8 + [5.0] * 8)
    inputs = torch.randn(batch_size, 16) * feature_scale
    targets = torch.bernoulli(torch.full((batch_size, 1), 0.1))
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()
    return loss.item()

for epoch in range(5):
    _ = training_step()

gradient_df = pd.DataFrame(gradient_log)
gradient_df.head()

## 📈 Gradient Trends
Use the recorded norms to trigger alerts when variance or magnitude exceeds thresholds.

In [None]:
threshold = 5.0
alerts = gradient_df[gradient_df['grad_norm'] > threshold]
alerts

### 🚨 Governance Guidance
- If any layer norm exceeds 5.0 consistently, pause training and notify QA.
- Log gradient alerts in the safety incident tracker and escalate to the MLOps on-call engineer.
- Keep gradient logs for 30 days to support ISO 45001 audits.

In [None]:
def apply_gradient_clipping(max_norm=1.0):
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

gradient_log.clear()
for epoch in range(5):
    model.train()
    inputs = torch.randn(32, 16)
    targets = torch.bernoulli(torch.full((32, 1), 0.1))
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    apply_gradient_clipping(1.0)
    optimizer.step()

pd.DataFrame(gradient_log).head()

## 🧪 Lab Assignment
1. Load your plant's labeled incident dataset and reproduce gradient logs.
2. Implement class-weighted loss and observe gradient stability improvements.
3. Build a Streamlit dashboard displaying gradient histograms per epoch.
4. Draft an SOP section documenting gradient alert handling procedures.

## ✅ Checklist
- [ ] Manual gradient calculations validated
- [ ] Autograd hooks configured with logging
- [ ] Alerts and mitigations documented
- [ ] Governance dashboard shared with safety committee

## 📚 References
- PyTorch Autograd Mechanics
- *Human Factors in Safety-Critical AI* (2024)
- ISO 45001: Occupational Health and Safety Management