# 🧠 Notebook 01: Neural Network Fundamentals with PyTorch

**Week 3-4: Deep Learning & NLP Foundations**  
**Gen AI Masters Program**

---

## 📋 Objectives

Welcome to the first notebook of the Deep Learning & NLP phase! This is where we transition from classical machine learning models to the powerful world of **neural networks**. This notebook lays the essential groundwork for understanding all subsequent deep learning architectures, including the Transformers that power modern Generative AI.

By the end of this notebook, you will have a solid, hands-on understanding of:
1.  **Core Neural Network Components**: You will grasp the fundamental architecture of a neural network, including the roles of **neurons**, **layers**, **weights**, and **biases**.
2.  **The Training Mechanism**: You will master the key processes that allow a network to learn: **forward propagation** (making a prediction), **backpropagation** (calculating the error gradient), and **gradient descent** (updating the weights).
3.  **The Role of Activation Functions**: You will understand why non-linearity is crucial for learning complex patterns and implement key activation functions like **Sigmoid**, **Tanh**, and the ubiquitous **ReLU**.
4.  **Building Networks with PyTorch**: You will learn to construct a neural network from scratch using PyTorch's elegant and powerful `nn.Module` class.
5.  **End-to-End Training Workflow**: You will go through the complete cycle of defining a model, preparing data, training the model, calculating its loss, and evaluating its performance on both simple and complex datasets.

**Estimated Time:** 3-4 hours

---

## 📚 What are Neural Networks?

Neural networks are a class of machine learning models inspired by the structure and function of the human brain. They form the backbone of modern **deep learning** and are the engine behind the recent breakthroughs in Generative AI. Their power lies in their ability to automatically learn complex, non-linear patterns and representations directly from data.

### Key Components:
-   **🔵 Neurons (or Nodes)**: The most fundamental computational units. Each neuron receives inputs, processes them, and passes the result to the next layer.
-   **🔗 Layers**: Neurons are organized into layers. A typical network has an **Input Layer** (which receives the raw data), one or more **Hidden Layers** (where the majority of the computation happens), and an **Output Layer** (which produces the final prediction).
-   **⚡ Weights & Biases**: These are the **learnable parameters** of the network. The entire training process is about finding the optimal values for these weights and biases to minimize the prediction error.
-   **🎯 Activation Functions**: Mathematical functions (like ReLU) applied to the output of each neuron. They introduce essential **non-linearity**, allowing the network to learn far more complex relationships than a simple linear model.
-   **📉 Loss Function**: A function that quantifies how wrong the model's predictions are compared to the actual ground truth. The goal of training is to minimize this function.
-   **🔄 Optimizer**: An algorithm (e.g., Adam, SGD) that updates the network's weights and biases based on the gradients computed during backpropagation. Its job is to navigate the "loss landscape" to find the point of minimum loss.

Let's dive in and build one from the ground up! 🚀

In [None]:
# --- Core Deep Learning Libraries ---
import torch
import torch.nn as nn  # `nn` is PyTorch's module for building neural networks
import torch.optim as optim  # `optim` contains optimization algorithms like Adam or SGD

# --- Data Handling & Visualization ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# --- Notebook Setup ---
# Set a consistent and professional style for all plots.
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)  # Default figure size
plt.rcParams['figure.dpi'] = 100  # High-resolution figures

# --- Reproducibility ---
# Setting random seeds is crucial for reproducibility. It ensures that anyone running
# this notebook will get the exact same results, as operations like weight initialization
# will be consistent.
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

# --- Device Configuration ---
# This is a critical step in any deep learning script. We check if a CUDA-enabled GPU is
# available and set the device accordingly. Training on a GPU can be orders of magnitude
# faster than on a CPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"✅ Using device: {device}")
print(f"PyTorch Version: {torch.__version__}")

## 1️⃣ The Perceptron: The Foundational Building Block

The **perceptron** is the simplest possible form of a neural network, representing a single artificial neuron. It was one of the earliest supervised learning algorithms and is the conceptual ancestor of the complex deep learning models we use today. Understanding the perceptron is key to understanding how more complex networks operate.

A perceptron takes multiple binary inputs, computes a weighted sum of these inputs, adds a bias, and then passes this result through a **step function** (or another activation function) to produce a single binary output.

### Mathematical Formulation
For a single neuron with inputs `X = [x₁, x₂, ..., xₙ]` and corresponding weights `W = [w₁, w₂, ..., wₙ]`, the process is as follows:

1.  **Compute the Linear Combination**: The neuron first computes a weighted sum of its inputs and adds a bias term `b`. This is often called the *logit* or *pre-activation*.
    `z = (w₁*x₁ + w₂*x₂ + ... + wₙ*xₙ) + b`

2.  **Apply the Activation Function**: The result `z` is then passed through an activation function to produce the final output.
    `output = activation(z)`

This can be expressed more concisely using vector notation, which is how it's implemented in libraries like NumPy and PyTorch for efficiency:

`output = activation( W · X + b )`

In [None]:
# --- A Simple Perceptron Implemented from Scratch using NumPy ---
# Building a perceptron from scratch is a great way to understand the core mechanics
# before moving to a high-level framework like PyTorch.
class Perceptron:
    """A simple implementation of a single neuron (perceptron) using NumPy."""
    
    def __init__(self, input_size: int):
        """
        Initializes the perceptron's parameters: weights and bias.
        
        Args:
            input_size (int): The number of input features the perceptron will accept.
        """
        # The weights and bias are the learnable parameters. We initialize them randomly.
        # The shape of the weights must match the number of input features.
        self.weights = np.random.randn(input_size)
        # The bias is a single scalar value.
        self.bias = np.random.randn()
    
    def _sigmoid(self, x: float) -> float:
        """
        The Sigmoid activation function. It squashes any input value into a range between 0 and 1.
        This is useful for interpreting the output as a probability.
        """
        return 1 / (1 + np.exp(-x))
    
    def forward(self, X: np.ndarray) -> float:
        """
        Performs the forward pass: computes the weighted sum and applies the activation function.
        
        Args:
            X (np.ndarray): A NumPy array representing the input features.
            
        Returns:
            float: The output of the perceptron, a value between 0 and 1.
        """
        # 1. Compute the linear combination (weighted sum + bias).
        # `np.dot` calculates the dot product between the input features and the weights.
        linear_combination = np.dot(X, self.weights) + self.bias
        
        # 2. Apply the activation function to the linear combination.
        return self._sigmoid(linear_combination)
    
    def predict(self, X: np.ndarray) -> int:
        """Makes a binary prediction (0 or 1) based on the forward pass output."""
        # The standard threshold for a sigmoid output is 0.5.
        # If the output probability is 0.5 or greater, we predict class 1; otherwise, class 0.
        return 1 if self.forward(X) >= 0.5 else 0

# --- Test the Perceptron ---
# Create a perceptron that accepts 2 input features.
perceptron = Perceptron(input_size=2)
test_input = np.array([0.7, 0.2])

# Perform a forward pass to get the raw output probability.
output_value = perceptron.forward(test_input)
# Make a final binary prediction based on the output.
prediction = perceptron.predict(test_input)

print("🔵 Single Perceptron Test (from Scratch)")
print("=" * 50)
print(f"Input Features: {test_input}")
print(f"Initialized Weights: {perceptron.weights}")
print(f"Initialized Bias: {perceptron.bias:.4f}")
print("-" * 50)
print(f"Output (Probability): {output_value:.4f}")
print(f"Final Prediction (Threshold at 0.5): {prediction}")

## 2️⃣ Activation Functions: Introducing Non-Linearity

Activation functions are a critical component of any neural network. Their primary purpose is to introduce **non-linearity** into the model.

**Why is non-linearity so important?**
Without a non-linear activation function, a neural network, no matter how many layers it has, would behave just like a single-layer linear model. Each layer would just be performing a linear transformation (matrix multiplication). A series of linear transformations can be collapsed into a single, equivalent linear transformation. This means the network would only be able to learn linear relationships in the data, which is far too restrictive for complex, real-world problems like image recognition or natural language understanding.

By introducing non-linearity, activation functions allow the network to learn much more complex, curved decision boundaries and model intricate patterns in the data.
</VSCode.Cell>
<VSCode.Cell id="#VSC-c5689868" language="python">
# --- Visualizing Common Activation Functions ---
# Generate a range of input values to see how each function behaves.
x = np.linspace(-6, 6, 100)

# --- Calculate the output of each activation function ---
# Sigmoid: Squashes values to a (0, 1) range.
sigmoid = 1 / (1 + np.exp(-x))
# Tanh (Hyperbolic Tangent): Squashes values to a (-1, 1) range.
tanh = np.tanh(x)
# ReLU (Rectified Linear Unit): The most common activation for hidden layers. It's simply max(0, x).
relu = np.maximum(0, x)
# Leaky ReLU: An improvement on ReLU that allows a small, non-zero gradient for negative inputs.
leaky_relu = np.where(x > 0, x, 0.1 * x) # Using a common alpha (slope for negative values) of 0.1

# --- Plotting ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('Common Activation Functions', fontsize=16, fontweight='bold')

# Sigmoid Plot
axes[0, 0].plot(x, sigmoid, color='blue', lw=2)
axes[0, 0].set_title('Sigmoid', fontweight='bold')
axes[0, 0].text(-5.5, 0.8, 'Range: (0, 1)\nUse: Output layer for binary classification', bbox=dict(facecolor='blue', alpha=0.1))
axes[0, 0].grid(True, linestyle='--', alpha=0.6)

# Tanh Plot
axes[0, 1].plot(x, tanh, color='green', lw=2)
axes[0, 1].set_title('Tanh', fontweight='bold')
axes[0, 1].text(-5.5, 0.7, 'Range: (-1, 1)\nUse: Hidden layers (often better than sigmoid)', bbox=dict(facecolor='green', alpha=0.1))
axes[0, 1].grid(True, linestyle='--', alpha=0.6)

# ReLU Plot
axes[1, 0].plot(x, relu, color='red', lw=2)
axes[1, 0].set_title('ReLU', fontweight='bold')
axes[1, 0].text(-5.5, 5, 'Range: [0, ∞)\nUse: Most common for hidden layers', bbox=dict(facecolor='red', alpha=0.1))
axes[1, 0].grid(True, linestyle='--', alpha=0.6)

# Leaky ReLU Plot
axes[1, 1].plot(x, leaky_relu, color='purple', lw=2)
axes[1, 1].set_title('Leaky ReLU', fontweight='bold')
axes[1, 1].text(-5.5, 4, 'Range: (-∞, ∞)\nUse: Fixes the "dying ReLU" problem', bbox=dict(facecolor='purple', alpha=0.1))
axes[1, 1].grid(True, linestyle='--', alpha=0.6)

# Common formatting for all subplots
for ax in axes.flat:
    ax.axhline(0, color='black', linewidth=0.5)
    ax.axvline(0, color='black', linewidth=0.5)
    ax.set_xlabel('Input (z)')
    ax.set_ylabel('Output')

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

print("\n--- Key Properties & Use Cases ---")
print("🔹 Sigmoid: Historically popular, but now mostly used in the output layer of a binary classifier to produce a probability.")
print("🔹 Tanh: Often preferred over sigmoid for hidden layers because it is zero-centered, which can help with optimization during training.")
print("🔹 ReLU: The default choice for hidden layers. It's computationally very efficient and helps mitigate the 'vanishing gradient' problem. Its main drawback is the 'dying ReLU' problem where neurons can become inactive.")
print("🔹 Leaky ReLU: An improvement over ReLU that allows a small, non-zero gradient when the unit is not active, preventing neurons from 'dying'.")

In [None]:
# --- Visualizing Common Activation Functions ---
# Generate a range of input values.
x = np.linspace(-6, 6, 100)

# Calculate the output of each activation function.
sigmoid = 1 / (1 + np.exp(-x))
tanh = np.tanh(x)
relu = np.maximum(0, x)
leaky_relu = np.where(x > 0, x, 0.1 * x) # Using a common alpha of 0.1

# --- Plotting ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('Common Activation Functions', fontsize=16, fontweight='bold')

# Sigmoid
axes[0, 0].plot(x, sigmoid, color='blue')
axes[0, 0].set_title('Sigmoid', fontweight='bold')
axes[0, 0].text(-5.5, 0.8, 'Range: (0, 1)\nUse: Output layer for binary classification', bbox=dict(facecolor='blue', alpha=0.1))
axes[0, 0].grid(True, linestyle='--', alpha=0.6)

# Tanh (Hyperbolic Tangent)
axes[0, 1].plot(x, tanh, color='green')
axes[0, 1].set_title('Tanh', fontweight='bold')
axes[0, 1].text(-5.5, 0.7, 'Range: (-1, 1)\nUse: Hidden layers (often better than sigmoid)', bbox=dict(facecolor='green', alpha=0.1))
axes[0, 1].grid(True, linestyle='--', alpha=0.6)

# ReLU (Rectified Linear Unit)
axes[1, 0].plot(x, relu, color='red')
axes[1, 0].set_title('ReLU', fontweight='bold')
axes[1, 0].text(-5.5, 5, 'Range: [0, ∞)\nUse: Most common for hidden layers', bbox=dict(facecolor='red', alpha=0.1))
axes[1, 0].grid(True, linestyle='--', alpha=0.6)

# Leaky ReLU
axes[1, 1].plot(x, leaky_relu, color='purple')
axes[1, 1].set_title('Leaky ReLU', fontweight='bold')
axes[1, 1].text(-5.5, 4, 'Range: (-∞, ∞)\nUse: Fixes the "dying ReLU" problem', bbox=dict(facecolor='purple', alpha=0.1))
axes[1, 1].grid(True, linestyle='--', alpha=0.6)

for ax in axes.flat:
    ax.axhline(0, color='black', linewidth=0.5)
    ax.axvline(0, color='black', linewidth=0.5)
    ax.set_xlabel('Input (z)')
    ax.set_ylabel('Output')

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

print("\n--- Key Properties ---")
print("🔹 Sigmoid: Squashes values into a (0, 1) range, making it suitable for output probabilities.")
print("🔹 Tanh: Zero-centered (output ranges from -1 to 1), which can help with optimization.")
print("🔹 ReLU: Computationally efficient and helps mitigate the vanishing gradient problem. The default choice for hidden layers.")
print("🔹 Leaky ReLU: An improvement over ReLU that allows a small, non-zero gradient when the unit is not active, preventing 'dying neurons'.")

## 3️⃣ Building a Neural Network with PyTorch

While building a perceptron from scratch with NumPy is insightful, it's not practical for building the large, complex networks used in modern AI. High-level frameworks like **PyTorch** and **TensorFlow** provide optimized, pre-built components that make it much easier to build, train, and deploy deep learning models.

### The `nn.Module` Class: The Heart of PyTorch Models

In PyTorch, every neural network you build should be a subclass of `nn.Module`. This base class provides a tremendous amount of functionality:
-   It tracks all the layers and sub-modules of your network.
-   It manages the model's parameters (`weights` and `biases`), making it easy to access them and pass them to an optimizer.
-   It provides helper methods for moving the model to different devices (like a GPU), saving and loading the model, and switching between training and evaluation modes.

### A Simple Multi-Layer Perceptron (MLP)

Let's build a simple network with one hidden layer to classify data. This is often called a **Multi-Layer Perceptron (MLP)**. We will define the layers in the `__init__` method and specify how data flows through them in the `forward` method.

In [None]:
# --- Defining a Neural Network using PyTorch's nn.Module ---
class SimpleMLP(nn.Module):
    """A simple Multi-Layer Perceptron with one hidden layer, built using PyTorch."""
    
    def __init__(self, input_size: int, hidden_size: int, output_size: int):
        """
        Defines the architecture of the network by specifying its layers.
        
        Args:
            input_size (int): The number of features in the input data (e.g., 2 for the XOR problem).
            hidden_size (int): The number of neurons in the hidden layer. This is a hyperparameter you can tune.
            output_size (int): The number of output neurons (e.g., 1 for binary classification).
        """
        # `super().__init__()` must be called first to initialize the parent `nn.Module` class.
        super(SimpleMLP, self).__init__()
        
        # `nn.Sequential` is a container that holds a sequence of layers.
        # Data passed to it will flow through each layer in the defined order.
        self.layers = nn.Sequential(
            # 1. First Linear Layer: Maps from the input layer to the hidden layer.
            # `nn.Linear(in_features, out_features)` creates a fully connected layer.
            nn.Linear(input_size, hidden_size),
            
            # 2. Activation Function: Introduce non-linearity after the first linear layer.
            nn.ReLU(),
            
            # 3. Second Linear Layer: Maps from the hidden layer to the output layer.
            nn.Linear(hidden_size, output_size),
            
            # 4. Output Activation: Sigmoid is used to squash the output to a [0, 1] probability.
            nn.Sigmoid()
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Defines the forward pass of the network. This method is automatically called
        when you pass data to an instance of the model (e.g., `model(input_data)`).
        
        Args:
            x (torch.Tensor): The input tensor.
            
        Returns:
            torch.Tensor: The output tensor (predictions).
        """
        return self.layers(x)

# --- Instantiate and Inspect the Model ---
# Create an instance of our network.
# It will take 2 input features, has a hidden layer with 8 neurons, and produces 1 output.
model = SimpleMLP(input_size=2, hidden_size=8, output_size=1)

print("🧠 Neural Network Architecture (SimpleMLP)")
print("=" * 50)
# Printing the model object gives a nice summary of its architecture.
print(model)

# --- Count the Learnable Parameters ---
# This is a good practice to understand the complexity (and memory footprint) of your model.
# `p.numel()` returns the number of elements in a tensor.
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal learnable parameters: {total_params}")
# Calculation:
# Layer 1 weights: 2 * 8 = 16
# Layer 1 bias: 8
# Layer 2 weights: 8 * 1 = 8
# Layer 2 bias: 1
# Total: 16 + 8 + 8 + 1 = 33

# --- Test the Forward Pass with a Dummy Input ---
# Create a dummy input tensor of shape (batch_size, num_features)
test_input = torch.tensor([[0.5, -1.2]], dtype=torch.float32)
# Pass the input through the model to get an output.
output = model(test_input)
print(f"\nTest input: {test_input.numpy()}")
print(f"Model output (probability): {output.item():.4f}")

## 4️⃣ Training a Neural Network: The XOR Problem

The **XOR (exclusive OR)** problem is a classic "Hello, World!" for neural networks. It's a binary classification task that holds historical significance because it demonstrated the limitations of single-layer perceptrons and the necessity of multi-layer networks.

The core issue is that the XOR data is **not linearly separable**. You cannot draw a single straight line to separate the two classes (the 0s and the 1s). This makes it impossible for a single perceptron to solve. However, a multi-layer network with a non-linear activation function can learn the complex, non-linear decision boundary required to solve it perfectly.

### The XOR Logic Table:
- `0 XOR 0 = 0`
- `0 XOR 1 = 1`
- `1 XOR 0 = 1`
- `1 XOR 1 = 0`

In [None]:
# --- XOR Dataset Definition ---
# Input features for the XOR problem, represented as a PyTorch tensor.
X_xor = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float32)
# Corresponding target labels for the XOR problem.
y_xor = torch.tensor([[0], [1], [1], [0]], dtype=torch.float32)

print("⚡ The XOR Problem Dataset")
print("=" * 30)
print(" Input (X)  | Output (y)")
print("------------|------------")
for i in range(len(X_xor)):
    print(f" {X_xor[i].numpy()} |     {int(y_xor[i].item())}")
print("=" * 30)

# --- Visualize the XOR Data ---
plt.figure(figsize=(6, 5))
# Plot points belonging to Class 0 (y=0) in red.
plt.scatter(X_xor[y_xor.squeeze() == 0, 0], X_xor[y_xor.squeeze() == 0, 1], c='red', s=200, label='Class 0', edgecolors='k', alpha=0.8)
# Plot points belonging to Class 1 (y=1) in blue.
plt.scatter(X_xor[y_xor.squeeze() == 1, 0], X_xor[y_xor.squeeze() == 1, 1], c='blue', s=200, label='Class 1', edgecolors='k', alpha=0.8)

plt.title('The XOR Problem (Not Linearly Separable)', fontweight='bold')
plt.xlabel('Input Feature 1')
plt.ylabel('Input Feature 2')
plt.xticks([0, 1])
plt.yticks([0, 1])
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

In [None]:
# --- Model, Loss, and Optimizer Setup ---
# 1. Instantiate the model with a hidden layer of 8 neurons.
model_xor = SimpleMLP(input_size=2, hidden_size=8, output_size=1)

# 2. Define the Loss Function. Binary Cross-Entropy (BCE) is the standard choice for
#    binary classification problems. It measures the difference between the predicted
#    probability and the actual label.
criterion = nn.BCELoss()

# 3. Define the Optimizer. The optimizer's job is to update the model's parameters
#    based on the gradients computed during backpropagation. Adam is a popular,
#    robust, and effective choice that often works well with default settings.
#    We pass the model's parameters (`model_xor.parameters()`) to the optimizer
#    so it knows which tensors to update. `lr` is the learning rate.
optimizer = optim.Adam(model_xor.parameters(), lr=0.1)

# --- The Training Loop ---
# This is the core of the training process. We will iterate over the data multiple times (epochs).
epochs = 200
losses = []

print("\n🔄 Training the network to solve the XOR problem...")
print("=" * 50)

for epoch in range(epochs):
    # --- The 5 Core Steps of a Training Iteration ---
    
    # 1. Forward Pass: Pass the input data through the model to get predictions.
    y_pred = model_xor(X_xor)
    
    # 2. Compute Loss: Compare the model's predictions (y_pred) with the true labels (y_xor)
    #    using the defined loss function.
    loss = criterion(y_pred, y_xor)
    losses.append(loss.item())
    
    # 3. Zero Gradients: Before the backward pass, we must clear the gradients from the
    #    previous iteration. If we don't, gradients will accumulate.
    optimizer.zero_grad()
    
    # 4. Backward Pass (Backpropagation): This is where PyTorch automatically computes the
    #    gradient of the loss with respect to each of the model's parameters.
    loss.backward()
    
    # 5. Update Weights: The optimizer uses the computed gradients to update the model's
    #    weights and biases in the direction that minimizes the loss.
    optimizer.step()
    
    # Print progress periodically to monitor training.
    if (epoch + 1) % 20 == 0:
        print(f"Epoch [{epoch+1: >3}/{epochs}], Loss: {loss.item():.4f}")

print("\n✅ Training Complete!")

# --- Evaluate the Trained Model ---
# After training, we evaluate the model's performance on the same data.
# `torch.no_grad()` is a context manager that disables gradient calculation.
# This is important for inference as it reduces memory consumption and speeds up computation.
with torch.no_grad():
    predictions = model_xor(X_xor)
    # Convert the output probabilities to binary class predictions (0 or 1).
    predicted_classes = (predictions >= 0.5).float()

print("\n🎯 Final Results:")
print("=" * 50)
print(" Input    | True | Predicted | Probability")
print("----------|------|-----------|-------------")
for i in range(len(X_xor)):
    print(f" {X_xor[i].numpy()} |  {int(y_xor[i].item())}   |     {int(predicted_classes[i].item())}     |   {predictions[i].item():.4f}")

# Calculate the final accuracy of the model.
accuracy = (predicted_classes == y_xor).float().mean()
print(f"\nFinal Accuracy: {accuracy.item() * 100:.2f}%")

In [None]:
# --- Plotting the Training Loss ---
# Visualizing the loss is a crucial step in diagnosing the training process.
# A decreasing loss curve indicates that the model is learning.
plt.figure(figsize=(10, 5))
plt.plot(losses, color='purple', lw=2)
plt.title('Training Loss Over Epochs', fontweight='bold')
plt.xlabel('Epoch')
plt.ylabel('BCE Loss')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

## 5️⃣ Real-World Example: Manufacturing Defect Classification

Now, let's apply these concepts to a more realistic problem. Imagine we are building a **Manufacturing Copilot**, an AI assistant designed to monitor a production line in real-time. A key feature of this copilot is its ability to predict whether a manufactured part is defective based on sensor readings, allowing for immediate quality control.

### The Dataset
We'll use a synthetic dataset that mimics this scenario. Each data point represents a manufactured part, and the features are two sensor readings:
-   `temperature_deviation`: How much the part's temperature varied from the optimal setting during production.
-   `pressure_deviation`: How much the pressure varied from the optimal setting.

The data is intentionally designed to be **non-linearly separable**, forming two concentric circles.
-   The **inner circle** (Class 0) represents non-defective parts.
-   The **outer ring** (Class 1) represents defective parts.

This is a classic "toy" problem that is more complex than XOR and effectively mimics real-world scenarios where simple linear models would fail completely, but a neural network can excel.

In [None]:
# --- Generate Synthetic Manufacturing Data ---
# `make_circles` is a function from scikit-learn that is perfect for creating a dataset
# where the classes are not linearly separable.
X, y = make_circles(n_samples=1000, noise=0.1, factor=0.5, random_state=SEED)

# Using a Pandas DataFrame provides better context and makes visualization easier.
df = pd.DataFrame(X, columns=['temperature_deviation', 'pressure_deviation'])
df['is_defective'] = y

print("🏭 Synthetic Manufacturing Dataset")
print("=" * 50)
print(f"Total samples: {len(df)}")
print(f"Defective parts (Class 1): {df['is_defective'].sum()}")
print(f"Non-defective parts (Class 0): {len(df) - df['is_defective'].sum()}")
print(f"Defect rate: {df['is_defective'].mean():.1%}")
print("-" * 50)
print("Sample Data:")
print(df.head())

# --- Visualize the Data ---
plt.figure(figsize=(8, 7))
sns.scatterplot(
    data=df,
    x='temperature_deviation',
    y='pressure_deviation',
    hue='is_defective',
    palette={0: 'green', 1: 'red'},
    style='is_defective',
    markers={0: 'o', 1: 'X'},
    s=100,
    alpha=0.8
)
plt.title('Manufacturing Data: Temperature vs. Pressure Deviations', fontweight='bold')
plt.xlabel('Temperature Deviation (Normalized)')
plt.ylabel('Pressure Deviation (Normalized)')
plt.legend(title='Status', labels=['Non-defective', 'Defective'])
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

In [None]:
# --- Step 1: Train-Test Split ---
# We split the data into a training set (for model training) and a testing set (for unbiased evaluation).
# `test_size=0.2` means 20% of the data will be reserved for testing.
# `stratify=y` is crucial for classification problems, especially with imbalanced datasets. It ensures
# that the proportion of each class is the same in both the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED, stratify=y)

# --- Step 2: Feature Scaling ---
# Neural networks are sensitive to the scale of input features. Features with larger ranges can
# dominate the learning process. `StandardScaler` standardizes features by removing the mean
# and scaling to unit variance.
scaler = StandardScaler()
# We `fit_transform` on the training data to learn the scaling parameters (mean and std).
X_train_scaled = scaler.fit_transform(X_train)
# We only `transform` the test data using the *same* scaler fitted on the training data.
# This prevents data leakage from the test set into the training process.
X_test_scaled = scaler.transform(X_test)

# --- Step 3: Convert to PyTorch Tensors ---
# PyTorch models operate on tensors. We convert our NumPy arrays to PyTorch tensors.
# We also move the data to the configured device (CPU or GPU) to enable hardware acceleration.
X_train_tensor = torch.FloatTensor(X_train_scaled).to(device)
y_train_tensor = torch.FloatTensor(y_train).unsqueeze(1).to(device) # `unsqueeze(1)` adds a dimension to match model output
X_test_tensor = torch.FloatTensor(X_test_scaled).to(device)
y_test_tensor = torch.FloatTensor(y_test).unsqueeze(1).to(device)

print("✅ Data Preprocessing Complete")
print(f"Training samples: {len(X_train_tensor)}")
print(f"Testing samples: {len(X_test_tensor)}")
print(f"Data tensors are on device: '{X_train_tensor.device}'")

In [None]:
# --- A Deeper Network for a More Complex Problem ---
# For this non-linear problem, a single hidden layer might not be enough.
# We'll build a slightly deeper network with two hidden layers to give it more
# capacity to learn the complex circular decision boundary.
class ManufacturingClassifier(nn.Module):
    """A neural network designed to classify manufacturing defects based on sensor data."""
    def __init__(self):
        super(ManufacturingClassifier, self).__init__()
        self.layers = nn.Sequential(
            # Layer 1: Input (2 features) to Hidden Layer 1 (16 neurons)
            nn.Linear(2, 16),
            nn.ReLU(),
            
            # Layer 2: Hidden Layer 1 (16 neurons) to Hidden Layer 2 (16 neurons)
            nn.Linear(16, 16),
            nn.ReLU(),
            
            # Layer 3: Hidden Layer 2 (16 neurons) to Output Layer (1 neuron)
            nn.Linear(16, 1),
            nn.Sigmoid() # Sigmoid for binary classification output
        )
    
    def forward(self, x):
        return self.layers(x)

# --- Instantiate the Model and Move to the Configured Device ---
model_manufacturing = ManufacturingClassifier().to(device)

print("🧠 Manufacturing Defect Classifier Architecture")
print("=" * 50)
print(model_manufacturing)

# Count the learnable parameters to understand model complexity.
total_params = sum(p.numel() for p in model_manufacturing.parameters() if p.requires_grad)
print(f"\nTotal learnable parameters: {total_params}")

In [None]:
# --- Training Setup ---
# Use the same loss function and optimizer as before.
criterion = nn.BCELoss()
optimizer = optim.Adam(model_manufacturing.parameters(), lr=0.01)

# --- Full Training and Evaluation Loop ---
epochs = 500
train_losses, test_losses = [], []
train_accuracies, test_accuracies = [], []

print("\n🔄 Training Manufacturing Defect Classifier...")
print("=" * 70)

for epoch in range(epochs):
    # --- Training Phase ---
    model_manufacturing.train() # Set the model to training mode. This enables features like dropout.
    
    # Forward pass: get predictions on the training data.
    y_train_pred = model_manufacturing(X_train_tensor)
    
    # Calculate loss.
    train_loss = criterion(y_train_pred, y_train_tensor)
    
    # Backward pass and optimization.
    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()
    
    # --- Evaluation Phase ---
    # It's good practice to evaluate the model on both training and test data at each epoch
    # to monitor for overfitting.
    model_manufacturing.eval() # Set the model to evaluation mode. This disables features like dropout.
    with torch.no_grad(): # Disable gradient computation for evaluation to save memory and speed up.
        # Calculate training accuracy
        train_preds_class = (y_train_pred >= 0.5).float()
        train_acc = (train_preds_class == y_train_tensor).float().mean().item()
        
        # Make predictions on the test data
        y_test_pred = model_manufacturing(X_test_tensor)
        # Calculate test loss
        test_loss = criterion(y_test_pred, y_test_tensor)
        # Calculate test accuracy
        test_preds_class = (y_test_pred >= 0.5).float()
        test_acc = (test_preds_class == y_test_tensor).float().mean().item()
    
    # Record metrics for this epoch to plot later.
    train_losses.append(train_loss.item())
    test_losses.append(test_loss.item())
    train_accuracies.append(train_acc)
    test_accuracies.append(test_acc)
    
    # Print progress periodically.
    if (epoch + 1) % 50 == 0:
        print(f"Epoch [{epoch+1: >3}/{epochs}] | "
              f"Train Loss: {train_loss.item():.4f}, Train Acc: {train_acc:.2%} | "
              f"Test Loss: {test_loss.item():.4f}, Test Acc: {test_acc:.2%}")

print("\n✅ Training Complete!")
print(f"Final Test Accuracy: {test_accuracies[-1]:.2%}")

In [None]:
# --- Plotting Training and Evaluation Metrics ---
# Visualizing metrics is essential for understanding the training process.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Loss vs. Epochs
# We expect both training and test loss to decrease over time.
# A large gap between the two curves can indicate overfitting.
ax1.plot(train_losses, label='Training Loss', color='blue', lw=2)
ax1.plot(test_losses, label='Test Loss', color='orange', linestyle='--', lw=2)
ax1.set_title('Loss Over Epochs', fontweight='bold')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('BCE Loss')
ax1.legend()
ax1.grid(True, linestyle='--', alpha=0.6)

# Plot 2: Accuracy vs. Epochs
# We expect both accuracies to increase over time.
ax2.plot(train_accuracies, label='Training Accuracy', color='blue', lw=2)
ax2.plot(test_accuracies, label='Test Accuracy', color='orange', linestyle='--', lw=2)
ax2.set_title('Accuracy Over Epochs', fontweight='bold')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.legend()
ax2.grid(True, linestyle='--', alpha=0.6)

plt.tight_layout()
plt.show()

## 6️⃣ Visualizing the Decision Boundary

A **decision boundary** is a hypersurface that partitions the underlying vector space into two or more sets, one for each class. In simpler terms, it's the line or surface that the model "learns" to separate the different classes.

Visualizing the decision boundary is an excellent way to understand *how* the model is making its decisions and to intuitively assess its performance. For our non-linear, circular dataset, we hope to see a circular or near-circular boundary that correctly separates the green "non-defective" points from the red "defective" points. A linear model would only be able to draw a straight line, which would fail miserably at this task.
</VSCode.Cell>
<VSCode.Cell id="#VSC-f813091b" language="python">
def plot_decision_boundary(model, X, y, scaler):
    """
    Plots the decision boundary of a trained PyTorch model on a 2D dataset.
    
    Args:
        model (nn.Module): The trained PyTorch model.
        X (torch.Tensor): The input features (unscaled).
        y (torch.Tensor): The true labels.
        scaler (StandardScaler): The scaler used to transform the training data.
    """
    # Move model and data to CPU for plotting, as Matplotlib works with NumPy arrays.
    model.to('cpu')
    X, y = X.to('cpu'), y.to('cpu')
    
    # 1. Create a mesh grid of points to cover the entire feature space.
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
    
    # 2. Predict the class for each point in the mesh grid.
    # `np.c_` concatenates the flattened xx and yy arrays to create a list of points.
    mesh_data = np.c_[xx.ravel(), yy.ravel()]
    # Important: We must scale this new mesh data using the *same scaler* that was fitted on the training data.
    mesh_data_scaled = scaler.transform(mesh_data)
    mesh_tensor = torch.FloatTensor(mesh_data_scaled)
    
    model.eval()
    with torch.no_grad():
        # Get the model's predictions for every point on the grid.
        Z = model(mesh_tensor).numpy()
    Z = Z.reshape(xx.shape)
    
    # 3. Plot the contour and the actual data points.
    plt.figure(figsize=(10, 8))
    # `contourf` creates a filled contour plot. This will color the background
    # based on the model's predicted probability for each point.
    contour = plt.contourf(xx, yy, Z, cmap='RdYlGn', alpha=0.7)
    plt.colorbar(contour, label='Probability of being Defective (Class 1)')
    
    # Overlay the actual data points from the test set to see how well the model did.
    sns.scatterplot(
        x=X[:, 0], 
        y=X[:, 1], 
        hue=y.squeeze(), 
        palette={0: 'green', 1: 'red'},
        style=y.squeeze(),
        markers={0: 'o', 1: 'X'},
        s=100,
        edgecolor='k'
    )
    
    plt.title('Neural Network Decision Boundary on Test Data', fontweight='bold')
    plt.xlabel('Temperature Deviation (Normalized)')
    plt.ylabel('Pressure Deviation (Normalized)')
    plt.legend(title='Actual Class', labels=['Non-defective', 'Defective'])
    plt.show()

# --- Visualize the boundary on the test set ---
# We pass the unscaled test data for plotting, as the scaling is handled inside the function.
plot_decision_boundary(model_manufacturing, torch.FloatTensor(X_test), y_test_tensor.cpu(), scaler)

In [None]:
def plot_decision_boundary(model, X, y, scaler):
    """
    Plots the decision boundary of a trained model on a 2D dataset.
    """
    # Move model and data to CPU for plotting
    model.to('cpu')
    X, y = X.to('cpu'), y.to('cpu')
    
    # 1. Create a mesh grid of points
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
    
    # 2. Predict the class for each point in the mesh
    mesh_data = np.c_[xx.ravel(), yy.ravel()]
    # Important: Scale the mesh data using the same scaler as the training data
    mesh_data_scaled = scaler.transform(mesh_data)
    mesh_tensor = torch.FloatTensor(mesh_data_scaled)
    
    model.eval()
    with torch.no_grad():
        Z = model(mesh_tensor).numpy()
    Z = Z.reshape(xx.shape)
    
    # 3. Plot the contour and the data points
    plt.figure(figsize=(10, 8))
    # The contour plot shows the model's predicted probability for each point.
    contour = plt.contourf(xx, yy, Z, cmap='RdYlGn', alpha=0.7)
    plt.colorbar(contour, label='Probability of Defect (Class 1)')
    
    # Overlay the actual data points from the test set.
    sns.scatterplot(
        x=X[:, 0], 
        y=X[:, 1], 
        hue=y.squeeze(), 
        palette={0: 'green', 1: 'red'},
        style=y.squeeze(),
        markers={0: 'o', 1: 'X'},
        s=100,
        edgecolor='k'
    )
    
    plt.title('Neural Network Decision Boundary on Test Data', fontweight='bold')
    plt.xlabel('Temperature Deviation (Normalized)')
    plt.ylabel('Pressure Deviation (Normalized)')
    plt.legend(title='Actual Class', labels=['Non-defective', 'Defective'])
    plt.show()

# --- Visualize the boundary on the test set ---
# We pass the unscaled test data for plotting, as the scaler is handled inside the function.
plot_decision_boundary(model_manufacturing, torch.FloatTensor(X_test), y_test_tensor.cpu(), scaler)

## 🎉 Summary & Key Takeaways

Congratulations! You have successfully built, trained, and evaluated your first neural networks using PyTorch. You've progressed from a theoretical understanding to practical implementation, tackling both a classic computer science problem (XOR) and a more realistic, industry-inspired example (manufacturing defect detection).

### Core Concepts You've Mastered:
-   **Neural Network Architecture**: You now understand how layers, weights, biases, and activation functions are combined to create a complete network. You appreciate the critical role of non-linear activations.
-   **The Training Trinity**: You have seen firsthand how the **loss function** (measuring error), the **optimizer** (updating weights), and **backpropagation** (calculating gradients) work together in a cycle to enable a model to learn from data.
-   **PyTorch Fundamentals**: You are now equipped to define a custom neural network using `nn.Module`, set up a complete training loop, and use your model to make predictions on new data.
-   **Non-Linearity is Key**: You've witnessed how multi-layer networks with non-linear activations (like ReLU) can learn complex, curved decision boundaries to solve problems that are impossible for simpler linear models.

### What You Built:
1.  **🔵 A Perceptron from Scratch**: To deeply understand the fundamental mechanics of a single neuron.
2.  **🧠 A Multi-Layer Perceptron in PyTorch**: To solve the classic, non-linearly separable XOR problem and prove the power of multi-layer architectures.
3.  **🏭 A Manufacturing Defect Classifier**: A deeper, more practical network designed to solve a complex classification task, complete with data preprocessing, training/evaluation loops, and visualization of its learned decision boundary.

---

### 📚 Next Steps

With this solid grasp of fundamental neural network concepts, you are now fully prepared to explore the more specialized and powerful architectures that are used for specific data types like images and text.

<div align="center">
    <h3>Neural networks are no longer a black box! You're ready for the next challenge.</h3>
    <p>Proceed to <strong>Notebook 02: Convolutional Neural Networks (CNNs)</strong> to learn how to build models that can "see" and process images, a critical step towards multi-modal Generative AI.</p>
</div>