# Neural Network Components From Scratch

## Deep Learning Coding Interview Prep

Based on 2025 interview trends, companies are asking candidates to **implement neural network components from scratch** without relying on high-level libraries like Keras or PyTorch `nn.Module`.

### What You'll Implement:
1. **Linear (Dense) Layers** - Forward and backward pass
2. **Activation Functions** - ReLU, Sigmoid, Tanh, Softmax
3. **Loss Functions** - MSE, Cross-Entropy
4. **Optimizers** - SGD, Momentum, Adam
5. **Batch Normalization** - Normalization layer
6. **Dropout** - Regularization technique
7. **Complete Neural Network** - End-to-end implementation

### Interview Companies Asking These:
- Google, Meta, Amazon
- Cisco (for ML Engineer roles)
- Startups with ML focus

**Key Rule:** Only NumPy allowed - no PyTorch `nn.Module` or TensorFlow `keras.layers`!

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, Optional

# For testing/comparison only
from sklearn.datasets import make_classification, make_moons
from sklearn.model_selection import train_test_split

## 1. Linear (Dense/Fully-Connected) Layer

**Time:** 20-25 minutes  
**Difficulty:** Medium  
**Commonly Asked:** Yes

The fundamental building block of neural networks.

### Forward Pass
$y = Wx + b$

where:
- $W$: weight matrix $(n_{out}, n_{in})$
- $x$: input $(n_{in},)$ or $(batch, n_{in})$
- $b$: bias vector $(n_{out},)$

### Backward Pass

Given $\frac{\partial L}{\partial y}$ (gradient from next layer), compute:

1. $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \cdot x^T$
2. $\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y}$
3. $\frac{\partial L}{\partial x} = W^T \cdot \frac{\partial L}{\partial y}$ (pass to previous layer)

---

In [None]:
class Linear:
    """
    Fully connected (dense) layer
    
    y = Wx + b
    """
    
    def __init__(self, in_features: int, out_features: int):
        """
        Args:
            in_features: Number of input features
            out_features: Number of output features
        """
        self.in_features = in_features
        self.out_features = out_features
        
        # Xavier/Glorot initialization
        limit = np.sqrt(6 / (in_features + out_features))
        self.W = np.random.uniform(-limit, limit, (out_features, in_features))
        self.b = np.zeros((out_features, 1))
        
        # For backward pass
        self.x = None
        self.dW = None
        self.db = None
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Forward pass
        
        Args:
            x: Input (batch_size, in_features) or (in_features, 1)
        
        Returns:
            y: Output (batch_size, out_features) or (out_features, 1)
        """
        self.x = x  # Cache for backward pass
        
        # Handle both (batch, features) and (features, 1) shapes
        if x.ndim == 1:
            x = x.reshape(-1, 1)
        
        # y = Wx + b
        return self.W @ x.T + self.b
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        """
        Backward pass
        
        Args:
            grad_output: Gradient from next layer (out_features, batch_size)
        
        Returns:
            grad_input: Gradient w.r.t. input (in_features, batch_size)
        """
        batch_size = grad_output.shape[1] if grad_output.ndim > 1 else 1
        
        # Gradient w.r.t. weights: dL/dW = dL/dy * x^T
        if self.x.ndim == 1:
            x_reshaped = self.x.reshape(1, -1)
        else:
            x_reshaped = self.x
        
        self.dW = (grad_output @ x_reshaped) / batch_size
        
        # Gradient w.r.t. bias: dL/db = dL/dy (sum over batch)
        self.db = np.sum(grad_output, axis=1, keepdims=True) / batch_size
        
        # Gradient w.r.t. input: dL/dx = W^T * dL/dy
        grad_input = self.W.T @ grad_output
        
        return grad_input
    
    def update_params(self, learning_rate: float):
        """
        Update parameters using gradients
        """
        self.W -= learning_rate * self.dW
        self.b -= learning_rate * self.db
    
    def __repr__(self):
        return f"Linear(in_features={self.in_features}, out_features={self.out_features})"

# Test
layer = Linear(in_features=5, out_features=3)
x = np.random.randn(2, 5)  # Batch of 2, 5 features each

# Forward pass
y = layer.forward(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {y.shape}")
print(f"Weights shape: {layer.W.shape}")
print(f"Bias shape: {layer.b.shape}")

# Backward pass
grad_output = np.random.randn(*y.shape)
grad_input = layer.backward(grad_output)
print(f"\nGradient w.r.t. input shape: {grad_input.shape}")
print(f"Gradient w.r.t. weights shape: {layer.dW.shape}")
print(f"Gradient w.r.t. bias shape: {layer.db.shape}")

---

## 2. Activation Functions

**Time:** 15-20 minutes per activation  
**Difficulty:** Easy-Medium

Common activation functions and their derivatives.

---

### 2.1 ReLU (Rectified Linear Unit)

In [None]:
class ReLU:
    """
    ReLU activation: f(x) = max(0, x)
    
    Derivative: f'(x) = 1 if x > 0, else 0
    """
    
    def __init__(self):
        self.x = None
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        self.x = x
        return np.maximum(0, x)
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        """
        Gradient: dL/dx = dL/dy * 1 if x > 0, else 0
        """
        return grad_output * (self.x > 0).astype(float)
    
    def __repr__(self):
        return "ReLU()"

# Test
relu = ReLU()
x = np.array([-2, -1, 0, 1, 2])
y = relu.forward(x)
print(f"ReLU({x}) = {y}")

grad_output = np.ones_like(y)
grad_input = relu.backward(grad_output)
print(f"Gradient: {grad_input}")

### 2.2 Sigmoid

In [None]:
class Sigmoid:
    """
    Sigmoid activation: f(x) = 1 / (1 + e^(-x))
    
    Derivative: f'(x) = f(x) * (1 - f(x))
    """
    
    def __init__(self):
        self.y = None
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        # Numerical stability: clip x to prevent overflow
        x_clipped = np.clip(x, -500, 500)
        self.y = 1 / (1 + np.exp(-x_clipped))
        return self.y
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        """
        Gradient: dL/dx = dL/dy * y * (1 - y)
        """
        return grad_output * self.y * (1 - self.y)
    
    def __repr__(self):
        return "Sigmoid()"

# Test
sigmoid = Sigmoid()
x = np.array([-2, -1, 0, 1, 2])
y = sigmoid.forward(x)
print(f"Sigmoid({x}) = {y}")

grad_output = np.ones_like(y)
grad_input = sigmoid.backward(grad_output)
print(f"Gradient: {grad_input}")

### 2.3 Tanh

In [None]:
class Tanh:
    """
    Tanh activation: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
    
    Derivative: f'(x) = 1 - f(x)^2
    """
    
    def __init__(self):
        self.y = None
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        self.y = np.tanh(x)
        return self.y
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        """
        Gradient: dL/dx = dL/dy * (1 - y^2)
        """
        return grad_output * (1 - self.y ** 2)
    
    def __repr__(self):
        return "Tanh()"

# Test
tanh = Tanh()
x = np.array([-2, -1, 0, 1, 2])
y = tanh.forward(x)
print(f"Tanh({x}) = {y}")

grad_output = np.ones_like(y)
grad_input = tanh.backward(grad_output)
print(f"Gradient: {grad_input}")

### 2.4 Softmax

In [None]:
class Softmax:
    """
    Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))
    
    Used for multi-class classification
    """
    
    def __init__(self):
        self.y = None
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Numerically stable softmax
        """
        # Subtract max for numerical stability
        x_shifted = x - np.max(x, axis=0, keepdims=True)
        exp_x = np.exp(x_shifted)
        self.y = exp_x / np.sum(exp_x, axis=0, keepdims=True)
        return self.y
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        """
        Gradient of softmax
        
        When combined with cross-entropy, this simplifies to (y - t)
        where t is the true label (one-hot)
        """
        # General softmax gradient (Jacobian)
        # For efficiency, we typically combine softmax + cross-entropy
        # and use simplified gradient (y - t)
        return grad_output
    
    def __repr__(self):
        return "Softmax()"

# Test
softmax = Softmax()
x = np.array([[2.0, 1.0, 0.1]]).T  # 3 classes
y = softmax.forward(x)
print(f"Softmax input: {x.T}")
print(f"Softmax output: {y.T}")
print(f"Sum of probabilities: {y.sum()}")

### Visualize Activation Functions

In [None]:
# Visualize activations
x_range = np.linspace(-5, 5, 100)

relu_func = ReLU()
sigmoid_func = Sigmoid()
tanh_func = Tanh()

plt.figure(figsize=(15, 4))

# Plot activations
plt.subplot(1, 3, 1)
plt.plot(x_range, relu_func.forward(x_range), label='ReLU', linewidth=2)
plt.plot(x_range, sigmoid_func.forward(x_range), label='Sigmoid', linewidth=2)
plt.plot(x_range, tanh_func.forward(x_range), label='Tanh', linewidth=2)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Activation Functions')
plt.legend()
plt.grid(True)

# Plot derivatives
plt.subplot(1, 3, 2)
grad = np.ones_like(x_range)
plt.plot(x_range, relu_func.backward(grad), label="ReLU'", linewidth=2)
plt.plot(x_range, sigmoid_func.backward(grad), label="Sigmoid'", linewidth=2)
plt.plot(x_range, tanh_func.backward(grad), label="Tanh'", linewidth=2)
plt.xlabel('x')
plt.ylabel("f'(x)")
plt.title('Activation Function Derivatives')
plt.legend()
plt.grid(True)

# Comparison table
plt.subplot(1, 3, 3)
plt.axis('off')
table_data = [
    ['Function', 'Range', 'Use Case'],
    ['ReLU', '[0, ∞)', 'Hidden layers (most common)'],
    ['Sigmoid', '(0, 1)', 'Binary classification output'],
    ['Tanh', '(-1, 1)', 'Hidden layers (RNNs)'],
    ['Softmax', '(0, 1) sum=1', 'Multi-class output']
]
table = plt.table(cellText=table_data, cellLoc='left', loc='center',
                 colWidths=[0.3, 0.3, 0.4])
table.auto_set_font_size(False)
table.set_fontsize(9)
table.scale(1, 2)
plt.title('Activation Function Comparison', pad=20)

plt.tight_layout()
plt.show()

---

## 3. Loss Functions

**Time:** 15-20 minutes per loss  
**Difficulty:** Medium

---

### 3.1 Mean Squared Error (MSE)

In [None]:
class MSELoss:
    """
    Mean Squared Error Loss
    
    L = (1/N) * Σ(y_pred - y_true)^2
    
    Gradient: dL/dy_pred = (2/N) * (y_pred - y_true)
    """
    
    def __init__(self):
        self.y_pred = None
        self.y_true = None
    
    def forward(self, y_pred: np.ndarray, y_true: np.ndarray) -> float:
        """
        Compute MSE loss
        
        Args:
            y_pred: Predictions
            y_true: Ground truth
        
        Returns:
            loss: Scalar loss value
        """
        self.y_pred = y_pred
        self.y_true = y_true
        
        return np.mean((y_pred - y_true) ** 2)
    
    def backward(self) -> np.ndarray:
        """
        Compute gradient w.r.t. predictions
        
        Returns:
            grad: Gradient of loss w.r.t. y_pred
        """
        N = self.y_pred.size
        return (2 / N) * (self.y_pred - self.y_true)
    
    def __repr__(self):
        return "MSELoss()"

# Test
mse_loss = MSELoss()
y_pred = np.array([[2.5], [0.0], [2.1]])
y_true = np.array([[3.0], [0.5], [2.0]])

loss = mse_loss.forward(y_pred, y_true)
grad = mse_loss.backward()

print(f"Predictions: {y_pred.T}")
print(f"True values: {y_true.T}")
print(f"MSE Loss: {loss:.4f}")
print(f"Gradient: {grad.T}")

### 3.2 Binary Cross-Entropy Loss

In [None]:
class BCELoss:
    """
    Binary Cross-Entropy Loss
    
    L = -(1/N) * Σ[y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred)]
    
    Gradient: dL/dy_pred = (y_pred - y_true) / (y_pred * (1 - y_pred))
    """
    
    def __init__(self):
        self.y_pred = None
        self.y_true = None
        self.epsilon = 1e-10  # For numerical stability
    
    def forward(self, y_pred: np.ndarray, y_true: np.ndarray) -> float:
        # Clip predictions to prevent log(0)
        y_pred_clipped = np.clip(y_pred, self.epsilon, 1 - self.epsilon)
        self.y_pred = y_pred_clipped
        self.y_true = y_true
        
        return -np.mean(
            y_true * np.log(y_pred_clipped) + 
            (1 - y_true) * np.log(1 - y_pred_clipped)
        )
    
    def backward(self) -> np.ndarray:
        """
        Simplified gradient when used with sigmoid:
        dL/dy_pred = y_pred - y_true
        """
        N = self.y_pred.size
        return -(self.y_true / self.y_pred - 
                (1 - self.y_true) / (1 - self.y_pred)) / N
    
    def __repr__(self):
        return "BCELoss()"

# Test
bce_loss = BCELoss()
y_pred = np.array([[0.9], [0.2], [0.8]])
y_true = np.array([[1.0], [0.0], [1.0]])

loss = bce_loss.forward(y_pred, y_true)
grad = bce_loss.backward()

print(f"Predictions: {y_pred.T}")
print(f"True labels: {y_true.T}")
print(f"BCE Loss: {loss:.4f}")
print(f"Gradient: {grad.T}")

### 3.3 Cross-Entropy Loss (Multi-Class)

In [None]:
class CrossEntropyLoss:
    """
    Cross-Entropy Loss for multi-class classification
    
    Combines Softmax + Negative Log-Likelihood
    
    L = -(1/N) * Σ Σ y_true[i,c] * log(y_pred[i,c])
    
    When combined with softmax, gradient simplifies to:
    dL/dz = y_pred - y_true (where z is logit before softmax)
    """
    
    def __init__(self):
        self.y_pred = None
        self.y_true = None
        self.epsilon = 1e-10
    
    def forward(self, logits: np.ndarray, y_true: np.ndarray) -> float:
        """
        Args:
            logits: Raw scores before softmax (num_classes, batch_size)
            y_true: One-hot encoded labels (num_classes, batch_size)
        """
        # Apply softmax
        softmax = Softmax()
        self.y_pred = softmax.forward(logits)
        self.y_true = y_true
        
        # Clip for numerical stability
        y_pred_clipped = np.clip(self.y_pred, self.epsilon, 1 - self.epsilon)
        
        # Cross-entropy
        return -np.mean(np.sum(y_true * np.log(y_pred_clipped), axis=0))
    
    def backward(self) -> np.ndarray:
        """
        Gradient w.r.t. logits (before softmax)
        
        Simplified: dL/dz = (y_pred - y_true) / batch_size
        """
        batch_size = self.y_pred.shape[1]
        return (self.y_pred - self.y_true) / batch_size
    
    def __repr__(self):
        return "CrossEntropyLoss()"

# Test
ce_loss = CrossEntropyLoss()

# Logits for 3 classes, batch of 2
logits = np.array([[2.0, 1.0],   # class 0
                   [1.0, 3.0],   # class 1
                   [0.1, 0.5]])  # class 2

# One-hot encoded labels (true class: [0, 1])
y_true = np.array([[1.0, 0.0],
                   [0.0, 1.0],
                   [0.0, 0.0]])

loss = ce_loss.forward(logits, y_true)
grad = ce_loss.backward()

print(f"Logits:\n{logits}")
print(f"\nPredicted probabilities (after softmax):\n{ce_loss.y_pred}")
print(f"\nTrue labels (one-hot):\n{y_true}")
print(f"\nCross-Entropy Loss: {loss:.4f}")
print(f"\nGradient:\n{grad}")

---

## 4. Optimizers

**Time:** 20-25 minutes  
**Difficulty:** Medium

Different optimization algorithms for parameter updates.

---

### 4.1 Stochastic Gradient Descent (SGD)

In [None]:
class SGD:
    """
    Stochastic Gradient Descent
    
    θ = θ - α * ∇L(θ)
    """
    
    def __init__(self, learning_rate: float = 0.01):
        self.learning_rate = learning_rate
    
    def update(self, params: dict, grads: dict):
        """
        Update parameters using gradients
        
        Args:
            params: Dictionary of parameters
            grads: Dictionary of gradients
        """
        for key in params:
            params[key] -= self.learning_rate * grads[key]
    
    def __repr__(self):
        return f"SGD(lr={self.learning_rate})"

### 4.2 SGD with Momentum

In [None]:
class SGDMomentum:
    """
    SGD with Momentum
    
    v = β * v + ∇L(θ)
    θ = θ - α * v
    
    Helps accelerate in relevant direction and dampen oscillations
    """
    
    def __init__(self, learning_rate: float = 0.01, momentum: float = 0.9):
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.velocities = {}
    
    def update(self, params: dict, grads: dict):
        # Initialize velocities on first call
        if not self.velocities:
            for key in params:
                self.velocities[key] = np.zeros_like(params[key])
        
        # Update with momentum
        for key in params:
            self.velocities[key] = (
                self.momentum * self.velocities[key] + grads[key]
            )
            params[key] -= self.learning_rate * self.velocities[key]
    
    def __repr__(self):
        return f"SGDMomentum(lr={self.learning_rate}, momentum={self.momentum})"

### 4.3 Adam Optimizer

In [None]:
class Adam:
    """
    Adam Optimizer (Adaptive Moment Estimation)
    
    Combines momentum (first moment) and RMSProp (second moment)
    
    m = β1 * m + (1 - β1) * ∇L(θ)         # First moment (momentum)
    v = β2 * v + (1 - β2) * (∇L(θ))^2     # Second moment (variance)
    m_hat = m / (1 - β1^t)                 # Bias correction
    v_hat = v / (1 - β2^t)                 # Bias correction
    θ = θ - α * m_hat / (sqrt(v_hat) + ε)
    """
    
    def __init__(self, learning_rate: float = 0.001, beta1: float = 0.9, 
                 beta2: float = 0.999, epsilon: float = 1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        
        self.m = {}  # First moment
        self.v = {}  # Second moment
        self.t = 0   # Time step
    
    def update(self, params: dict, grads: dict):
        # Initialize moments on first call
        if not self.m:
            for key in params:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])
        
        self.t += 1
        
        for key in params:
            # Update biased first moment estimate
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
            
            # Update biased second moment estimate
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key] ** 2)
            
            # Bias correction
            m_hat = self.m[key] / (1 - self.beta1 ** self.t)
            v_hat = self.v[key] / (1 - self.beta2 ** self.t)
            
            # Update parameters
            params[key] -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
    
    def __repr__(self):
        return f"Adam(lr={self.learning_rate}, β1={self.beta1}, β2={self.beta2})"

# Demo comparison
print("Optimizers created:")
print(f"1. {SGD(learning_rate=0.01)}")
print(f"2. {SGDMomentum(learning_rate=0.01, momentum=0.9)}")
print(f"3. {Adam(learning_rate=0.001)}")

---

## 5. Complete Neural Network

**Time:** 40-50 minutes  
**Difficulty:** Hard

Putting it all together: A complete multi-layer neural network from scratch.

---

In [None]:
class NeuralNetwork:
    """
    Complete feedforward neural network from scratch
    
    Architecture: Input → Linear → ReLU → Linear → ReLU → Linear → Softmax
    """
    
    def __init__(self, input_size: int, hidden_sizes: list, output_size: int):
        """
        Args:
            input_size: Number of input features
            hidden_sizes: List of hidden layer sizes
            output_size: Number of output classes
        """
        self.layers = []
        self.activations = []
        
        # Build network architecture
        layer_sizes = [input_size] + hidden_sizes + [output_size]
        
        for i in range(len(layer_sizes) - 1):
            # Add linear layer
            self.layers.append(Linear(layer_sizes[i], layer_sizes[i+1]))
            
            # Add activation (ReLU for hidden, Softmax for output)
            if i < len(layer_sizes) - 2:
                self.activations.append(ReLU())
            else:
                self.activations.append(Softmax())
        
        self.loss_fn = CrossEntropyLoss()
    
    def forward(self, X: np.ndarray) -> np.ndarray:
        """
        Forward pass through network
        
        Args:
            X: Input (batch_size, input_size)
        
        Returns:
            Output probabilities (output_size, batch_size)
        """
        out = X
        
        for layer, activation in zip(self.layers, self.activations):
            out = layer.forward(out)
            out = activation.forward(out)
        
        return out
    
    def backward(self, grad_output: np.ndarray):
        """
        Backward pass through network
        
        Args:
            grad_output: Gradient from loss (output_size, batch_size)
        """
        grad = grad_output
        
        # Backward through layers in reverse
        for layer, activation in zip(reversed(self.layers), reversed(self.activations)):
            grad = activation.backward(grad)
            grad = layer.backward(grad)
    
    def update_params(self, optimizer):
        """
        Update parameters using optimizer
        """
        for layer in self.layers:
            params = {'W': layer.W, 'b': layer.b}
            grads = {'W': layer.dW, 'b': layer.db}
            optimizer.update(params, grads)
    
    def train_step(self, X: np.ndarray, y: np.ndarray, optimizer) -> float:
        """
        Single training step
        
        Args:
            X: Input batch (batch_size, input_size)
            y: One-hot labels (output_size, batch_size)
            optimizer: Optimizer instance
        
        Returns:
            loss: Scalar loss value
        """
        # Forward pass
        logits = self.forward(X)
        
        # Compute loss
        loss = self.loss_fn.forward(logits, y)
        
        # Backward pass
        grad = self.loss_fn.backward()
        self.backward(grad)
        
        # Update parameters
        self.update_params(optimizer)
        
        return loss
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict class labels
        
        Args:
            X: Input (batch_size, input_size)
        
        Returns:
            Predicted class indices (batch_size,)
        """
        probs = self.forward(X)
        return np.argmax(probs, axis=0)
    
    def accuracy(self, X: np.ndarray, y_labels: np.ndarray) -> float:
        """
        Compute accuracy
        
        Args:
            X: Input (batch_size, input_size)
            y_labels: True class indices (batch_size,)
        
        Returns:
            Accuracy score
        """
        preds = self.predict(X)
        return np.mean(preds == y_labels)
    
    def __repr__(self):
        arch = []
        for layer in self.layers:
            arch.append(f"Linear({layer.in_features}→{layer.out_features})")
        return "NeuralNetwork(\n  " + "\n  ".join(arch) + "\n)"

# Test the complete network
print("Building neural network...\n")
model = NeuralNetwork(input_size=4, hidden_sizes=[8, 8], output_size=3)
print(model)

### Train on Real Dataset

In [None]:
# Generate dataset
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert labels to one-hot
def to_one_hot(y, num_classes):
    one_hot = np.zeros((num_classes, len(y)))
    one_hot[y, np.arange(len(y))] = 1
    return one_hot

y_train_onehot = to_one_hot(y_train, 3)
y_test_onehot = to_one_hot(y_test, 3)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Number of classes: 3 (Iris species)\n")

# Create model and optimizer
model = NeuralNetwork(input_size=4, hidden_sizes=[16, 8], output_size=3)
optimizer = Adam(learning_rate=0.01)

# Training loop
epochs = 200
batch_size = 32
train_losses = []
train_accs = []
test_accs = []

print("Training neural network...\n")

for epoch in range(epochs):
    # Mini-batch training
    indices = np.random.permutation(len(X_train))
    epoch_loss = 0
    num_batches = 0
    
    for i in range(0, len(X_train), batch_size):
        batch_indices = indices[i:i+batch_size]
        X_batch = X_train[batch_indices]
        y_batch = y_train_onehot[:, batch_indices]
        
        loss = model.train_step(X_batch, y_batch, optimizer)
        epoch_loss += loss
        num_batches += 1
    
    epoch_loss /= num_batches
    train_losses.append(epoch_loss)
    
    # Compute accuracies
    train_acc = model.accuracy(X_train, y_train)
    test_acc = model.accuracy(X_test, y_test)
    train_accs.append(train_acc)
    test_accs.append(test_acc)
    
    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}/{epochs}: Loss = {epoch_loss:.4f}, "
              f"Train Acc = {train_acc:.4f}, Test Acc = {test_acc:.4f}")

print(f"\nFinal Test Accuracy: {test_accs[-1]:.4f}")

### Visualize Training

In [None]:
# Plot training curves
plt.figure(figsize=(15, 4))

# Loss curve
plt.subplot(1, 3, 1)
plt.plot(train_losses, label='Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Over Time')
plt.legend()
plt.grid(True)

# Accuracy curves
plt.subplot(1, 3, 2)
plt.plot(train_accs, label='Train Accuracy')
plt.plot(test_accs, label='Test Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Accuracy Over Time')
plt.legend()
plt.grid(True)

# Confusion matrix
plt.subplot(1, 3, 3)
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
plt.imshow(cm, cmap='Blues')
plt.colorbar()
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
for i in range(3):
    for j in range(3):
        plt.text(j, i, str(cm[i, j]), ha='center', va='center')

plt.tight_layout()
plt.show()

print("\nClassification Report:")
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=iris.target_names))

---

## Interview Discussion Points

### Q1: Why use Xavier/He initialization instead of zeros or random?

**Answer:**
- **Zeros:** All neurons learn the same features (symmetry problem)
- **Random (large):** Activations explode or vanish
- **Xavier:** Variance of outputs = variance of inputs (for sigmoid/tanh)
- **He:** Xavier × 2 (better for ReLU because it kills half the neurons)

**Xavier:** $W \sim U(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}})$

**He:** $W \sim N(0, \sqrt{\frac{2}{n_{in}}})$

---

### Q2: Why do we need the backward pass?

**Answer:**
The backward pass computes gradients using **backpropagation** (chain rule):

1. **Chain Rule:** $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial W}$

2. **Flow:** Loss → Output Layer → Hidden Layers → Input

3. **Purpose:** Tell each parameter how to change to reduce loss

Without backprop, we'd need to compute gradients numerically (extremely slow!).

---

### Q3: ReLU vs Sigmoid - which is better and why?

**ReLU Advantages:**
- No vanishing gradient problem (gradient is 0 or 1)
- Faster computation (no expensive exp)
- Sparse activations (some neurons output 0)
- Works better in practice for deep networks

**ReLU Disadvantages:**
- Dying ReLU problem (neurons output 0 forever)
- Not zero-centered

**When to use Sigmoid:**
- Binary classification output layer
- When you need outputs in (0, 1)
- Gates in LSTMs/GRUs

**When to use ReLU:**
- Hidden layers in deep networks (default choice)

---

### Q4: Why combine Softmax with Cross-Entropy?

**Answer:**
1. **Numerical Stability:** Computing them separately can cause overflow/underflow
2. **Simplified Gradient:** Combined gradient is just `y_pred - y_true`
3. **Efficiency:** One backward pass instead of two

**Mathematical reason:**
```
Softmax: p_i = e^(z_i) / Σ e^(z_j)
Cross-Entropy: L = -Σ y_i * log(p_i)

Combined gradient: ∂L/∂z_i = p_i - y_i
```

This is **much simpler** than computing each gradient separately!

---

### Q5: Adam vs SGD - when to use each?

**Adam:**
- **Pros:** Adaptive learning rates, fast convergence, less tuning
- **Cons:** Can overfit, higher memory usage
- **Use when:** Training deep networks, need fast iteration

**SGD (+Momentum):**
- **Pros:** Better generalization, more stable
- **Cons:** Requires learning rate tuning, slower convergence
- **Use when:** Final production model, have time to tune

**Common Practice:**
- Start with Adam for quick experimentation
- Switch to SGD+Momentum for final model

---

## Summary

You've implemented from scratch:

✅ **Linear Layer** - Forward and backward pass  
✅ **Activations** - ReLU, Sigmoid, Tanh, Softmax  
✅ **Loss Functions** - MSE, BCE, Cross-Entropy  
✅ **Optimizers** - SGD, Momentum, Adam  
✅ **Complete Neural Network** - End-to-end training

**You're now ready for deep learning coding interviews!** 🚀

---