<h1 style="color:white;">
    Iris Multilayer Perceptron (MLP)
</h1>
<h3 style="color:white;">
    Fully Connected Neural Network<br>
    Dense Neural Network
</h3>

<h3 style="color:white;">
    Load and split data into training and testing sets
</h3>

In [14]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils import Bunch

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Stratified split to maintain class distribution in train and test sets
print(f"Training set size: {X_train.shape}, Test set size: {X_test.shape}")

Training set size: (120, 4), Test set size: (30, 4)


<h3>Step 1 — Initialize Network Parameters</h3>
<p>
We create the weight matrices and bias vectors for a 1-hidden-layer classifier on Iris.<br>
Shapes follow the layer connectivity: <code>X (n_samples, 4)</code> → <code>Hidden (n_samples, 12)</code> → <code>Pred (n_samples, 3)</code>.
</p>
<ul>
  <li><code>W1</code> ∈ ℝ<sup>4×12</sup>: input→hidden weights</li>
  <li><code>b1</code> ∈ ℝ<sup>1×12</sup>: hidden biases (broadcast across samples)</li>
  <li><code>W2</code> ∈ ℝ<sup>12×3</sup>: hidden→output weights</li>
  <li><code>b2</code> ∈ ℝ<sup>1×3</sup>: output biases (broadcast across samples)</li>
</ul>
<p>
We set a fixed random seed for reproducibility. Biases start at zero; weights use a normal draw (you can switch to Xavier/He init later).
</p>


In [2]:
import numpy as np

np.random.seed(42)

input_size = 4       # features
hidden_size = 12     # you can tweak
output_size = 3      # classes

# Initialize weights
W1 = np.random.randn(input_size, hidden_size)   # [4, 12] weights from input to hidden
b1 = np.zeros((1, hidden_size))                 # [1, 12] bias for hidden layer
W2 = np.random.randn(hidden_size, output_size)  # [12, 3] weights from hidden to output
b2 = np.zeros((1, output_size))                 # [1, 3] bias for output layer

<h3>Step 2 — One-Hot Encode the Labels</h3>
<p>
Neural networks expect the targets to be represented in a way that matches the output layer. 
For the Iris dataset, we have 3 possible classes. Instead of labels like <code>[0, 1, 2]</code>, 
we convert them into one-hot vectors:
</p>
<ul>
  <li><code>[0] → [1, 0, 0]</code></li>
  <li><code>[1] → [0, 1, 0]</code></li>
  <li><code>[2] → [0, 0, 1]</code></li>
</ul>
<p>
We use <code>OneHotEncoder</code> to transform <code>y_train</code> and <code>y_test</code> into binary matrices:
</p>
<ul>
  <li><code>y_train_oh</code> → shape (<code>n_train_samples × 3</code>)</li>
  <li><code>y_test_oh</code> → shape (<code>n_test_samples × 3</code>)</li>
</ul>
<p>
This ensures the labels align with the network’s output probabilities from the softmax layer.
</p>

In [4]:
encoder = OneHotEncoder(sparse_output=False)
y_train_oh = encoder.fit_transform(y_train.reshape(-1, 1))
y_test_oh = encoder.transform(y_test.reshape(-1, 1))

<h3>Step 3 — Define Activations and Forward Pass</h3>
<p>
We implement the non-linearities and run a forward pass through the network (before training) to inspect the initial outputs:
</p>
<ul>
  <li><code>ReLU(x) = max(0, x)</code> is applied to the hidden layer to introduce non-linearity.</li>
  <li><code>Softmax(x)</code> converts the output logits into class probabilities, stabilized by subtracting <code>max(x)</code> to avoid overflow.</li>
</ul>

In [3]:
def relu(x):
    return np.maximum(0, x)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))  # stability trick
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

# Forward pass (before training) to check initial loss
hidden = relu(np.dot(X_train, W1) + b1)
output = softmax(np.dot(hidden, W2) + b2)

<h3>Step 4 — Define Loss Function (Cross-Entropy)</h3>
<p>
We use <strong>cross-entropy loss</strong>, the standard choice for classification with softmax outputs.
It measures how far the predicted probability distribution is from the true one-hot labels:
</p>
<p style="text-align:center;">
  <code>L = - (1/N) Σ (y_true · log(y_pred))</code>
</p>
<ul>
  <li><code>y_true</code>: one-hot encoded labels (shape = N×3)</li>
  <li><code>y_pred</code>: predicted probabilities from softmax (shape = N×3)</li>
  <li>We apply <code>np.clip(y_pred, 1e-15, 1 - 1e-15)</code> to avoid numerical issues:
    <ul>
      <li>Prevents <code>log(0)</code> → <code>-∞</code>.</li>
      <li>Keeps probabilities in a safe range: [1e-15, 0.999999999999999].</li>
      <li>Ensures gradients remain finite during training.</li>
    </ul>
  </li>
  <li>The function returns the mean loss across all samples.</li>
</ul>
<p>
At this stage, we compute the <em>initial loss</em> before training. A high value is expected since weights are random.
</p>

In [5]:
def cross_entropy(y_true, y_pred):
    # clip to avoid log(0)
    y_pred_clipped = np.clip(y_pred, 1e-15, 1 - 1e-15)
    log_probs = -np.sum(y_true * np.log(y_pred_clipped), axis=1)
    return np.mean(log_probs)

loss = cross_entropy(y_train_oh, output)
print("Initial loss:", loss)

Initial loss: 8.495902981763878


<h3>Step 5 — Training Loop (Forward, Backward, Update)</h3>
<p>
We train the network using gradient descent for 1000 epochs with a learning rate of 0.01. 
Each epoch performs three main steps:
</p>
<ol>
  <li>
    <strong>Forward pass</strong>:
    <ul>
      <li>Compute hidden activations: <code>hidden = ReLU(X·W1 + b1)</code></li>
      <li>Compute output probabilities: <code>output = softmax(hidden·W2 + b2)</code></li>
      <li>Evaluate cross-entropy loss against true labels.</li>
    </ul>
  </li>
  <li>
    <strong>Backward pass (backpropagation)</strong>:
    <ul>
      <li>Error at output: <code>dZ2 = output - y_true</code></li>
      <li>Gradients for output weights/biases: <code>dW2, db2</code></li>
      <li>Propagate error back to hidden layer: <code>dH = dZ2·W2<sup>T</sup></code></li>
      <li>Apply ReLU derivative: zero out gradients where <code>hidden ≤ 0</code></li>
      <li>Gradients for input→hidden weights/biases: <code>dW1, db1</code></li>
    </ul>
  </li>
  <li>
    <strong>Parameter update</strong>:
    <ul>
      <li>Weights and biases are updated with gradient descent:</li>
      <li><code>W ← W - η·dW</code>, <code>b ← b - η·db</code></li>
      <li>(η = learning rate = 0.01)</li>
    </ul>
  </li>
</ol>
<p>
We print the loss every 100 epochs to monitor convergence. Over time, 
the loss should decrease as the network learns to classify the Iris dataset.
</p>

In [6]:
n_samples = X_train.shape[0]
learning_rate = 0.01

for epoch in range(10000):
    # forward
    hidden = relu(np.dot(X_train, W1) + b1)
    output = softmax(np.dot(hidden, W2) + b2)

    # loss
    loss = cross_entropy(y_train_oh, output)

    # backward
    dZ2 = output - y_train_oh
    dW2 = np.dot(hidden.T, dZ2) / n_samples
    db2 = np.sum(dZ2, axis=0, keepdims=True) / n_samples

    dH = np.dot(dZ2, W2.T)
    dH[hidden <= 0] = 0
    dW1 = np.dot(X_train.T, dH) / n_samples
    db1 = np.sum(dH, axis=0, keepdims=True) / n_samples

    # update
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")


Epoch 0, Loss: 8.4959
Epoch 100, Loss: 0.2326
Epoch 200, Loss: 0.2037
Epoch 300, Loss: 0.1850
Epoch 400, Loss: 0.1711
Epoch 500, Loss: 0.1601
Epoch 600, Loss: 0.1511
Epoch 700, Loss: 0.1437
Epoch 800, Loss: 0.1375
Epoch 900, Loss: 0.1321
Epoch 1000, Loss: 0.1275
Epoch 1100, Loss: 0.1234
Epoch 1200, Loss: 0.1198
Epoch 1300, Loss: 0.1165
Epoch 1400, Loss: 0.1136
Epoch 1500, Loss: 0.1110
Epoch 1600, Loss: 0.1086
Epoch 1700, Loss: 0.1064
Epoch 1800, Loss: 0.1044
Epoch 1900, Loss: 0.1026
Epoch 2000, Loss: 0.1009
Epoch 2100, Loss: 0.0993
Epoch 2200, Loss: 0.0978
Epoch 2300, Loss: 0.0965
Epoch 2400, Loss: 0.0952
Epoch 2500, Loss: 0.0940
Epoch 2600, Loss: 0.0928
Epoch 2700, Loss: 0.0918
Epoch 2800, Loss: 0.0908
Epoch 2900, Loss: 0.0898
Epoch 3000, Loss: 0.0889
Epoch 3100, Loss: 0.0881
Epoch 3200, Loss: 0.0873
Epoch 3300, Loss: 0.0865
Epoch 3400, Loss: 0.0858
Epoch 3500, Loss: 0.0851
Epoch 3600, Loss: 0.0844
Epoch 3700, Loss: 0.0838
Epoch 3800, Loss: 0.0832
Epoch 3900, Loss: 0.0826
Epoch 4000, 

<h3>Step 6 — Model Evaluation</h3>
<p>
After training, we evaluate the network on the test set to measure how well it generalizes. 
The steps are:
</p>
<ol>
  <li>
    <strong>Forward pass on test data</strong>:
    <ul>
      <li>Compute hidden activations with ReLU.</li>
      <li>Compute output probabilities with softmax.</li>
    </ul>
  </li>
  <li>
    <strong>Predictions</strong>:
    <ul>
      <li>Choose the class with the highest probability for each sample: <code>np.argmax(output, axis=1)</code>.</li>
    </ul>
  </li>
  <li>
    <strong>Accuracy</strong>:
    <ul>
      <li>Compare predicted classes with true labels.</li>
      <li>Compute the percentage of correct predictions: <code>accuracy = correct / total</code>.</li>
    </ul>
  </li>
</ol>
<p>
This gives us a quantitative measure (accuracy score) of how well our trained model performs 
on unseen Iris samples.
</p>

In [7]:
# Forward pass on test data
y_test_hidden = relu(np.dot(X_test, W1) + b1)
y_test_pred = softmax(np.dot(y_test_hidden, W2) + b2)

# Predictions: pick class with highest probability
y_pred = np.argmax(y_test_pred, axis=1)
y_true = y_test.flatten()

# Loss
loss = cross_entropy(y_test_oh, y_test_pred)
print(f"Test Loss: {loss:.4f}")

# Accuracy
accuracy = np.mean(y_true == y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Test Loss: 0.0616
Test Accuracy: 100.00%
