# Assignment 9
---
## Adel Movahedian 400102074

**Multilayer Perceptron with Scikit-Learn**

binary classification

In [21]:
# Binary classification on the Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score

# Load data
X, y_multi = load_iris(return_X_y=True)

# Convert to binary target: class “setosa” (0) vs. “not-setosa” (1)
y = (y_multi != 0).astype(int)

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Train MLP classifier
clf = MLPClassifier(hidden_layer_sizes=(50, 25),
                    max_iter=1000,
                    early_stopping=True,
                    random_state=42)
clf.fit(X_train, y_train)

# Evaluate
print("F1-score:", f1_score(y_test, clf.predict(X_test)))

F1-score: 0.975609756097561




## Step-by-Step Explanation & Take-aways

**1. Load data**  
We start by loading the classic Iris dataset using `X, y_multi = load_iris(return_X_y=True)`. This brings measurements like sepal and petal length/width along with their species labels into memory. The benefit is that we now have a small, clean benchmark dataset that allows us to test ideas quickly and efficiently.

**2. Binarize the target**  
We convert the 3-class label problem into a binary classification task using `y = (y_multi != 0).astype(int)`. This changes the target to a *setosa vs. non-setosa* setup, which simplifies evaluation using a single F1-score and meets the assignment's requirement for binary classification. This step illustrates how re-labeling can adapt the same dataset for different problem types.

**3. Train/test split (80% / 20%)**  
We split the dataset into training and test sets using `train_test_split(..., test_size=0.2, stratify=y)`. Holding out 20% of the data ensures we have an untouched test set for unbiased evaluation. Stratification maintains the class ratio in both sets, helping us get a more reliable measure of generalization and aligning with the rubric’s 20% rule.

**4. Feature scaling**  
We apply feature scaling with `StandardScaler().fit_transform`, ensuring that all input features have zero mean and unit variance. This step is essential for stable and faster convergence during gradient descent in the MLP. It highlights how preprocessing can influence both optimization dynamics and model performance.

**5. Define & train the MLP**  
We build and train a multilayer perceptron using `MLPClassifier(hidden_layer_sizes=(50,25), ...)`, creating a network with 4 layers (input → 50 → 25 → 1). Training includes early stopping to prevent overfitting. This allows us to practice tuning key hyperparameters such as hidden sizes, epochs, and regularization techniques, all of which are crucial in real-world settings.

**6. Evaluate on the test set**  
Finally, we assess model performance using `f1_score(y_test, clf.predict(X_test))`. The F1-score balances precision and recall, which is especially helpful in class-imbalanced situations. This step confirms whether our model meets or exceeds the 0.75 F1-score threshold and reinforces how evaluation metrics influence model acceptance decisions.

---

### Key Understanding

- **Data preparation (splitting and scaling)** is as vital as model selection and architecture.  
- Even small networks can deliver high performance (around 0.95 F1) when the dataset is simple and well-separated, showing that model complexity should align with data complexity.  
- **Early stopping** provides a way to empirically determine the best training duration, emphasizing the importance of monitoring validation performance during training.


regression

In [22]:
# Regression on the same Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score

# Load data (again—but you could reuse X from the first cell)
iris   = load_iris()
X_full = iris.data
# Predict petal length (column index 2) from the other three features
y      = X_full[:, 2]
X      = X_full[:, [0, 1, 3]]

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Train MLP regressor
regr = MLPRegressor(hidden_layer_sizes=(50, 25),
                    max_iter=5000,
                    early_stopping=True,
                    random_state=42)
regr.fit(X_train, y_train)

# Evaluate
print("R²-score:", r2_score(y_test, regr.predict(X_test)))

R²-score: 0.9359089784533952




## Step-by-Step Explanation & Take-aways

**1. Load data**  
We begin by loading the Iris dataset using `iris = load_iris()`. This step brings the four numerical features—sepal length, sepal width, petal length, and petal width—into memory. Using the same dataset as in classification allows for a direct comparison between classification and regression tasks on identical real-world data.

**2. Select target & predictors**  
We frame a regression task by selecting petal length (column 2) as the target variable `y`, and using sepal length, sepal width, and petal width as the predictors `X`. This shows how a single dataset can be flexibly used for different problems by slicing the feature table into different input/target combinations.

**3. Train/test split (80% / 20%)**  
We divide the data using `train_test_split(..., test_size=0.2)`, reserving 20% of the samples as a test set for evaluation. This matches the rubric’s requirement and reinforces the principle that models should be evaluated on data they haven’t seen during training.

**4. Feature scaling**  
Using `StandardScaler()`, we center and scale the features. This is crucial because it speeds up gradient descent and ensures no single feature dominates the loss due to differences in scale. The step emphasizes the critical role of preprocessing in neural network optimization.

**5. Define & train the MLP regressor**  
We define and train the model using `MLPRegressor(hidden_layer_sizes=(50,25), …)`, creating a 4-layer feed-forward network (input → 50 → 25 → 1) and applying early stopping to avoid overfitting. This provides hands-on practice in tuning hyperparameters like hidden layer sizes and regularization, aiming to achieve a strong R² score (> 0.8).

**6. Evaluate on the test set**  
Finally, we assess the model’s performance using `r2_score(...)`. The R² score tells us how much of the variance in the target variable is explained by the model’s predictions. Reaching an R² of at least 0.80 confirms good predictive performance and deepens our understanding of regression evaluation metrics.

---

### Key Understanding

- **Feature choice shapes task difficulty** – predicting petal length from the other three dimensions is significantly easier than predicting it from categorical labels. A high R² score indicates strong linear or non-linear relationships among the features.  
- **Early stopping prevents overfitting** – training halts automatically when validation loss stops improving, making it a simple yet effective form of regularization.  
- **Same architecture, different tasks** – by changing the loss function and output activation, a feed-forward network can be adapted from classification to regression, highlighting its flexibility.

**4-layer feedforward network with Keras**

binary classification

In [23]:
# 4-layer feedforward network – binary classification (setosa vs. non-setosa)
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
import tensorflow as tf
from tensorflow.keras import layers, models

# Load data and create binary target
X, y_multi = load_iris(return_X_y=True)
y = (y_multi != 0).astype(int)

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Build 4-layer model (3 hidden + 1 output)
model = models.Sequential([
    layers.Input(shape=(4,)),
    layers.Dense(32, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(8,  activation='relu'),
    layers.Dense(1,  activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train
model.fit(X_train, y_train, epochs=100, batch_size=16, verbose=0)

# Evaluate
y_pred = (model.predict(X_test) > 0.5).astype(int)
print("F1-score:", f1_score(y_test, y_pred))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 81ms/step
F1-score: 1.0



---

## Step-by-Step Explanation & Take-aways: 4-Layer Keras Feed-Forward Classifier

**1. Load data & binarise target**  
We begin by loading the Iris dataset using `load_iris()`, then convert the 3-class problem into a binary classification task with `y = (y_multi != 0)`, distinguishing *setosa* (class 0) from *non-setosa* (classes 1 and 2). This reframing simplifies the task and allows us to evaluate performance using a single F1-score. It also illustrates how a dataset can be adapted to suit different learning objectives.

**2. Train/test split (80% / 20%)**  
We divide the dataset into training and test sets using `train_test_split(..., test_size=0.2, stratify=y)`. Reserving 20% of the data for testing ensures an unbiased evaluation, while stratifying maintains the class ratio in both splits. This step is essential for reliable performance measurement and reflects good experimental practice.

**3. Feature scaling**  
Using `StandardScaler()`, we standardize all features to have zero mean and unit variance. This preprocessing step is important for neural networks, as it ensures that all inputs are on a comparable scale, helping gradient descent to converge more efficiently and reliably.

**4. Define 4-layer model**  
We construct a fully connected feed-forward model using three hidden layers with 32, 16, and 8 units respectively, followed by a single-node output layer with a sigmoid activation: `Dense(32) → Dense(16) → Dense(8) → Dense(1)`. This setup meets the “4-layer” architecture requirement and strikes a balance between model capacity and the risk of overfitting. It shows how appropriate depth and width can yield expressive yet manageable models for simple data.

**5. Compile**  
The model is compiled with `optimizer='adam'` and `loss='binary_crossentropy'`. Adam provides adaptive learning rates, which help with faster convergence, and binary cross-entropy is the appropriate loss function for binary classification tasks. This stage emphasizes how the choice of loss and optimizer should align with the nature of the problem.

**6. Train**  
Training is performed using `fit(..., epochs=100, batch_size=16)`, allowing the model to optimize weights over up to 100 epochs using small batches. This lets us observe how batch size and training duration affect learning dynamics and model performance.

**7. Predict & evaluate**  
After training, we convert the predicted probabilities to binary labels using a threshold (typically 0.5) and calculate the F1-score with `f1_score(y_test, y_pred)`. The F1-score provides a balanced evaluation that considers both precision and recall, making it suitable for datasets with possible class imbalance. This step validates whether the model meets the performance benchmark (F1 ≥ 0.75).

---

### Key Understanding

- **Architectural simplicity can be enough** – a shallow, fully connected network can achieve an F1-score above 0.95 on well-separated data like the Iris dataset, without needing complex architectures like CNNs or transformers.  
- **Preprocessing and choosing the right loss function** are as crucial as network design when training neural networks.  
- **Hold-out testing** ensures honest performance assessment. Using a 20% test split enforces this critical evaluation discipline.


regression

In [24]:
# 4-layer feedforward network – regression predicting petal length
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import tensorflow as tf
from tensorflow.keras import layers, models

# Load data
iris   = load_iris()
X_full = iris.data
y      = X_full[:, 2]
X      = X_full[:, [0, 1, 3]]

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Build 4-layer model (3 hidden + 1 output)
model = models.Sequential([
    layers.Input(shape=(3,)),
    layers.Dense(32, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(8,  activation='relu'),
    layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')

# Train
model.fit(X_train, y_train, epochs=500, batch_size=16, verbose=0)

# Evaluate
y_pred = model.predict(X_test).flatten()
print("R²-score:", r2_score(y_test, y_pred))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step
R²-score: 0.9755104234980874



---

## Step-by-Step Explanation & Take-aways: 4-Layer Keras Feed-Forward Regressor

**1. Load data**  
We start by loading the Iris dataset using `load_iris()`, which provides four numerical features for each flower sample. Using the same dataset as in the classification task helps maintain consistency and allows direct comparison of modeling approaches.

**2. Define target & predictors**  
In this regression task, we choose petal length (column 2) as the target `y`, and use the remaining features—sepal length, sepal width, and petal width—as predictors `X`. This framing demonstrates how altering column selections transforms the learning problem, even when using the same raw data.

**3. Train/test split (80% / 20%)**  
We split the data into 80% for training and 20% for testing using `train_test_split(..., test_size=0.2)`. This split supports unbiased evaluation and adheres to typical rubric expectations. It provides a clear and fair measure of how well the model generalizes to unseen data.

**4. Feature scaling**  
Applying `StandardScaler()` standardizes each input feature to have zero mean and unit variance. Feature scaling is crucial in neural networks to prevent any one feature from dominating the loss function and to ensure stable, efficient training via gradient descent.

**5. Define 4-layer model**  
We design a feed-forward neural network with three hidden layers (32, 16, and 8 units) and a single linear output layer: `Dense(32) → Dense(16) → Dense(8) → Dense(1)`. ReLU activations introduce non-linearity in the hidden layers. This architecture meets the 4-layer criterion and shows that modest depth and width can effectively capture the non-linear structure in small tabular datasets.

**6. Compile**  
We compile the model with the Adam optimizer (`optimizer='adam'`) and mean-squared error loss (`loss='mse'`). Adam offers adaptive learning rates for faster convergence, and MSE is the standard loss function for regression tasks, aligning well with the assumptions of normally distributed residuals.

**7. Train**  
Training is done using `fit(..., epochs=500, batch_size=16)`. The higher epoch limit gives the small model sufficient time to converge, while the small batch size helps in producing smoother gradient updates. In practice, early stopping or plateauing helps avoid overfitting, even with many epochs.

**8. Predict & evaluate**  
After training, we evaluate performance on the test set using the R² score: `r2_score(y_test, y_pred)`. R² reflects the proportion of variance in the target explained by the model, providing a solid metric for regression performance. Achieving R² ≥ 0.80 confirms strong model fit.

---

### Key Understanding

- **Column selection defines the problem** – simply slicing the same dataset differently lets us repurpose it for regression instead of classification.  
- **Simple feed-forward neural networks are still powerful** – with a well-chosen structure and preprocessing, classic MLPs can yield high R² scores (~0.9) on compact datasets.  
- **Preprocessing, loss function choice, and proper evaluation** are as critical to success as the model architecture itself.  


**4-layer feedforward network with PyTorch**

binary classification

In [25]:
# 4-layer feed-forward network – binary classification (setosa vs. non-setosa)
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score

# Reproducibility
torch.manual_seed(42);  np.random.seed(42)

# ---------- Data ----------
X, y_multi = load_iris(return_X_y=True)
y = (y_multi != 0).astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train).astype(np.float32)
X_test  = scaler.transform(X_test).astype(np.float32)

# Tensors
X_train_t = torch.tensor(X_train)
y_train_t = torch.tensor(y_train).view(-1, 1)
X_test_t  = torch.tensor(X_test)

# ---------- Model ----------
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(4, 32),  nn.ReLU(),
            nn.Linear(32, 16), nn.ReLU(),
            nn.Linear(16,  8), nn.ReLU(),
            nn.Linear( 8,  1), nn.Sigmoid()
        )
    def forward(self, x): return self.net(x)

model = Net()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# ---------- Training ----------
for _ in range(300):
    optimizer.zero_grad()
    loss = criterion(model(X_train_t), y_train_t)
    loss.backward()
    optimizer.step()

# ---------- Evaluation ----------
with torch.no_grad():
    preds = (model(X_test_t) > 0.5).cpu().numpy().astype(int)
print("F1-score:", f1_score(y_test.astype(int), preds))

F1-score: 1.0



---

## Step-by-Step Explanation & Take-aways: 4-Layer PyTorch Feed-Forward Classifier

**1. Reproducibility seeds**  
We begin by setting seeds for both PyTorch and NumPy using `torch.manual_seed(42)` and `np.random.seed(42)`. This step ensures that pseudo-random processes like weight initialization and data shuffling yield the same results every time. Determinism is essential for reproducibility and debugging.

**2. Load & binarise data**  
The Iris dataset is loaded with `load_iris()`, and the multi-class labels are converted into binary form: *setosa* (0) versus *non-setosa* (1), via `y = (y_multi != 0)`. This form of label engineering simplifies the problem and aligns with binary classification requirements, such as using a Bernoulli-compatible loss.

**3. Train/test split (80% / 20%)**  
Next, we split the dataset using an 80/20 ratio with `test_size=0.2` and apply stratification via `stratify=y` to preserve class balance. This practice ensures an unbiased test set and supports reliable F1-score evaluation, aligning with common rubric standards.

**4. Feature scaling**  
Feature scaling is carried out using `StandardScaler()` to transform each input to have zero mean and unit variance. This step is vital in gradient-based learning since unscaled inputs can destabilize training and slow convergence. It also emphasizes preprocessing as a critical step in neural network pipelines.

**5. Tensor conversion**  
After preprocessing, the data is converted into PyTorch tensors using `torch.tensor(...)`. PyTorch operations depend on tensor arithmetic, so this step allows for compatibility with the rest of the model pipeline and supports GPU/CPU computation.

**6. Define 4-layer network**  
We define a fully connected feed-forward network with three hidden layers followed by a sigmoid output: `Linear(4 → 32 → 16 → 8 → 1)`. ReLU is used for hidden activations to introduce non-linearity, while the final sigmoid layer outputs probabilities for binary classification. This implementation uses `nn.Sequential` for simplicity and clarity, and demonstrates how to translate multilayer perceptrons (MLPs) into PyTorch syntax.

**7. Loss & optimiser**  
The model is compiled with binary cross-entropy loss (`BCELoss`) and optimised using Adam with a learning rate of 0.01. Binary cross-entropy is the natural choice for binary targets, and Adam helps speed up convergence by adjusting learning rates per parameter.

**8. Training loop (300 epochs)**  
Training occurs over 300 epochs using an explicit loop that includes zeroing gradients, a forward pass, a backward pass, and an optimiser step. This manual control of the learning process deepens understanding of how gradient descent operates internally and demonstrates that relatively few epochs are sufficient for a small, structured dataset.

**9. Prediction & evaluation**  
Once trained, the model's output on the test set is thresholded at 0.5 to produce class predictions, and the F1-score is computed. This metric balances precision and recall and confirms that the model performs well—typically achieving F1 scores ≳ 0.95, showing that even simple MLPs can excel on separable tabular data.

---

### Key Understanding

- **PyTorch’s hands-on training loop** reveals the mechanics of backpropagation and weight updates step by step, offering deeper insight into how neural networks learn.  
- **Effective preprocessing and a well-chosen architecture** are sufficient for excellent performance—even a small 4-layer MLP can achieve top-tier results on straightforward classification tasks.  
- **Setting random seeds** ensures reproducibility, a crucial part of sharing and verifying experimental results in research and development.


regression

In [26]:
# 4-layer feed-forward network – regression predicting petal length
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

torch.manual_seed(42);  np.random.seed(42)

# ---------- Data ----------
iris   = load_iris()
X_full = iris.data.astype(np.float32)
y      = X_full[:, 2]
X      = X_full[:, [0, 1, 3]]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train).astype(np.float32)
X_test  = scaler.transform(X_test).astype(np.float32)

X_train_t = torch.tensor(X_train)
y_train_t = torch.tensor(y_train).view(-1, 1)
X_test_t  = torch.tensor(X_test)

# ---------- Model ----------
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(3, 32),  nn.ReLU(),
            nn.Linear(32, 16), nn.ReLU(),
            nn.Linear(16,  8), nn.ReLU(),
            nn.Linear( 8,  1)
        )
    def forward(self, x): return self.net(x)

model = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# ---------- Training ----------
for _ in range(1000):
    optimizer.zero_grad()
    loss = criterion(model(X_train_t), y_train_t)
    loss.backward()
    optimizer.step()

# ---------- Evaluation ----------
with torch.no_grad():
    preds = model(X_test_t).cpu().numpy().flatten()
print("R²-score:", r2_score(y_test, preds))

R²-score: 0.9644560217857361



---

## Step-by-Step Explanation & Take-aways: 4-Layer PyTorch Feed-Forward Regressor

**1. Seeds for reproducibility**  
We begin by setting seeds using `torch.manual_seed(42)` and `np.random.seed(42)`. This ensures that the random number generation in both PyTorch and NumPy is fixed, allowing results to be replicated and simplifying debugging by making model behavior consistent across runs.

**2. Load & slice data**  
The Iris dataset is loaded, and we define the regression task by setting `y` as petal length and `X` as the other three features: sepal length, sepal width, and petal width. This highlights how flexible column selection enables us to reframe a classification dataset as a regression task with minimal changes.

**3. Train/test split (80% / 20%)**  
We split the dataset using `train_test_split(..., test_size=0.2)`, reserving 20% of the samples for unbiased evaluation. This aligns with standard rubric practices and ensures that our model is evaluated on data it hasn’t seen, providing a reliable estimate of its generalisation performance using R².

**4. Feature scaling**  
The input features are standardized using `StandardScaler()` to achieve zero mean and unit variance. This preprocessing step is crucial, as it improves the stability and speed of training by preventing features with larger scales from dominating the learning process. It's especially important in neural networks using gradient-based optimizers.

**5. Tensor conversion**  
After scaling, the NumPy arrays are converted into PyTorch tensors via `torch.tensor(...)`. Tensors are the core data structure in PyTorch, and converting the data enables it to be used within the PyTorch computational graph and optimized using GPU or CPU resources.

**6. Define 4-layer network**  
A 4-layer fully connected network is defined using `Linear(3 → 32 → 16 → 8 → 1)`, with ReLU activations in the hidden layers. The final layer is linear to suit regression outputs. This design meets the 4-layer criterion and showcases the use of `nn.Sequential` to organize layers concisely while demonstrating how modest MLPs can capture non-linear relationships in tabular data.

**7. Loss & optimiser**  
Training uses the mean squared error (MSE) loss function (`MSELoss`), which is standard for regression problems, along with the Adam optimizer set to a learning rate of 0.01. MSE captures the average squared error between predicted and actual values, and Adam dynamically adjusts learning rates to accelerate convergence.

**8. Training loop (1,000 epochs)**  
The training proceeds over 1,000 epochs with an explicit loop that includes zeroing gradients, forward propagation, backpropagation, and weight updates using the optimizer. This structure exposes every step in the learning process, reinforcing core concepts in neural network training and revealing how even tiny datasets may require many passes to converge.

**9. Prediction & evaluation**  
Once trained, predictions are made on the test set, and the `r2_score` is computed to assess model performance. R² represents the proportion of variance in the target variable explained by the model. A score above 0.80—often reaching ≳ 0.90—indicates strong performance and helps build intuition for interpreting regression metrics.

---

### Key Understanding

- **Column selection drives task definition** – by slicing the dataset differently, we can seamlessly shift from classification to regression.  
- **Explicit training loops in PyTorch** help clarify the internal workings of backpropagation and how hyperparameters like learning rate and epochs influence model convergence.  
- **Preprocessing remains critical** – even for small models, proper input scaling is often the deciding factor between training success and failure.  
- **Modest MLPs are powerful enough** – a compact 4-layer architecture can effectively model complex relationships in low-dimensional tabular datasets.


**4-layer non-sequential feedforward network with Keras**

binary classification

In [27]:
# 4-layer non-sequential network – binary classification (setosa vs. non-setosa)
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
import tensorflow as tf
from tensorflow.keras import layers, Model, Input

# ---------- Data ----------
X, y_multi = load_iris(return_X_y=True)
y = (y_multi != 0).astype(int)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_tr);  X_te = scaler.transform(X_te)

# ---------- Model (Functional API, 4 layers total) ----------
inp = Input(shape=(4,))
x   = layers.Dense(32, activation="relu")(inp)
x   = layers.Dense(16, activation="relu")(x)
x   = layers.Dense(8,  activation="relu")(x)
out = layers.Dense(1,  activation="sigmoid")(x)
model = Model(inputs=inp, outputs=out)

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(X_tr, y_tr, epochs=100, batch_size=16, verbose=0)

# ---------- Evaluation ----------
y_pred = (model.predict(X_te) > 0.5).astype(int)
print("F1-score:", f1_score(y_te, y_pred))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step
F1-score: 1.0



---

## Step-by-Step Explanation & Take-aways: 4-Layer Keras Functional-API Classifier

**1. Load & binarise data**  
We start by loading the Iris dataset and converting it into a binary classification task by setting `y = (y_multi != 0)`, which transforms the 3-class problem into distinguishing *setosa* (class 0) from the other two. This step illustrates the power of label engineering to tailor a dataset to a specific objective and simplifies evaluation by focusing on binary classification and F1-score.

**2. Train/test split (80% / 20%)**  
Next, the data is split using `train_test_split(..., stratify=y)`, reserving 20% for testing while maintaining the original class distribution through stratification. This approach ensures the evaluation is fair and representative of real-world generalisation and complies with the rubric’s standard data partitioning guideline.

**3. Feature scaling**  
Feature preprocessing is carried out using `StandardScaler()` to normalise the input features to have zero mean and unit variance. This step is essential for stable and efficient training, especially in neural networks where features of varying scales can disrupt gradient-based optimization.

**4. Define 4-layer model with Functional API**  
A feed-forward network is constructed using Keras's Functional API with the structure: `Input → Dense(32) → Dense(16) → Dense(8) → Dense(1, activation='sigmoid')`. This represents a 4-layer model: three hidden layers with ReLU activation and one sigmoid output layer. The Functional API allows explicit wiring of inputs and outputs, which not only accommodates this simple design but also prepares the user to construct more complex architectures like multi-branch or residual networks.

**5. Compile**  
The model is compiled using the Adam optimizer (`optimizer="adam"`) and the binary cross-entropy loss function. Adam provides adaptive learning rates that enhance training speed and stability, while binary cross-entropy is a natural match for binary classification tasks involving Bernoulli-distributed targets.

**6. Train**  
Training is performed via the `fit` function for 100 epochs with mini-batches of size 16. This setup balances efficiency and convergence, showing that small datasets like Iris often require only modest computational resources and training durations to achieve high accuracy.

**7. Predict & evaluate**  
Finally, predictions on the test set are generated, and probabilities are thresholded at 0.5 to produce class labels. The F1-score is then computed to assess performance. This step confirms that the model easily meets and typically exceeds the ≥ 0.75 F1 benchmark (often reaching ≳ 0.95), demonstrating that compact neural nets perform well on separable tabular data.

---

### Key Understanding

- **Functional API unlocks architectural flexibility** – although this model is a simple feed-forward network, the same syntax supports more complex, non-linear computational graphs like skip connections or multi-input streams.  
- **Model depth vs. dataset complexity** – the simplicity of the Iris dataset means that a 4-layer MLP is more than sufficient; deeper or wider networks would likely overfit.  
- **Data handling, loss selection, and evaluation protocols** – proper preprocessing, appropriate loss functions, and reliable evaluation practices are just as critical as the model structure for achieving trustworthy performance.

---


regression

In [28]:
# 4-layer non-sequential network – regression predicting petal length
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import tensorflow as tf
from tensorflow.keras import layers, Model, Input

# ---------- Data ----------
iris = load_iris()
X    = iris.data.astype(float)[:, [0, 1, 3]]
y    = iris.data.astype(float)[:, 2]
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_tr);  X_te = scaler.transform(X_te)

# ---------- Model (Functional API, 4 layers total) ----------
inp = Input(shape=(3,))
x   = layers.Dense(32, activation="relu")(inp)
x   = layers.Dense(16, activation="relu")(x)
x   = layers.Dense(8,  activation="relu")(x)
out = layers.Dense(1)(x)
model = Model(inputs=inp, outputs=out)

model.compile(optimizer="adam", loss="mse")
model.fit(X_tr, y_tr, epochs=500, batch_size=16, verbose=0)

# ---------- Evaluation ----------
y_pred = model.predict(X_te).flatten()
print("R²-score:", r2_score(y_te, y_pred))



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 122ms/step
R²-score: 0.973116774108986



---

## Step-by-Step Explanation & Take-aways: 4-Layer Keras Functional-API Regressor

**1. Load & slice data**  
We begin by loading the Iris dataset and selecting `X = [sepal length, sepal width, petal width]` as the input features and `y = petal length` as the target. This transforms the original classification task into a regression problem, illustrating how a simple change in column selection can yield a completely new supervised learning task. It underscores the flexibility of tabular data and the importance of problem framing.

**2. Train/test split (80% / 20%)**  
Using `train_test_split(..., test_size=0.2)`, we divide the dataset while reserving 20% of the samples for evaluation. This ensures the model is assessed on unseen data and aligns with rubric requirements. The result is an unbiased estimate of how well the model generalizes to new inputs, commonly measured using the R² score.

**3. Feature scaling**  
Next, we apply `StandardScaler()` to standardise the input features. By transforming the features to have zero mean and unit variance, we stabilize the gradient descent process during training. This is especially important in small datasets like Iris, where unscaled features could lead to inefficient learning or divergent training.

**4. Define 4-layer model (Functional API)**  
The neural network is defined using Keras's Functional API in the format: `Input → Dense(32) → Dense(16) → Dense(8) → Dense(1)`. This design includes three hidden layers with ReLU activation and one output layer with a linear activation, making it suitable for regression. Although the graph here is a straightforward feed-forward structure, the use of the Functional API enables more complex architectures in the future, such as networks with skip connections or multiple inputs and outputs.

**5. Compile**  
The model is compiled using the Adam optimizer and mean-squared error (`mse`) as the loss function. Adam is widely used due to its adaptive learning rate properties, and MSE is the canonical loss function for regression tasks, where we aim to minimise the squared difference between predictions and actual values.

**6. Train**  
Training is performed for up to 500 epochs with a batch size of 16. This setup provides enough iterations for convergence on such a small dataset, and the model typically plateaus early. The high epoch count doesn't harm performance thanks to the network’s modest size and the small learning task.

**7. Predict & evaluate**  
Finally, the model’s predictions on the test set are compared against the ground truth using the R² score. This metric quantifies the proportion of variance in the target variable that the model explains. A score typically above 0.90 demonstrates strong model performance and satisfies the rubric's ≥ 0.80 threshold.

---

### Key Understanding

- **Functional API > Sequential for flexibility** – even though this model is simple, the Functional API lays the groundwork for building more advanced architectures such as multi-branch or residual networks.  
- **Small MLPs can model modest nonlinearities** – the 4-layer design is more than sufficient for capturing relationships in small, low-dimensional datasets like Iris, without a high risk of overfitting.  
- **Proper preprocessing and evaluation protocol** – feature scaling and maintaining a held-out test set are as crucial as model architecture in achieving high performance, as evidenced by a strong R² score.

---

**Neural networks are powerful because they can approximate almost any function, build hierarchical feature representations directly from raw data, and scale effectively with modern hardware—but designing them is hard because the vast space of architectures and training settings offers few guarantees and many pitfalls.**

## Why neural networks are so powerful
1. **Universal approximation** – A feed-forward network with even one hidden layer can approximate any continuous function on a compact domain (the universal-approximation theorem).  
2. **Depth builds hierarchies** – Stacking layers lets a model reuse simple patterns to construct higher-level features, which gives deep nets far greater expressive power per parameter than shallow ones.  
3. **End-to-end feature learning** – Back-propagation trains all layers jointly, so the network discovers task-specific features without manual engineering.  
4. **Scalability with compute** – GPUs/TPUs allow massive parallelism, so networks grow to billions of parameters while training times stay practical.  
5. **Cross-domain versatility** – The same core idea (with tweaks like convolutions or attention) solves vision, language, audio, and time-series tasks.

## What’s difficult about designing neural networks
1. **Huge design space** – Choosing depth, width, activations, optimizers, learning rates, regularizers, etc. is largely empirical; exhaustive search is expensive.  
2. **Training instabilities** – Deep nets can suffer vanishing/exploding gradients; careful initialization, normalization, and residual connections only partially tame this.  
3. **Overfitting vs. generalization** – Powerful models memorize easily; avoiding this demands big data, augmentation, and regularization strategies.  
4. **Compute and energy cost** – Cutting-edge models require substantial hardware, time, and power, limiting accessibility and raising environmental concerns.  
5. **Interpretability & safety** – Networks behave like black boxes; it remains hard to explain or verify their decisions, complicating debugging and trust.

**3-layer Recurrent Neural Network with Keras**

**Bonus**

Although the Iris dataset is not a true time-series, we treat each feature as a “time-step” so the data can flow through an RNN while still meeting the “same dataset for all parts” requirement.

binary classification

In [29]:
# 3-layer LSTM network – binary classification (setosa vs. non-setosa)
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model

# ---------- Data ----------
X, y_multi = load_iris(return_X_y=True)
y = (y_multi != 0).astype(int)

# Scale each of the four “time-steps”
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reshape so each feature becomes a time-step: (samples, timesteps, features)
X_seq = X_scaled.reshape(X_scaled.shape[0], 4, 1)

X_tr, X_te, y_tr, y_te = train_test_split(
    X_seq, y, test_size=0.2, random_state=42, stratify=y
)

# ---------- Model (3 recurrent layers + 1 dense) ----------
inp = Input(shape=(4, 1))
x   = LSTM(32, return_sequences=True)(inp)
x   = LSTM(16, return_sequences=True)(x)
x   = LSTM(8)(x)
out = Dense(1, activation="sigmoid")(x)
model = Model(inp, out)

model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.fit(X_tr, y_tr, epochs=100, batch_size=16, verbose=0)

# ---------- Evaluation ----------
y_pred = (model.predict(X_te) > 0.5).astype(int)
print("F1-score:", f1_score(y_te, y_pred))



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 531ms/step
F1-score: 0.9523809523809523



---

## Step-by-Step Explanation & Take-aways: 3-Layer LSTM Classifier (Iris, Binary)

**1. Load & binarise data**  
We begin by transforming the original 3-class Iris classification problem into a binary task: distinguishing *setosa* from *non-setosa*. This is done by converting the labels such that `y = (y_multi != 0)`. This process of label engineering simplifies the classification task and makes it suitable for binary classification models, such as those using a sigmoid output and binary cross-entropy loss.

**2. Feature scaling**  
Before training, the features are standardised using `StandardScaler()` to ensure zero mean and unit variance. This step is crucial for stabilising the training process, especially in recurrent neural networks (RNNs) where unscaled inputs can lead to vanishing or exploding gradients. Here, each feature effectively becomes a time-step, so scaling also helps avoid any one feature dominating the learning dynamics.

**3. Reshape to sequence**  
To make the tabular data compatible with an LSTM network, we reshape the input into a sequence format: `X_seq.reshape(n_samples, 4, 1)`. This means treating the four original features as four time-steps in a univariate sequence. While there's no actual temporal structure in the data, this creative reshaping allows us to "shoehorn" static tabular data into an RNN framework, thereby satisfying requirements for sequence-based models.

**4. Train/test split (80% / 20%)**  
Using `train_test_split(..., stratify=y)`, we divide the dataset while ensuring class balance in both sets. Holding out 20% of the data provides a clean, unbiased test set for evaluating generalisation performance. This step aligns with rubric requirements and helps ensure that the F1 score truly reflects model capability beyond the training data.

**5. Define 3-layer LSTM + dense output**  
The model is constructed with three stacked LSTM layers: 32, 16, and 8 units respectively, followed by a dense output layer with a sigmoid activation. This setup allows the model to learn hierarchical representations of the (pseudo) sequential data. Even though the input isn’t truly temporal, stacking LSTMs shows how deeper RNNs can capture more complex dependencies within the input structure.

**6. Compile**  
The model is compiled with the Adam optimizer and `binary_crossentropy` as the loss function. Adam’s adaptive learning rates help with stable and efficient training, while binary cross-entropy is the appropriate loss for binary classification tasks where outputs represent probabilities.

**7. Train**  
Training is conducted for 100 epochs with a batch size of 16. Given the small size and clear separability of the Iris dataset, even this relatively modest training budget allows the network to converge quickly and effectively.

**8. Predict & evaluate**  
After training, predictions are generated, thresholded at 0.5, and evaluated using the F1 score on the test set. The F1 score is particularly valuable when class distributions are imbalanced, as it balances precision and recall. In practice, this LSTM classifier easily exceeds the rubric’s minimum F1 requirement of 0.75—often reaching values above 0.95.

---

### Key Understanding

- **Sequentialising tabular data**—by reinterpreting features as time-steps—enables the use of RNNs on non-temporal datasets.  
- **Layer depth in RNNs** increases model capacity; even small, shallow LSTMs can perform well on simple problems like binary Iris classification.  
- **Evaluation discipline** remains essential: using a dedicated test split ensures that metrics like F1-score reflect true model generalisation rather than overfitting.

---


regression

In [30]:
# 3-layer LSTM network – regression predicting petal length
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model

# ---------- Data ----------
iris = load_iris()
# Predict petal length (feature 2) from the other three features (0,1,3)
y = iris.data[:, 2]
X = iris.data[:, [0, 1, 3]]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reshape to (samples, timesteps=3, features=1)
X_seq = X_scaled.reshape(X_scaled.shape[0], 3, 1)

X_tr, X_te, y_tr, y_te = train_test_split(
    X_seq, y, test_size=0.2, random_state=42
)

# ---------- Model (3 recurrent layers + 1 dense) ----------
inp = Input(shape=(3, 1))
x   = LSTM(32, return_sequences=True)(inp)
x   = LSTM(16, return_sequences=True)(x)
x   = LSTM(8)(x)
out = Dense(1)(x)
model = Model(inp, out)

model.compile(optimizer="adam", loss="mse")
model.fit(X_tr, y_tr, epochs=500, batch_size=16, verbose=0)

# ---------- Evaluation ----------
y_pred = model.predict(X_te).flatten()
print("R²-score:", r2_score(y_te, y_pred))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 445ms/step
R²-score: 0.974721401952081



---

## Step-by-Step Explanation & Take-aways: 3-Layer LSTM Regressor (Iris, Petal-Length)

**1. Load & slice data**  
The task is framed as a regression problem by selecting petal length as the target (`y`) and the other three features—sepal length, sepal width, and petal width—as predictors (`X`). This simple reorganization of the Iris dataset showcases how the same data matrix can support multiple supervised learning tasks depending on which columns are chosen as inputs and targets.

**2. Feature scaling**  
Before feeding the data into the model, the features are standardised using `StandardScaler()` to ensure they each have zero mean and unit variance. This step is crucial for neural networks, particularly RNNs, to ensure consistent gradient magnitudes and faster convergence during training. It mitigates the risk of certain features dominating the learning process due to larger numerical scales.

**3. Reshape to sequences**  
The tabular data is reshaped to simulate sequential input: `X_scaled.reshape(n_samples, 3, 1)`. This treats the three predictors as three time-steps in a single-feature sequence, conforming to the expected input format for an LSTM. Although the data is not inherently temporal, this restructuring enables the use of recurrent models by presenting the features as if they occur in sequence.

**4. Train/test split (80% / 20%)**  
The data is split using `train_test_split(..., test_size=0.2)`, holding out 20% of the samples for evaluation. This ensures that performance metrics such as R² reflect the model’s ability to generalise, not just memorize. This setup aligns with best practices for machine learning experiments and meets rubric requirements for fair assessment.

**5. Define 3-layer LSTM + linear head**  
The model consists of three stacked LSTM layers (with 32, 16, and 8 units respectively), followed by a single dense output node. The LSTMs model dependencies across the pseudo-temporal structure, capturing relationships between the features. The final dense layer outputs a scalar prediction corresponding to the petal length. This setup demonstrates the use of deep RNNs for low-dimensional, structured data.

**6. Compile**  
The model is compiled with the Adam optimizer and `mean_squared_error` as the loss function. Adam is widely used for its adaptive learning rate capabilities, while MSE is the standard loss for regression problems. This step reinforces the importance of aligning the loss function with the task type.

**7. Train**  
Training is conducted for up to 500 epochs with a batch size of 16. Given the simplicity and size of the Iris dataset, the model often converges well before 500 epochs. However, allowing for a large number of epochs ensures that the network can stabilise and reach its performance ceiling without overfitting, especially since the model is relatively small.

**8. Predict & evaluate**  
Finally, the model’s predictions are evaluated on the test set using the R² score. This metric indicates how much variance in the target variable is explained by the model. A typical score for this setup exceeds 0.90, demonstrating that even a modest LSTM regressor can achieve high performance on well-structured, low-dimensional data.

---

### Key Understanding

- **Sequentialising features** makes it possible to apply LSTM networks to static tabular data, allowing the model to learn interactions between features in a time-series-like format.  
- **Layer depth in LSTMs** enhances the model's ability to capture complex relationships, though relatively narrow layers suffice for simple datasets like Iris.  
- **Preprocessing and disciplined evaluation**—through proper feature scaling and a clean train/test split—are essential for obtaining reliable, meaningful regression results.

---
