# D7041E Miniproject

Group ID: MINI-PROJECT 14

Ahmad Allahham
[ahmall-0@student.ltu.se] | 940120-0556 |

Arian Asghari
[ariasg-0@student.ltu.se] | 010721-7051 |

Hannes Furhoff
[hanfur-0@student.ltu.se] | 010929-4710 |

## Grade requirements
### G3: Run and understand a publicly available model on a one selected dataset.
We have chosen to work with the MLP model (regressive) on a housing dataset.

### G3: Choose a dataset.
Our data set is a collection of synthetic housing data, with various parameters (eg. rooms, year built) and price.

### G3: Implement tutorial.
Implemented a tutorial from machinelearningmastery, link is in README.md.

### G3: Test performance for different configurations of the perceptron.
Testing different configurations of the perceptron in terms of hidden layers, epochs, batch size, and learning rate. 

### G3: Document the performance.
The performance metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE), are thoroughly documented. Please refer to the [Performance Table](#performance-table) for detailed results. 

### G4: You should use data pre-processing
For preprocessing a few steps were done:
- Converting all values to float type.
- Encoding the categorical "neighborhood" column to one-hot and then to float.
- Normalize float tables.

### G4: Systematically choose the hyper-parameters of the model
The hyperparameters of the MLP model, including the number of hidden layers, epochs, batch size, and learning rate, have been systematically chosen and tested. The impact of different configurations on model performance is detailed in the [Performance Table](#performance-table). 

### G4: Use cross-validation for training 
To ensure robust training and evaluate the model's generalization performance, we have employed k-fold cross-validation. The results of cross-validation are reflected in the [Performance Table](#performance-table) 

### G4: Use different seeds and  recorded performance statistics with various performance metrics
We have utilized different random seeds during training and recorded comprehensive performance statistics, including MSE, RMSE, MAE, and MAPE. The [Performance Table](#performance-table) provides a detailed breakdown of these metrics for various model configurations. 

## Code

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import itertools
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_percentage_error, mean_absolute_error
import tqdm
import time
from IPython.display import clear_output

In [2]:
# Misc

# Returns from run
class ModelMetrics:
    def __init__(self, _mse, _rmse, _mae, _mape, _time, _settings):
        self.mse = round(_mse, 5)
        self.rmse =round( _rmse, 5)
        self.mae = round(_mae, 5)
        self.mape = round(_mape, 5)
        self.training_time = round(_time, 2)
        self.run_settings = _settings # The settings the model was run on
    def __str__(self):
        return (
            f"MSE={self.mse}, "
            f"RMSE={self.rmse}, "
            f"MAE={self.mae}, "
            f"MAPE={self.mape}, "
            f"time={self.training_time}s, "
            f"settings={self.run_settings}"
        )

# Settings given to the training function
class ModelSettings:
    def __init__(self, _hidden_layers, _epochs, _batch, _learning_rate, _k_folds):
        self.hidden_layers = _hidden_layers
        self.epochs = _epochs
        self.batch_size = _batch
        self.learning_rate = _learning_rate
        self.k_folds = _k_folds
    def __str__(self):
        return (
            f"hidden_layers={self.hidden_layers}, "
            f"epochs={self.epochs}, "
            f"batch_size={self.batch_size}, "
            f"learning_rate={self.learning_rate}, "
            f"k_folds={self.k_folds}"
        )

In [3]:
# Preprocessing 

# Normalize
def normalize(data):
    normalized_data = (data - data.min(axis=0)) / (data.max(axis=0) - data.min(axis=0))
    return normalized_data
    
# Read data
raw_data = np.loadtxt("housing_data.csv", delimiter=",", dtype=str)
column_names = raw_data[0]
raw_data = raw_data[1:]

# For "Neighborhood" column:
# Str -> OH-encoding -> str repr -> float repr 
# Eg. "Urban" -> [0 1 0] -> "010" -> (0)10.0
d = raw_data[:,3].reshape(-1, 1)
oh_enc = OneHotEncoder(dtype=int)
oh_enc.fit(d)
oh_d = oh_enc.transform(d).toarray().astype(str)
for r in range(len(oh_d)):
    tf = int(''.join(oh_d[r]))
    raw_data[r,3] = tf

# Type convert data
raw_data = raw_data.astype(float)
housing_data, housing_prices = normalize(raw_data[:,0:5]), normalize(raw_data[:,5:6])

In [4]:
# Multilayer Perceptron

class MLPRegressor:
    def __init__(self, _input_size: int, _output_size: int, _settings: ModelSettings):
        self.model = self.build_model(_input_size, _output_size, _settings.hidden_layers)
        self.settings = _settings
        
    def build_model(self, input_size: int, output_size: int, hidden_layers) -> nn.Sequential:
        layers = []
        layers.append(nn.Linear(input_size, hidden_layers[0])) 
        layers.append(nn.ReLU())
        for i in range(len(hidden_layers)-1):
            layers.append(nn.Linear(hidden_layers[i], hidden_layers[i+1])) 
            layers.append(nn.ReLU())
        layers.append(nn.Linear(hidden_layers[-1], output_size))
        return nn.Sequential(*layers)

    def convert_to_tensor(self, X_train, y_train, X_test, y_test):
        X_train = torch.from_numpy(X_train.astype(np.float32))
        y_train = torch.from_numpy(y_train.astype(np.float32))
        X_test = torch.from_numpy(X_test.astype(np.float32))
        y_test = torch.from_numpy(y_test.astype(np.float32))
        return X_train, y_train, X_test, y_test
        
    def evaluate_model(self, X_train, y_train, X_test, y_test):
        X_train, y_train, X_test, y_test = self.convert_to_tensor(X_train, y_train, X_test, y_test)
        loss_fn = nn.MSELoss()  # Mean square error
        optimizer = optim.Adam(self.model.parameters(), lr = self.settings.learning_rate)

        n_epochs = self.settings.epochs   # Number of epochs to run
        batch_size = self.settings.batch_size # Size of each batch
        batch_start = torch.arange(0, len(X_train), batch_size)

        error = np.inf
        for epoch in range(n_epochs):
            
            self.model.train()
            with tqdm.tqdm(batch_start, unit="batch", mininterval=0, disable=True) as bar:
                bar.set_description(f"Epoch {epoch}")
                for start in bar:
                    # Take a batch
                    X_batch = X_train[start:start+batch_size]
                    y_batch = y_train[start:start+batch_size]

                    # Forward pass
                    y_pred = self.model(X_batch)
                    loss = loss_fn(y_pred, y_batch)

                    # Backward pass
                    optimizer.zero_grad()
                    loss.backward()

                    # Update weights
                    optimizer.step()

                    # Print progress
                    bar.set_postfix(mse=float(loss))

            # Evaluate accuracy at end of each epoch
            self.model.eval()
        with torch.no_grad():
                y_pred = self.model(X_test)
                mae = mean_absolute_error(y_pred, y_test)
                mape = mean_absolute_percentage_error(y_pred, y_test)   
                mse = loss_fn(y_pred, y_test)
                rmse = torch.sqrt(mse)          
        return mse, rmse, mae, mape

    # K-Fold Cross-validation
    def k_fold_validation(self, housing_data, housing_prices):
        
        stime = time.time()
        kfold = KFold(n_splits=self.settings.k_folds, shuffle=True)
        history = {
            'mse': np.array([]),
            'rmse': np.array([]),
            'mae': np.array([]),
            'mape': np.array([])
        }
        for fold, (train_ids, test_ids) in enumerate(kfold.split(housing_data)): 
            X_train = housing_data[train_ids]
            X_test = housing_data[test_ids]
            y_train = housing_prices[train_ids]
            y_test = housing_prices[test_ids]
            mse, rmse, mae, mape = self.evaluate_model(X_train, y_train, X_test, y_test)
            history['mse'] = np.append(history['mse'], mse)
            history['rmse'] = np.append(history['rmse'], rmse)
            history['mae'] = np.append(history['mae'], mae)
            history['mape'] = np.append(history['mape'], mape)
        etime = time.time()
        return ModelMetrics(
            np.mean(history['mse']),
            np.mean(history['rmse']),
            np.mean(history['mae']),
            np.mean(history['mape']),
            etime - stime,
            self.settings
            )

In [5]:
# Run

# Running params
# The idea is to cycle through the permutations of r
k_folds = [5]                             # For cross-fold validation
number_of_hidden_layers = [2, 3]          # 1 'layer' = 1 neuron + activation in sequence
learning_rates = [0.01, 0.05]             # Learning rate of net optimizer
batch_sizes = [500, 1000]                 # Batch size for training net with loss function
epochs = [3, 10, 20, 60]                  # Epochs for training the model
neurons_per_layer = [10, 50, 100, 200]    # Neuros in each hidden layer

# Setup
max_saved_runs = 10
best_runs = []

# Train and test model for all argument combinations
# Save the {max_saved_runs} best runs
combos = itertools.product(number_of_hidden_layers, neurons_per_layer, epochs, batch_sizes, learning_rates, k_folds)


for argcomb in combos:
    hidden_layers = [argcomb[1] for i in range(argcomb[0])]
    settings = ModelSettings(
        hidden_layers,
        argcomb[2],
        argcomb[3],
        argcomb[4],
        argcomb[5]
    )
    mlp = MLPRegressor(housing_data.shape[1], housing_prices.shape[1], settings)
    metrics = mlp.k_fold_validation(housing_data, housing_prices)
    best_runs.append(metrics)
    best_runs = sorted(best_runs, key=lambda x: x.mape)[:max_saved_runs]

    # Update output with new best 10
    clear_output(wait = True)
    for i, metrics in enumerate(best_runs):
        print(f"Rank: {i + 1}: Score = {(1-metrics.mape)*100:.1f}%, Metrics={metrics}")

Rank: 1: Score = 84.2%, Metrics=MSE=0.00909, RMSE=0.09532, MAE=0.0762, MAPE=0.1585, time=17.71s, settings=hidden_layers=[10, 10, 10], epochs=60, batch_size=1000, learning_rate=0.05, k_folds=5
Rank: 2: Score = 84.1%, Metrics=MSE=0.00906, RMSE=0.0952, MAE=0.07611, MAPE=0.15853, time=12.68s, settings=hidden_layers=[200, 200], epochs=20, batch_size=1000, learning_rate=0.05, k_folds=5
Rank: 3: Score = 84.1%, Metrics=MSE=0.00905, RMSE=0.09511, MAE=0.07603, MAPE=0.15859, time=36.57s, settings=hidden_layers=[100, 100, 100], epochs=60, batch_size=1000, learning_rate=0.05, k_folds=5
Rank: 4: Score = 84.1%, Metrics=MSE=0.00913, RMSE=0.09555, MAE=0.07641, MAPE=0.15861, time=7.08s, settings=hidden_layers=[50, 50], epochs=20, batch_size=1000, learning_rate=0.05, k_folds=5
Rank: 5: Score = 84.1%, Metrics=MSE=0.00906, RMSE=0.0952, MAE=0.07609, MAPE=0.15875, time=14.75s, settings=hidden_layers=[10, 10], epochs=60, batch_size=1000, learning_rate=0.05, k_folds=5
Rank: 6: Score = 84.1%, Metrics=MSE=0.0090