<a href="https://colab.research.google.com/github/gerryfrank10/AI2025/blob/main/AI_Lab_08_ANN_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Lab 09

Data Transformation and MLPs.


## Loading the dataset

The first cell below loads the ``wine dataset``.

https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The second cell prints out the description of this dataset. In short, it is for **classification of wines** based on features like alcohol, magnesium and colour.

In [None]:
from sklearn.datasets import load_wine

# Loading the dataset
dataset = load_wine()

# Get the X (feature matrix) and y (class label vector) from the data
X, y = dataset.data, dataset.target

In [None]:
# Print out the dataset description
print(dataset.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Fl

## Scaling datasets

Setting up 2 x feature matrices here, using the ``Normalizer`` and ``StandardScaler``, to experiment below on whether this helps the MLP perform better.

In [None]:
# Normalising feature matrix values
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
normalizer.fit(X)
X_normalised = normalizer.transform(X)

In [None]:
# Scaling feature matrix values
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

## MLP validation

The 1st cell below does the following:

 * Splits the dataset for training and testing (using the original feature values ``X``)
 * Creates an 3-layer ``MLPClassifier`` with 10 neurons in the hidden layer, to be trained for 100 epocs (iterations)
 * Trains the MLP and tests it
 * Calculates and prints out a confusion matrix and accuracy

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_validate
from sklearn import metrics

def print_results(scores):
    print("Accuracy:          %0.2f (+/- %0.2f)" % (scores['test_score'].mean(), scores['test_score'].std() * 2))
    print("Training time (s): %0.2f (+/- %0.2f)" % (scores['fit_time'].mean(), scores['fit_time'].std() * 2))
    print("Testing time (s):  %0.2f (+/- %0.2f)" % (scores['score_time'].mean(), scores['score_time'].std() * 2))

In [None]:
# Instantiating MLP
model = MLPClassifier(hidden_layer_sizes=(1000), max_iter=5000)

# Validating MLP model
scores = cross_validate(model, X, y, cv=5)

# Printing performance results
print_results(scores)

Accuracy:          0.79 (+/- 0.24)
Training time (s): 0.50 (+/- 1.09)
Testing time (s):  0.00 (+/- 0.00)


You should have seen warnings about the MLP not having converged above (with the starting values), and a rather sub-optimal performance!

**QUESTION**: after increasing ``max_iter`` to the point the convergance warning disappears, what's the performance like? Is it good enough?


**SUGGESTED ANSWER**: you should note that even with 1000 neurons in the hidden layer and training for 5000 epochs, the performance is still quite poor (79% accuracy in the run cached in this notebook).

## Normalised vs Scaled feature values

If you tweak the number of neurons in the hidden layer and the maximum number of iterations (and other hyper-parameters), you will probably find that the performance remains quite poor. **IDEED, AS NOTED ABOVE**

So, let us now move on to comparing the performance when using the normalised and scaled feature matrices instead.

### Normalised feature matrix

In [None]:
# Instantiating MLP
model = MLPClassifier(hidden_layer_sizes=(100), max_iter=5000)

# Validating MLP model
scores = cross_validate(model, X_normalised, y, cv=5)

# Printing performance results
print_results(scores)

Accuracy:          0.91 (+/- 0.12)
Training time (s): 0.40 (+/- 0.07)
Testing time (s):  0.00 (+/- 0.00)


### Scaled feature matrix

In [None]:
# Instantiating MLP
model = MLPClassifier(hidden_layer_sizes=(10), max_iter=1000)

# Validating MLP model
scores = cross_validate(model, X_scaled, y, cv=5)

# Printing performance results
print_results(scores)


Accuracy:          0.96 (+/- 0.04)
Training time (s): 0.07 (+/- 0.03)
Testing time (s):  0.00 (+/- 0.00)


## Discussion and Conclusions

Above, you should be able to make a few key observations regarding:

* The performance of MLPs in general on this dataset
* How the performance is affected by
  - Hyper-parameters like the number of neurons and number of epochs
  - Data processing: original feature values vs normalised vs scaled

What seems to be best?

What seems to be worst?

What seems to make the biggest difference to the performance?

**SUGGESTED ANSWER / OBSERVATIONS**

* Using the StandardScaler seems to be the best; 96% accuracy vs 91% vs 79%.
* The Normalizer seems better than using the raw/original feature values.
* HOWEVER, note how the performance with the Normalized feature values only reaches > 90% accuracy with a significanlty larger network; 100 neurons in the hidden layer vs 10 when using the StandardScaler. Also, the number of traing iterations had to be increased significantly.
* Training time in the cached results are 0.07 seconds to train using feature values from the StandardScaler vs 0.40 seconds when using Normalized feature values.
* It seems the biggest impact on performance is the data processing through scaling the feature values.
* Performance may still improve with different hyper-parameters, which the bonus/extra materials below will explore.



**PS**: Feel free to play around with other hyper-parameters as well, which you can see in the API reference documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

## Bonus / Extras

### 4-Layer MLP with more hyper-parameters

Showing the use of more hyper-parameters and multiple hidden layers.

**NOTE**: In Scikit-Learn, you can only set the activation function for neurons in the hidden layer (default ``ReLu``).

In [None]:
from sklearn.neural_network import MLPClassifier

# Scaling feature matrix
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

# Instantiating MLP
model = MLPClassifier(hidden_layer_sizes=(5,5),
                      activation='tanh',
                      learning_rate='adaptive',
                      max_iter=1000)

# Validating MLP model
scores = cross_validate(model, X_scaled, y, cv=5)

# Printing performance results
print_results(scores)

Accuracy:          0.98 (+/- 0.04)
Training time (s): 0.10 (+/- 0.01)
Testing time (s):  0.00 (+/- 0.00)


### Hyper-parameter optimisation

Here's an example of using Random Search for automatically finding "optimal" hyper-parameters for an MLP.

API Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Another popular hyper-parameter optimisation algorithm is Grid Search, which does a brute-force search, trying every single combination of hyper-parameter values. But this can be very time-consuming. Since the code would essentially be the same, only showing an example with Random Search here.

This example has been modified from: https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html

In [None]:
from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning
import numpy as np

# Utility function to report best scores
def report(results, rank_metric='score', n_top=3):
    """
    Utility function to report best scores.
    :param results: the cv_results_ data structure from the optimisation algorithm
    :param rank_metric: name of the metric to report results for
    :param n_top: the number of top results to report
    """
    print("\nModels ranked according to", rank_metric)
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results["rank_test_" + rank_metric] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.2f} (+/- {1:.2f})".format(
                  results["mean_test_" + rank_metric][candidate],
                  results["std_test_" + rank_metric][candidate]*2))
            print("Params: {0}".format(results['params'][candidate]))
            print("")

@ignore_warnings(category=ConvergenceWarning)
def run_random_search(model, X, y, param_dict, num_itr=100):
    """
    Note: you don't need to put random search code into such a function, but it is one way to suppress
    the "convergence warning spam" you will get otherwise with models like the MLP.
    """

    random_search = RandomizedSearchCV(model, # the MLP model
                                       param_distributions=param_dict, # the dictionary of hyper-parameters and value space
                                       n_iter=num_itr, # will try 100 random combinations of the above values
                                       cv=5) # 5-fold cross-validation

    random_search.fit(X, y)

    return random_search.cv_results_

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from time import time

# instantiating the model (not setting any hyper-parameters here)
model = model = MLPClassifier()

# specify parameters and potential values to sample from
param_dict = {
    "hidden_layer_sizes": [5, 10, 20, 30, 50, 100, (5,5), (10,10), (20,20)],
    "max_iter": [100, 300, 500, 1000, 2000],
    "activation": ["tanh", "relu", "logistic", "identity"],
    "learning_rate": ["constant", "invscaling", "adaptive"]
}

# set the number of random combinations of hyper-parameter values that will be sampled
# e.g., if 100 - then 100 MLPs will be trained and tested, using a random set of hyper-parameter
# values each time
num_itr = 100

# run random search
print("> STARTING RANDOM SEARCH ...")
start_time = time()

# running random search using THE ORIGINAL feature values (X)
results = run_random_search(model, X, y, param_dict, num_itr)

end_time = time()
print("> RANDOM SEARCH COMPLETE")

print("\nRandomizedSearchCV took %.2f seconds for %d candidates"
          " parameter settings." % ((end_time - start_time), num_itr))
report(results)

> STARTING RANDOM SEARCH ...
> RANDOM SEARCH COMPLETE

RandomizedSearchCV took 14.80 seconds for 100 candidates parameter settings.

Models ranked according to score
Model with rank: 1
Mean validation score: 0.96 (+/- 0.08)
Params: {'max_iter': 2000, 'learning_rate': 'constant', 'hidden_layer_sizes': 100, 'activation': 'logistic'}

Model with rank: 2
Mean validation score: 0.96 (+/- 0.09)
Params: {'max_iter': 1000, 'learning_rate': 'constant', 'hidden_layer_sizes': 50, 'activation': 'logistic'}

Model with rank: 3
Mean validation score: 0.96 (+/- 0.08)
Params: {'max_iter': 2000, 'learning_rate': 'constant', 'hidden_layer_sizes': (20, 20), 'activation': 'tanh'}



In [None]:
from sklearn.model_selection import RandomizedSearchCV
from time import time

# instantiating the model (not setting any hyper-parameters here)
model = model = MLPClassifier()

# specify parameters and potential values to sample from
param_dict = {
    "hidden_layer_sizes": [5, 10, 20, 30, 50, 100, (5,5), (10,10), (20,20)],
    "max_iter": [100, 300, 500, 1000, 2000],
    "activation": ["tanh", "relu", "logistic", "identity"],
    "learning_rate": ["constant", "invscaling", "adaptive"]
}

# set the number of random combinations of hyper-parameter values that will be sampled
# e.g., if 100 - then 100 MLPs will be trained and tested, using a random set of hyper-parameter
# values each time
num_itr = 100

# run random search
print("> STARTING RANDOM SEARCH ...")
start_time = time()

# running random search using THE SCALED feature values (X_scaled)
results = run_random_search(model, X_scaled, y, param_dict, num_itr)

end_time = time()
print("> RANDOM SEARCH COMPLETE")

print("\nRandomizedSearchCV took %.2f seconds for %d candidates"
          " parameter settings." % ((end_time - start_time), num_itr))
report(results)

> STARTING RANDOM SEARCH ...
> RANDOM SEARCH COMPLETE

RandomizedSearchCV took 26.56 seconds for 100 candidates parameter settings.

Models ranked according to score
Model with rank: 1
Mean validation score: 0.99 (+/- 0.02)
Params: {'max_iter': 1000, 'learning_rate': 'constant', 'hidden_layer_sizes': 50, 'activation': 'relu'}

Model with rank: 2
Mean validation score: 0.99 (+/- 0.03)
Params: {'max_iter': 2000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': 20, 'activation': 'logistic'}

Model with rank: 3
Mean validation score: 0.98 (+/- 0.03)
Params: {'max_iter': 500, 'learning_rate': 'adaptive', 'hidden_layer_sizes': 50, 'activation': 'relu'}

Model with rank: 3
Mean validation score: 0.98 (+/- 0.04)
Params: {'max_iter': 500, 'learning_rate': 'invscaling', 'hidden_layer_sizes': 10, 'activation': 'tanh'}

Model with rank: 3
Mean validation score: 0.98 (+/- 0.03)
Params: {'max_iter': 2000, 'learning_rate': 'constant', 'hidden_layer_sizes': 20, 'activation': 'tanh'}

Model with rank

### Final remarks

You should see several key things above:

* The maximum performance on the original dataset / feature values capped at 96% accuracy, which is much higher than the manual tweaking done above.
* However, using the scaled feature values allowed a 99% accuracy, which is really high!
* There are different combinations of hyper-parameters that lead to the same accuracy. Which is why using a hyper-parameter optimisation function like Random Search is desired, because it would take a long time to manually try out the same number of combinations as we could do in 20-30 seconds here.

