# DISCLAIMER
The `random_state` is set to 42 because it is a common convention choice in tutorials and assignments (a "fun convention" from *The Hitchhiker's Guide to the Galaxy*). You can pick any integer, but different integers produce different sequence of randomness so your results will not be identical across different seeds. Using a fixed `random_state` ensures reproducibility of results.

### Imports

In [81]:
import pandas as pd
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

### Load wine dataset

In [68]:
#Define variables X, Y and df
wine = datasets.load_wine()
X=wine.data
Y=wine.target
df=pd.DataFrame(X,columns=wine.feature_names)
df.head()

#changing name of troublesome column
i = wine.feature_names.index('od280/od315_of_diluted_wines')
wine.feature_names[i] = 'ratio_of_diluted_wines'
df = df.rename(columns={'od280/od315_of_diluted_wines': 'ratio_of_diluted_wines'})

In [154]:
#Create a MLP (Multi Layer Perceptron) model
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

**Baseline model: MLP classifier with default setup**

In [155]:
default_model = MLPClassifier(max_iter = 1400)
default_model.fit(X_train, Y_train) #finding weights

layers = default_model.n_layers_
print("Layers: ", layers)

Layers:  3


In [161]:
predictions_default = default_model.predict(X_test)
accuracy_1 = accuracy_score(Y_test, predictions_default)
scores_1 = cross_val_score(default_model, X, Y, cv=10, scoring="accuracy")

print("Cross validation values:\n", scores_1)
print("Cross validation mean:", scores_1.mean())
print("Cross validation standard deviation: ", scores_1.std())
print("Test accuracy: ", accuracy_1, "\n")

Cross validation values:
 [0.88888889 0.61111111 0.88888889 0.94444444 0.83333333 0.61111111
 0.16666667 0.61111111 1.         1.        ]
Cross validation mean: 0.7555555555555555
Cross validation standard deviation:  0.24620577562400375
Test accuracy:  0.3888888888888889 



**QUESTION 1.1**

Above is the MLP classifier built using the standard setup from SK-learn, along with the evaluation of its performance. 

To ensure that the MLP has enough of time (number of iterations) to learn, we increased the `max_iter` beyond its default value. By experimenting different vales, starting at 1000 and increasing to 2000, we gradually narrowed down an appropriate iteration limit. In the end we chose 1400 iterations, as this prevented convergence warnings and ensured that the optimizer completed training before reaching the maximum iteration limit.

**Performance evaluation**

To evaluate the baseline model's generalization, we repeated 10-fold CV five times manually. This could have been done using a for-loop but we chose this simpler approach. The results for each repeat were:
1. CV mean: ~0.505, CV standard deviation: ~0.226
2. CV mean: ~0.827, CV standard deviation: ~0.176
3. CV mean: ~0.528, CV standard deviation: ~0.195
4. CV mean: ~0.703 , CV standard deviation: ~0.241
5. CV mean: ~0.755 , CV standard deviation: ~0.246

Overall CV mean: (0.505 + 0.827 + 0.528 + 0.703 + 0.755) / 5 = 0.6636

The default model achieves an overall cross-validation mean of 0.6636 indicating that it is not consistently strong across different subsets of the data and that the performance varies depending on how the data is split. The relatively high standard deviations for each repeat (0.176-0.246) indicate that the model is unstable and flactuates considerably across folds. The model reaches a test accuracy of 0.388, meaning that the model's performance on a single test split is highly dependent on the split. All these factors suggests that the rules of thumb model's performance is not very robust and its ability to generalize to unseen data is limited without tuning or improvements. 

**Applying the standard rules of thumb**

In [170]:
rule_of_thumb_model = MLPClassifier(hidden_layer_sizes=(8,), max_iter=2000)
rule_of_thumb_model.fit(X_train, Y_train)

predictions_rule_of_thumb = rule_of_thumb_model.predict(X_test)
accuracy_2 = accuracy_score(Y_test, predictions_rule_of_thumb)
scores_2 = cross_val_score(rule_of_thumb_model, X, Y, cv=10, scoring="accuracy")

print("Cross validation values:\n", scores_2)
print("Cross validation mean:", scores_2.mean())
print("Cross validation standard deviation: ", scores_2.std())
print("Test accuracy: ", accuracy_2)

Cross validation values:
 [0.38888889 0.38888889 0.33333333 0.33333333 0.38888889 0.33333333
 0.33333333 0.27777778 0.58823529 0.41176471]
Cross validation mean: 0.37777777777777777
Cross validation standard deviation:  0.07982423341580144
Test accuracy:  0.3888888888888889


**QUESTION 1.2 and 1.2.1**

Above is MLP classifier built in accordance with the rules of the thumb.

We increased the value of `max_iter` to 2000 iterations to prevent convergence warnings and ensure that the optimizer completed training before reaching the maximum iteration limit.

Based on the wine dataset complexity (13 input features and 3 output classes), we applied the standard rules of the thumb for determining an appropriate hidden layer size for MLP classifier. The rules are applied as follows: 

**Rule 1: Number of neurons somewhere in-between the size of the input and output layer**

The input layer has 13 neurons and the output layer has 3 neurons --> 3-13 neurons

**Rule 2: Mean input and output**

(13 + 3) / 2 = 8 neurons

**Rule 3: Less than twice the input layer**

Twice the input size is 26, therefore less than 26 --> 1-25 neurons, however this rule is quite broad and not as informative

**Rule 4: Two-thirds rule**

(2/3 * 13) + 3 ~= 11-12 neurons

**Conclusion**

Combining the rules, it is reasonable to evaluate hidden layer sizes in the range 3-12 neurons. Among these, 8 neurons is particularly strong candidate as it is supported by multiple rules, 3 rules in total(rule 1-3).

**QUESTION 1.2.2**

**Performance evaluation**

To evaluate the rules of thumb model's generalization, we repeated 10-fold CV five times manually. This could have been done using a for-loop but we chose this simpler approach. The results for each repeat were:
1. CV mean: ~0.377, CV standard deviation: ~0.079
2. CV mean: ~0.342, CV standard deviation: ~0.122
3. CV mean: ~0.355, CV standard deviation: ~0.117
4. CV mean: ~0.336 , CV standard deviation: ~0.124
5. CV mean: ~0.537 , CV standard deviation: ~0.309 

Overall CV mean: (0.337 + 0.342 + 0.355 + 0.336 + 0.537) / 5 = 0.3814

The rules of thumb model achieves an overall cross-validation mean of 0.3814 indicating that it is not consistently strong across different subsets of the data and that the performance varies depending on how the data is split. The relatively high standard deviations for each repeat (0.079-0.309) indicate that the model is unstable and flactuates considerably across folds. The model reaches a test accuracy of 0.388, meaning that the model's performance on a single test split is highly dependent on the split. All these factors suggests that the rules of thumb model's performance is not very robust and its ability to generalize to unseen data is limited without tuning or improvements. 

**QUESTION 1.3**

The best baseline model out of the two previously built models (the default model and the rules-of-thumb model) is the default model based on their performance. The default model has a better overall cross-validation mean of 0.6636 and relatively lower and more consistent standard deviations across repeats (0.176-0.246). In comparison, the rules of thumb model has a lower overall cross-validation mean of 0.3814 and high variability across repeats with standard deviations ranging from 0.079-0.309. Besides, the default model is easier and simpler to set up, whereas the rules-of-thumb model requires precomputing of the number of neurons, making its setup more complex and time-consuming.

# BELOW ARE CODE COPIED FROM LECTURE SLIDES SO THAT WE DON'T NEED TO WRITE EVERY CODE MANUALLY - THIS IS TO BE REMOVED AFTER FINISHED

**Using grid search with cross-validation**

In [35]:
search_model_1 = MLPClassifier(max_iter = 10000)
H_param = {
    'hidden_layer_sizes' : [(1,), (2,), (3,), (4,), (5,), (6,), (7,),]
}
optimal_param = GridSearchCV(search_model_1, H_param)
optimal_param.fit(X, Y)
print(optimal_param.best_params_)

{'hidden_layer_sizes': (2,)}


**Evaluating the suggested best classfier**

In [36]:
best_model = MLPClassifier(hidden_layer_sizes=(2,), max_iter=10000)
scores_best = cross_val_score(best_model, X, Y, cv=10, scoring="accuracy")
print("Cross validation values:\n", scores_best)
print("Cross validation mean:", scores_best.mean())
print("Cross validation standard deviation: ", scores_best.std())

Cross validation values:
 [0.66666667 0.38888889 0.38888889 0.33333333 0.38888889 0.38888889
 0.38888889 0.38888889 0.23529412 0.47058824]
Cross validation mean: 0.403921568627451
Cross validation standard deviation:  0.10404481531287219


In [51]:
scores_best = cross_val_score(best_model, X, Y, cv=10, scoring="accuracy")
print("Cross validation values:\n", scores_best)
print("Cross validation mean:", scores_best.mean())
print("Cross validation standard deviation: ", scores_best.std())

Cross validation values:
 [0.33333333 0.27777778 0.38888889 0.27777778 0.66666667 0.38888889
 0.55555556 0.33333333 0.41176471 0.29411765]
Cross validation mean: 0.39281045751633986
Cross validation standard deviation:  0.12047969680072056


**Including: hidden layer size and activation function in the grid search**

In [42]:
search_model_2 = MLPClassifier(max_iter = 10000)
H_param = {
    'hidden_layer_sizes' : [(1,), (2,), (3,), (4,), (5,), (6,), (7,)],
    'activation' : ['identity', 'logistic', 'tanh', 'relu']
}
optimal_param = GridSearchCV(search_model_2, H_param)
optimal_param.fit(X, Y)
print(optimal_param.best_params_)

{'activation': 'logistic', 'hidden_layer_sizes': (6,)}


In [43]:
model_2 = MLPClassifier(activation='logistic', hidden_layer_sizes=(6,), max_iter=10000)
scores_model_2 = cross_val_score(model_2, X, Y, cv=10, scoring="accuracy")
print("Cross validation values:\n", scores_model_2)
print("Cross validation mean:", scores_model_2.mean())
print("Cross validation standard deviation: ", scores_model_2.std())

Cross validation values:
 [0.83333333 0.94444444 0.38888889 0.94444444 0.88888889 1.
 1.         1.         1.         0.47058824]
Cross validation mean: 0.8470588235294118
Cross validation standard deviation:  0.21589809365223062


In [45]:
model_3 = MLPClassifier(hidden_layer_sizes=(2,2,2), max_iter=10000)
scores_model_3 = cross_val_score(model_3, X, Y, cv=10, scoring="accuracy")
print("Cross validation values:\n", scores_model_3)
print("Cross validation mean:", scores_model_3.mean())
print("Cross validation standard deviation: ", scores_model_3.std())

Cross validation values:
 [0.27777778 0.33333333 0.38888889 0.38888889 0.         0.33333333
 0.38888889 0.38888889 0.76470588 0.47058824]
Cross validation mean: 0.3735294117647059
Cross validation standard deviation:  0.17756890122537536


In [46]:
model_4 = MLPClassifier(max_iter = 10000)
H_param = {
    'hidden_layer_sizes' : [(1,), (2,), (3,), (4,), (5,), (6,), (7,)],
    'solver' : ['sgd', 'adam', 'lbfgs'],
    'learning_rate' : ['constant', 'invscaling', 'adaptive'],
    'learning_rate_init' : [0.1, 0.05, 0.01, 0.005, 0.001]
}
optimal_param = GridSearchCV(model_4, H_param, n_jobs=-1)
optimal_param.fit(X, Y)
print(optimal_param.best_params_)

{'hidden_layer_sizes': (6,), 'learning_rate': 'adaptive', 'learning_rate_init': 0.05, 'solver': 'lbfgs'}


ABNORMAL: .

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [47]:
model_5 = MLPClassifier(hidden_layer_sizes=6, learning_rate='adaptive', learning_rate_init=0.05, solver='lbfgs', max_iter=10000)
scores_model_5 = cross_val_score(model_5, X, Y, cv=10, scoring="accuracy")
print("Cross validation values:\n", scores_model_5)
print("Cross validation mean:", scores_model_5.mean())
print("Cross validation standard deviation: ", scores_model_5.std())

ABNORMAL: .

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
ABNORMAL: .

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
ABNORMAL: .

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Cross validation values:
 [0.33333333 0.38888889 0.38888889 0.38888889 0.38888889 0.38888889
 0.38888889 0.38888889 0.41176471 1.        ]
Cross validation mean: 0.4467320261437909
Cross validation standard deviation:  0.18536672506271654


In [48]:
model_5.fit(X_train, Y_train)
predictions_5 = model_5.predict(X_test)
print(classification_report(Y_test, predictions_5))

              precision    recall  f1-score   support

           0       0.86      0.86      0.86         7
           1       1.00      0.89      0.94        18
           2       0.77      0.91      0.83        11

    accuracy                           0.89        36
   macro avg       0.88      0.89      0.88        36
weighted avg       0.90      0.89      0.89        36



In [49]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
model_5.fit(X_train, Y_train)
predictions_5 = model_5.predict(X_test)
print(classification_report(Y_test, predictions_5))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        13
           1       0.39      1.00      0.56        14
           2       0.00      0.00      0.00         9

    accuracy                           0.39        36
   macro avg       0.13      0.33      0.19        36
weighted avg       0.15      0.39      0.22        36



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
