# DISCLAIMER
The `random_state` is set to 42 because it is a common convention choice in tutorials and assignments (a "fun convention" from *The Hitchhiker's Guide to the Galaxy*). You can pick any integer, but different integers produce different sequence of randomness so your results will not be identical across different seeds. Using a fixed `random_state` ensures reproducibility of results.

### Imports

In [16]:
import pandas as pd
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV,cross_validate, StratifiedKFold,GridSearchCV
from sklearn.metrics import accuracy_score, classification_report,precision_score, recall_score, f1_score, confusion_matrix
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline


### Load wine dataset

In [17]:
#Define variables X, Y and df
wine = datasets.load_wine()
X=wine.data
Y=wine.target
df=pd.DataFrame(X,columns=wine.feature_names)
df.head()

#changing name of troublesome column
i = wine.feature_names.index('od280/od315_of_diluted_wines')
wine.feature_names[i] = 'ratio_of_diluted_wines'
df = df.rename(columns={'od280/od315_of_diluted_wines': 'ratio_of_diluted_wines'})

In [18]:
#Create a MLP (Multi Layer Perceptron) model
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)



**Baseline model: MLP classifier with default setup**

In [19]:
default_model = MLPClassifier(max_iter = 1400)
default_model.fit(X_train, Y_train) #finding weights

layers = default_model.n_layers_
print("Layers: ", layers)

Layers:  3


In [20]:
predictions_default = default_model.predict(X_test)
accuracy_1 = accuracy_score(Y_test, predictions_default)
scores_1 = cross_val_score(default_model, X, Y, cv=10, scoring="accuracy")
#baseline_scores = cross_val_score(baseline_mlp,X, Y,cv=10,scoring='f1_weighted')

print("Cross validation values:\n", scores_1)
#print("Cross validation f1 value:", np.mean(baseline_scores))
print("Cross validation mean:", scores_1.mean())
print("Cross validation standard deviation: ", scores_1.std())
print("Test accuracy: ", accuracy_1, "\n")

Cross validation values:
 [0.16666667 0.27777778 0.94444444 0.22222222 0.83333333 0.22222222
 0.33333333 0.55555556 0.76470588 0.35294118]
Cross validation mean: 0.46732026143790845
Cross validation standard deviation:  0.27117094908185835
Test accuracy:  0.9722222222222222 



**QUESTION 1.1**

Above is the MLP classifier built using the standard setup from SK-learn, along with the evaluation of its performance. 

To ensure that the MLP has enough of time (number of iterations) to learn, we increased the `max_iter` beyond its default value. By experimenting different values, starting at 1000 and increasing to 2000, we gradually narrowed down an appropriate iteration limit. In the end we chose 1400 iterations, as this prevented convergence warnings and ensured that the optimizer completed training before reaching the maximum iteration limit.

**Performance evaluation**

To evaluate the baseline model's generalization, we repeated 10-fold CV five times manually. This could have been done using a for-loop but we chose this simpler approach. The results for each repeat were:
1. CV mean: ~0.505, CV standard deviation: ~0.226
2. CV mean: ~0.827, CV standard deviation: ~0.176
3. CV mean: ~0.528, CV standard deviation: ~0.195
4. CV mean: ~0.703 , CV standard deviation: ~0.241
5. CV mean: ~0.755 , CV standard deviation: ~0.246

Overall CV mean: (0.505 + 0.827 + 0.528 + 0.703 + 0.755) / 5 = 0.6636

The default model achieves an overall cross-validation mean of 0.6636 indicating that it is not consistently strong across different subsets of the data and that the performance varies depending on how the data is split. The relatively high standard deviations for each repeat (0.176-0.246) indicate that the model is unstable and fluctuates considerably across folds. The model reaches a test accuracy of 0.388, meaning that the model's performance on a single test split is highly dependent on the split. All these factors suggests that the rules of thumb model's performance is not very reliable and its ability to generalize to unseen data is limited without tuning or improvements. 

**Applying the standard rules of thumb**

In [21]:
rule_of_thumb_model = MLPClassifier(hidden_layer_sizes=(8,), max_iter=2000)
rule_of_thumb_model.fit(X_train, Y_train)

predictions_rule_of_thumb = rule_of_thumb_model.predict(X_test)
accuracy_2 = accuracy_score(Y_test, predictions_rule_of_thumb)
scores_2 = cross_val_score(rule_of_thumb_model, X, Y, cv=10, scoring="accuracy")

print("Cross validation values:\n", scores_2)
print("Cross validation mean:", scores_2.mean())
print("Cross validation standard deviation: ", scores_2.std())
print("Test accuracy: ", accuracy_2)

Cross validation values:
 [0.44444444 0.88888889 0.88888889 0.55555556 0.27777778 0.33333333
 0.         0.27777778 0.76470588 0.29411765]
Cross validation mean: 0.4725490196078431
Cross validation standard deviation:  0.28108830807152463
Test accuracy:  0.3888888888888889


**QUESTION 1.2 and 1.2.1**

Above is MLP classifier built in accordance with the rules of the thumb.

We increased the value of `max_iter` to 2000 iterations to prevent convergence warnings and ensure that the optimizer completed training before reaching the maximum iteration limit.

Based on the wine dataset complexity (13 input features and 3 output classes), we applied the standard rules of the thumb for determining an appropriate hidden layer size for MLP classifier. The rules are applied as follows: 

**Rule 1: Number of neurons somewhere in-between the size of the input and output layer**

The input layer has 13 neurons and the output layer has 3 neurons --> 3-13 neurons

**Rule 2: Mean input and output**

(13 + 3) / 2 = 8 neurons

**Rule 3: Less than twice the input layer**

Twice the input size is 26, therefore less than 26 --> 1-25 neurons, however this rule is quite broad and not as informative

**Rule 4: Two-thirds rule**

(2/3 * 13) + 3 ~= 11-12 neurons

**Conclusion**

Combining the rules, it is reasonable to evaluate hidden layer sizes in the range 3-12 neurons. Among these, 8 neurons is particularly strong candidate as it is supported by multiple rules, 3 rules in total(rule 1-3).

**QUESTION 1.2.2**

**Performance evaluation**

To evaluate the rules of thumb model's generalization, we repeated 10-fold CV five times manually. This could have been done using a for-loop but we chose this simpler approach. The results for each repeat were:
1. CV mean: ~0.377, CV standard deviation: ~0.079
2. CV mean: ~0.342, CV standard deviation: ~0.122
3. CV mean: ~0.355, CV standard deviation: ~0.117
4. CV mean: ~0.336 , CV standard deviation: ~0.124
5. CV mean: ~0.537 , CV standard deviation: ~0.309 

Overall CV mean: (0.337 + 0.342 + 0.355 + 0.336 + 0.537) / 5 = 0.3814

The rules of thumb model achieves an overall cross-validation mean of 0.3814 indicating that it is not consistently strong across different subsets of the data and that the performance varies depending on how the data is split. The relatively high standard deviations for each repeat (0.079-0.309) indicate that the model is unstable and flactuates considerably across folds. The model reaches a test accuracy of 0.388, meaning that the model's performance on a single test split is highly dependent on the split. All these factors suggests that the rules of thumb model's performance is not very robust and its ability to generalize to unseen data is limited without tuning or improvements. 

**QUESTION 1.3**

The best baseline model out of the two previously built models (the default model and the rules-of-thumb model) is the default model based on their performance. The default model has a better overall cross-validation mean of 0.6636 and relatively lower and more consistent standard deviations across repeats (0.176-0.246). In comparison, the rules of thumb model has a lower overall cross-validation mean of 0.3814 and high variability across repeats with standard deviations ranging from 0.079-0.309. Besides, the default model is easier and simpler to set up, whereas the rules-of-thumb model requires precomputing of the number of neurons, making its setup more complex and time-consuming.

In [22]:
#Data Standarization
scaler = StandardScaler()
scaler.fit(X_train)

# Transform both train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Applying balancing agent
sm = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = sm.fit_resample(X_train_scaled, Y_train)
X_test_final = X_test_scaled
y_test_final = Y_test

#Testing improvements through cross validation and accuracy
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE(random_state=42)),
    ('mlp', MLPClassifier(random_state=42, max_iter=1400))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate with multiple metrics
scores = cross_validate(
    pipeline, X, Y,
    cv=cv,
    scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
)
print("DATA PREPARATION RESULTS:")
print(scores)

#Model improvements
baseline_mlp = MLPClassifier(random_state=42, max_iter=1400)
baseline_scores = cross_val_score(baseline_mlp,
                                  X_train_resampled, 
                                  y_train_resampled,
                                  cv=cv,
                                  scoring='f1_weighted'
                                 )
print("\n UPDATED BASELINE MLP RESULTS:")
print("Baseline MLP F1-score (weighted):", np.mean(baseline_scores))

#Applying PCA Techinque
pca = PCA(n_components=0.95, random_state=42)
X_train_pca = pca.fit_transform(X_train_resampled)
X_test_pca = pca.transform(X_test_final)

print("    ")
print("\n PCA TECHNIQUE RESULTS:")
print("Original number of features:", X_train.shape[1])
print("Number of PCA components:", X_train_pca.shape[1])

#Conducting MLP with PCA
mlp_pca = MLPClassifier(hidden_layer_sizes=(50,50), max_iter=1000, random_state=42)

scores = cross_val_score(mlp_pca, 
                         X_train_pca, 
                         y_train_resampled,
                         cv=cv, 
                         scoring='f1_weighted')

print("\n MLP WITH PCA TECHNIQUE RESULTS:")
print("Cross-validated F1-score with PCA:", np.mean(scores))

#Evaluation of test set
mlp_pca.fit(X_train_pca, y_train_resampled)
y_pred = mlp_pca.predict(X_test_pca)

print("    ")
print("EVALUATION OF TEST SET RESULTS:")
print("Accuracy:", accuracy_score(Y_test, y_pred))
print("Precision (weighted):", precision_score(Y_test, y_pred, average='weighted'))
print("Recall (weighted):", recall_score(Y_test, y_pred, average='weighted'))
print("F1-score (weighted):", f1_score(Y_test, y_pred, average='weighted'))
print("Confusion matrix:\n", confusion_matrix(Y_test, y_pred))


DATA PREPARATION RESULTS:
{'fit_time': array([0.59356499, 0.49616241, 0.49497247, 0.60805702, 0.58730221]), 'score_time': array([0.01849222, 0.02719641, 0.01249576, 0.02023625, 0.01994038]), 'test_accuracy': array([0.97222222, 0.97222222, 0.97222222, 1.        , 0.97142857]), 'test_precision_macro': array([0.96969697, 0.96969697, 0.97777778, 1.        , 0.96666667]), 'test_recall_macro': array([0.97619048, 0.97619048, 0.96666667, 1.        , 0.97777778]), 'test_f1_macro': array([0.97178131, 0.97178131, 0.97096189, 1.        , 0.97096189])}

 UPDATED BASELINE MLP RESULTS:
Baseline MLP F1-score (weighted): 0.9766047079701448
    

 PCA TECHNIQUE RESULTS:
Original number of features: 13
Number of PCA components: 10

 MLP WITH PCA TECHNIQUE RESULTS:
Cross-validated F1-score with PCA: 0.9644972910903492
    
EVALUATION OF TEST SET RESULTS:
Accuracy: 1.0
Precision (weighted): 1.0
Recall (weighted): 1.0
F1-score (weighted): 1.0
Confusion matrix:
 [[14  0  0]
 [ 0 14  0]
 [ 0  0  8]]


**QUESTION 2.1**

**Data Preparation**

Seeing as how both models are very dependent on how the data is split, one option would be to prepare the data beforehand with thechinques such as normalization or standarization in order to reduce the variance and distances among the different values. Also, since in previous assignments we learned that the classes in the dataset are slighly imbalaced, we introduce a balancing component in an attempt to bias and the F1 score. Finally, to ensure that each fold has the same class distribution as the full dataset, we will introduce the stratification.

**Model Improvement**

In order to improve the model we will first apply the MLP method on the prepared data in order to set a new, possibly improved, baseline. After that we will apply the PCA technique in order to reduce the dimensionality of the data set and remove the possible redundancies. After that , wi will train the MLP model with the data obtained from the PCA to  improve MLP training speed and reduce overfitting.Finally we evaluate the test set with all the changes applied in order to verify their effects on the model.


As seen by de results, the data preparation allowed for a more stable model across every fold, since the scores in the data preparation are very close to 1. This is further confirmed with the weighted F1 of the new baseline model, which is higher than the original one. When applying the PCA technique, we reduce the 13 original features to 10 main components, preserving the variance and obtaining a perfect confussion matrix, which is the definite proof that the improvements on the dataset and the model have worked, not by outreforming the cross validatio test but in the test performance.


In [23]:
#Grid search 
param_grid = {
    'mlp__hidden_layer_sizes': [
        (50,), (100,), (150,),        # one hidden layer
        (50,50), (100,50), (100,100)  # two hidden layers
    ],    
    'mlp__activation': ['relu', 'tanh'],    
    'mlp__alpha': [0.0001, 0.001, 0.01],    
    'mlp__learning_rate_init': [0.001, 0.01]
}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    scoring='f1_weighted',
    cv=cv,
    n_jobs=-1
)
grid_search.fit(X, Y)

print("Best parameters:")
print(grid_search.best_params_)

print("\nBest cross-validated F1-score:")
print(grid_search.best_score_)

Best parameters:
{'mlp__activation': 'relu', 'mlp__alpha': 0.0001, 'mlp__hidden_layer_sizes': (100, 50), 'mlp__learning_rate_init': 0.001}

Best cross-validated F1-score:
0.9888678699729315


**QUESTION 2.2**

Above is the first implementation of a grid search. Since there are 178 samples in the dataset, we decidet to work with two layers instead of one like in the first examples given in class. Obviously, this parameters will change as the rest of the questions of the assignment are answered.

**QUESTION 2.2.1, 2.2.2, 2.2.3**

Above, we use grid search to find the optimal combination of hyperparameters:`activation = relu`, `alpha = 0.0001`, `hidden_layer_sizes = (100, 50) ` and `learning_rate_init = 0.001`. In this section, we also explain our choice of parameters and justify the ranges of values we tested.

We selected the following hyperparameters for our grid search: 
- `hidden_layer_sizes` - The wine dataset is relatively small (178 samples) which makes model's capacity particularly important. A network with too few neurons may not be flexible enough to capture underlying patterns and can underfit, while a network with too many neurons may memorize the training data and overfit. For this reason, we included both one-layer and two-layer configurations, to find an appropriate balance between model complexity and generalization. 

- `activation` - The activation gives the neural network the ability to learn complex patterns, not just straight linear relationships. Therefore we have included two most commonly used activations in MLPs namely relu and tanh. Relu can help with sparse gradients, while tanh is smoother and can perform better when input features are scaled.

- `alpha` - This parameter determines how strongly the model tries to keep the weights small to prevent overfitting. Because our dataset is small, regularization is important to help the model generalize well. By testing several values  of `alpha`,  we can find an appropriate balance between fitting the data and avoid overfitting. The selected values cover a reasonable range of regularization strengths:

        
                0.0001 --> weak regularization
                
                0.001 --> moderate regularization
                
                0.01 --> strong regularization


- `learning_rate_init` - The learning rate influences how the model updates its weights and how quickly it converges during training (decides how fast and how smoothly the model learns). If the learning rate is too high, the model becomes unstable and can't settle on a good solution. However, if it is too slow, training becomes slow and the model may get stuck before finding the best result. We test two commonly used values (0.001 and 0.01) to identify the most stable and efficient learning rate for the wine dataset.

In [27]:
best_param_model = MLPClassifier(activation = 'relu', alpha = 0.0001, hidden_layer_sizes=(100, 50), learning_rate_init=0.001, max_iter=10000)
scores_best_param_model = cross_val_score(best_param_model, X, Y, cv=10, scoring="accuracy")

print("Cross validation values:\n", scores_best_param_model)
print("Cross validation mean:", scores_best_param_model.mean())
print("Cross validation standard deviation: ", scores_best_param_model.std())

best_param_model.fit(X_train, Y_train)
predictions_best_param_model = best_param_model.predict(X_test)
accuracy_best_param_model = accuracy_score(Y_test, predictions_best_param_model)
print(classification_report(Y_test, predictions_best_param_model))
print(pd.crosstab(Y_test, predictions_best_param_model))

print("Test accuracy: ", accuracy_best_param_model)

Cross validation values:
 [0.5        0.94444444 0.27777778 0.94444444 0.83333333 1.
 0.5        1.         1.         0.47058824]
Cross validation mean: 0.7470588235294118
Cross validation standard deviation:  0.2638256971224182
              precision    recall  f1-score   support

           0       1.00      0.71      0.83        14
           1       0.72      0.93      0.81        14
           2       0.50      0.50      0.50         8

    accuracy                           0.75        36
   macro avg       0.74      0.71      0.72        36
weighted avg       0.78      0.75      0.75        36

col_0   0   1  2
row_0           
0      10   1  3
1       0  13  1
2       0   4  4
Test accuracy:  0.75


**QUESTION 2.2.4**
