<a href="https://colab.research.google.com/github/aminedahire/AutoML/blob/main/Copie_de_TP4_automl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TP: Machine Learning (SIA_3611)

## TP4: AutoML (4h)

by Guillaume Renton

In previous TP, you have learned to use machine learning for different kind of tasks, from regression to clustering through classification. In this TP, you are going to use the earned knowledge on new datasets for regression and classification.

You are going to use 2 new datasets in this TP. First one is california housing, whose target variable is the value of houses in california, expressed in hundred of thousand of dollars. For each house, a set of 9 features is available. There is a total of 20 060 data.

Second one is MNIST, a very popular dataset for handwritten recognition and image classification. The original dataset is made of 60 000 training images of shape 28x28 of handwritten digits from 0 to 9, and 10 000 images for test dataset. For computaional time, you will work on a given random subset of MNIST made of 6000 images in train and 1000 images in test.  

**Objectives :**
- Apply your knowledge on new datasets
- Tune models hyperparameters and explore metrics
- Apply principal components analysis and understand its effects on both dataset
- Understand and use Cross-Validation
- Use AutoML to find interesting models

### STEP 1 : Getting started with new datasets

#### Substep 1 : Regression

In first part of step 1, you will work on the regression problem with the dataset california housing.

**To do 1.1**

Execute the following cell to load the california housing dataset and normalize it.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import normalize

X, y = fetch_california_housing(return_X_y = True)
X = normalize(X)

**To code 1.2**

Apply [Stochastic Gradient Descent](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) and [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) methods and cross validate your results using 5 folders. For this, you can either use the function [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)(or any other method for cross validation in sklearn) or either compute yourself the cross validation. According to a relevant metric optimize both methods. For SGD you will optimize the value of alpha for both L2 and L1 penalty score. For SVR, you will optimize the kernel. Be careful with the metric if you use cross_val_score, the returned values are often negative.

In [2]:
from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score, GridSearchCV
import numpy as np

In [3]:
# Define the range of alpha values for SGD
alphas = np.logspace(-6, -1, 6)

# Define the models and parameters for SGD
sgd_models_params = [
    {'penalty': ['l2'], 'alpha': alphas},
    {'penalty': ['l1'], 'alpha': alphas}
]

# Perform GridSearchCV for SGD
for model_params in sgd_models_params:
    sgd = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)
    grid_sgd = GridSearchCV(sgd, model_params, cv=5, scoring='neg_mean_squared_error')
    grid_sgd.fit(X, y)
    print(f"Best parameters for SGD with penalty {model_params['penalty'][0]}: {grid_sgd.best_params_}")
    print(f"Best cross-validation score: {-grid_sgd.best_score_}\n")

# Define the kernels for SVR
kernels = ['linear', 'rbf', 'sigmoid', 'poly']

# Perform GridSearchCV for SVR
svr = SVR()
grid_svr = GridSearchCV(svr, {'kernel': kernels}, cv=5, scoring='neg_mean_squared_error')
grid_svr.fit(X, y)
print(f"Best parameters for SVR: {grid_svr.best_params_}")
print(f"Best cross-validation score: {-grid_svr.best_score_}")


Best parameters for SGD with penalty l2: {'alpha': 0.1, 'penalty': 'l2'}
Best cross-validation score: 1.3742208955790975

Best parameters for SGD with penalty l1: {'alpha': 0.1, 'penalty': 'l1'}
Best cross-validation score: 1.3714799514468143

Best parameters for SVR: {'kernel': 'poly'}
Best cross-validation score: 1.198046329410784


**Question 1**

## According to your metric, which method obtain the best result ?

According to the provided cross-validation scores (which are based on the Mean Squared Error), the Support Vector Regression (SVR) with a polynomial kernel achieves the best performance with the lowest score of approximately 1.198. It's important to note that lower values of Mean Squared Error (MSE) indicate better model performance for regression tasks. Therefore, SVR with a polynomial kernel is the best-performing model among the ones evaluated.

**Question 2**

## What is the interest of using cross validation in general ? Is it relevant in this particular case ?

Cross-validation is essential for providing a reliable and unbiased performance estimate of different models, aiding in hyperparameter tuning, and ensuring a fair comparison across models, which is particularly important in scenarios like this with a moderately sized dataset and multiple models to evaluate.

**To code 1.3**

Transform your data according to [principal component analysis](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), and optimize the number of components according to the same metric than previously for both models.

In [4]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

In [5]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils import shuffle



n_components = np.arange(1, X.shape[1] + 1)
# Shuffle the data and select a subset for faster computation
X_shuffled, y_shuffled = shuffle(X, y, random_state=42)
X_subset, y_subset = X_shuffled[:2000], y_shuffled[:2000]  # Adjust the size as needed

# Define the pipelines and randomized search for each model because the gridsearch was taking too long
def randomized_search(model, params):
    search = RandomizedSearchCV(model, params, n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42, n_jobs=-1)
    search.fit(X_subset, y_subset)
    print(f"Best parameters: {search.best_params_}")
    print(f"Best cross-validation score: {-search.best_score_}\n")

gd_l2_pipe = Pipeline([
    ('pca', PCA(random_state=42)),
    ('sgd', SGDRegressor(max_iter=1000, tol=1e-3, random_state=42, alpha=0.1, penalty='l2'))
])

# SGD with L1 penalty
sgd_l1_pipe = Pipeline([
    ('pca', PCA(random_state=42)),
    ('sgd', SGDRegressor(max_iter=1000, tol=1e-3, random_state=42, alpha=0.1, penalty='l1'))
])

svr_pipe = Pipeline([
    ('pca', PCA(random_state=42)),
    ('svr', SVR(kernel='poly'))
])

# SGD with L2 penalty
print("SGD with L2 penalty:")
randomized_search(sgd_l2_pipe, {'pca__n_components': n_components})

# SGD with L1 penalty
print("SGD with L1 penalty:")
randomized_search(sgd_l1_pipe, {'pca__n_components': n_components})

# SVR with polynomial kernel
print("SVR with polynomial kernel:")
randomized_search(svr_pipe, {'pca__n_components': n_components})


SGD with L2 penalty:


NameError: ignored

**Question 3**

## What is the interest of Principal Component Analysis in general ? Is it relevant here ?

**Interest of Principal Component Analysis (PCA):**

PCA aids in reducing dimensionality, filtering out noise, and potentially improving model performance, which is beneficial for both computational efficiency and model optimization.

**Relevance in This Scenario:**

In this case, PCA proved useful in speeding up hyperparameter tuning without significantly degrading model performance. It also provided an additional parameter to optimize, contributing to the fine-tuning of the models.

#### Substep 2 : Classification

**To do 1.4**

Execute the following cells to load a subset of MNIST dataset. Since the dataset is already divided into training/test, we won't use cross validation this time.

In [None]:
from tensorflow.keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape((X_train.shape[0], -1))  # Flatten the images
X_test = X_test.reshape((X_test.shape[0], -1))  # Flatten the images


In [None]:
import pickle
with open("data/mnist.pkl", "rb") as f:
    ((X_train, y_train), (X_test, y_test)) = pickle.load(f)

**To code 1.5**

Compute classification on those images using a [KNN classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) and an [Adaboost classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html). For each classifier, optimize the parameters according to a relevant metric. For the KNN classifier, you will optimize the number of neighbor while for the Adaboost classifier, you will optimize the base estimator along with the number of estimators (for the basis estimator, limit yourself to different depth of decision tree classifier).

Also, for each model, compute the confusion matrix.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score
import numpy as np

# Flatten the images and select a subset for faster computation
X_train_flat = X_train.reshape((X_train.shape[0], -1))
X_test_flat = X_test.reshape((X_test.shape[0], -1))

# Subset of the data for faster computation
X_train_subset = X_train_flat[:6000]
y_train_subset = y_train[:6000]
X_test_subset = X_test_flat[:1000]
y_test_subset = y_test[:1000]

# KNN Classifier
knn = KNeighborsClassifier()
param_grid_knn = {'n_neighbors': np.arange(1, 5)}
grid_knn = GridSearchCV(knn, param_grid_knn, cv=5, scoring='accuracy', n_jobs=-1)
grid_knn.fit(X_train_subset, y_train_subset)

# AdaBoost Classifier
ada = AdaBoostClassifier()
param_grid_ada = {
    'base_estimator': [DecisionTreeClassifier(max_depth=depth) for depth in [1, 2]],
    'n_estimators': [50, 100]
}
grid_ada = GridSearchCV(ada, param_grid_ada, cv=5, scoring='accuracy', n_jobs=-1)
grid_ada.fit(X_train_subset, y_train_subset)

# Best parameters and accuracy for KNN
print("Best parameters for KNN:", grid_knn.best_params_)
print("Best accuracy for KNN:", grid_knn.best_score_)

# Best parameters and accuracy for AdaBoost
print("Best parameters for AdaBoost:", grid_ada.best_params_)
print("Best accuracy for AdaBoost:", grid_ada.best_score_)

# Confusion Matrix for KNN
y_pred_knn = grid_knn.predict(X_test_subset)
cm_knn = confusion_matrix(y_test_subset, y_pred_knn)
print("Confusion Matrix for KNN:\n", cm_knn)

# Confusion Matrix for AdaBoost
y_pred_ada = grid_ada.predict(X_test_subset)
cm_ada = confusion_matrix(y_test_subset, y_pred_ada)
print("Confusion Matrix for AdaBoost:\n", cm_ada)


**Question 4**

## According to your metric, which method obtain the best results ?

According to the accuracy metric, the KNN classifier obtains the best results with an accuracy of approximately 93.4%, while the AdaBoost classifier has a significantly lower accuracy of 61.6%.

**Question 5**

## According to the confusion matrix, which class if the easiest to classify ? Which ones are the most difficult ? Which ones are the most confused with each other ?

**Easiest to Classify:**

    Class 1 seems to be the easiest to classify with the KNN classifier, as it has 126 true positives and no false positives or false negatives.
    For the AdaBoost classifier, Class 1 also has relatively good performance, but not as perfect as in KNN.

**Most Difficult to Classify:**

    For KNN, Classes 2 and 9 seem more challenging. Class 2 has some misclassifications across several other classes, and Class 9 has several instances misclassified as Class 4.
    For AdaBoost, many classes show poor performance, but Classes 2, 5, and 9 stand out. Class 2 has widespread misclassifications, Class 5 has a lot of instances misclassified as Class 9, and Class 9 has many instances misclassified as Class 4.

**Most Confused With Each Other:**

    For KNN, Class 9 is often confused with Class 4.
    For AdaBoost, Class 5 is frequently confused with Class 9, and Class 9 is often confused with Class 4 and Class 7.

**Bonus**

For the Adaboost classifier, explore other classifier as base estimators. What are the limitations about those estimators ?

In [None]:
# Decision Tree Classifier
dt_params = {
    'base_estimator__max_depth': [1, 2, 3],
    'n_estimators': [50, 100, 150]
}
dt = GridSearchCV(AdaBoostClassifier(base_estimator=DecisionTreeClassifier()), dt_params)
dt.fit(X_train, y_train)
print(f"Best parameters for AdaBoost with Decision Tree: {dt.best_params_}")
print(f"Best accuracy: {accuracy_score(y_test, dt.predict(X_test))}")

# Support Vector Classifier
svc_params = {
    'base_estimator__kernel': ['linear', 'rbf'],
    'n_estimators': [50, 100, 150]
}
svc = GridSearchCV(AdaBoostClassifier(base_estimator=SVC(probability=True)), svc_params)
svc.fit(X_train, y_train)
print(f"Best parameters for AdaBoost with SVC: {svc.best_params_}")
print(f"Best accuracy: {accuracy_score(y_test, svc.predict(X_test))}")

# Logistic Regression
lr_params = {
    'base_estimator__C': [0.01, 0.1, 1, 10],
    'n_estimators': [50, 100, 150]
}
lr = GridSearchCV(AdaBoostClassifier(base_estimator=LogisticRegression()), lr_params)
lr.fit(X_train, y_train)
print(f"Best parameters for AdaBoost with Logistic Regression: {lr.best_params_}")
print(f"Best accuracy: {accuracy_score(y_test, lr.predict(X_test))}")

**To code 1.6**

Transform your data according to principal component analysis, and optimize the number of components according to the same metric than previously for each classifier.

Once again, compute the confusion matrix for each model.

In [None]:
knn_pipeline = Pipeline([
    ('pca', PCA()),
    ('knn', KNeighborsClassifier())
])

# Define parameter grid for KNN
knn_param_grid = {
    'pca__n_components': np.arange(1, X_train.shape[1] + 1, 10),
    'knn__n_neighbors': [3, 5, 7]
}

# Perform grid search for KNN
knn_random_search = RandomizedSearchCV(knn_pipeline, knn_param_grid, n_iter=10, cv=5, n_jobs=-1)
knn_random_search.fit(X_train, y_train)

# Print results for KNN
print(f'Best parameters for KNN: {knn_random_search.best_params_}')
print(f'Best accuracy for KNN: {knn_random_search.best_score_}')
y_pred_knn = knn_random_search.predict(X_test)
print('Confusion Matrix for KNN:')
print(confusion_matrix(y_test, y_pred_knn))

# Define a pipeline for AdaBoost
adaboost_pipeline = Pipeline([
    ('pca', PCA()),
    ('adaboost', AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1)))
])

# Define parameter grid for AdaBoost
adaboost_param_grid = {
    'pca__n_components': np.arange(1, X_train.shape[1] + 1, 10),
    'adaboost__n_estimators': [50, 100, 150]
}

# Perform grid search for AdaBoost
adaboost_random_search = RandomizedSearchCV(adaboost_pipeline, adaboost_param_grid, n_iter=10, cv=5, n_jobs=-1)
adaboost_random_search.fit(X_train, y_train)

# Print results for AdaBoost
print(f'Best parameters for AdaBoost: {adaboost_random_search.best_params_}')
print(f'Best accuracy for AdaBoost: {adaboost_random_search.best_score_}')
y_pred_adaboost = adaboost_random_search.predict(X_test)
print('Confusion Matrix for AdaBoost:')
print(confusion_matrix(y_test, y_pred_adaboost))


**Question 6**

Is the use of PCA relevant here ?

**Question 7**

Did your answers from question 5 changed with PCA ?

### Step 2 : AutoML

In this second section, we discuss on the utilisation of AutoMl tools, such as auto-sklearn.
If you are using colab or don't have auto-sklearn installed, you may need to run the following cell at first in order to install auto-sklearn. This will require you to restart the runtime (a prompt will invite you to).

Restarting the runtime will clear all your variables and imported libraries, so you will need to import them again.

In [None]:
sudo - apt install python 3.9

In [None]:
!pip install --force-reinstall scipy==1.6
!pip install --force-reinstall auto-sklearn==0.15

**To do 2.1**

Execute the following cells.  

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
X, y = fetch_california_housing(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
import autosklearn.regression
import sklearn.model_selection
import sklearn.datasets
import os, shutil
from sklearn.metrics import mean_squared_error, mean_absolute_error

automl = autosklearn.regression.AutoSklearnRegressor(
    include = {'regressor': ["libsvm_svr", "sgd"]},
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/california_housing_tmp',
)
automl.fit(X_train, y_train, dataset_name='California_Housing')

print(automl.leaderboard())

y_pred = automl.predict(X_test, y_test)
print("MSE = ", mean_squared_error(y_test, y_pred))
print("MRE = ", mean_absolute_error(y_test, y_pred))

In [None]:
from pprint import pprint
pprint(automl.show_models(), indent=4)

**Question 8**

What are the evaluated models by autoML ?
Which model obtain the best performance ?
What are the parameters of the best model ?

**To code 2.2**

With the help of the previous code, use autoML for the classification task on MNIST, by limiting the exploration to KNN and Adaboost.

**Question 9**

What are the evaluated models by autoML ?
Which model obtain the best performance ?
What are the parameters of the best model ?

### Bonus step

As a bonus step, have fun and remove a maximum of constraints of your autoML model. Which model obtain the best performances ? Describe the parameters of this model. You can do it for either for regression or classification or both.