[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alexwolson/postdocbootcamp2023/blob/main/lab_3_1_auto_sklearn.ipynb)

# UofT DSI-CARTE Postdoc Bootcamp
#### Tuesday July 20, 2023
#### Intro to Auto-sklearn of AutoML- Lab 1, Day 3
#### Teaching team: Alex Olson, Nakul Upadhya, Shehnaz Islam
##### Lab author: Shehnaz Islam, shehnaz.islam@mail.utoronto.ca, edited by Alex Olson

### Introduction
**Automated machine learning (AutoML)** is a fairly young field with the goal to build an automated workflow that could take raw data as input, and produce a prediction automatically. This automated workflow should automatically do preprocessing, model selection, hyperparameter tuning, and all other stages of the ML process.

There are different types of AutoML frameworks, each having unique features automating a few steps of a full machine learning workflow, from pre-processing to model development. In this lab we discuss one such framework called **Auto-sklearn**.

### Auto-sklearn
**Auto-sklearn** is a state of the art, and open-source AutoML framework on top of scikit-Learn. The task of Auto-sklearn is to find the best ML model and its hyperparameter for a dataset among a vast search space, including plenty of classifiers and a lot of hyperparameters. Thus, Auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning by leveraging recent advantages in Bayesian optimization, meta-learning and ensemble construction.

### Why AutoML?
AutoML can be useful as it has the ability to improve the quality of work for data scientists, not remove data scientists from the cycle.
Experts could use AutoML to increase their job performance by focusing on the best-performing pipelines, and non-experts could use AutoML systems without a broad ML education.

## Auto-sklearn for Classfication

Now lets see how to implement Auto-sklearn framework of AutoML for a breast cancer binary classfication task with 30 features and 569 samples belonging to either 'malignant' or 'benign' class.

In [None]:
# Install auto-sklearn
!pip3 install auto-sklearn

In [None]:
# Import autosklearn and check version
import autosklearn
print(autosklearn.__version__)
# 0.15.0

In [None]:
# Make necessary imports
import autosklearn.classification

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score,accuracy_score,classification_report

### Data Loading

In [None]:
# Load breast cancer dataset
import sklearn.datasets
import sklearn.metrics

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

In [None]:
print(X.shape)
print(y.shape)

In [None]:
# View dataset as dataframe
X_df = pd.DataFrame(X)
X_df.head()

In [None]:
y_df = pd.DataFrame(y)
y_df.head()

### Define and fit auto-sklearn classifier

**Auto-sklearn parameters**: Although Auto-sklearn might be able to find an outperforming pipeline without setting any parameters, there are some parameters that you can use to boost your productivity. To check all parameters visit the [official page](https://automl.github.io/auto-sklearn/master/api.html).


> **time_left_for_this_task: int, optional (default=3600):** Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.

> **per_run_time_limit: int, optional (default=1/10 of time_left_for_this_task):**
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.

> **tmp_folder: string, optional (None)**:
Folder to store configuration output and log files, if None automatically use


In [None]:
# Define the model # 2 mins
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120, # Set max runtime
    per_run_time_limit=30, # Set max runtime per model
    tmp_folder="/tmp/autosklearn_classification_example_tmp",
)

# Train the model
automl.fit(X_train, y_train, dataset_name="breast_cancer")

## View Training Statistics
To get infomartion about traning statistics use **automl.sprint_statistics()**

In [None]:
sprint_statistics_str = automl.sprint_statistics()
sprint_statistics_str

## View the models found by auto-sklearn

The function **automl.leaderboard()** gives an overview of all models trained during the search process along with various statistics about their training.

In [None]:
print(automl.leaderboard())

The "extra-tress" classfier was ranked highest with a cost of 0.014.

Print the final ensemble constructed by auto-sklearn using **automl.show_models()** function

In [None]:
# show all models
show_modes_str=automl.show_models()
show_modes_str

## Get the score of the final ensemble

In [None]:
predictions = automl.predict(X_test) # predict on test set

# Accuracy
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))

# AUROC score
y_pred_proba= automl.predict_proba(X_test) # Get prediction probabilities
roc_score= roc_auc_score(y_test, y_pred_proba[:,1])
print("AUROC score: ", roc_score)

## Visualize best outperforming pipelines
When you fit the auto-sklearn model, you can check all the best outperforming pipelines with PipelineProfiler. To do that, you need to run the following code:

In [None]:
!pip install pipelineprofiler

In [None]:
import PipelineProfiler
# automl is an object Which has already been created.
profiler_data= PipelineProfiler.import_autosklearn(automl)
PipelineProfiler.plot_pipeline_matrix(profiler_data)

## Final thought
Overall, auto-sklearn is still a new technology. Because auto-sklearn is built on top of scikit-learn, many ML practitioners can quickly try it and see how it works.

The most important advantage of this framework is that it saves a lot of time for experts. The one weakness is that it acts as a black box, and doesn’t say anything about how to make a decision.

All in all, it’s a pretty interesting tool, so it’s worth giving auto-sklearn a look.