# 3.10 AutoML

Automated workflow for hyper-parameter tuning and optimal model finder

In this tutorial, we will try some cool technique that has been used widely to make AI/ML less tedious and boost your ML workflow efficiency. 

If you have learned [3.6](3.6_randomForest_regression.ipynb), you might be amazed but also annoyed by all those parameter tuning efforts and many back-n-forth iterations needed to figure out which configuration will be optimal for your case. It has been known as the major reason for low productivity in the AI/ML world. People come up with an idea that it seems most work in that tuning and iteration are very simple, can we automate it? The answer is yes, and that will be the technique we will introduce here: AutoML.

There are many AutoML solutions on the market, e.g., AutoKeras, auto-sklearn, H2O, Auto-WEKA, etc. Here we will focus on [PyCaret](https://pycaret.gitbook.io/docs/get-started/installation) which is a popular one in both academia and industry and very easy to use.


In the following tutorial, we will use the Pycaret Docker Image to run the tutorial. In Terminal, call ``docker`` to pull the PyCaret image and start a jupyter notebook:

```
docker pull pycaret/full
docker run -it -p 8888:8888 -e GRANT_SUDO=yes pycaret/full
```

Installations on M1 Mac can be tricky - especially when using lighgbm library. Try to install both libraries.

You will then be able to edit a notebook with the following cells:


In [1]:
!pip install pycaret


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## First we get data ready

As usual, data collection is the first step. To better demonstrate the point of AutoML, we will use the same data as [3.6 Random Forest](3.6_randomForest_regression.ipynb).

In [2]:
!pip install wget


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import wget
wget.download("https://docs.google.com/uc?export=download&id=1pko9oRmCllAxipZoa3aoztGZfPAD2iwj")

'temps (6).csv'

## Display the data columns

Show the columns and settle on the target variables and the input variables. In this chapter, we will use 

In [4]:
# Pandas is used for data manipulation
import pandas as pd
# Read in data and display first 5 rows
features = pd.read_csv('temps.csv')
features.columns

Index(['year', 'month', 'day', 'week', 'temp_2', 'temp_1', 'average', 'actual',
       'forecast_noaa', 'forecast_acc', 'forecast_under', 'friend'],
      dtype='object')

- Temp_2 : Maximum temperature on 2 days prior to today.

- Temp_1: Maximum temperature on yesterday.

- Average: Historical temperature average

- Actual: Actual measure temperature on today.

- Forecast_NOAA: Temperature values forecasted by NOAA

- Friend: Forecasted by Friend (Randomly selected number within plus-minus 20 of Average temperature)

We will use the `actual` as the label, and all the other variables as features. 

# Check the data shape


In [5]:
features.shape

(348, 12)

In [6]:
# One-hot encode the data using pandas get_dummies
features = pd.get_dummies(features)
# Display the first 5 rows of the last 12 columns
features.iloc[:,5:].head(5)

Unnamed: 0,average,actual,forecast_noaa,forecast_acc,forecast_under,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,45.6,45,43,50,44,29,True,False,False,False,False,False,False
1,45.7,44,41,50,44,61,False,False,True,False,False,False,False
2,45.8,41,43,46,47,56,False,False,False,True,False,False,False
3,45.9,40,44,48,46,53,False,True,False,False,False,False,False
4,46.0,44,46,46,46,41,False,False,False,False,False,True,False


# Split training and testing

As we already did all the quality checks in [3.6](3.6_randomForest_regression.ipynb), we will not repeat them here and directly go to AutoML experiment. First, split the data into training and testing subsets.

In [7]:
train_df = features[:300]
test_df = features[300:]
print('Data for Modeling: ' + str(train_df.shape))
print('Unseen Data For Predictions: ' + str(test_df.shape))

Data for Modeling: (300, 18)
Unseen Data For Predictions: (48, 18)


In [8]:
train_df

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,2016,1,1,45,45,45.6,45,43,50,44,29,True,False,False,False,False,False,False
1,2016,1,2,44,45,45.7,44,41,50,44,61,False,False,True,False,False,False,False
2,2016,1,3,45,44,45.8,41,43,46,47,56,False,False,False,True,False,False,False
3,2016,1,4,44,41,45.9,40,44,48,46,53,False,True,False,False,False,False,False
4,2016,1,5,41,40,46.0,44,46,46,46,41,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,2016,11,9,63,71,52.4,65,48,56,52,42,False,False,False,False,False,False,True
296,2016,11,10,71,65,52.2,64,52,54,51,38,False,False,False,False,True,False,False
297,2016,11,11,65,64,51.9,63,50,53,52,55,True,False,False,False,False,False,False
298,2016,11,12,64,63,51.7,59,50,52,52,63,False,False,True,False,False,False,False


# Run PyCaret (no hassle)

Directly get to the point. Expect PyCaret to tell you what is going wrong. It should be able to automatically recognize the columns and assign appropriate data types to them.

First step, PyCaret need you to confirm the data columns are correctly parsed and their data types match their values. If yes, please enter in the popup text field. 


In [9]:
from pycaret.regression import *
exp_reg101 = setup(data = train_df, 
                   target = 'actual',
                   # imputation_type='iterative', 
                   fold_shuffle=True, 
                   session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,actual
2,Target type,Regression
3,Original data shape,"(300, 18)"
4,Transformed data shape,"(300, 18)"
5,Transformed train set shape,"(210, 18)"
6,Transformed test set shape,"(90, 18)"
7,Numeric features,10
8,Preprocess,True
9,Imputation type,simple


## Compare Models
Once you confirmed the data types are correct, run the comparison using one single line of code:

In [10]:
best = compare_models(exclude = ['ransac'])

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,3.8383,24.2928,4.7893,0.7416,0.0713,0.059,0.018
ada,AdaBoost Regressor,4.0621,26.1749,5.0051,0.7203,0.0747,0.0628,0.009
et,Extra Trees Regressor,4.0285,27.5968,5.0967,0.7128,0.0759,0.062,0.014
lasso,Lasso Regression,3.8163,27.731,4.9796,0.7117,0.0743,0.059,0.003
llar,Lasso Least Angle Regression,3.8166,27.7349,4.9799,0.7117,0.0743,0.059,0.004
en,Elastic Net,3.8289,28.0368,5.0033,0.7079,0.0747,0.0593,0.004
huber,Huber Regressor,3.8708,28.414,5.0481,0.7048,0.0754,0.0595,0.005
br,Bayesian Ridge,3.8655,28.4961,5.0447,0.7037,0.0754,0.0599,0.01
lightgbm,Light Gradient Boosting Machine,4.0354,27.4807,5.1129,0.7018,0.0767,0.0624,0.088
gbr,Gradient Boosting Regressor,4.0385,28.0064,5.0722,0.6964,0.0756,0.0623,0.009


# Get Best Model

It looks great! PyCaret automatically did all the work under the hood and give us the best model! You need to look at the RMSE and R2 columns in the comparison table, and the best RMSE and R2 are both achieved by Random Forest, which is much clear and can save you a lot of time to compare them. These results are professionally calculated at the point where PyCaret thinks it is neither overfitting nor underfitting. So the comparison results are very solid and reliable.

Next step is to extract the best model's hyperparameter configuration, and you can consider the hyperparameter tuning step is done, and go ahead and train your model. 

In [11]:
best

If you don't think the best model is the most cost wise model and need to check more models, you can print out more models by `top3 = compare_models(exclude = ['ransac'], n_select = 3)` and `top3` will be a list and return the first 3 models.

## Model Interpretation

You can get more details about why the best model is the best. PyCaret provides a function called `interpret_model`. It will produce a figure showing the influence of each input variable on the results. It is actually the same result of SHAP library and PyCaret integrates it. 

In [12]:
interpret_model(best)

ModuleNotFoundError: 
'shap' is a soft dependency and not included in the pycaret installation. Please run: `pip install shap` to install.
Alternately, you can install this by running `pip install pycaret[analysis]`

# Evaluate More Metrics

PyCaret provides some awesome widgets and plots to give you an easy way for visualizing and checking many other useful metrics during its training.

In [None]:
evaluate_model(best)

# TroubleShooting

1. First time runners might meet this issue on M1: https://github.com/microsoft/LightGBM/issues/1369 Please reinstall pycaret and lightgbm and see if the problem is gone. If not, please create a new issue on the Github repository issue page.