# 3.5 AutoML

Automated workflow for hyper-parameter tuning and optimal model finder

In this tutorial, we will try some cool technique that has been used widely to make AI/ML less tedious and boost your ML workflow efficiency. 

If you have learnt [3.6](3.6_randomForest_regression.ipynb), you might be amazed but also annoyed by all those parameter tuning efforts and many back-n-forth iterations needed to figure out which configuration will be optimal for your case. It has been known as the major reason for low productivity in the AI/ML world. People come up with an idea that it seems most work in that tuning and iteration are very simple, can we automate it? The answer is yes, and that will be the technique we will introduce here: AutoML.

There are many AutoML solutions on the market, e.g., AutoKeras, auto-sklearn, H2O, Auto-WEKA, etc. Here we will focus on [PyCaret](https://pycaret.gitbook.io/docs/get-started/installation) which is a popular one in both academia and industry and very easy to use.

## First we get data ready

As usual, data collection is the first step. To better demonstrate the point of AutoML, we will use the same data as [3.6 Random Forest](3.6_randomForest_regression.ipynb).

In [1]:
import wget
wget.download("https://docs.google.com/uc?export=download&id=1pko9oRmCllAxipZoa3aoztGZfPAD2iwj")

100% [..............................................................................] 14436 / 14436

'temps (3).csv'

## Display the data columns

Show the columns and settle on the target variables and the input variables. In this chapter, we will use 

In [2]:
# Pandas is used for data manipulation
import pandas as pd
# Read in data and display first 5 rows
features = pd.read_csv('temps.csv')
features.columns

Index(['year', 'month', 'day', 'week', 'temp_2', 'temp_1', 'average', 'actual',
       'forecast_noaa', 'forecast_acc', 'forecast_under', 'friend'],
      dtype='object')

- Temp_2 : Maximum temperature on 2 days prior to today.

- Temp_1: Maximum temperature on yesterday.

- Average: Historical temperature average

- Actual: Actual measure temperature on today.

- Forecast_NOAA: Temperature values forecasted by NOAA

- Friend: Forecasted by Friend (Randomly selected number within plus-minus 20 of Average temperature)

We will use the `actual` as the label, and all the other variables as features. 

# Check the data shape


In [3]:
features.shape

(348, 12)

In [4]:
# One-hot encode the data using pandas get_dummies
features = pd.get_dummies(features)
# Display the first 5 rows of the last 12 columns
features.iloc[:,5:].head(5)

Unnamed: 0,average,actual,forecast_noaa,forecast_acc,forecast_under,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,45.6,45,43,50,44,29,1,0,0,0,0,0,0
1,45.7,44,41,50,44,61,0,0,1,0,0,0,0
2,45.8,41,43,46,47,56,0,0,0,1,0,0,0
3,45.9,40,44,48,46,53,0,1,0,0,0,0,0
4,46.0,44,46,46,46,41,0,0,0,0,0,1,0


# Split training and testing

As we already did all the quality checks in [3.6](3.6_randomForest_regression.ipynb), we will not repeat them here and directly go to AutoML experiment. First, split the data into training and testing subsets.

In [5]:
train_df = features[:300]
test_df = features[300:]
print('Data for Modeling: ' + str(train_df.shape))
print('Unseen Data For Predictions: ' + str(test_df.shape))

Data for Modeling: (300, 18)
Unseen Data For Predictions: (48, 18)


In [6]:
train_df

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,2016,1,1,45,45,45.6,45,43,50,44,29,1,0,0,0,0,0,0
1,2016,1,2,44,45,45.7,44,41,50,44,61,0,0,1,0,0,0,0
2,2016,1,3,45,44,45.8,41,43,46,47,56,0,0,0,1,0,0,0
3,2016,1,4,44,41,45.9,40,44,48,46,53,0,1,0,0,0,0,0
4,2016,1,5,41,40,46.0,44,46,46,46,41,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,2016,11,9,63,71,52.4,65,48,56,52,42,0,0,0,0,0,0,1
296,2016,11,10,71,65,52.2,64,52,54,51,38,0,0,0,0,1,0,0
297,2016,11,11,65,64,51.9,63,50,53,52,55,1,0,0,0,0,0,0
298,2016,11,12,64,63,51.7,59,50,52,52,63,0,0,1,0,0,0,0


# Run PyCaret (no hassle)

Directly get to the point. Expect PyCaret to tell you what is going wrong. It should be able to automatically recognize the columns and assign appropriate data types to them.


NOTE: First time runners might meet this issue on M1: https://github.com/microsoft/LightGBM/issues/1369 Please reinstall pycaret and lightgbm and see if the problem is gone. If not, please create a new issue on the Github repository issue page.

In [7]:
from pycaret.regression import *
exp_reg101 = setup(data = train_df, 
                   target = 'actual',
                   # imputation_type='iterative', 
                   fold_shuffle=True, 
                   session_id=123)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,actual
2,Original Data,"(300, 18)"
3,Missing Values,False
4,Numeric Features,15
5,Categorical Features,2
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(209, 27)"


AttributeError: 'Simple_Imputer' object has no attribute 'fill_value_categorical'

In [2]:
print("test")

test
