## AutoML with AutoGluon 

In this notebook we will take the same dataset we used with SageMaker Canvas and use AutoGluon's Tabular Predictor (can work with categorical and numerical features) to train a model to predict wine prices from our dataset. AutoGluon will train an ensemble of regression models and recommend a model based on winning RMSE metrics. With AutoGluon we are running the full build mode, but with SageMaker Canvas we only trained it for a few epochs with the Quick Build option. For your production requirement, you should consider the standard build option for Canvas. We 

We are using the raw dataset because we expect AutoML to take care of data processing and feature engineering prior to model training.

### IMPORTANT - Restart Kernel after executing cell below 

You can restart the Kernel by going to Kernel in the top menu, and select Restart Kernel And Clear All Outputs 

In [None]:
!pip install -U setuptools wheel
!pip install "torch>=1.0,<1.12+cpu" -f https://download.pytorch.org/whl/cpu/torch_stable.html
!pip3 install autogluon

### Have you restarted the Kernel?

Please make sure you have restarted the Kernel before executing the cell below. For instructions refer to the previous cell.

In [1]:
# Import the libraries we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
import warnings
from IPython.display import display
from time import strftime, gmtime

warnings.filterwarnings('ignore')

In [2]:
# Let's first load the data into a Pandas dataframe so it is easy for us to work with it
wine_canvas_raw_df = pd.read_csv('./wine_canvas_ds.csv', sep=',',header=0)
wine_canvas_raw_df.head()

Unnamed: 0,country,designation,points,price,province,region_1,region_2,variety,winery
0,US,Martha's Vineyard,96.0,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,Carodorum Selección Especial Reserva,96.0,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Special Selected Late Harvest,96.0,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,Reserve,96.0,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,La Brûlade,95.0,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [3]:
# Let us reserve 10 rows to test our model and the rest will be our training dataset
wine_train_df = wine_canvas_raw_df.iloc[:1989]
wine_test_df = wine_canvas_raw_df.iloc[1990:]

In [16]:
wine_test_df

Unnamed: 0,country,designation,points,price,province,region_1,region_2,variety,winery
1990,US,Ancestor Estate Reserve Estate Bottled,90.0,55.0,California,Adelaida District,Central Coast,Bordeaux-style Red Blend,Halter Ranch
1991,Germany,Johannisberger Klaus Spätlese,90.0,40.0,Rheingau,,,Riesling,Johannishof
1992,US,Classic Collection,90.0,24.0,California,Napa Valley,Napa,Petit Verdot,Napa Cellars
1993,US,Proprietary,90.0,48.0,California,Napa Valley,Napa,Red Blend,Paraduxx
1994,US,Schatz Family Reserve,90.0,60.0,California,Lodi,Central Valley,Cabernet Sauvignon,Peltier
1995,Germany,Cuvée Noir,90.0,21.0,Pfalz,,,Red Blend,Pflüger
1996,Germany,Hattenheimer Schützenhaus Kabinett Trocken,90.0,25.0,Rheingau,,,Riesling,Weingut Hans Bausch
1997,US,,90.0,18.0,California,El Dorado,Sierra Foothills,Barbera,Boeger
1998,US,Testa Vineyard,90.0,28.0,California,Mendocino,Mendocino/Lake Counties,Carignane,Donkey & Goat
1999,Australia,Ned & Henry's,90.0,25.0,South Australia,Barossa Valley,,Shiraz,Hewitson


From our raw dataset, we want to train a ML model that can predict the price of a wine bottle based on correlated input features. But notice that our tabular dataset contains a mix of text, and numbers. So we have categorical and quantitative features. AutoGluon tabular is designed to work with such a dataset, and we will use the TabularPredictor for our training needs.

### Run Training

In [6]:
# Data processing, feature engineering, setting up and running training is just 3 lines of code
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label='price', path='winning_wine_predictor')
predictor.fit(wine_train_df)

Beginning AutoGluon training ...
AutoGluon will save models to "winning_wine_predictor/"
AutoGluon Version:  0.5.0
Python Version:     3.7.10
Operating System:   Linux
Train Data Rows:    1989
Train Data Columns: 8
Label Column: price
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (500.0, 4.0, 38.18793, 29.00031)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1435.84 MB
	Train Data (Original)  Memory Usage: 0.87 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special 

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f64b9d4b610>

As we can see our winning model is a WeightedEnsemble_L2 with a RMSE of 19.89 (you have to multiply by -1). In comparison, the Canvas model with quick build had a RMSE of 23.75. If we had run Canvas in standard build we would have a much lower RMSE and a better model.

### Run Predictions

Now that we have our winning model let us run predictions using this model

In [7]:
# Let us drop the price column from our test dataset so we can get the model to predict that
wine_test_priceless = wine_test_df.drop(['price'], axis=1)
wine_test_priceless.head()

Unnamed: 0,country,designation,points,province,region_1,region_2,variety,winery
1990,US,Ancestor Estate Reserve Estate Bottled,90.0,California,Adelaida District,Central Coast,Bordeaux-style Red Blend,Halter Ranch
1991,Germany,Johannisberger Klaus Spätlese,90.0,Rheingau,,,Riesling,Johannishof
1992,US,Classic Collection,90.0,California,Napa Valley,Napa,Petit Verdot,Napa Cellars
1993,US,Proprietary,90.0,California,Napa Valley,Napa,Red Blend,Paraduxx
1994,US,Schatz Family Reserve,90.0,California,Lodi,Central Valley,Cabernet Sauvignon,Peltier


In [11]:
winning_predictions = predictor.predict(wine_test_priceless)
print("Winning Model Predictions for test data:  "+str(winning_predictions))

Winning Model Predictions for test data:  1990    45.047230
1991    28.318157
1992    29.502701
1993    76.576393
1994    35.715672
1995    36.933662
1996    28.318157
1997    23.321636
1998    34.547947
1999    47.772491
Name: price, dtype: float32


### Predictor Performance on test data

Now let us compare actual vs predicted from our test data to see how good our predictor is. The lower the root_mean_squared_error or RMSE (you have to multiply the printed value with -1), the better our model is. 

In [12]:
perf = predictor.evaluate_predictions(y_true=wine_test_df['price'], y_pred=winning_predictions, auxiliary_metrics=True)

Evaluation: root_mean_squared_error on test data: -15.895959262529042
	Note: Scores are always higher_is_better. This metric score can be multiplied by -1 to get the metric value.
Evaluations on test data:
{
    "root_mean_squared_error": -15.895959262529042,
    "mean_squared_error": -252.6815208759828,
    "mean_absolute_error": -13.389192962646485,
    "r2": -0.22044784039790777,
    "pearsonr": 0.43964546812450195,
    "median_absolute_error": -10.817306518554688
}


### Model Leaderboard

Finally AutoGluon also gives us a super cool leaderboard with all the models it trained and their evaluation results. 

In [13]:
predictor.leaderboard(wine_test_df, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBMLarge,-12.885584,-22.04392,0.020869,0.009866,0.530794,0.020869,0.009866,0.530794,1,True,11
1,CatBoost,-14.010923,-20.731509,0.025151,0.009975,2.725942,0.025151,0.009975,2.725942,1,True,6
2,LightGBMXT,-14.19553,-22.039222,0.022491,0.014447,1.218927,0.022491,0.014447,1.218927,1,True,3
3,LightGBM,-14.249132,-22.556779,0.017054,0.009536,0.366809,0.017054,0.009536,0.366809,1,True,4
4,KNeighborsDist,-14.423592,-25.697862,0.107832,0.108938,0.007975,0.107832,0.108938,0.007975,1,True,2
5,KNeighborsUnif,-14.423592,-25.693216,0.107898,0.105727,0.007785,0.107898,0.105727,0.007785,1,True,1
6,XGBoost,-14.930833,-21.513244,0.019682,0.010366,0.634773,0.019682,0.010366,0.634773,1,True,9
7,WeightedEnsemble_L2,-15.895959,-19.898685,0.188187,0.152718,16.07449,0.007455,0.00122,0.448209,2,True,12
8,ExtraTreesMSE,-18.040932,-22.209243,0.152636,0.106923,0.837662,0.152636,0.106923,0.837662,1,True,7
9,NeuralNetFastAI,-19.185879,-21.993628,0.026735,0.021132,3.368754,0.026735,0.021132,3.368754,1,True,8


### END OF NOTEBOOK, PLEASE GO BACK TO CHAPTER 7 IN THE BOOK FOR NEXT STEPS