![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 3</a>

## AutoGluon

In this notebook, we use [AutoGluon Tabular Prediction](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) to predict the __target_label__ field (plug or no plug) of the Amazon electric plug dataset. This notebook is following the [quick start tutorial](https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-quickstart.html). For more advanced use, check out the [in-depth tutorial](https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-indepth.html).

AutoGluon implements many of the best practices that we have discussed in this class, and more!  In particular, it sets itself apart from other AutoML solutions by having excellent automated feature engineering that can handle text data and missing values without any hand-coded solutions (See their [paper](https://arxiv.org/abs/2003.06505) for details). 

1. <a href="#1">Set up AutoGluon</a>
2. <a href="#2">Read the datasets </a>
3. <a href="#3">Train a Classifier with AutoGluon</a>
4. <a href="#4">Classifier evaluation</a>
5. <a href="#5">Clean up model artifacts</a>

__Dataset schema:__ 
- __ASIN__: Product ASIN
- __target_label:__ Binary field with values in {0,1}. A value of 1 show ASIN has a plug, otherwise 0.
- __ASIN_STATIC_ITEM_NAME:__ Title of the ASIN.
- __ASIN_STATIC_PRODUCT_DESCRIPTION:__ Description of the ASIN
- __ASIN_STATIC_GL_PRODUCT_GROUP_TYPE:__ GL information for the ASIN.
- __ASIN_STATIC_ITEM_PACKAGE_WEIGHT:__ Weight of the ASIN.
- __ASIN_STATIC_LIST_PRICE:__ Price information for the ASIN.
- __ASIN_STATIC_BATTERIES_INCLUDED:__ Information whether batteries are included along with the product.
- __ASIN_STATIC_BATTERIES_REQUIRED:__ Information whether batteries are required for using the product.
- __ASIN_STATIC_ITEM_CLASSIFICATION:__ Item classification of whether it is a standalone or bundle parent item etc



## 1. <a name="1">Set up AutoGluon</a>
(<a href="#0">Go to top</a>)

Let's install Autogluon. This may take some time as it installs all required libraries for AutoGluon.

In [1]:
! pip install -q pip==22.3.1
! pip install -q setuptools==54.1.1
! pip install -q wheel==0.36.2
! pip install -q autogluon>=0.6.1

## 2. <a name="2">Read the datasets</a>
(<a href="#0">Go to top</a>)

Let's read the training and test datasets into dataframes, using Pandas. (AutoGluon will handle the validation itself).

In [2]:
import pandas as pd

import warnings
warnings.filterwarnings("ignore")
  
train_data = pd.read_csv('../../data/review/asin_electrical_plug_training_data.csv')
test_data = pd.read_csv('../../data/review/asin_electrical_plug_test_data.csv')

print('The shape of the training dataset is:', train_data.shape)
print('The shape of the test dataset is:', test_data.shape)

The shape of the training dataset is: (55109, 10)
The shape of the test dataset is: (6124, 10)


__AutoGluon__ will handle the __validation data__ itself.

## 3. <a name="3">Train a Classifier with AutoGluon</a>
(<a href="#0">Go to top</a>)

We can run AutoGluon with a short snippet. For fitting, we just call the __.fit()__ function. In this exercise, we used the data frame objects, but this tool also accepts the raw csv files as input. To use this tool with simple csv files, you can follow the code snippet below.

```python
from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset(file_path='path_to_dataset/train.csv')
test_data = TabularDataset(file_path='path_to_dataset/test.csv')

predictor = TabularPredictor(label='label_column').fit(train_data)
test_predictions = predictor.predict(test_data)
```

We have our separate __data frames__ for training and test data, so we work with them below. We grab the first 10000 data points for a quick demo. You can also pass the full dataset.

In [3]:
from autogluon.tabular import TabularDataset, TabularPredictor

k = 10000 # grab less data for a quick demo
#k = train_data.shape[0] # grad the whole dataset

predictor = TabularPredictor(label='target_label').fit(train_data.head(k))

No path specified. Models will be saved in: "AutogluonModels/ag-20230609_175815/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230609_175815/"
AutoGluon Version:  0.7.0
Python Version:     3.10.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Mon Apr 24 23:34:06 UTC 2023
Train Data Rows:    10000
Train Data Columns: 9
Label Column: target_label
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    5857.48 MB
	Train Data (Orig

We can also summarize what happened during fit.

In [4]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0              LightGBM      0.964       0.043321   5.734516                0.043321           5.734516            1       True          4
1   WeightedEnsemble_L2      0.964       0.045562   6.810692                0.002240           1.076176            2       True         14
2               XGBoost      0.963       0.053662  68.633538                0.053662          68.633538            1       True         11
3        KNeighborsDist      0.963       1.041546   0.772372                1.041546           0.772372            1       True          2
4        KNeighborsUnif      0.963       1.069594   2.023881                1.069594           2.023881            1       True          1
5       NeuralNetFastAI      0.962       0.031359  12.975146                0.031359          12.975146 

{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': 0.963,
  'KNeighborsDist': 0.963,
  'LightGBMXT': 0.962,
  'LightGBM': 0.964,
  'RandomForestGini': 0.961,
  'RandomForestEntr': 0.961,
  'CatBoost': 0.962,
  'ExtraTreesGini': 0.961,
  'ExtraTreesEntr': 0.961,
  'NeuralNetFastAI': 0.962,
  'XGBoost': 0.963,
  'NeuralNetTorch': 0.962,
  'LightGBMLarge': 0.962,
  'WeightedEnsemble_L2': 0.964},
 'model_best': 'WeightedEnsemble_L2',
 'model_paths': {'KNeighborsUnif': 'AutogluonModels/ag-20230609_175815/models/

## 4. <a name="4">Prediction and Evaluation</a>
(<a href="#0">Go to top</a>)

Next, load separate test data to demonstrate how to make predictions on new examples at inference time.

In [5]:
# First predictions
y_pred = predictor.predict(test_data.head(k))

# Then, evaluations
predictor.evaluate_predictions(y_true=test_data['target_label'].head(k),
                               y_pred=y_pred,
                               auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.9624428478118876
Evaluations on test data:
{
    "accuracy": 0.9624428478118876,
    "balanced_accuracy": 0.5081713615234803,
    "mcc": 0.06967046815978162,
    "f1": 0.033613445378151266,
    "precision": 0.3333333333333333,
    "recall": 0.017699115044247787
}


{'accuracy': 0.9624428478118876,
 'balanced_accuracy': 0.5081713615234803,
 'mcc': 0.06967046815978162,
 'f1': 0.033613445378151266,
 'precision': 0.3333333333333333,
 'recall': 0.017699115044247787}

We can see the performance of each individual trained model on the test data:

In [6]:
predictor.leaderboard(test_data.head(k), silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestEntr,0.963259,0.961,0.667967,0.14133,18.955637,0.667967,0.14133,18.955637,1,True,6
1,NeuralNetTorch,0.963096,0.962,0.085135,0.041718,8.885574,0.085135,0.041718,8.885574,1,True,12
2,CatBoost,0.963096,0.962,0.146532,0.241259,20.416557,0.146532,0.241259,20.416557,1,True,7
3,LightGBMLarge,0.963096,0.962,0.206708,0.044144,11.629147,0.206708,0.044144,11.629147,1,True,13
4,LightGBMXT,0.963096,0.962,0.211115,0.03679,4.540165,0.211115,0.03679,4.540165,1,True,3
5,XGBoost,0.963096,0.963,0.45549,0.053662,68.633538,0.45549,0.053662,68.633538,1,True,11
6,RandomForestGini,0.962606,0.961,0.798879,0.156485,28.414665,0.798879,0.156485,28.414665,1,True,5
7,ExtraTreesEntr,0.962606,0.961,0.877503,0.163085,24.001079,0.877503,0.163085,24.001079,1,True,9
8,LightGBM,0.962443,0.964,0.257697,0.043321,5.734516,0.257697,0.043321,5.734516,1,True,4
9,WeightedEnsemble_L2,0.962443,0.964,0.260525,0.045562,6.810692,0.002828,0.00224,1.076176,2,True,14


## 5. <a name="5">Clean up model artifacts</a>
(<a href="#0">Go to top</a>)

In [7]:
!rm -rf AutogluonModels