![MLU Logo](../data/MLU_Logo.png)

# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 3</a>


## AutoGluon

In this notebook, we use __AutoGluon__ to predict the __Outcome Type__ field of our review dataset.


[AutoGluon](https://auto.gluon.ai/stable/index.html) implements many of the best practices that we have discussed in this class, and more!  In particular, it sets itself apart from other AutoML solutions by having excellent automated feature engineering that can handle text data and missing values without any hand-coded solutions (See their [paper](https://arxiv.org/abs/2003.06505) for details).  It is too new to be in an existing Sagemaker kernel, so let's install it.

1. <a href="#1">Set up AutoGluon</a>
2. <a href="#2">Read the datasets</a>
3. <a href="#3">Train a classifier with AutoGluon</a>
4. <a href="#4">Model evaluation</a>
5. <a href="#5">Clean up model artifacts</a>

__Austin Animal Center Dataset__:

In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). 

In order to work with a single table, we joined the intake and outcome tables using the "Animal ID" column and created a single __review.csv__ file. We also didn't consider animals with multiple entries to the facility to keep our dataset simple. If you want to see the original datasets and the merged data with multiple entries, they are available under data/review folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv and Austin_Animal_Center_Intakes_Outcomes.csv.

__Dataset schema:__ 
- __Pet ID__ - Unique ID of pet
- __Outcome Type__ - State of pet at the time of recording the outcome (0 = not placed, 1 = placed). This is the field to predict.
- __Sex upon Outcome__ - Sex of pet at outcome
- __Name__ - Name of pet 
- __Found Location__ - Found location of pet before entered the center
- __Intake Type__ - Circumstances bringing the pet to the center
- __Intake Condition__ - Health condition of pet when entered the center
- __Pet Type__ - Type of pet
- __Sex upon Intake__ - Sex of pet when entered the center
- __Breed__ - Breed of pet 
- __Color__ - Color of pet 
- __Age upon Intake Days__ - Age of pet when entered the center (days)
- __Age upon Outcome Days__ - Age of pet at outcome (days))

## 1. <a name="1">Set up AutoGluon</a>
(<a href="#0">Go to top</a>)

In [1]:
%%capture
%pip install -q -r ../requirements.txt

## 2. <a name="2">Read the dataset</a>
(<a href="#0">Go to top</a>)

Let's read the dataset into a dataframe, using Pandas, and split the dataset into train and test sets (AutoGluon will handle the validation itself).

In [2]:
import pandas as pd

df = pd.read_csv('../data/review/review_dataset.csv')

In [3]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.1, shuffle=True, random_state=23)

## 3. <a name="3">Train a classifier with AutoGluon</a>
(<a href="#0">Go to top</a>)

We can run AutoGluon with a short snippet. For fitting, we just call the __.fit()__ function. In this exercise, we used the data frame objects, but this tool also accepts the raw csv files as input. To use this tool with simple csv files, you can follow the code snippet below.

```python
from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset(file_path='path_to_dataset/train.csv')
test_data = TabularDataset(file_path='path_to_dataset/test.csv')

predictor = TabularPredictor(label='label_column').fit(train_data)
test_predictions = predictor.predict(test_data)
```

We have our separate __data frames__ for training and test data, so we work with them below. We grab the first 10000 data points for a quick demo. You can also pass the full dataset.

In [4]:
from autogluon.tabular import TabularDataset, TabularPredictor

k = 10000 # grab less data for a quick demo
#k = train_data.shape[0] # grad the whole dataset

predictor = TabularPredictor(label='Outcome Type').fit(train_data.head(k))

No path specified. Models will be saved in: "AutogluonModels/ag-20240913_203220"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.1.1
Python Version:     3.10.14
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Aug 7 16:53:32 UTC 2024
CPU Count:          8
Memory Avail:       25.55 GB / 30.99 GB (82.4%)
Disk Space Avail:   69.53 GB / 98.25 GB (70.8%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon 

We can also summarize what happened during fit.

In [5]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val eval_metric  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2      0.872    accuracy       2.410174  19.341192                0.001181           0.193021            2       True         14
1               XGBoost      0.854    accuracy       0.021203   4.380946                0.021203           4.380946            1       True         11
2        NeuralNetTorch      0.854    accuracy       0.024079  28.960747                0.024079          28.960747            1       True         12
3              CatBoost      0.854    accuracy       0.030123   6.214871                0.030123           6.214871            1       True          7
4              LightGBM      0.853    accuracy       0.010500   1.349329                0.010500           1.349329            1       True          4
5      RandomForestGini      0.8

{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': 0.651,
  'KNeighborsDist': 0.66,
  'LightGBMXT': 0.848,
  'LightGBM': 0.853,
  'RandomForestGini': 0.853,
  'RandomForestEntr': 0.85,
  'CatBoost': 0.854,
  'ExtraTreesGini': 0.836,
  'ExtraTreesEntr': 0.844,
  'NeuralNetFastAI': 0.817,
  'XGBoost': 0.854,
  'NeuralNetTorch': 0.854,
  'LightGBMLarge': 0.847,
  'WeightedEnsemble_L2': 0.872},
 'model_best': 'WeightedEnsemble_L2',
 'model_paths': {'KNeighborsUnif': ['KNeighborsUnif'],
  'KNeighborsDist': ['KNe

## 4. <a name="4">Model evaluation</a>
(<a href="#0">Go to top</a>)

Next, we load a separate test data to demonstrate how to make predictions on new examples at inference time.

In [6]:
# First predictions
y_pred = predictor.predict(test_data.head(k))

# Then, evaluations
predictor.evaluate_predictions(y_true=test_data['Outcome Type'],
                               y_pred=y_pred,
                               auxiliary_metrics=True)

{'accuracy': 0.8597758927636402,
 'balanced_accuracy': 0.8476454319618011,
 'mcc': 0.7163511597035244,
 'f1': 0.8835550917471084,
 'precision': 0.8352515619861888,
 'recall': 0.9377884437880746}

We can see the performance of each individual trained model on the test data:

In [7]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.859776,0.872,accuracy,1.356899,2.410174,19.341192,0.004159,0.001181,0.193021,2,True,14
1,RandomForestEntr,0.855482,0.85,accuracy,0.26559,0.137594,1.72465,0.26559,0.137594,1.72465,1,True,6
2,CatBoost,0.854016,0.854,accuracy,0.042136,0.030123,6.214871,0.042136,0.030123,6.214871,1,True,7
3,RandomForestGini,0.854016,0.853,accuracy,0.226414,0.093923,1.52907,0.226414,0.093923,1.52907,1,True,5
4,XGBoost,0.853493,0.854,accuracy,0.131057,0.021203,4.380946,0.131057,0.021203,4.380946,1,True,11
5,LightGBM,0.850141,0.853,accuracy,0.059189,0.0105,1.349329,0.059189,0.0105,1.349329,1,True,4
6,LightGBMLarge,0.846686,0.847,accuracy,0.040833,0.009413,2.920159,0.040833,0.009413,2.920159,1,True,13
7,LightGBMXT,0.846686,0.848,accuracy,0.094671,0.014008,4.366523,0.094671,0.014008,4.366523,1,True,3
8,NeuralNetTorch,0.846581,0.854,accuracy,0.08902,0.024079,28.960747,0.08902,0.024079,28.960747,1,True,12
9,ExtraTreesGini,0.842811,0.836,accuracy,0.275699,0.130907,1.496954,0.275699,0.130907,1.496954,1,True,8


## 5. <a name="5">Clean up model artifacts</a>
(<a href="#0">Go to top</a>)

In [8]:
!rm -r AutogluonModels