<a href = "https://www.pieriantraining.com"><img src="../PT Centered Purple.png"> </a>

<em style="text-align:center">Copyrighted by Pierian Training</em>

# Advanced Settings with AutoGluon

In this notebook we are going to inspect some advanced settings like smaller hyperparameter sets.

The same dataset as in the previous notebook will be used, however to save time for the demonstrations, we'll shrink the dataset to just 10% of the size.

In [14]:
from autogluon.tabular import TabularDataset, TabularPredictor

In [5]:
data = TabularDataset("data/uber/uber.csv")
# To save time, let's make the data a little smaller!
data = data.sample((len(data)//3),random_state=42)

In [6]:
len(data)

66666

In [7]:
train_size = len(data)//10
seed = 42
train_data = data.sample(train_size, random_state=seed)
test_data = data.drop(train_data.index)

### Presets

Autogluon comes with many presets that you can pass to the training routine.
The training process will be adjusted depending on these presets.

You can find the documentation about all presets here:
https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.fit.html#:~:text=num_bag_sets%20is%20specified.-,presets
Currently the following presets are available:
[‘best_quality’, ‘high_quality’, ‘good_quality’, ‘medium_quality’, ‘optimize_for_deployment’, ‘interpretable’, ‘ignore_text’]


To get the best possible model you can use the *best_quality* preset - this will however drastically increase the training time.

To reduce the training time we can exclude both Neural Networks via <br />
**predictor.fit(train_data, presets=presets, excluded_model_types=["NN_TORCH", "FASTAI"])**



In [None]:
save_path = 'uber_predictors_preset'
presets = ["best_quality"]
predictor = TabularPredictor(label="fare_amount", path=save_path)
predictor.fit(train_data, presets=presets, excluded_model_types=["NN_TORCH", "FASTAI"])

Let's check the leaderboard to get a ranking of the individual models

In [9]:
predictor.leaderboard()

                    model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0     WeightedEnsemble_L2  -5.773691       0.629315  3.692127                0.000000           0.172537            2       True          5
1  RandomForestMSE_BAG_L1  -5.782839       0.316154  2.685818                0.316154           2.685818            1       True          3
2     WeightedEnsemble_L3  -5.885338       1.292543  7.775210                0.000998           0.130651            3       True          8
3    ExtraTreesMSE_BAG_L2  -5.896083       0.943477  4.541854                0.284240           0.983370            2       True          7
4  RandomForestMSE_BAG_L2  -6.016210       1.007305  6.661188                0.348069           3.102704            2       True          6
5    ExtraTreesMSE_BAG_L1  -6.800786       0.313161  0.833771                0.313161           0.833771            1       True          4
6   KNeighborsUnif_B

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-5.773691,0.629315,3.692127,0.0,0.172537,2,True,5
1,RandomForestMSE_BAG_L1,-5.782839,0.316154,2.685818,0.316154,2.685818,1,True,3
2,WeightedEnsemble_L3,-5.885338,1.292543,7.77521,0.000998,0.130651,3,True,8
3,ExtraTreesMSE_BAG_L2,-5.896083,0.943477,4.541854,0.28424,0.98337,2,True,7
4,RandomForestMSE_BAG_L2,-6.01621,1.007305,6.661188,0.348069,3.102704,2,True,6
5,ExtraTreesMSE_BAG_L1,-6.800786,0.313161,0.833771,0.313161,0.833771,1,True,4
6,KNeighborsUnif_BAG_L1,-10.714851,0.016956,0.017951,0.016956,0.017951,1,True,1
7,KNeighborsDist_BAG_L1,-11.292901,0.012966,0.020944,0.012966,0.020944,1,True,2


In [10]:
y_test = test_data["fare_amount"]
test_data = test_data.drop(columns=["fare_amount"])

y_pred = predictor.predict(test_data)
metrics = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)


Evaluation: root_mean_squared_error on test data: -5.948512879596984
	Note: Scores are always higher_is_better. This metric score can be multiplied by -1 to get the metric value.
Evaluations on test data:
{
    "root_mean_squared_error": -5.948512879596984,
    "mean_squared_error": -35.384805478731195,
    "mean_absolute_error": -2.5376794884308183,
    "r2": 0.6577656018723376,
    "pearsonr": 0.8113561769828805,
    "median_absolute_error": -1.5141555786132814
}


We can see, that the result is slightly superior compared to the previous notebook (where MAE was > 1.9)

Alternatively, we can pass a time_limit for the whole training.
Let's use a time limit of 1800 seconds

In [None]:
save_path = 'uber_predictors_preset2'
presets = ["best_quality"]
predictor = TabularPredictor(label="fare_amount", path=save_path)
predictor.fit(train_data, presets=presets, time_limit=1800)

In [12]:

y_pred = predictor.predict(test_data)
metrics = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)


Evaluation: root_mean_squared_error on test data: -5.948512879596984
	Note: Scores are always higher_is_better. This metric score can be multiplied by -1 to get the metric value.
Evaluations on test data:
{
    "root_mean_squared_error": -5.948512879596984,
    "mean_squared_error": -35.384805478731195,
    "mean_absolute_error": -2.5376794884308183,
    "r2": 0.6577656018723376,
    "pearsonr": 0.8113561769828805,
    "median_absolute_error": -1.5141555786132814
}


### Hyperparameter presets

To increase the training speed, we can select between different hyperparameter presets:

https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.fit.html#:~:text=was%20also%20specified.-,hyperparameters

Currently, the following sets are available: [‘default’, ‘light’, ‘very_light’, ‘toy’, ‘multimodal’]

In [None]:
save_path = 'uber_predictors_hyperparams_very_light'

predictor = TabularPredictor(label="fare_amount", path=save_path)
predictor.fit(train_data, presets=presets, time_limit=1800, hyperparameters="very_light")

In [None]:

y_pred = predictor.predict(test_data)
metrics = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)


## Deployment
After identification of the best model we might want to deploy it.

To do so let us at first create a clone of the original model, which we will then further postprocess

In [None]:
save_path = 'uber_predictors_preset'  # This was the best model from above
save_path_clone = save_path + '_clone'

predictor = TabularPredictor.load(save_path)
path_clone = predictor.clone(path=save_path_clone)


Optimize inference speed.
Next we can call the **refit_full()** function which retrains the model and optimizes it for inference time (https://auto.gluon.ai/dev/api/autogluon.tabular.TabularPredictor.refit_full.html) if bagging was used.

In [None]:
predictor.refit_full()

Last but not least, we can use **clone_for_deployment()** which only keeps the data required for inference and thus requires less storage

In [None]:
predictor.clone_for_deployment("uber_predictor_deployment")

## Feature Engineering
Autogluon comes with a heavily optimized automated feature engineering pipeline, the **AutoMLPipelineFeatureGenerator** https://auto.gluon.ai/stable/api/autogluon.features.html

A deatailed overview over the implemented routines can be found here: https://auto.gluon.ai/stable/tutorials/tabular/tabular-feature-engineering.html

In [None]:
from autogluon.features import AutoMLPipelineFeatureGenerator

We have already seen in the previous lecture, that autogluon succesfully recognized the date in the timestamp string and converted it to multiple columns. All of this functionality is implemented in the above described generator.
It can be used as follows (if you are familiar with sklearn, you already know this routine):

In [None]:
train_data.head()

In [None]:
feature_pipeline = AutoMLPipelineFeatureGenerator()
transformed_features = feature_pipeline.fit_transform(train_data)

In [None]:
transformed_features.head()

### Custom feature pipeline
In case you want to define your own feature preprocessing pipeline you can use the **PipelineFeatureGenerator()**

In [None]:
from autogluon.features import PipelineFeatureGenerator, DatetimeFeatureGenerator

In [None]:
feature_gen_custom = PipelineFeatureGenerator(generators=[
    DatetimeFeatureGenerator(features=["year", "month", "day", "hour", "minute"]),
    
])

In [None]:
transformed_custom = feature_gen_custom.fit_transform(train_data)

In [None]:
transformed_custom.head()

# Great Job!

In [None]:
print('done')