# Custom Models

In this tutorial, we will show how to add a custom model to AutoGluon. A custom model can be a new model AutoGluon doesn't have yet, or a model fits your data well. Although you can train your model directly, integrating into AutoGluon allows you to leverage its feature engineering, model selection and combination. 

AutoGluon's class for a ML model is {class}`autogluon.core.models.AbstractModel`. You need to define your model as its subclass. To do so, you need to implement two methods: 

- `_process`: transforms the input data to the internal representation usable by the model.
- `_fit`: fit your model on the input data

In the following example, we wrap scikit-learn's random forest into an AutoGluon model. 

In [21]:
import numpy as np
import pandas as pd

from autogluon.core.models import AbstractModel
from autogluon.features.generators import LabelEncoderFeatureGenerator

class CustomRandomForestModel(AbstractModel):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._feature_generator = None

    # `_preprocess` is called by `preprocess` 
    def _preprocess(self, X: pd.DataFrame, is_train=False, **kwargs) -> np.ndarray:
        X = super()._preprocess(X, **kwargs)
        # TODO, explain what's _feature_generator
        if is_train:  # called during model fit
            self._feature_generator = LabelEncoderFeatureGenerator(verbosity=0)
            self._feature_generator.fit(X=X)
        if self._feature_generator.features_in:
            # This converts categorical features to numeric via stateful label encoding.
            X = X.copy()
            X[self._feature_generator.features_in] = self._feature_generator.transform(X=X)
        # Handle missing values
        return X.fillna(0).to_numpy(dtype=np.float32)

    # Note that we ignored some common arguments such as X_val, y_val, time_limit
    def _fit(self, X: pd.DataFrame, y: pd.Series, **kwargs):
        # Import the required dependencies for the model. 
        # Importing here rather than outside makes AutoGluon more robust for 
        # the case the libraries are not available. AutoGluon will still train 
        # other models.
        from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

        # problem_type can be binary, multiclass, regression, quantile, softclass
        if self.problem_type in ['regression', 'softclass']:
            model_cls = RandomForestRegressor
        else:
            model_cls = RandomForestClassifier

        # Call preprocess instead of _preprocess
        X = self.preprocess(X, is_train=True)
        # Fetches the user-specified (and default) hyperparameters.
        params = self._get_model_params()
        self.model = model_cls(**params)
        self.model.fit(X, y)
    
    # Defines the default hyperparameters. We can override them later.
    def _set_default_params(self):
        default_params = {
            'n_estimators': 300,
            'n_jobs': -1,
            'random_state': 0,
        }
        for param, val in default_params.items():
            self._set_default_param_value(param, val)

    # Defines model-agnostic hyperparameters. 
    # In most cases, you only need to specify the valid/invalid dtypes.
    def _get_default_auxiliary_params(self) -> dict:
        default_auxiliary_params = super()._get_default_auxiliary_params()
        # all raw dtypes are: ['int', 'float', 'category', 'object', 'datetime']
        # objects can be raw texts and image paths.
        default_auxiliary_params.update(
            {'valid_raw_types': ['int', 'float', 'category']})
        return default_auxiliary_params

```{seealso}
You can check the source code for {class}`autogluon.tabular.models.RFModel` for the official AutoGluon random forest implementation.
```

Now let's train ... 


In [2]:
#@title Load the knot theory data
from autogluon.tabular import TabularDataset, TabularPredictor

url = 'https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/'
train_data = TabularDataset(url+'train.csv')
test_data = TabularDataset(url+'test.csv')
label = 'signature'

In [24]:
from autogluon.core.space import Categorical, Int, Real

# TODO: these hps were not set before, need to explain.
custom_hyperparameters_hpo = {CustomRandomForestModel: {
    'max_depth': Int(lower=5, upper=30),
    'max_features': Real(lower=0.1, upper=1.0),
    'criterion': Categorical('gini', 'entropy'),
}}

predictor = TabularPredictor(label=label).fit(
    train_data, hyperparameters=custom_hyperparameters_hpo,
    hyperparameter_tune_kwargs='auto', 
    time_limit=60)


No path specified. Models will be saved in: "AutogluonModels/ag-20220713_205924/"
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20220713_205924/"
AutoGluon Version:  0.5.0
Python Version:     3.9.12
Operating System:   Linux
Train Data Rows:    10000
Train Data Columns: 18
Label Column: signature
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	First 10 (of 13) unique label values:  [-2, 0, 2, -8, 4, -4, -6, 8, 6, 10]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Fraction of data from classes with at least 10 examples that will be kept for training models: 0.9984
Train Data Class Count: 9
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeature

Fitted model: CustomRandomForestModel/T32 ...
	0.9514	 = Validation score   (accuracy)
	1.12s	 = Training   runtime
	0.12s	 = Validation runtime
Fitted model: CustomRandomForestModel/T33 ...
	0.9494	 = Validation score   (accuracy)
	1.68s	 = Training   runtime
	0.12s	 = Validation runtime
Fitted model: CustomRandomForestModel/T34 ...
	0.8668	 = Validation score   (accuracy)
	0.95s	 = Training   runtime
	0.12s	 = Validation runtime
Fitted model: CustomRandomForestModel/T35 ...
	0.9529	 = Validation score   (accuracy)
	1.43s	 = Training   runtime
	0.12s	 = Validation runtime
Fitted model: CustomRandomForestModel/T36 ...
	0.9124	 = Validation score   (accuracy)
	0.95s	 = Training   runtime
	0.12s	 = Validation runtime
Fitted model: CustomRandomForestModel/T37 ...
	0.9504	 = Validation score   (accuracy)
	1.37s	 = Training   runtime
	0.12s	 = Validation runtime
Fitted model: CustomRandomForestModel/T38 ...
	0.8518	 = Validation score   (accuracy)
	0.94s	 = Training   runtime
	0.12s	 = Vali

In [32]:
best_model = predictor.leaderboard(silent=True).iloc[1]['model']
best_config = predictor.info()['model_info'][best_model]['hyperparameters']

In [34]:
from autogluon.tabular.configs.hyperparameter_configs import get_hyperparameter_config

custom_hyperparameters = get_hyperparameter_config('default')
custom_hyperparameters[CustomRandomForestModel] = best_config

All

In [36]:
predictor = TabularPredictor(label=label).fit(
    train_data, hyperparameters=custom_hyperparameters)

No path specified. Models will be saved in: "AutogluonModels/ag-20220713_210651/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220713_210651/"
AutoGluon Version:  0.5.0
Python Version:     3.9.12
Operating System:   Linux
Train Data Rows:    10000
Train Data Columns: 18
Label Column: signature
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	First 10 (of 13) unique label values:  [-2, 0, 2, -8, 4, -4, -6, 8, 6, 10]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Fraction of data from classes with at least 10 examples that will be kept for training models: 0.9984
Train Data Class Count: 9
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Ava

In [40]:
predictor.leaderboard(test_data, silent=True).head(n=8)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.9498,0.965966,0.453382,0.165812,17.925526,0.006438,0.000571,0.67052,2,True,15
1,LightGBM,0.9456,0.955956,0.06857,0.027067,3.902843,0.06857,0.027067,3.902843,1,True,5
2,XGBoost,0.9448,0.956957,0.062688,0.019377,6.111501,0.062688,0.019377,6.111501,1,True,11
3,LightGBMLarge,0.9444,0.94995,0.155716,0.035261,9.678791,0.155716,0.035261,9.678791,1,True,13
4,CatBoost,0.9432,0.955956,0.01961,0.008976,18.071039,0.01961,0.008976,18.071039,1,True,8
5,CustomRandomForestModel,0.9424,0.945946,0.165466,0.119914,1.596552,0.165466,0.119914,1.596552,1,True,14
6,RandomForestEntr,0.9388,0.94995,0.175203,0.119092,1.09261,0.175203,0.119092,1.09261,1,True,7
7,NeuralNetFastAI,0.938,0.943944,0.21879,0.02595,9.546953,0.21879,0.02595,9.546953,1,True,3
