# Model Validations

We have seen how to use the `evaluate` method to evaluate on a separate dataset using multiple metrics. During `fit`, AutoGluon also evaluate models to select hyperparameters and model combination weights. In this tutorial, we show how to change AutoGluon's internal evaluation to better guide its auto-tuning. 

In default, AutoGluon randomly samples a fraction of the training examples, i.e. the `train_data` to the `fit` method, as a validation set, and picks an evaluation metric based on the inferred task type. [Accuracy](https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score) is used for classification, [root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation) for regression, and [pinball loss](https://scikit-learn.org/stable/modules/model_evaluation.html#pinball-loss) for quantile regression.  

There two typical situations you want change the default setting. One is you have a separate validation set, whose distribution may be different to the training set, and the model performance on the validation set matters more. For example, when predicting if a user will click an online ads, we want to use the user click histories from the last few days to tune models, as recent user behaviors will be closer to the ones after our models are deployed. To do so,
you can specify the `tuning_data` argument in the `fit` method. 

:::{tip}
If your validation set isn't different to the training set, we recommend you to combine both into a single `train_data`, instead of specifying the `tuning_data`. The reason is that AutoGluon has an efficient partition mechanism, such as stratified sampling, to maximize the example utilization but still avoid severe over-fitting. 
:::

The other one is that you evaluate the model performance with a different metric. So you want AutoGluon to use this metric to score models during fit. AutoGluon provides the following metrics: 

Classification
:  `accuracy`, `balanced_accuracy`, `f1`, `f1_macro`, `f1_micro`, `f1_weighted`, `roc_auc`, `roc_auc_ovo_macro`, `average_precision`, `precision`, `precision_macro`, `precision_micro`, `precision_weighted`, `recall`, `recall_macro`, `recall_micro`, `recall_weighted`, `log_loss`, `pac_score`

Regression
: `root_mean_squared_error`, `mean_squared_error`, `mean_absolute_error`, `median_absolute_error`, `mean_absolute_percentage_error`, `r2`

You can found their definitions from [scikit-learn](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics). But note the above metric names are slightly different to scikit-learn's, as the `_score` and `_loss` postfix often is dropped for simplicity.

(TODO, rename `pac_score` to `pac` to be more consistent). 

:::{seealso}
If your metric isn't included above, you can define your customized metrics [TODO link].
:::


In the following example, we use the last 2000 training examples as the tuning set, and [`balanced_accuracy`](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score) as the metric to score models during fit. 

In [None]:
#@title Install autogluon
!pip install autogluon==0.5.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting autogluon
  Downloading autogluon-0.5.0-py3-none-any.whl (9.5 kB)
Collecting autogluon.features==0.5.0
  Downloading autogluon.features-0.5.0-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 877 kB/s 
[?25hCollecting autogluon.tabular[all]==0.5.0
  Downloading autogluon.tabular-0.5.0-py3-none-any.whl (272 kB)
[K     |████████████████████████████████| 272 kB 11.0 MB/s 
[?25hCollecting autogluon.timeseries[all]==0.5.0
  Downloading autogluon.timeseries-0.5.0-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 1.8 MB/s 
[?25hCollecting autogluon.multimodal==0.5.0
  Downloading autogluon.multimodal-0.5.0-py3-none-any.whl (141 kB)
[K     |████████████████████████████████| 141 kB 31.4 MB/s 
[?25hCollecting autogluon.core[all]==0.5.0
  Downloading autogluon.core-0.5.0-py3-none-any.whl (203 kB)
[K     |██████████████████████████████

In [13]:
#@title Load Knot Theory data
from autogluon.tabular import TabularDataset, TabularPredictor

url = 'https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/'
train_data = TabularDataset(url+'train.csv')
test_data = TabularDataset(url+'test.csv')
label = 'signature'

Loaded data from: https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/train.csv | Columns = 19 / 19 | Rows = 10000 -> 10000
Loaded data from: https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/test.csv | Columns = 19 / 19 | Rows = 5000 -> 5000


In [9]:
predictor = TabularPredictor(label=label, eval_metric='balanced_accuracy').fit(
    train_data[:-2000], tuning_data=train_data[-2000:])

Loaded data from: https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/train.csv | Columns = 19 / 19 | Rows = 10000 -> 10000
Loaded data from: https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/test.csv | Columns = 19 / 19 | Rows = 5000 -> 5000
No path specified. Models will be saved in: "AutogluonModels/ag-20220710_164022/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220710_164022/"
AutoGluon Version:  0.5.0
Python Version:     3.9.12
Operating System:   Linux
Train Data Rows:    8000
Train Data Columns: 18
Tuning Data Rows:    2000
Tuning Data Columns: 18
Label Column: signature
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	First 10 (of 12) unique label values:  [-2, 0, 2, -8, 4, -4, -6, 8, 6, 10]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predic



	0.9285	 = Validation score   (mcc)
	5.94s	 = Training   runtime
	0.03s	 = Validation runtime
Fitting model: NeuralNetTorch ...
	0.9219	 = Validation score   (mcc)
	46.91s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: LightGBMLarge ...
	0.9316	 = Validation score   (mcc)
	11.04s	 = Training   runtime
	0.06s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	0.9427	 = Validation score   (mcc)
	2.67s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 144.1s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20220710_164022/")


Finally, let's evaluate the model performance. 

In [11]:
predictor.evaluate(test_data, silent=True)

{'mcc': 0.9299397919476182,
 'accuracy': 0.9428,
 'balanced_accuracy': 0.7547076012362441}

TODO: choose a different eval_metric doesn't guarantee better score on this metric, even on validation dataset