# Tabluar Prediction

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mli/ag-docs/blob/main/get_started/tabular_quick_start.ipynb)


In a tabular prediction task, we predict the values in a column based on the rest columns' values. This tutorial demonstrates how to use AutoGluon for this task. 

To start, import the {class}`autogluon.tabular.TabularDataset` and 
{class}`autogluon.tabular.TabularPredictor` classes. We will use the former to load data and the latter to train models and predict. 



In [2]:
#@title Install autogluon
!pip install autogluon==0.5.0









In [4]:
from autogluon.tabular import TabularDataset, TabularPredictor

The dataset we will use is from the cover story of [Nature issue 7887](https://www.nature.com/nature/volumes/600/issues/7887): [AI guided tuition for math theorems](https://www.nature.com/articles/s41586-021-04086-x.pdf). The task is to predict knot's signature based on its properties. We sampled 10K training and 5K test examples from the [original data](https://github.com/deepmind/mathematics_conjectures/blob/main/knot_theory.ipynb). The sampled dataset makes this tutorial runs fast, but you can try the full dataset. 

We load this dataset directly from a URL. Note that the `TabularDataset` class is a subclass of [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), any pandas methods can be applied here. 

In [20]:
url = 'https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/'
train_data = TabularDataset(url+'train.csv')
train_data.head()

Loaded data from: https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/train.csv | Columns = 19 / 19 | Rows = 10000 -> 10000


Unnamed: 0.1,Unnamed: 0,chern_simons,cusp_volume,hyperbolic_adjoint_torsion_degree,hyperbolic_torsion_degree,injectivity_radius,longitudinal_translation,meridinal_translation_imag,meridinal_translation_real,short_geodesic_imag_part,short_geodesic_real_part,Symmetry_0,Symmetry_D3,Symmetry_D4,Symmetry_D6,Symmetry_D8,Symmetry_Z/2 + Z/2,volume,signature
0,70746,0.09053,12.226322,0,10,0.507756,10.685555,1.144192,-0.519157,-2.760601,1.015512,0.0,0.0,0.0,0.0,0.0,1.0,11.393225,-2
1,240827,0.232453,13.800773,0,14,0.413645,10.453156,1.320249,-0.158522,-3.013258,0.827289,0.0,0.0,0.0,0.0,0.0,1.0,12.742782,0
2,155659,-0.144099,14.76103,0,14,0.436928,13.405199,1.101142,0.768894,2.233106,0.873856,0.0,0.0,0.0,0.0,0.0,0.0,15.236505,2
3,239963,-0.171668,13.738019,0,22,0.249481,27.819496,0.493827,-1.188718,-2.042771,0.498961,0.0,0.0,0.0,0.0,0.0,0.0,17.27989,-8
4,90504,0.235188,15.896359,0,10,0.389329,15.330971,1.036879,0.722828,-3.056138,0.778658,0.0,0.0,0.0,0.0,0.0,0.0,16.749298,4


Our targets are stored in the `signature` column, which has 18 unique integers. Though Pandas didn't correctly recognize its data type, AutoGluon will fix this issue.


In [21]:
label = 'signature'
train_data[label].describe()

count    10000.000000
mean        -0.022000
std          3.025166
min        -12.000000
25%         -2.000000
50%          0.000000
75%          2.000000
max         12.000000
Name: signature, dtype: float64

Now construct a `TabularPredictor` instance by specifying the label column name, then train on the dataset with the {func}`autogluon.tabular.TabularPredictor.fit` method. We don't need to specify any other hyperparameters. This method will recognize it as a multi-class classification task, perform automatic feature engineering, train multiple models, and then ensemble them to form the final predictions. 



In [22]:
predictor = TabularPredictor(label=label).fit(train_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20220709_051040/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220709_051040/"
AutoGluon Version:  0.5.0
Python Version:     3.9.12
Operating System:   Linux
Train Data Rows:    10000
Train Data Columns: 18
Label Column: signature
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	First 10 (of 13) unique label values:  [-2, 0, 2, -8, 4, -4, -6, 8, 6, 10]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Fraction of data from classes with at least 10 examples that will be kept for training models: 0.9984
Train Data Class Count: 9
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Ava

The training takes one or a few minuets, depends on your CPU speed. It explored several models, including decision trees and neural networks. The training is often fast, as AutoGluon will not use deep neural networks in the default setting.  

```{hint}
If you can stop `fit` earlier to make the training faster. In particular, you can specify the `time_limit` argument in the `fit` method. For example, `fit(..., time_limit=60, ...)` means training at most 1 minute. But note that a too small value will impact the model quality.
```

Once training is done, load separate test data to predict.

In [26]:
test_data = TabularDataset(url+'test.csv')
# Optional: delete the label column for safety check.
y_pred = predictor.predict(test_data.drop(columns=[label]))
y_pred.head()

Loaded data from: https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/test.csv | Columns = 19 / 19 | Rows = 5000 -> 5000


0   -4
1   -2
2    0
3    4
4    2
Name: signature, dtype: int64

If you just want to evaluate the model performance, you can call the {func}`autogluon.tabular.TabularPredictor.evaluate` method.

In [27]:
predictor.evaluate(test_data, silent=True)

{'accuracy': 0.95,
 'balanced_accuracy': 0.7619277504882699,
 'mcc': 0.9387411901257484}

Now we did a quick through about using AutoGluon for tabular prediction. We used two classes, {class}`autogluon.tabular.TabularDataset` (essentially a pandas DataFrame) to load data and {class}`autogluon.tabular.TabularPredictor` to train (via the `fit` method) and predict (via the `predict` method). You will see similar APIs for other tasks, namely a `Dataset` class to load data and a `Prediction` class to train and predict. 


In addition, AutoGluon simplifies the model training by not requiring feature engineering and specifying model hyperparameters. AutoGluon automatically performs these jobs when running `fit`. You may worry about the resulted longer training time, AutoGluon balances the computational cost and model quality. You can benchmark AutoGluon's performance on the whole dataset loaded above against your favorite machine learning model. But to be fair, you also need to count the time you spend on preprocessing data and tuning your models. 

```{seealso}
To know more about AutoGluon, next you can read

- the cheetsheet for a quick overview of the APIs
- tutorials to customize the training and inference
- understand how AutoGluon performs feature engineering and model ensemble. 
```