# Kaggle Competition 

This tutorial walk you through using AutoGluon to participant [Kaggle competition](https://www.kaggle.com/). Let's start with the most basic way: download data, fit, submit results. We will use the [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/) as the example. 

First, join the competition to download Kaggle dataset. You can either click the download button on its website or use its [API](https://www.kaggle.com/docs/api). For the latter, once you installed `kaggle` and configured your credential, you can download through 

In [11]:
!kaggle competitions download -c ieee-fraud-detection 
!unzip -o ieee-fraud-detection.zip

ieee-fraud-detection.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  ieee-fraud-detection.zip
  inflating: sample_submission.csv   
  inflating: test_identity.csv       
  inflating: test_transaction.csv    
  inflating: train_identity.csv      
  inflating: train_transaction.csv   
sample_submission.csv  test_transaction.csv  train_transaction.csv
test_identity.csv      train_identity.csv


Then let's load the data. Since the training data for this competition is comprised of multiple CSV files, we need to join them into a single large table.

In [13]:
import pandas as pd
import numpy as np
from autogluon.tabular import TabularPredictor

label = 'isFraud'  # name of target variable to predict.
eval_metric = 'roc_auc'  # Optional: the competition evaluation metric is AUC

train_identity = pd.read_csv('train_identity.csv')
train_transaction = pd.read_csv('train_transaction.csv')
train_data = pd.merge(train_transaction, train_identity, on='TransactionID', how='left')
train_data.shape

(590540, 434)

Note that a left-join on the `TransactionID` key is appropriate for this dataset. For others involving multiple tables, you likely need to use a different join strategy. It could be time-consuming. Unfortunately AutoGluon cannot automatically do it for you yet. 

Now we train our model with `best_quality`. For demo purpose, we limit the training time to be 5 minutes. You need to change it to a larger number, e.g. 1 hour, to get high quality predictions.

In [16]:
predictor = TabularPredictor(label=label, eval_metric=eval_metric).fit(
    train_data, presets='best_quality', time_limit=300)

No path specified. Models will be saved in: "AutogluonModels/ag-20220714_043553/"
Presets specified: ['best_quality']
Beginning AutoGluon training ... Time limit = 300s
AutoGluon will save models to "AutogluonModels/ag-20220714_043553/"
AutoGluon Version:  0.5.0
Python Version:     3.9.12
Operating System:   Linux
Train Data Rows:    590540
Train Data Columns: 433
Label Column: isFraud
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    230786.61 MB
	Train Data (Original)  Memory Usage: 2715.97 M

We load and join test example as the training data. But column names starts with `id_` (e.g. `id_01`) are changed to `id-` (e.g. `id-01`) in the test data. We rename them to match the training data names.

In [None]:
test_identity = pd.read_csv('test_identity.csv')
test_transaction = pd.read_csv('test_transaction.csv')
test_data = pd.merge(test_transaction, test_identity, on='TransactionID', how='left')
rename = {c : c.replace('-', '_') for c in test_data.columns if c.startswith('id')}
test_data.rename(columns=rename, inplace=True)

Now let's predict. As this competition requires us to submit the predicted probabilities for the positive class. We use `predict_proba` to obtain these probabilities. 

In [45]:
y_pred = predictor.predict_proba(test_data, as_multiclass=False)
y_pred.head(5)

0    0.002866
1    0.002976
2    0.005579
3    0.002728
4    0.003901
Name: isFraud, dtype: float32

Note we set `as_multiclass=False` to indicate it's binary classification to only return the positive class probabilities. In default it will return a table, each column stores the probabilities for a class label. 

Last, we prepare the submission. 

In [48]:
submission = pd.read_csv('sample_submission.csv')
submission['isFraud'] = y_pred
submission.to_csv('my_submission.csv', index=False)

You can submit through the competition page, or using the following command

In [None]:
!kaggle competitions submit -c ieee-fraud-detection -f my_submission.csv -m "my first submission"

Now we went through how to use AutoGluon to participant a Kaggle competition. Often the `best_quality` preset will you reasonable results, but unlikely win the competition. To improve your results, you can 

1. tuning hyperparameters, refer to {doc}`./model_hyperparameters`
1. do more feature engineering (TODO, ref)
1. add custom models, refer to {doc}`./custom_model`

```{seealso}
Check Kaggle kernels that use AutoGluon (TODO, links). 

Please add your AutoGluon-based Kaggle kernels here.
```