

1. <a href="#1">Set up AutoGluon</a>
2. <a href="#2">Read the datasets </a>
3. <a href="#3">Train a Classifier with AutoGluon</a>
4. <a href="#4">Classifier evaluation</a>
5. <a href="#5">Clean up model artifacts</a>





## 1. <a name="1">Set up AutoGluon</a>
(<a href="#0">Go to top</a>)

Let's install Autogluon. This may take some time as it installs all required libraries for AutoGluon.

In [None]:
! pip install pip==21.3.1
! pip install setuptools==54.1.1
! pip install wheel==0.36.2
! pip install mxnet==1.7.0.post2
! pip install autogluon==0.1.0

## 2. <a name="2">Read the datasets</a>
(<a href="#0">Go to top</a>)

Let's read the training and test datasets into dataframes, using Pandas. (AutoGluon will handle the validation itself).

In [None]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
  
training_data = pd.read_csv('../../data/titanic/train.csv')
test_data = pd.read_csv('../../data/titanic/test.csv')

print('The shape of the training dataset is:', training_data.shape)
print('The shape of the test dataset is:', test_data.shape)

__AutoGluon__ will handle the __validation data__ itself.

## 3. <a name="3">Train a Classifier with AutoGluon</a>
(<a href="#0">Go to top</a>)

We can run AutoGluon with a short snippet. For fitting, we just call the __.fit()__ function. In this exercise, we used the data frame objects, but this tool also accepts the raw csv files as input. To use this tool with simple csv files, you can follow the code snippet below.

```python
from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset(file_path='path_to_dataset/train.csv')
test_data = TabularDataset(file_path='path_to_dataset/test.csv')

predictor = TabularPredictor(label='label_column').fit(train_data)
test_predictions = predictor.predict(test_data)
```

We have our separate __data frames__ for training and test data, so we work with them below. We grab all the data points. You can also pass the full dataset.

In [None]:
from autogluon.tabular import TabularDataset, TabularPredictor


k = training_data.shape[0] # grab the whole dataset

predictor = TabularPredictor(label='Survived').fit(training_data.head(k))

We can also summarize what happened during fit.

In [None]:
predictor.fit_summary()

## 4. <a name="4">Prediction and Evaluation</a>
(<a href="#0">Go to top</a>)

Next, load separate test data to demonstrate how to make predictions on new examples at inference time.

In [None]:
# Run predictions for the test dataset
test_predictions = predictor.predict(test_data)


We can see the performance of each individual trained model on the test data:

## 5. <a name="5">Write predictions</a>
(<a href="#0">Go to top</a>)

In [None]:
import pandas as pd

result_df = pd.DataFrame(columns=["PassengerId", "Survived"])
result_df["PassengerId"] = test_data["PassengerId"].tolist()
result_df["Survived"] = test_predictions

result_df.to_csv("../../data/titanic/titanic_survived.csv", index=False)

In [None]:
print('Double-check submission file against the gender_submission.csv')
sample_submission_df = pd.read_csv('../../data/titanic/gender_submission.csv')
print('Differences between test data PassengerId and sample submission IDs:',(sample_submission_df['PassengerId'] != result_df['PassengerId']).sum())