## Titanic

`Titanic` is a famous playground competition hosted by Kaggle ([here](https://www.kaggle.com/c/titanic)), so I'll simply copy-paste its brief description here:

> This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.
> 
> The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

Here are the frist few rows of the `train.csv` of `Titanic`:

```csv
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
```

And the first few rows of the `test.csv`:

```csv
PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
```

What we need to do is to predict the `Survived` column in `test.csv`.

In [1]:
# preparations

import torch
import cflearn
import numpy as np
from cflearn.toolkit import seed_everything

# for reproduction
seed_everything(123)

123

### Pre-Process Data

Since the target column is not the last column (which is the default setting of `carefree-learn`), we need to manually configure it:

In [2]:
processor_config = cflearn.MLBundledProcessorConfig(label_names=["Survived"])

And you're all set! Notice that only the `label_name` needs to be provided, and `carefree-learn` will find out the corresponding target column for you😉

> - Notice that we can directly pass in a file and `carefree-learn` will handle everything for you (*file-in*).
>
> - We also specified `num_split=200`, which means we will randomly pick `50` samples for validation.

In [3]:
data = cflearn.MLData.init(processor_config=processor_config).fit("train.csv")



As you can see, `carefree-learn` can do some auto data preprocessing: it detects three columns that might be redundant!

### Build Your Model

For instance, we'll use the famous `Wide & Deep` model. First, we need to define the `config`:

In [4]:
config = cflearn.MLConfig(
    module_name="wnd",
    module_config=dict(input_dim=data.num_features, output_dim=1),
    loss_name="bce",
    metric_names=["acc", "auc"],
    # use nesterov SGD optimizer
    lr=0.1,
    optimizer_name="sgd",
    optimizer_config=dict(nesterov=True, momentum=0.9),
    # set embedding dim to 8
    global_encoder_settings=cflearn.MLGlobalEncoderSettings(embedding_dim=8),
)

Notice that we used `data.num_features`, which will tell the model what the (original) number of features is.

With this `config`, building model is just one-line-code:

In [5]:
m = cflearn.api.fit_ml(data, config=config)

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   791
                                                   valid_samples   |   100
                                               max_snapshot_file   |   25
                                          encoder_settings.1.dim   |   3
                                      encoder_settings.1.methods   |   embedding
                               encoder_settings.1.method_configs   |   None
                                          encoder_settings.3.dim   |   2
                                      encoder_settings.3.methods   |   embedding
                               encoder_settings.3.method_configs   |   None
                                          encoder_settings

| epoch  51  [1 / 7] [1.583s] | acc : 0.820000 | auc : 0.855475 | score : 0.837737 |
>  [ info ] rolling back to the best checkpoint
>  [ info ] restoring from _logs/2023-12-03_12-07-25-161446/checkpoints/model_351.pt
| epoch  -1  [-1 / 7] [0.391s] | acc : 0.820000 | auc : 0.855475 | score : 0.837737 |


### Evaluate Your Model

After building the model, we can directly build a `loader` from a `file` to evaluate our model (*file-out*):

In [6]:
loader = m.data.build_loader("train.csv")
m.evaluate(loader)

MetricsOutputs(final_score=0.8588882304010288, metric_values={'acc': 0.8383838383838383, 'auc': 0.8793926224182194}, is_positive={'acc': True, 'auc': True})

Our model achieved an accuracy of `0.83389`, not bad!

> Note that this performance is not exactly the *training* performance, because `carefree-learn` will automatically split out the cross validation dataset for you.

### Making Predictions

Again, we can directly build a `loader` from a `file` to make predictions:

In [7]:
loader = m.data.build_loader("test.csv")
predictions = m.predict(loader)[cflearn.PREDICTIONS_KEY]

>  [ info ] labels are not detected and `for_inference` is set to True, so `contain_labels` will be set to False


Notice that we detected that the `test.csv` does not contain labels, and handled it correctly!

Apart from making raw predictions, we can also specify `carefree-learn` to return probabilities, or classes:

In [8]:
probabilities = m.predict(loader, return_probabilities=True)[cflearn.PREDICTIONS_KEY]
classes = m.predict(loader, return_classes=True)[cflearn.PREDICTIONS_KEY]
print(probabilities[:3])
print(classes[:3])

[[0.9015375  0.09846253]
 [0.7232104  0.2767896 ]
 [0.9253068  0.07469322]]
[[0]
 [0]
 [0]]


### Submit Your Results

If you reached here, we have actually already completed this `Titanic` task! All we need to do is to convert the `predictions` into a submission file:

In [9]:
with open("test.csv", "r") as f:
    f.readline()
    id_list = [line.strip().split(",")[0] for line in f]
with open("submission.csv", "w") as f:
    f.write("PassengerId,Survived\n")
    for test_id, c in zip(id_list, classes.ravel()):
        f.write(f"{test_id},{c.item()}\n")

After running these codes, a `submissions.csv` will be generated and you can submit it to Kaggle directly. In my personal experience, it could achieve 0.77751.

### Conclusions

Since `Titanic` is just a small toy dataset, using Neural Network to solve it might actually 'over-killed' (or, overfit) it, and that's why we decided to conclude here instead of introducing more fancy techniques (e.g. ensemble, AutoML, etc.). We hope that this small example can help you quickly walk through some basic concepts in `carefre-learn`, as well as help you leverage `carefree-learn` in your own tasks!