# Kaggle Compeition with Titanic Dataset using ML.NET AutoML

This notebook shows how to anticipate the [well-known titanic competition](https://www.kaggle.com/c/titanic) on kaggle using ml.net. In this notebook, you will learn how to
- Create pipeline for titanic
- Use AutoML API to run hyper-parameter optimization on that pipeline and get the best model
- Predict using the best model, and save predicting result to csv for submission

## Install NuGet packages for training ML.NET models and plotting:

In [None]:
// using nightly-build
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json"
#r "nuget: Plotly.NET.Interactive, 3.0.2"
#r "nuget: Plotly.NET.CSharp, 0.0.1"
#r "nuget: Microsoft.ML.AutoML, 0.20.0-preview.22356.1"
#r "nuget: Microsoft.Data.Analysis, 0.20.0-preview.22356.1"

Loading extensions from `Plotly.NET.Interactive.dll`

Loading extensions from `Microsoft.ML.AutoML.Interactive.dll`

Loading extensions from `Microsoft.Data.Analysis.Interactive.dll`

## Import packages

In [None]:
// Import usings.
using Microsoft.Data.Analysis;
using System;
using System.IO;
using Microsoft.ML;
using Microsoft.ML.AutoML;
using Microsoft.ML.Data;

## Load Dataset
The dataset comes from [kaggle titanic](https://www.kaggle.com/competitions/titanic/data) and is split into __train.csv__ and __test.csv__. __train.csv__ includes feature columns like `Sex`, `Pclass`, etc.. and also ground truth label for us to train a model while __test.csv__ only contains feature columns.

In the following section, we are going to load and split __train.csv__ into training and validation set using `DataFrame` api, and preview the first 10 line of it.

In [None]:
var context = new MLContext();

//Load File
var trainDataPath = Path.Combine(Directory.GetCurrentDirectory(), "data", "titanic", "train.csv");
var df = DataFrame.LoadCsv(trainDataPath);

var trainTestSplit = context.Data.TrainTestSplit(df, 0.1);
df.Head(10)

index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,<null>,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,C


### Construct training pipeline
The following code shows how to construct sweepable pipeline with default binary classifiers and search space for hyper-parameter optimization. The sweepable pipeline comes with featurize pipeline first, which transfers columns into a single feature and then feeds into binary classifiers.

In [None]:
var pipeline = context.Auto().Featurizer(df, excludeColumns: new[]{"Survived"})
                        .Append(context.Transforms.Conversion.ConvertType("Survived", "Survived", DataKind.Boolean))
					    .Append(context.Auto().BinaryClassification(labelColumnName: "Survived", useFastForest: false, useSdca: false, useLbfgs: false));

### Run Hyper-parameter optimization using AutoMLExperiment
The following code shows how to use `AutoMLExperiment` to sweep over the sweepable pipeline created before and optimize for the best parameters. During the 300 seconds training time budget, it will train model on `trainTestSplit.TrainSet`, and evaluate model with `trainTestSplit.TestSet` using `Accuracy` metric. While training, the `NotebookMonitor` helps organize the experiment output and plotting it in output panel. 

In [None]:
// Configure AutoML
var monitor = new NotebookMonitor();

var experiment = context.Auto().CreateExperiment()
                    .SetPipeline(pipeline)
                    .SetTrainingTimeInSeconds(60)
                    .SetDataset(trainTestSplit.TrainSet, trainTestSplit.TestSet)
                    .SetEvaluateMetric(BinaryClassificationMetric.Accuracy, "Survived", "PredictedLabel")
                    .SetMonitor(monitor);

// Configure Visualizer			
monitor.SetUpdate(monitor.Display());

// Start Experiment
var res = await experiment.RunAsync();

index,Trial,Metric,Trainer,Parameters
⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️


## Predicting result from __test.csv__ and write it to csv for submission
The following code shows how to consume the best model from previous AutoMLExperiment, generate prediction result and save it to csv. The score for submission should be around 78% and beats over 90% of total submissions.

In [None]:
var bestModel = res.Model;
var testDataPath = Path.Combine(Directory.GetCurrentDirectory(), "data", "titanic", "test.csv");
var submissionCsvPath = Path.Combine(Directory.GetCurrentDirectory(), "data", "titanic", "submission.csv");

var testDf = DataFrame.LoadCsv(testDataPath,guessRows: 200);
var predictionResult = bestModel.Transform(testDf);
var survived = predictionResult.GetColumn<bool>("PredictedLabel");
var passengerId = predictionResult.GetColumn<float>("PassengerId");

var submissionDf = new DataFrame();
submissionDf["PassengerId"] = DataFrameColumn.Create("PassengerId", passengerId);
submissionDf["Survived"] = DataFrameColumn.Create("Survived", survived.Select(x => x ? 1 : 0));
DataFrame.WriteCsv(submissionDf, submissionCsvPath);

submissionDf.Head(10)

index,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,0
7,899,0
8,900,1
9,901,0


## How can I improve the result
If you are looking for improve the submission result, feature engineering should always be the first thing you try. The featurizer pipeline this notebook uses is automatically generated by [ML.Net ModelBuilder]() based on hard-code rules so there's a huge room for improvement! Bring your human knowledge and imagination when featurizing columns, digging more information behind them. For example. departure and arrival information is available from `Ticket`. `Name` also includes ton of useful information as well, last name might indicate their social status, which proves to be strongly related to survival rate. There's no limitation in feature engineering, use your imagination and explore!