# Titanic Using Daimensions

This notebook uses data from the Titanic competition on Kaggle (https://www.kaggle.com/c/titanic/overview).

Kaggle's description of the competition:
"The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered 'unsinkable' RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: 'what sorts of people were more likely to survive?' using passenger data (ie name, age, gender, socio-economic class, etc)."

Goal: Make a predictor of survival from Titanic training data. We'll do this by using Daimensions to measure, build, and validate a predictor.

## 1. Get Measurements

We want to measure our data before building a predictor so we know what kind of model will work best. Daimensions tells us about learnability, the generalization ratio, noise resilience, and all the standard accuracy and confusion figures.
For more information, you can read the Daimensions How-to Guide and Glossary.
Below is a clip of the training data:

In [16]:
! head train.csv

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S


As you can see from above, the target column (Survived) isn't the last column on the right. Because of this, we need to use -target so that Daimensions is looking at the correct column for measuring and building a predictor.

In [17]:
! ./btc_linux -measureonly train.csv -target Survived -server beta.brainome.ai

Brainome Daimensions(tm) 0.97 Copyright (c) 2019, 2020 by Brainome, Inc. All Rights Reserved.
Licensed to: Ariana Park
Expiration date: 2020-11-30 (138 days left)
Number of threads: 1
Maximum file size: 4GB
Connected to: https://beta.brainome.ai:8080

Data:
Number of instances: 891
Number of attributes: 11
Number of classes: 2
Class balance: 61.5% 38.38%

Learnability:
Best guess accuracy: 61.50%
Capacity progression (# of decision points): [7, 8, 9, 9, 10, 10]
Decision Tree: 422 parameters
Estimated Memory Equivalent Capacity for Neural Networks: 118 parameters

Risk that model needs to overfit for high accuracies...
using Decision Tree: 94.73%
using Neural Networks: 100.00%

Expected Generalization...
using Decision Tree: 2.11 bits/bit
using a Neural Network: 7.55 bits/bit

Recommendations:
Note: Maybe enough data to generalize. [yellow]

Time estimate for a Neural Network:
Estimated time to architect: 0d 0h 0m 1s
Estimated time to prime (subject to change after model architecting): 

## 2. Build the predictor

Because the learnability of the data is yellow, the how-to guide recommends to choose predictor with higher generalization and increase effort for best results. This means using a neural network with effort should work best. Here, I'm using '-f NN' to make the predictor a neural network. I'm also using '-o predict.py' to output the predictor as a python file. To increase the effort, I'm using '-e 10' for 10 times the effort. Again, we have to use '-target Survived' because the target column isn't the last one.

In [20]:
! ./btc_linux -v -v -f NN train.csv -o predict.py -target Survived -e 10 --yes

Brainome Daimensions(tm) 0.96 Copyright (c) 2019, 2020 by Brainome, Inc. All Rights Reserved.
Licensed to: Ariana Park
Expiration date: 2020-11-30 (138 days left)
Number of threads: 1
Maximum file size: 4GB
Connected to Brainome cloud.

Running btc will overwrite existing predict.py. OK? [y/N] yes
Input: train.csv
Sampling...done.
Preprocessing...done.
Cleaning...done.
Splitting into training and validation...done.
Pre-training measurements...done.
Data:
Number of instances: 891
Number of attributes: 11
Number of classes: 2
Class balance: 61.5% 38.38%

Learnability:
Best guess accuracy: 61.50%
Capacity progression (# of decision points): [7, 8, 9, 9, 10, 10]
Quick Clustering: 422 parameters
Estimated Memory Equivalent Capacity for Neural Networks: 118 parameters

Risk that model needs to overfit for high accuracies...
using Quick clustering: 94.73%
using Neural Networks: 100.00%

Expected Generalization...
using Quick clustering: 2.11 bits/bit
using a Neural Network: 7.55 bits/bit

Rec

## 3. Validate and Make Predictions

We've built our first predictor! Now it's time to put it to use. If you have validation data, or data that has the target column but wasn't used for training, you can use it to validate the accuracy of your predictor. In the case of Titanic, we have test data instead, where it's different from the training data and doesn't include 'Survival'. We can use the model we built to make predictions for the test data and submit it to Kaggle. If you want to validate this data, you could split the training data into two files, using one to train and one to validate.
In the following code, I'll save the model's prediction in 'prediction.csv'.

In [21]:
# to validate
# ! python3 predict.py -validate validation_data.csv

In [22]:
! python3 predict.py test.csv > prediction.csv
! head prediction.csv

P,a,s,s,e,n,g,e,r,I,d,",",P,c,l,a,s,s,",",N,a,m,e,",",S,e,x,",",A,g,e,",",S,i,b,S,p,",",P,a,r,c,h,",",T,i,c,k,e,t,",",F,a,r,e,",",C,a,b,i,n,",",E,m,b,a,r,k,e,d,",",P,r,e,d,i,c,t,i,o,n
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S,0
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q,0
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S,0
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S,0
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S,0
898,3,"Connolly, Miss. Kate",female,30,0,0,330972,7.6292,,Q,1
899,2,"Caldwell, Mr. Albert Francis",male,26,1,1,248738,29,,S,0
900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18,0,0,2657,7.2292,,C,1


As you can see, the prediction file appended the model's prediction of survival as the last column. 
When the prediction is submitted to Kaggle, it has 74.641% accuracy.

## Improving Our Model

use -rank or -ignorecolumns