# Titanic Using Daimensions

This notebook uses data from the Titanic competition on Kaggle (https://www.kaggle.com/c/titanic/overview).

Kaggle's description of the competition:
"The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered 'unsinkable' RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: 'what sorts of people were more likely to survive?' using passenger data (ie name, age, gender, socio-economic class, etc)."

Goal: Make a predictor of survival from Titanic training data. We'll do this by using Daimensions to measure, build, and validate a predictor.

## 0. Getting Started

Because this is the very first tutorial, we'll go over how to install btc and get started. You can also see how to setup btc in the Daimensions Quickstart guide.

First, use the following link to download the installation script: https://download.brainome.net/btc-cli/btc-setup.sh. From the download directory, run the following bash command.

In [1]:
! sh btc-setup.sh

sh: btc-setup.sh: No such file or directory


The script will check that your operating system is supported, download the latest btc client to your machine and install it in /usr/local/bin. You will be prompted to enter the administrator password to install the software. 
*NOTE: After installation, make sure that “/usr/local/bin” is in your search path. *

Next, run the following command to wipe all cloud files. You will need your user credentials to login to DaimensionsTM. The first time you login, your license key will be downloaded automatically. Please use the default password that was provided to you. 

In [None]:
! btc WIPE

To change your password, use the following bash command.

In [None]:
! btc CHPASSWD

## 1. Get Measurements

Measuring our data before building a predictor is important in order to avoid mistakes and optimize our model. If we don't measure our data, we have no way of knowing whether the predictor we build will actually do what we want it to do when it sees new data that it wasn’t trained on. We'll probably build a model that is much larger than it needs to be, meaning our training and run times will probably be much longer than they need to be. We could end up in a situation where we just don’t know whether we have the right amount or right type of training data, even after extensive training and testing. Because of these reasons, it's best to measure our data beforehand. Not to mention, Daimensions will tell us about learnability, the generalization ratio, noise resilience, and all the standard accuracy and confusion figures. 
For more information, you can read the Daimensions How-to Guide and Glossary.

In [2]:
# Below is a clip of the training data:
! head titanic_train.csv
# For Windows command prompt:
# type titanic_train.csv | more













As you can see from above, the target column (Survived) isn't the last column on the right. Because of this, we need to use '-target' so that Daimensions is looking at the correct target column for measuring and building a predictor.

In [3]:
# Measuring the training data:
! brainome -measureonly titanic_train.csv -target Survived 


[01;1mBrainome Table Compiler v1.005-7-prod[0m
Copyright (c) 2019-2021 Brainome, Inc. All Rights Reserved.
Licensed to:                 Alexander Makhratchev  (Evaluation)
Expiration Date:             2021-08-31   29 days left
Maximum File Size:           30 GB
Maximum Instances:           unlimited
Maximum Attributes:          unlimited
Maximum Classes:             unlimited
Connected to:                daimensions.brainome.ai  (local execution)

[01;1mCommand:[0m
    btc -measureonly titanic_train.csv -target Survived

Start Time:                 08/02/2021, 13:15 PDT

Cleaning...done. 
Splitting into training and validation...done. 
Pre-training measurements...done. 


[01;1mPre-training Measurements[0m
Data:
    Input:                      titanic_train.csv
    Target Column:              Survived
    Number of instances:        891
    Number of attributes:        11 out of 11
    Number of classes:            2

Class Balance:                
                              

## 2. Build the Predictor

Because the learnability of the data (based on capacity progression and risk) is yellow, the how-to guide recommends to choose predictor with higher generalization and increase effort for best results. This means using a neural network with effort should work best. Here, I'm using '-f NN' to make the predictor a neural network. I'm also using '-o predict.py' to output the predictor as a python file. To increase the effort, I'm using '-e 10' for 10 times the effort. Again, we have to use '-target Survived' because the target column isn't the last one.

In [4]:
# Building the predictor and outputting it to 'titanic_predict.py':
! brainome -v -v -f NN titanic_train.csv -o titanic_predict.py -target Survived --yes


[01;1mBrainome Table Compiler v1.005-7-prod[0m
Copyright (c) 2019-2021 Brainome, Inc. All Rights Reserved.
Licensed to:                 Alexander Makhratchev  (Evaluation)
Expiration Date:             2021-08-31   29 days left
Maximum File Size:           30 GB
Maximum Instances:           unlimited
Maximum Attributes:          unlimited
Maximum Classes:             unlimited
Connected to:                daimensions.brainome.ai  (local execution)

[01;1mCommand:[0m
    btc -v -v -f NN titanic_train.csv -o titanic_predict.py -target Survived --yes

Start Time:                 08/02/2021, 13:15 PDT

Cleaning...done. < 1s
Splitting into training and validation...done. < 1s
Pre-training measurements...done. 6s


[01;1mPre-training Measurements[0m
Data:
    Input:                      titanic_train.csv
    Target Column:              Survived
    Number of instances:        891
    Number of attributes:        11 out of 11
    Number of classes:            2

Class Balance:          

## 3. Validate and Make Predictions

We've built our first predictor! Now it's time to put it to use. In the case of Titanic, we are given test data from Kaggle, where it's different from the training data and doesn't include 'Survival'. We can use the model we built to make predictions for the test data and submit it to Kaggle for its competition. In the following code, I'll save the model's prediction in 'titanic_prediction.csv'. You will see that the predictor appended the model's prediction of survival as the last column. 

In [5]:
# Using predictor on test data and saving it to 'titanic_prediction.csv':
! python3 titanic_predict.py titanic_test.csv > titanic_prediction.csv
! head titanic_prediction.csv

PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Prediction
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S,0
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q,0
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S,0
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S,1
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S,0
898,3,"Connolly, Miss. Kate",female,30,0,0,330972,7.6292,,Q,1
899,2,"Caldwell, Mr. Albert Francis",male,26,1,1,248738,29,,S,0
900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18,0,0,2657,7.2292,,C,1


If you have validation data, or data that has the target column but wasn't used for training, you can use it to validate the accuracy of your predictor, as we will do. For this particular instance, I found an annotated version of the Titanic test data, 'titanic_validation.csv', and used it to validate our model.

In [6]:
# To validate:
! python3 titanic_predict.py -validate titanic_validation.csv

Classifier Type:                    Neural Network
System Type:                        2-way classifier

Accuracy:
    Best-guess accuracy:            62.20%
    Model accuracy:                 72.96% (305/418 correct)
    Improvement over best guess:    10.76% (of possible 37.8%)

Model capacity (MEC):               27 bits
Generalization ratio:               10.80 bits/bit

Confusion Matrix:

      Actual |Predicted
    ------------------
           0 |198  62
           1 | 51 107

Accuracy by Class:

      target |  TP FP  TN FN     TPR     TNR     PPV     NPV      F1      TS
    -------- | --- -- --- -- ------- ------- ------- ------- ------- -------
           0 | 198 51 107 62  76.15%  67.72%  79.52%  63.31%  77.80%  63.67%
           1 | 107 62 198 51  67.72%  76.15%  63.31%  79.52%  65.44%  48.64%


From validating the predictor, we can see that it has 74.64% accuracy, 12.44% better than best-guess accuracy (which classifies all data points as the majority class). 

## 4. Improving Our Model

Our model did pretty well, but let's see if we can improve it. A column that contains a unique value in each row (for example a database key) will never contribute to generalization, so we shouldn't include database keys or other unique ID columns. We can remove these columns by using '-ignorecolumns'. We'll try ignoring columns: PassengerId, Name, Ticket, Cabin, Embarked, because they're all unique ID columns. We could also use '-rank' to rank columns by significance and only process contributing attributes.

### Ignorecolumns vs Rank:
There may be situations where domain knowledge suggests a better choice of features than -rank. If we know the data generative process, we can do better with -ignorecolumns than with -rank. Rank is also optimizing for quick clustering/decision tree. For neural networks, we may still wish to reduce input features, which can be done with pca, but at the cost of interpretability. Some applications may require the original features are used in which case pca isn't viable. Ignorecolumns can reduce features while maintaining interpretability and work better for neural networks than -rank may, but the burden of choosing the right columns to keep is now on us.

### Using -ignorecolumns:

In [7]:
# Using -ignorecolumns to make a better predictor:
! brainome -v -v -f NN titanic_train.csv -o titanic_predict_igcol.py -target Survived -ignorecolumns PassengerId,Name,Ticket,Cabin,Embarked -e 10 --yes


[01;1mBrainome Table Compiler v1.005-7-prod[0m
Copyright (c) 2019-2021 Brainome, Inc. All Rights Reserved.
Licensed to:                 Alexander Makhratchev  (Evaluation)
Expiration Date:             2021-08-31   29 days left
Maximum File Size:           30 GB
Maximum Instances:           unlimited
Maximum Attributes:          unlimited
Maximum Classes:             unlimited
Connected to:                daimensions.brainome.ai  (local execution)

[01;1mCommand:[0m
    btc -v -v -f NN titanic_train.csv -o titanic_predict_igcol.py -target Survived -ignorecolumns PassengerId,Name,Ticket,Cabin,Embarked -e 10 --yes

Start Time:                 08/02/2021, 13:18 PDT

Cleaning...done. < 1s
Splitting into training and validation...done. < 1s
Pre-training measurements...done. 4s


[01;1mPre-training Measurements[0m
Data:
    Input:                      titanic_train.csv
    Target Column:              Survived
    Number of instances:        891
    Number of attributes:         6 out o

In [18]:
# Using the ignorecolumns predictor on test data and saving it to 'titanic_prediction_igcol.csv':
! python3 titanic_predict_igcol.py titanic_test.csv > titanic_prediction_igcol.csv
! head titanic_prediction_igcol.csv

PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Prediction
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S,1
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q,0
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S,0
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S,1
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S,0
898,3,"Connolly, Miss. Kate",female,30,0,0,330972,7.6292,,Q,1
899,2,"Caldwell, Mr. Albert Francis",male,26,1,1,248738,29,,S,0
900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18,0,0,2657,7.2292,,C,1


As we wanted, -ignorecolumns removed the PassengerId, Name, Ticket, Cabin, and Embarked attributes. Next, we can use -validate to check the accuracy of our new predictor.

In [19]:
# Validating the -ignorecolumns predictor
! python3 titanic_predict_igcol.py -validate titanic_validation.csv

Classifier Type:                    Neural Network
System Type:                        Binary classifier
Best-guess accuracy:                62.20%
Model accuracy:                     77.27% (323/418 correct)
Improvement over best guess:        15.07% (of possible 37.8%)
Model capacity (MEC):               1 bits
Generalization ratio:               308.99 bits/bit
Model efficiency:                   15.06%/parameter
System behavior
True Negatives:                     52.87% (221/418)
True Positives:                     24.40% (102/418)
False Negatives:                    13.40% (56/418)
False Positives:                    9.33% (39/418)
True Pos. Rate/Sensitivity/Recall:  0.65
True Neg. Rate/Specificity:         0.85
Precision:                          0.72
F-1 Measure:                        0.68
False Negative Rate/Miss Rate:      0.35
Critical Success Index:             0.52
Confusion Matrix:
 [52.87% 9.33%]
 [13.40% 24.40%]


Using -ignorecolumns has improved our accuracy to 77.27% from 74.64% originally.

### Using -rank:

In [8]:
# Using -rank to make a better predictor:
! brainome -v -v -f NN titanic_train.csv -o titanic_predict_rank.py -target Survived -rank --yes -e 10


[01;1mBrainome Table Compiler v1.005-7-prod[0m
Copyright (c) 2019-2021 Brainome, Inc. All Rights Reserved.
Licensed to:                 Alexander Makhratchev  (Evaluation)
Expiration Date:             2021-08-31   29 days left
Maximum File Size:           30 GB
Maximum Instances:           unlimited
Maximum Attributes:          unlimited
Maximum Classes:             unlimited
Connected to:                daimensions.brainome.ai  (local execution)

[01;1mCommand:[0m
    btc -v -v -f NN titanic_train.csv -o titanic_predict_rank.py -target Survived -rank --yes -e 10

Start Time:                 08/02/2021, 13:25 PDT

Cleaning...done. < 1s
Ranking attributes...done. 1s

[01;1mAttribute Ranking:[0m
    Columns selected:           Sex, SibSp, Parch, Pclass
    Risk of coincidental column correlation:    0.0%
    Ignoring columns:           PassengerId, Name, Age, Ticket, Fare, Cabin, Embarked
    Test Accuracy Progression:
                                          Sex :   78.56%
     

In [9]:
# Using the rank predictor on test data and saving it to 'titanic_prediction_rank.csv':
! python3 titanic_predict_rank.py titanic_test.csv > titanic_prediction_rank.csv
! head titanic_prediction_rank.csv

PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Prediction
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S,1
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q,0
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S,0
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S,1
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S,0
898,3,"Connolly, Miss. Kate",female,30,0,0,330972,7.6292,,Q,1
899,2,"Caldwell, Mr. Albert Francis",male,26,1,1,248738,29,,S,0
900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18,0,0,2657,7.2292,,C,1


You can see that -rank decided to only look at the columns 'Sex','Parch' (Parent/child), and 'Fare'. This makes a lot of sense that the determining factors for survival on the Titanic were sex, how many parents or children they had on board, and how much their fare was. Seeing what attributes -rank chooses gives us powerful insight into understanding our data and its correlations.

In [10]:
# Validating the -rank predictor
! python3 titanic_predict_rank.py -validate titanic_validation.csv

Classifier Type:                    Neural Network
System Type:                        2-way classifier

Accuracy:
    Best-guess accuracy:            62.20%
    Model accuracy:                 77.51% (324/418 correct)
    Improvement over best guess:    15.31% (of possible 37.8%)

Model capacity (MEC):               31 bits
Generalization ratio:               10.00 bits/bit

Confusion Matrix:

      Actual |Predicted
    ------------------
           0 |220  40
           1 | 54 104

Accuracy by Class:

      target |  TP FP  TN FN     TPR     TNR     PPV     NPV      F1      TS
    -------- | --- -- --- -- ------- ------- ------- ------- ------- -------
           0 | 220 54 104 40  84.62%  65.82%  80.29%  72.22%  82.40%  70.06%
           1 | 104 40 220 54  65.82%  84.62%  72.22%  80.29%  68.87%  52.53%


With -rank, our accuracy is 76.79%, again, an improvement over our original 74.64%.

## 5. Next Steps

Success! We've built our first predictor and used it to make predictions on the Titanic test data. From here, we can use our model on any new Titanic data or use other control options to try to improve our results even more.
To check out some of the other control options, use '-h' to see the full list. You can also check out Brainome's How-to Guide and Glossary for more information.

In [11]:
! brainome -h

usage: brainome [-h] [-version] [-headerless] [-target TARGET]
                [-ignorecolumns IGNORECOLUMNS] [-rank [ATTRIBUTERANK]]
                [-measureonly] [-f FORCEMODEL] [-nosplit] [-split FORCESPLIT]
                [-nsamples NSAMPLES] [-ignoreclasses IGNORELABELS]
                [-usecolumns IMPORTANTCOLUMNS] [-o OUTPUT] [-v] [-q] [-y]
                [-e EFFORT] [-biasmeter] [-novalidation] [-balance]
                [-O OPTIMIZE] [-nofun] [-modelonly]
                input [input ...]

[01;30mBrainome Table Compiler (tm)  v1.005-7-prod[0m

[01;1mRequired arguments[0m:
  input                 Table as CSV files and/or URLs or Command above

[01;1mOptional arguments[0m:
  -h                    show this help message and exit
  -version              show program's version number and exit

[01;1mBasic options[0m:
  -headerless           Headerless CSV input file.
  -target TARGET        Specify target column by name or number. Default: last column of table.
  -igno