## <a name="abstract">Titanic Machine Learning From Disaster</a>

Abstract The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## Import External Notebook: Julia Assistant Tools

In [51]:
using NBInclude
using DecisionTree
# nbinclude("../06_assistent_tools/Julia\ Assistent\ Tools\ for\ Machine\ Learning.ipynb")

## Load Data

In [3]:
train = readtable("data/train_v2.tsv")
test = readtable("data/test_v2.tsv");

## Define Features X and Label Y

In [55]:
features = setdiff(names(test), [:Name, :PassengerId])
feature_space = gen_feature_space(features)
label = setdiff(names(train), names(test))[1];

## Prepare Training, Validation and Test Instances

### In Memory

In [44]:
X_train, X_val = split_train_val(train; train_size=.85, random_state=1)
train_x, train_y, val_x, val_y = gen_train_val(train, features, label)
dtrain, dval = gen_dtrain(train, features, label);

### In External Memory

#### Vowpal Wabbit

In [59]:
train_vw      = generate_vw_file(X_train[:, vcat(features, label)], feature_space, label)
val_vw        = generate_vw_file(X_val[:, vcat(features, label)], feature_space, label)
test_vw       = generate_vw_file(test[:, features], feature_space, label)
train_vw_file = "data/train_v2.vw"
val_vw_file   = "data/val_v2.vw"
test_vw_file  = "data/test_v2.vw"
export_gz(train_vw, train_vw_file)
export_gz(val_vw, val_vw_file)
export_gz(val_vw, test_vw_file);

## Train Models

### XGBoost

In [None]:
num_rounds = 10
watchlist  = [(dtrain, "train"), (dval, "eval")]
metrics    = ["error"]
params     = Dict("objective"         => "binary:logistic",
                   "booster"          => "gbtree",
                   "eta"              => .1,
                   "alpha"            => .5,
                   "gamma"            => .0,
                   "max_depth"        =>  5,
                   "colsample_bytree" => .5,
                   "min_child_weight" =>  10,
                   "subsample"        => .5,
                   "seed"             =>  1)

println("Training Base Model...")
tic()
model_xgb = XGBoost.xgboost(dtrain, num_rounds, param=params, metrics=metrics, watchlist=watchlist)
toc()

### GLMs

In [None]:
formulas = gen_formulas(features, feature_space, label, 10)
model_glm = gen_glm(train, formulas, Binomial(), LogitLink());

### Decision Trees, Random Forests and Adaboost Trees

In [17]:
model_decision_tree = train_decision_tree(train_x, train_y)
model_random_forest = train_random_forest(train_x, train_y, random_features=5, num_trees=10, portion_samples=.5)
model_adaboost_trees, adaboost_trees_coefs = train_adaptive_boosted_trees(train_x, train_y, num_iteration=5);

### Scikit-Models

In [30]:
models_scikit = train_scikit_models(scikit_all_classifier_models, train_x, train_y)

Training Model ExtremyTree (1/15)...
elapsed time: 0.023986754 seconds
Training Model RandomForest (2/15)...
elapsed time: 0.02914997 seconds
Training Model PassiveAggressiveClassifier (3/15)...
elapsed time: 0.039122004 seconds
Training Model LogisticRegression (4/15)...
elapsed time: 0.018073097 seconds
Training Model Bagging_ExtremyTree (5/15)...
elapsed time: 0.514863291 seconds
Training Model NaiveBayes (6/15)...
elapsed time: 0.027991561 seconds
Training Model DecisionTree (7/15)...
elapsed time: 0.090264383 seconds
Training Model Bagging_DecisionTree (8/15)...
elapsed time: 0.28740818 seconds
Training Model kNN (9/15)...
elapsed time: 0.063536531 seconds
Training Model Boosting_ExtremyTree (10/15)...
elapsed time: 0.091424457 seconds
Training Model Boosting_DecisionTree (11/15)...
elapsed time: 0.760145922 seconds
Training Model SGDClassifier (12/15)...
elapsed time: 0.045841504 seconds
Training Model GradientBoostingTrees (13/15)...
elapsed time: 0.361286513 seconds
Training Mo

## Vowpal Wabbit

In [67]:
train_vw_binary_classifier("$train_vw_file.gz", "", l1=0.1);

using l1 regularization = 0.1
enabling BFGS based optimization **without** curvature calculation
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
m = 15
Allocated 34M for weights and mem
## avg. loss 	der. mag. 	d. m. cond.	 wolfe1    	wolfe2    	mix fraction	curvature 	dir. magnitude	step size 
using cache_file = data/train_v2.vw.gz.cache
ignoring text input in favor of cache input
num sources = 1
 1 0.69315   	0.20299   	2.47557   	          	          	          	27.54136  	1264.73645	0.08989   
Maximum number of passes reached. If you want to optimize further, increase the number of passes

finished run
number of examples = 1364
weighted example sum = 1364
weighted label sum = 514
average loss = 0.795431 h
best constant = -0.503013
best constant's loss = 0.662492
total feature number = 18302


## Evaluate Models