# Building a more complex model

We are going to use an external library to build a more complex model, which shall be learned automatically

In [None]:
import pandas as pd
import numpy as np

# Load the data as a dataframe
df = pd.read_csv('./../data/titanic_train.csv')

In [None]:
Y_truth = list(1.0*df['Survived'])

In [None]:
def compute_accuracy(Y, Y_pred):
    assert type(Y) is list
    assert type(Y_pred) is list
    assert len(Y) == len(Y_pred)
    n_success = [(1 if round(y_pred) == y else 0) for (y, y_pred) in zip(Y, Y_pred)]
    return 1.0*sum(n_success)/len(Y)

Just to see everything is working...

In [None]:
Y_pred = [0.0]*len(df)
compute_accuracy(Y_truth, Y_pred)

## Feature Engineering

In practice, we often need to tune the variables, as the data world is not perfect. 
For examples:

  * some data are useless for prediction (names, cabin number...)
  * some data are textual and mathemtaical models prefer number (Male/Female)
  * some data are missing (Age)

**The automatic model will use all the available data to try to build a predictor. It's time to ry to be clever**

Let's check again what is avaialable.

In [None]:
df.head(15)

### Turning to numeric

The *sex* field is either *"female"* or *"male"*. The shall set it respectively to *1* or *0*

In [None]:
df['IsWoman'] = 1.0*(df['Sex'] == "female")

### Removing useless prediction data
Some data are too unique and they shall only confuse and automatic model

In [None]:
del df['Name']
del df['Sex']
del df['Ticket']
del df['PassengerId']
del df['Cabin']

Let's just see how our data look like now

In [None]:
df.head(15)

### The Embarcation is a letter. Let's convert into letters

In [None]:
df['Embarked'].value_counts(dropna = False)

Let's turn then *NaN* (undefined) in to *"S"* (the most common ones)

In [None]:
df['Embarked'].fillna("S", inplace=True)

Let's turn those three letters into numbers.
And first build a *dictiony_letter* to associate each letter with a number

In [None]:
embark_letters=list(df['Embarked'].unique())
number_of_letters = len(embark_letters)

dictionary_letters = {embark_letters[i]: i for i in range(0,number_of_letters)}
dictionary_letters


Turn all the letter into the associated code

In [None]:
df['Embarked_code'] = [dictionary_letters[x] for x in df['Embarked']]

An now remove the useless orginal *Embarked"* field

In [None]:
del df['Embarked']

### Replace the missing ages by the averag value

Weel, it's better than nothing

In [None]:
# The age is not always given. We remplace empty value with the mean.
df['Age'].fillna((df['Age'].mean()), inplace=True)

### Check data (a last time)

In [None]:
df.head()

## Train the model

We can now launch the model.
It is provided by the much reknowned *sklearn* package


In [None]:
from sklearn import tree

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=2, min_samples_leaf=1)
clf_trained = clf.fit(df, Y_truth)


#### What is happening here? Perfection???

In [None]:
compute_accuracy(Y_truth, list(clf_trained.predict(df)))

Don't dream, we forgot to remove the *Survived* column, so the model cheats and take it has an input.

Let's remove it

In [None]:
del df['Survived']

### The Accuracy is 80%!
Way better than any of our hand made models

In [None]:
clf_trained = clf.fit(df, Y_truth)
compute_accuracy(Y_truth, list(clf_trained.predict(df)))

### What are the rules behind the model??

In [None]:
!pip install graphviz
import graphviz
dot_data = tree.export_graphviz(clf_trained, out_file=None, 
                         feature_names=list(df),  
                         filled=True, rounded=True, class_names=["Died", "Survived"], 
                         special_characters=True, impurity=False)  
graph = graphviz.Source(dot_data)  
graph 

### Can we make a more efficient model with a tree?

Let's make a deeper tree

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=10, min_samples_leaf=1)

A miracle happens: 92%!

In [None]:
clf_trained = clf.fit(df, Y_truth)
compute_accuracy(Y_truth, list(clf_trained.predict(df)))

In [None]:
dot_data = tree.export_graphviz(clf_trained, out_file=None, 
                         feature_names=list(df),  
                         filled=True, rounded=True, class_names=["Died", "Survived"], 
                         special_characters=True, impurity=False)  
graph = graphviz.Source(dot_data)  
graph 

## How to overcome that problem??