# Implementing a machine learning for Titanic case

*For data anlysis see Titanic notebook*
##### First trials
* Load Input Data
  * Clean & complete data
  * Split Training Data 
* Train 1 ML 
* Validate using Input Data

##### Several ML implementing
First three steps kepts
* Train several ML
* Validate and compare
* Select Best
* Predict
* Submit

In [17]:
import pandas as pd
import numpy as np
import math
import StringIO
import datetime
import pydot_ng
import shutil 
from sklearn import tree
from IPython.display import Image  
from sklearn.externals.six import StringIO

# Load Input Data

In [18]:
def LoadData(path):
    return pd.read_csv(path)
wholeInputData = LoadData('train.csv')
wholeInputData.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


## Clean & complete data

### Number of familly members
We add the number of familly member. It is dependent of two others variables but it wouldn't it be complicated for the ml to find this feature ?

In [19]:
def AddFamillyNbr(data):
    data['FamillyNbr']=data.Parch+data.SibSp
AddFamillyNbr(wholeInputData)

### Missing Age values
We fill the missing value for the age

In [20]:
def FillWithMedian(data, columnName):
    median = data[columnName].median()
    data[columnName]=data[columnName].fillna(median)
    
FillWithMedian(wholeInputData,'Age')

### Is There other empty values

In [21]:
def CheckForNull(ser):
    return ser.isnull().any()
print('Does Fare has NA : %s' % CheckForNull(wholeInputData.Fare))
print('Does Sex has NA : %s' % CheckForNull(wholeInputData.Sex))
print('Does Clas has NA : %s' % CheckForNull(wholeInputData.Pclass))
print('Does FamillyNbr has NA : %s' % CheckForNull(wholeInputData.FamillyNbr))
print('Does Ticket has NA : %s' % CheckForNull(wholeInputData.Ticket))
print('Does Cabin has NA : %s' % CheckForNull(wholeInputData.Cabin))

Does Fare has NA : False
Does Sex has NA : False
Does Clas has NA : False
Does FamillyNbr has NA : False
Does Ticket has NA : False
Does Cabin has NA : True


#### Cabin
Cabin is not useful as it is partially empty and has unique values.
Lets analyze the data to see if the First letter could be computed. 

In [22]:
wholeInputData[['Pclass','Cabin']].head(30)

Unnamed: 0,Pclass,Cabin
0,3,
1,1,C85
2,3,
3,1,C123
4,3,
5,3,
6,1,E46
7,3,
8,3,
9,2,


Data won't be usable

### Split Training Data

In [23]:
def SplitData(input, sizeValidate):
    validateSample = input.sample(n=sizeValidate, random_state =10)#to be able to repeat if necessary we force randomstate
    rest = input.drop(validateSample.index)    
    return (validateSample, rest)
    
validateData, trainData=SplitData(wholeInputData,50)
len(validateData.index)

50

In [24]:
len(trainData.index)

841

#### Save them in case of

In [25]:
validateData.to_csv('split_validate.csv')
trainData.to_csv('split_train.csv')

## Train a decision tree
The decision tree is the easiest to understand. That's why I choosed it as the first

*This requires no missing value. No other preparation is needed*
#### Tidy Dataset

In [26]:
trainData.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamillyNbr
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1


In [29]:
def TidyData(data):
    result =  data.drop(['Survived','Cabin','Name','Ticket','Embarked','Sex','PassengerId'],axis=1)
    result['Genre'] = map(lambda x:1 if x=='male' else 0,data['Sex'])
    return result
tidyValidateData = TidyData(validateData)
tidyTrainData= TidyData(trainData)
survivedTrain = trainData['Survived']
survivedValidate = validateData['Survived']

In [30]:
tidyTrainData.head(3)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,FamillyNbr,Genre
0,3,22.0,1,0,7.25,1,1
2,3,26.0,0,0,7.925,0,0
3,1,35.0,1,0,53.1,1,0


In [39]:
survivedTrain.head(3)

0    0
2    1
3    1
Name: Survived, dtype: int64

In [31]:
classifierTree = tree.DecisionTreeClassifier()
classifierTree = classifierTree.fit(tidyTrainData,survivedTrain)

### Display the tree

In [55]:
def ExportGraphAsPng(treeToExport,featuresName,targetName,fileName):
    output = StringIO() 
    tree.export_graphviz(treeToExport, out_file=output,  
                         feature_names=featuresName,  
                         class_names=targetName,  
                         filled=True, rounded=True,  
                         special_characters=True)    
    graph = pydot_ng.graph_from_dot_data(output.getvalue()) 
    graph.write_png(fileName)         
targetLabels = ['Dead','Survived']
ExportGraphAsPng(classifierTree, tidyTrainData.columns,targetLabels,'FirstTree.png')


<img src="FirstTree.png" alt="First Tree"/>

No need to be a genius to see overfit..

Let's put a limit on the number of sample per class (several iteration were done : 5,10)

In [59]:
limitedTree = tree.DecisionTreeClassifier(min_samples_leaf =10)
limitedTree = limitedTree.fit(tidyTrainData,survivedTrain)
ExportGraphAsPng(limitedTree, tidyTrainData.columns,targetLabels,'SecondTree.png')

<img src="SecondTree.png" alt="First Tree"/>