### Introduction - Explore the Titanic survivors dataset using sklearn

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Let´s see if we can predict which type of passenger was most likely to survive, using decision-tree based classifiers

### Import Data

If you are using Linux/Mac, you can download the data using the cell below. On Windows, it´s recommended to download the data manually, save it in the same folder as this .ipnyb file, and skipping directly to cell 2

In [1]:
!wget https://raw.githubusercontent.com/egrochos/DevNetCreate2019/master/titanic.csv

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [2]:
import pandas as pd
data = pd.read_csv('titanic.csv')

Let´s take a look at our data:

In [3]:
print(data.shape)
data.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Explanation of the columns:**

**Survived** - 1 if survived, 0 if not

**Pclass** - A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

**SibSp** - Number of Sibling/Spouse aboard

**Parch** - Number of Parent/Children aboard

**Embarked** - The port in which a passenger has embarked. C - Cherbourg, S - Southampton, Q = Queenstown

### Feature Selection

Based on the data above, we can do an initial feature selection. Let´s use the following columns: Pclass, Sex, Age, SibSp, Parch and Fare (plus Survived which will be the value we will predict)

In [4]:
cols_to_use = ['Pclass','Sex','Age','SibSp','Parch','Fare','Survived']
data = data[cols_to_use]

In [5]:
data.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Survived
0,3,male,22.0,1,0,7.25,0
1,1,female,38.0,1,0,71.2833,1
2,3,female,26.0,0,0,7.925,1
3,1,female,35.0,1,0,53.1,1
4,3,male,35.0,0,0,8.05,0


Let´s check if there are any null values in the data

In [6]:
data.isnull().values.any()

True

It seems that we have a couple of null values in the dataset. We wont be able to train a model with a dataset that contains null values, so let´s drop these to avoid errors.

In [7]:
data = data.dropna()

### Train a simple model

Now that our data is ready to be used, let´s first train a simple model using only Sex, Age and Fare. The idea is to create a simple model first to see how the decision tree will look like

In [8]:
features = data[['Sex','Age','Fare']]
labels = data['Survived']

Before proceeding, we need to convert Sex into an integer value of 0 or 1

In [9]:
features['Sex'] = features['Sex'].map({'male': 0, 'female': 1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Before training, it´s recommended to split the data into two sub-sets: training data (which will be used to train the model), and test data (which will be used to evaluate the model´s performance)

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state=42)

The cell below will train the model using the training data

In [11]:
from sklearn.tree import DecisionTreeClassifier

DecisionTreeModel = DecisionTreeClassifier(max_depth=3)
DecisionTreeModel.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

And here we evaluate the model´s accuracy using the test data

In [12]:
from sklearn.metrics import accuracy_score

y_predict = DecisionTreeModel.predict(X_test)
accuracy_score(y_test, y_predict)

0.7318435754189944

Finally, let´s print the decision tree to a PNG file

In [13]:
from sklearn.tree import export_graphviz
export_graphviz(DecisionTreeModel, out_file='simple_tree.dot', feature_names=features.columns,impurity=False,class_names=['Not survived','Survived'],filled=True)
from subprocess import call
call(['dot', '-T', 'png', 'simple_tree.dot', '-o', 'simple_tree.png'])

0

### Train a complete model

Now that we trained a simple model using just Sex, Age and Fare, let´s train a new model using more features/columns. 

In [14]:
features = data[['Pclass','Sex','Age','SibSp','Parch','Fare']]
labels = data['Survived']

In [15]:
features['Sex'] = features['Sex'].map({'male': 0, 'female': 1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state=42)

And here we train our model using the training data:

In [17]:
from sklearn.ensemble import RandomForestClassifier

RandomForestModel = RandomForestClassifier(min_samples_leaf=3,min_samples_split=20,n_estimators=500,max_depth=None,random_state=10)
RandomForestModel.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=20,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
            oob_score=False, random_state=10, verbose=0, warm_start=False)

In [18]:
y_predict = RandomForestModel.predict(X_test)
accuracy_score(y_test, y_predict)

0.8100558659217877

Accuracy improved to 81%, nice improvement!

In [19]:
from sklearn.tree import export_graphviz
estimator = RandomForestModel.estimators_[0]
export_graphviz(estimator, out_file='complete_tree.dot', feature_names=features.columns,impurity=False,class_names=['Not survived','Survived'],filled=True)
from subprocess import call
call(['dot', '-T', 'png', 'complete_tree.dot', '-o', 'complete_tree.png'])

0

### Also let´s check the importance of each feature

In [20]:
import numpy as np
importances = pd.DataFrame({'feature':X_train.columns,'importance':RandomForestModel.feature_importances_})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances.head(15)

Unnamed: 0_level_0,importance
feature,Unnamed: 1_level_1
Sex,0.430488
Fare,0.191178
Age,0.183515
Pclass,0.130419
SibSp,0.035417
Parch,0.028982


In [21]:
importances.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x257373c9c18>