One Step Further
=========

Preamble
---------

In [99]:
import pandas as pd
import numpy as np
import graphviz
from sklearn.tree import *

Read the data:

In [55]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

Now, let's repair our data by filling NaN's.

In [56]:
# Fill NaN ages with the median of the ages.
train_data.loc[train_data['Age'].isnull(), 'Age'] = train_data['Age'].median()

# Fill missing embarkations with 'S'.
train_data['Embarked'] = train_data['Embarked'].fillna('S')

Now, digitize 'male' and 'female' values.

In [57]:
train_data.loc[train_data['Sex'] == 'female', 'Sex'] = 1
train_data.loc[train_data['Sex'] == 'male', 'Sex'] = 0

And digitize embarkations too...

In [58]:
train_data.loc[train_data['Embarked'] == 'S', 'Embarked'] = 0
train_data.loc[train_data['Embarked'] == 'C', 'Embarked'] = 1
train_data.loc[train_data['Embarked'] == 'Q', 'Embarked'] = 2

Now let's print Sex and Embarked columns.

In [59]:
print(train_data.loc[:, ('Sex', 'Embarked')])

    Sex Embarked
0     0        0
1     1        1
2     1        0
3     1        0
4     0        0
5     0        2
6     0        0
7     0        0
8     1        0
9     1        1
10    1        0
11    1        0
12    0        0
13    0        0
14    1        0
15    1        0
16    0        2
17    0        0
18    1        0
19    1        1
20    0        0
21    0        0
22    1        2
23    0        0
24    1        0
25    1        0
26    0        1
27    0        0
28    1        2
29    0        0
..   ..      ...
861   0        0
862   1        0
863   1        0
864   0        0
865   1        0
866   1        1
867   0        0
868   0        0
869   0        0
870   0        0
871   1        0
872   0        0
873   0        0
874   1        1
875   1        1
876   0        0
877   0        0
878   0        0
879   1        1
880   1        0
881   0        0
882   1        0
883   0        0
884   0        0
885   1        2
886   0        0
887   1       

Well, good to go.

In [60]:
target = train_data['Survived'].values
features = train_data[['Pclass', 'Sex', 'Age', 'Fare']].values
decision_tree = DecisionTreeClassifier()
decision_tree = decision_tree.fit(features, target)

print(decision_tree.feature_importances_)
print(decision_tree.score(features, target))

[ 0.13031677  0.31274009  0.24381015  0.31313298]
0.977553310887


Well. Let's see the gaps in our test data:

In [61]:
for col in ['Pclass', 'Sex', 'Age', 'Fare']:
    print('===================')
    print(col)
    print(test_data[col][test_data[col].isnull() == True])
    print('===================')
    print()

Pclass
Series([], Name: Pclass, dtype: int64)

Sex
Series([], Name: Sex, dtype: object)

Age
10    NaN
22    NaN
29    NaN
33    NaN
36    NaN
39    NaN
41    NaN
47    NaN
54    NaN
58    NaN
65    NaN
76    NaN
83    NaN
84    NaN
85    NaN
88    NaN
91    NaN
93    NaN
102   NaN
107   NaN
108   NaN
111   NaN
116   NaN
121   NaN
124   NaN
127   NaN
132   NaN
133   NaN
146   NaN
148   NaN
       ..
268   NaN
271   NaN
273   NaN
274   NaN
282   NaN
286   NaN
288   NaN
289   NaN
290   NaN
292   NaN
297   NaN
301   NaN
304   NaN
312   NaN
332   NaN
339   NaN
342   NaN
344   NaN
357   NaN
358   NaN
365   NaN
366   NaN
380   NaN
382   NaN
384   NaN
408   NaN
410   NaN
413   NaN
416   NaN
417   NaN
Name: Age, Length: 86, dtype: float64

Fare
152   NaN
Name: Fare, dtype: float64



Let's fill them with medians and digital values for textual ones.

In [62]:
test_data.loc[test_data.Sex == 'male', 'Sex'] = 0
test_data.loc[test_data.Sex == 'female', 'Sex'] = 1
test_data.loc[test_data.Age.isnull(), 'Age'] = test_data.Age.median()
test_data.loc[test_data.Fare.isnull(), 'Fare'] = test_data.Fare.median()

Now, let's submit another prediction.
------------------------------------------

In [98]:
test_features = test_data[['Pclass', 'Sex', 'Age', 'Fare']].values
prediction = decision_tree.predict(test_features)
pids = test_data['PassengerId'].values
solution = pd.DataFrame({'PassengerId': pids, 'Survived': prediction})
print(solution)
solution.to_csv("dt_submission.csv", index=False)

(418, 2)


Hmm... Not well enough. Let's train our tree again, with fine tuning this time.

In [146]:
decision_tree_improved = DecisionTreeClassifier(
    max_depth=10,
    min_samples_split=10,
    random_state=1
)
decision_tree_improved = decision_tree_improved.fit(features, target)
print(decision_tree_improved.feature_importances_)
print(decision_tree_improved.score(features, target))

dot_data = export_graphviz(
    decision_tree_improved,
    out_file=None,
    feature_names=['Pclass', 'Sex', 'Age', 'Fare'],
    class_names=['0', '1'],
    special_characters=True
)
graph = graphviz.Source(dot_data)
#graph.view()

[ 0.16905181  0.44303111  0.15308105  0.23483602]
0.896745230079


Let's test it.

In [147]:
prediction = decision_tree_improved.predict(test_features)
pids = test_data['PassengerId'].values
solution = pd.DataFrame({'PassengerId': pids, 'Survived': prediction})
print(solution)
solution.to_csv("dt_submission2.csv", index=False)

     PassengerId  Survived
0            892         0
1            893         0
2            894         0
3            895         0
4            896         1
5            897         0
6            898         0
7            899         0
8            900         1
9            901         0
10           902         0
11           903         0
12           904         1
13           905         0
14           906         1
15           907         1
16           908         0
17           909         0
18           910         0
19           911         0
20           912         0
21           913         1
22           914         1
23           915         0
24           916         1
25           917         0
26           918         1
27           919         0
28           920         0
29           921         0
..           ...       ...
388         1280         0
389         1281         0
390         1282         0
391         1283         1
392         1284         0
3