We can plot the learning curves for a Random Forest model based on all the features we extracted. 

In [1]:
import pandas as pd
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.learning_curve import learning_curve
import argparse
import sys
import numpy as np
import matplotlib.pyplot as plt
from helpers import helpers


We can first load the train data which has been properly preprocessed.

In [2]:
# load data
df = pd.read_csv('../../preprocessing/output/train.csv', sep=',')

# split X and y
y = df['Survived']
X = df.drop('Survived', 1)

# store the feature list
features_list = X.columns.values

And initialise a simple Random Forest model. Let's use it to look at the most important features.

In [3]:
# init random forest
model_random_forest = RandomForestClassifier(n_estimators=1000)
# fit model
model_random_forest.fit(X, y)
# check feature importance.
# http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
importances = model_random_forest.feature_importances_
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(20):
    print("%d. feature %s (%f)" % (f + 1, features_list[indices[f]], importances[indices[f]]))


Feature ranking:
1. feature age (0.139451)
2. feature fare (0.105445)
3. feature title_Mr (0.090302)
4. feature sex (0.083041)
5. feature family_size (0.040963)
6. feature title_Miss (0.040464)
7. feature bracket (0.033953)
8. feature class_3 (0.031704)
9. feature title_Mrs (0.029439)
10. feature cabin_count (0.020533)
11. feature first_ticket_digit_1 (0.020510)
12. feature quotes (0.016983)
13. feature class_1 (0.015461)
14. feature class_2 (0.013766)
15. feature first_ticket_digit_3 (0.013607)
16. feature port_S (0.012602)
17. feature first_ticket_digit_2 (0.010824)
18. feature port_C (0.009712)
19. feature title_Sir (0.008717)
20. feature cabin_deck_E (0.006928)


Looks like the most important features are the age, fare, sex, title Mr. This seems to make sense as per what we observed from the data. It's good to see that a couple of features we extracted and engineered are actually making it towards the top of the list, namely the bracket indicator, the ticket digits, the quotes...

We can draw the learning curves of this model to see how it performs.

In [4]:
# initialise cross-validation
cv = cross_validation.ShuffleSplit(X.shape[0], n_iter=10, test_size=0.2, random_state=0)


In [None]:
helpers.plot_learning_curve(model_random_forest, "Random Forest Learning Curves", X, y, n_jobs=-1, cv=cv)
plt.show()