# Titanic  - Random Forest Classifier
Since I've done the Titanic exercise about a thousand times, I decided to go straight to the data manipulation and creating the model, since I already have a good understanding of how to features correlate.

There is stil a lot of room for improvement (obviously) regarding feature engineering, but I'm pretty happy with what I achieved so far. Specially considering this is my first kernel on Kaggle.

In [44]:
# Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')

In [45]:
# Importing the data and checking the first 5 lines of it.

titanic_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [46]:
# Now let's check the info on both titanic_df and test_df

print(titanic_df.info())
print('---------------------------------------')
print(test_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
---------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-

In [47]:
# Checking for missing values on both data sets

print(titanic_df.isnull().sum())
print('------------------')
print(test_df.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
------------------
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [48]:
# Let's check the most common value in 'Embarked', so we can fill the 2 Nan's in titanic_df

titanic_df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [49]:
titanic_df['Embarked'].fillna('S', inplace=True)
test_df['Fare'].fillna(test_df['Fare'].mean(), inplace=True)

In [50]:
# Here I'm trying to find what's the average age by class, so we can more accurately fill the
# missing values in the 'Age' column

first_class_age = int(titanic_df[titanic_df['Pclass'] == 1]['Age'].mean())
second_class_age = int(titanic_df[titanic_df['Pclass'] == 2]['Age'].mean())
third_class_age = int(titanic_df[titanic_df['Pclass'] == 3]['Age'].mean())

In [51]:
print('First Class Average Age: {}'.format(first_class_age))
print('Second Class Average Age: {}'.format(second_class_age))
print('Third Class Average Age: {}'.format(third_class_age))

First Class Average Age: 38
Second Class Average Age: 29
Third Class Average Age: 25


In [52]:
# Now that we have the average age by class, let's fill those NaN's

def empute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return first_class_age
        elif Pclass == 2:
            return second_class_age
        else:
            return third_class_age
    else:
        return Age

In [53]:
titanic_df['Age'] = titanic_df[['Age', 'Pclass']].apply(empute_age, axis=1)
test_df['Age'] = test_df[['Age', 'Pclass']].apply(empute_age, axis=1)

In [54]:
# We can drop the Cabin column since it's missing so many values, and I don't think we can
# feature engineering it.

titanic_df.drop('Cabin', axis=1, inplace=True)
test_df.drop('Cabin', axis=1, inplace=True)

In [55]:
# Let's get the lenght of names into a new column, since that has a relationship with survival
# rate

titanic_df['Name_Len'] = titanic_df['Name'].apply(lambda x: len(x))
test_df['Name_Len'] = test_df['Name'].apply(lambda x: len(x))

In [56]:
# We can also take the title out of the 'Name' feature and put it in a new column

titanic_df['Name_Title'] = titanic_df['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0])
test_df['Name_Title'] = test_df['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0])

In [57]:
# Let's check the titles

print(titanic_df['Name_Title'].value_counts())
print('--------------------')
print(test_df['Name_Title'].value_counts())

Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Mlle.          2
Major.         2
Col.           2
Don.           1
Sir.           1
Capt.          1
the            1
Jonkheer.      1
Mme.           1
Lady.          1
Ms.            1
Name: Name_Title, dtype: int64
--------------------
Mr.        240
Miss.       78
Mrs.        72
Master.     21
Rev.         2
Col.         2
Dr.          1
Ms.          1
Dona.        1
Name: Name_Title, dtype: int64


In [58]:
# We can create a dictionary for the titles with values to be used as dummy variables in our
# model

titles = {'Mr.': 0, 'Miss.': 1, 'Mrs.': 2, 'Master.': 3, 'Dr.': 4, 'Rev.': 5, 'Major.': 6,
          'Col.': 7, 'Mlle.': 8, 'Jonkheer.': 9, 'Sir.': 10, 'Mme.': 11, 'the': 12, 'Don.': 13,
          'Capt.' : 14, 'Lady.': 15, 'Ms.': 16, 'Dona.': 17
         }

In [59]:
# Replacing the titles for their correspondent dummy variable

titanic_df['Name_Title'].replace(titles, inplace=True)
test_df['Name_Title'].replace(titles, inplace=True)

In [60]:
# Getting dummy variables

sex_dummy_titanic = pd.get_dummies(titanic_df['Sex'], drop_first=True)
sex_dummy_test = pd.get_dummies(test_df['Sex'], drop_first=True)

embark_dummy_titanic = pd.get_dummies(titanic_df['Embarked'], drop_first=True)
embark_dummy_test = pd.get_dummies(test_df['Embarked'], drop_first=True)

In [61]:
# Let's drop the rest of the columns that we are not going to use
# Note that we are not dropping the PassengerId in the test_df because we need it for our
# submission file for the competition

titanic_df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Embarked'], axis=1, inplace=True)
test_df.drop(['Name', 'Sex', 'Ticket', 'Embarked'], axis=1, inplace=True)

In [62]:
# It's time to start modeling, first let's separate training and testing sets

from sklearn.model_selection import train_test_split

In [63]:
# We will use only the data from titanic_df because we want to get accuracy_score, after we
# selected the best model, we'll use it to predict on the test_df

X = titanic_df.drop('Survived', axis=1)
y = titanic_df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [64]:
# Importing evaluation modules

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.grid_search import GridSearchCV

In [65]:
# Decision Tree 

from sklearn.tree import DecisionTreeClassifier

In [66]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)
print('Accuracy: {}'.format(dt_acc))

Accuracy: 0.7593220338983051


In [67]:
# Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

In [68]:
score = []

for i in range(1, 101):
    clf = RandomForestClassifier(n_estimators=i, criterion='entropy', random_state=101)
    clf.fit(X_train, y_train)
    clf_pred = clf.predict(X_test)
    score.append(accuracy_score(y_test, clf_pred))
    
print(max(score))

0.840677966102


In [69]:
score.index(max(score))

18

In [70]:
params = {
    'n_estimators': list(range(1, 101)),
    'criterion': ['entropy'],
    'random_state': [101]
}

grid = GridSearchCV(RandomForestClassifier(), param_grid=params)
grid.fit(X_train, y_train)
print('Best estimator: {}'.format(grid.best_estimator_))
print('Best parameters: {}'.format(grid.best_params_))
print('Best score: {}'.format(grid.best_score_))

Best estimator: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=84, n_jobs=1,
            oob_score=False, random_state=101, verbose=0, warm_start=False)
Best parameters: {'criterion': 'entropy', 'n_estimators': 84, 'random_state': 101}
Best score: 0.7869127516778524


In [71]:
# Funny how the GridSearch gives me a different n_estimator then my for loop. And my for
# loop result is much better then the GridSearch. Maybe I'm missing something here idk?

In [72]:
random_forest = RandomForestClassifier(n_estimators=19, criterion='entropy', random_state=101)
random_forest.fit(X_train, y_train)
rfc_pred = random_forest.predict(X_test)
rfc_acc = accuracy_score(y_test, rfc_pred)
print('Accuracy: {}'.format(rfc_acc))

Accuracy: 0.8406779661016949


In [73]:
# Logistic Regression

from sklearn.linear_model import LogisticRegression

In [74]:
params_logreg = {
    'C': [0.1, 1, 10, 100]
}

grid_logistic = GridSearchCV(LogisticRegression(), param_grid=params_logreg)
grid_logistic.fit(X_train, y_train)
print('Best estimator: {}'.format(grid_logistic.best_estimator_))
print('Best parameters: {}'.format(grid_logistic.best_params_))
print('Best score: {}'.format(grid_logistic.best_score_))

Best estimator: LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Best parameters: {'C': 0.1}
Best score: 0.7802013422818792


In [75]:
logreg = LogisticRegression(C=0.1)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)
logreg_acc = accuracy_score(y_test, logreg_pred)
print('Accuracy: {}'.format(logreg_acc))

Accuracy: 0.7322033898305085


In [76]:
# Gaussian Naive Bayes

from sklearn.naive_bayes import GaussianNB

In [77]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)
gnb_pred = gnb.predict(X_test)
gnb_acc = accuracy_score(y_test, gnb_pred)
print('Accuracy: {}'.format(gnb_acc))

Accuracy: 0.6813559322033899


In [78]:
# KNearest Neighbors

from sklearn.neighbors import KNeighborsClassifier

In [79]:
params_knn = {
    'n_neighbors': list(range(1, 41))
}

grid_knn = GridSearchCV(KNeighborsClassifier(), param_grid=params_knn)
grid_knn.fit(X_train, y_train)
print('Best estimator: {}'.format(grid_knn.best_estimator_))
print('Best parameters: {}'.format(grid_knn.best_params_))
print('Best score: {}'.format(grid_knn.best_score_))

Best estimator: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
           weights='uniform')
Best parameters: {'n_neighbors': 6}
Best score: 0.7332214765100671


In [80]:
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
knn_acc = accuracy_score(y_test, knn_pred)
print('Accuracy: {}'.format(knn_acc))

Accuracy: 0.7050847457627119


In [81]:
# SGDClassifier

from sklearn.linear_model import SGDClassifier

In [82]:
params_sgdc = {
    'alpha': [0.1, 0.01, 0.001, 0.0001]
}

grid_sgdc = GridSearchCV(SGDClassifier(), param_grid=params_sgdc)
grid_sgdc.fit(X_train, y_train)
print('Best estimator: {}'.format(grid_sgdc.best_estimator_))
print('Best parameters: {}'.format(grid_sgdc.best_params_))
print('Best score: {}'.format(grid_sgdc.best_score_))

Best estimator: SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)
Best parameters: {'alpha': 0.0001}
Best score: 0.6728187919463087




In [83]:
sgdc = SGDClassifier(alpha=0.001)
sgdc.fit(X_train, y_train)
sgdc_pred = sgdc.predict(X_test)
sgdc_acc = accuracy_score(y_test, sgdc_pred)
print('Accuracy: {}'.format(sgdc_acc))

Accuracy: 0.7084745762711865




In [84]:
# Comparing accuracy from all the models

list_scores = [dt_acc, rfc_acc, logreg_acc, gnb_acc, knn_acc, sgdc_acc]
classifiers = ['Decision Tree', 'Random Forest', 'Logistic Regression', 'Gaussian Naive Bayes', 'KNN', 'SGDC']

scores_df = pd.DataFrame(list_scores, columns=['Scores'], index=classifiers)
scores_df

Unnamed: 0,Scores
Decision Tree,0.759322
Random Forest,0.840678
Logistic Regression,0.732203
Gaussian Naive Bayes,0.681356
KNN,0.705085
SGDC,0.708475


In [85]:
# Now let's predict with random forest classifier on the test_df set

predictions = random_forest.predict(test_df.drop('PassengerId', axis=1))

In [86]:
# Create a submission file to be sent to kaggle

submission = pd.DataFrame(
            {'PassengerId': test_df['PassengerId'],
             'Survived': predictions})
submission.to_csv('random_forest_classifier_3', index=False)

I'm still trying to figure out how to properly tune the models with GridSearch. If you have any constructive critiscm, please go ahead! I'm looking forward to hearing your advices.