# Titanic  - Random Forest Classifier
Since I've done the Titanic exercise about a thousand times, I decided to go straight to the data manipulation and creating the model, since I already have a good understanding of how the features correlate.

There is stil a lot of room for improvement (obviously) regarding feature engineering, but I'm pretty happy with what I achieved so far. Specially considering this is my first kernel on Kaggle.

In [1]:
# Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')

In [2]:
# Importing the data and checking the first 5 lines of it.

titanic_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# Now let's check the info on both titanic_df and test_df

print(titanic_df.info())
print('---------------------------------------')
print(test_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
---------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-

In [4]:
# Checking for missing values on both data sets

print(titanic_df.isnull().sum())
print('------------------')
print(test_df.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
------------------
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [5]:
# Let's check the most common value in 'Embarked', so we can fill the 2 Nan's in titanic_df

titanic_df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [6]:
titanic_df['Embarked'].fillna('S', inplace=True)
test_df['Fare'].fillna(test_df['Fare'].mean(), inplace=True)

In [7]:
# Here I'm trying to find what's the average age by class, so we can more accurately fill the
# missing values in the 'Age' column

first_class_age = int(titanic_df[titanic_df['Pclass'] == 1]['Age'].mean())
second_class_age = int(titanic_df[titanic_df['Pclass'] == 2]['Age'].mean())
third_class_age = int(titanic_df[titanic_df['Pclass'] == 3]['Age'].mean())

In [8]:
print('First Class Average Age: {}'.format(first_class_age))
print('Second Class Average Age: {}'.format(second_class_age))
print('Third Class Average Age: {}'.format(third_class_age))

First Class Average Age: 38
Second Class Average Age: 29
Third Class Average Age: 25


In [9]:
# Now that we have the average age by class, let's fill those NaN's

def empute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return first_class_age
        elif Pclass == 2:
            return second_class_age
        else:
            return third_class_age
    else:
        return Age

In [10]:
titanic_df['Age'] = titanic_df[['Age', 'Pclass']].apply(empute_age, axis=1)
test_df['Age'] = test_df[['Age', 'Pclass']].apply(empute_age, axis=1)

In [11]:
# We can drop the Cabin column since it's missing so many values, and I don't think we can
# feature engineering it.

titanic_df.drop('Cabin', axis=1, inplace=True)
test_df.drop('Cabin', axis=1, inplace=True)

In [12]:
# Let's get the lenght of names into a new column, since that has a relationship with survival
# rate

titanic_df['Name_Len'] = titanic_df['Name'].apply(lambda x: len(x))
test_df['Name_Len'] = test_df['Name'].apply(lambda x: len(x))

In [13]:
# We can also take the title out of the 'Name' feature and put it in a new column

titanic_df['Name_Title'] = titanic_df['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0])
test_df['Name_Title'] = test_df['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0])

In [14]:
# Let's check the titles

print(titanic_df['Name_Title'].value_counts())
print('--------------------')
print(test_df['Name_Title'].value_counts())

Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Col.           2
Mlle.          2
Major.         2
the            1
Mme.           1
Sir.           1
Ms.            1
Don.           1
Jonkheer.      1
Capt.          1
Lady.          1
Name: Name_Title, dtype: int64
--------------------
Mr.        240
Miss.       78
Mrs.        72
Master.     21
Rev.         2
Col.         2
Dona.        1
Ms.          1
Dr.          1
Name: Name_Title, dtype: int64


In [15]:
# We can create a dictionary for the titles with values to be used as dummy variables in our
# model

titles = {'Mr.': 0, 'Miss.': 1, 'Mrs.': 2, 'Master.': 3, 'Dr.': 4, 'Rev.': 5, 'Major.': 6,
          'Col.': 7, 'Mlle.': 8, 'Jonkheer.': 9, 'Sir.': 10, 'Mme.': 11, 'the': 12, 'Don.': 13,
          'Capt.' : 14, 'Lady.': 15, 'Ms.': 16, 'Dona.': 17
         }

In [16]:
# Replacing the titles for their correspondent dummy variable

titanic_df['Name_Title'].replace(titles, inplace=True)
test_df['Name_Title'].replace(titles, inplace=True)

In [17]:
# Getting dummy variables

sex_dummy_titanic = pd.get_dummies(titanic_df['Sex'], drop_first=True)
sex_dummy_test = pd.get_dummies(test_df['Sex'], drop_first=True)

embark_dummy_titanic = pd.get_dummies(titanic_df['Embarked'], drop_first=True)
embark_dummy_test = pd.get_dummies(test_df['Embarked'], drop_first=True)

In [18]:
# Let's drop the rest of the columns that we are not going to use
# Note that we are not dropping the PassengerId in the test_df because we need it for our
# submission file for the competition

titanic_df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Embarked'], axis=1, inplace=True)
test_df.drop(['Name', 'Sex', 'Ticket', 'Embarked'], axis=1, inplace=True)

In [19]:
# Let's define our features and target columns and split them 

X_train = titanic_df.drop('Survived', axis=1)
y_train = titanic_df['Survived']

X_test = test_df.drop('PassengerId', axis=1).copy()

In [20]:
# Now we can start creating our model
# Let's import the necessary modules, create an instance of RandomForestClassifier and fit the
# model

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error

model = RandomForestClassifier()
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [21]:
# Let's assign our predictions to a variable

predictions = model.predict(X_test)

In [22]:
# And check our model's final score

model.score(X_train, y_train)

0.97643097643097643

In [24]:
# Create a submission file to be sent to kaggle

submission = pd.DataFrame(
            {'PassengerId': test_df['PassengerId'],
             'Survived': predictions})
submission.to_csv('titanic_logreg.csv', index=False)

Like I said in the notebook, there are obviously TONS of room for improvement, I'm only a beginner and student of data science and machine learning, if you happen to check this notebook and have some constructive criticism, please go ahead, 90% of the reason of posting this notebook and beeing on kaggle is to learn new things, so I'm really looking forward to connecting with you.