# Abstract

This is a simple template notebook in order to achieve 80% of accuracy in the Kaggle Titanic Competion, without a particular attention on modelling. The purpose is to show how with a good EDA it is possible to improve dramatically the score.

## Import libreries

We import the needed libriries

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score

## Surname and Title split

We start with splitting the Name column into Surname and Title columns. We do it in order to take into account the incidence of these two fields for the passengers' survival

In [2]:
titanic_dataset = pd.read_csv('train.csv')
titanic_test_set = pd.read_csv('test.csv')
titanic_dataset['Surname']=titanic_dataset.Name.str.split(',',expand=True)[0]
titanic_test_set['Surname']=titanic_test_set.Name.str.split(',',expand=True)[0]
titanic_dataset['Title']=titanic_dataset.Name.str.split(',',expand=True)[1].str.split('.',expand=True)[0]
titanic_test_set['Title']=titanic_test_set.Name.str.split(',',expand=True)[1].str.split('.',expand=True)[0]

In [3]:
titanic_dataset.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen,Mr


We define the Family number column as the total number of people with the same Surname, including also Homonyms. We use both training and test set

In [4]:
lbl_sur={}
for name, group in titanic_test_set.append(titanic_dataset,sort=False).groupby('Surname'):
    lbl_sur[name]=len(group)
titanic_dataset['FamilyNumber']=titanic_dataset.Surname.map(lbl_sur)
titanic_test_set['FamilyNumber']=titanic_test_set.Surname.map(lbl_sur)

We then define the isAlone column as the representation of the pepole which are not part of a family. The HasHomonym column is quiet straightforward to calculate.

In [5]:
titanic_dataset['isAlone'] = titanic_dataset.SibSp+titanic_dataset.Parch==0
titanic_test_set['isAlone'] = titanic_test_set.SibSp+titanic_test_set.Parch==0
hasHo=[]
for i in range(len(titanic_dataset)):
    if titanic_dataset.iloc[i]['isAlone'] and titanic_dataset.iloc[i]['FamilyNumber'] > 1:
        hasHo.append(True)
    else:
        hasHo.append(False)
titanic_dataset['HasHomonym']=hasHo

hasHo=[]
for i in range(len(titanic_test_set)):
    if titanic_test_set.iloc[i]['isAlone'] and titanic_test_set.iloc[i]['FamilyNumber'] > 1:
        hasHo.append(True)
    else:
        hasHo.append(False)
titanic_test_set['HasHomonym']=hasHo

## Labelling and Mean Encoding

We now use both training and test sets to calculate labels for Cabin and Surname columns, together with passageners' mean age. Finally we fill the Nan values with the age mean for the Age column, and with -1 for the Surname and Cabin columns

In [6]:
### Labelling and filling NaN ###

mean_age=titanic_test_set.append(titanic_dataset, sort=False).Age.mean()
lbl_cabin = {k: v for v, k in enumerate(
    titanic_test_set.append(titanic_dataset, sort=False)['Cabin'].unique())}
lbl_Surname = {k: v for v, k in enumerate(
    titanic_test_set.append(titanic_dataset, sort=False)['Surname'].unique())}

titanic_dataset.Age = titanic_dataset.Age.fillna(mean_age)
titanic_dataset.Cabin = titanic_dataset.Cabin.map(lbl_cabin)
titanic_dataset.Surname = titanic_dataset.Surname.map(lbl_Surname)
titanic_dataset=titanic_dataset.fillna(-1)

titanic_test_set.Age = titanic_test_set.Age.fillna(mean_age)
titanic_test_set.Cabin = titanic_test_set.Cabin.map(lbl_cabin)
titanic_test_set.Surname = titanic_test_set.Surname.map(lbl_Surname)
titanic_test_set=titanic_test_set.fillna(-1)

This is the most important part. We label four fields (Sex, Embarked, Pclass and Title) using the Mean Encoding technique. We choose a Kfold validation strategy in order to avoid overfitting 

In [7]:
###### MEAN ENCODING ######

skf = StratifiedKFold(5,shuffle=True,random_state=123).split(titanic_dataset.values,
                                                             titanic_dataset['Survived'].values)

new_df = pd.DataFrame()

for tr_ind, val_ind in skf:
    X_tr, X_val = titanic_dataset.iloc[tr_ind], titanic_dataset.iloc[val_ind] 
    for col in ['Sex','Embarked','Pclass','Title']:
        means = X_val[col].map(X_tr.groupby(col)['Survived'].mean())
        X_val[col+'_mean_target']=means
    new_df = new_df.append(X_val)

emb_dict = new_df.groupby('Embarked')['Embarked_mean_target'].mean().to_dict()
sex_dict = new_df.groupby('Sex')['Sex_mean_target'].mean().to_dict()
pclass_dict = new_df.groupby('Pclass')['Pclass_mean_target'].mean().to_dict()
title_dict = new_df.groupby('Title')['Title_mean_target'].mean().to_dict()

titanic_dataset['Embarked_mean_target'] = titanic_dataset.Embarked.map(emb_dict)
titanic_dataset['Sex_mean_target'] = titanic_dataset.Sex.map(sex_dict)
titanic_dataset['Pclass_mean_target'] = titanic_dataset.Pclass.map(pclass_dict)
titanic_dataset['Title_mean_target'] = titanic_dataset.Title.map(title_dict)
titanic_test_set['Embarked_mean_target'] = titanic_test_set.Embarked.map(emb_dict)
titanic_test_set['Sex_mean_target'] = titanic_test_set.Sex.map(sex_dict)
titanic_test_set['Pclass_mean_target'] = titanic_test_set.Pclass.map(pclass_dict)
titanic_test_set['Title_mean_target'] = titanic_test_set.Title.map(title_dict)

titanic_dataset=titanic_dataset.fillna(-1)
titanic_test_set=titanic_test_set.fillna(-1)
titanic_dataset.isAlone = titanic_dataset.isAlone.astype(float)
titanic_dataset.HasHomonym = titanic_dataset.HasHomonym.astype(float)
titanic_test_set.isAlone = titanic_test_set.isAlone.astype(float)
titanic_test_set.HasHomonym = titanic_test_set.HasHomonym.astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


## Modelling

Now that the dataset is ready, we can build a simple model. First, we scale the data using both training and test set

In [8]:
fcol=['Age','SibSp','Parch','Cabin','Fare','Surname','FamilyNumber','isAlone','HasHomonym',
      'Pclass_mean_target','Sex_mean_target','Embarked_mean_target','Title_mean_target']

y_col = ['Survived']

X_train = titanic_dataset[fcol].values
X_test = titanic_test_set[fcol].values
scaler = MinMaxScaler()

scaler.fit(np.concatenate([X_train,X_test]))
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Then, using a Stratified K-fold strategy for cross validation, we train a Random Forest model, trying different hyperparameters. As we can see, the model is a little overfitted. I invite you to improve the model in order to achieve a better generalization on the test set

In [9]:
skf = StratifiedKFold(5,shuffle=True,random_state=123).split(X_train,
                                                             titanic_dataset['Survived'].values)
y_tr = titanic_dataset[y_col].values
y_tr = y_tr.reshape(len(y_tr),)

clf = RandomForestClassifier(n_estimators=200,max_depth=40,max_features=10,
                             min_samples_leaf=2,random_state=123,n_jobs=-1)
#clf = LogisticRegression(C=9)

accuracy_tr = []
accuracy_val = []

for tr_ind, val_ind in skf:
    X_tr, X_val = X_train_scaled[tr_ind], X_train_scaled[val_ind]
    clf.fit(X_tr,y_tr[tr_ind])
    accuracy_tr.append(clf.score(X_tr,y_tr[tr_ind]))
    accuracy_val.append(clf.score(X_val,y_tr[val_ind]))

print('The mean tr accuracy is ', np.mean(accuracy_tr))
print('The mean val accuracy is ', np.mean(accuracy_val))

The mean tr accuracy is  0.9520202259639061
The mean val accuracy is  0.8305296818518562


Once the validation is done, we can fit the model over the entire training set

In [10]:
clf = RandomForestClassifier(n_estimators=200,max_depth=40,max_features=10,
                             min_samples_leaf=2,random_state=123,n_jobs=-1)
clf.fit(X_train_scaled,titanic_dataset['Survived'].values)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=40, max_features=10, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1,
            oob_score=False, random_state=123, verbose=0, warm_start=False)

Finally, we just prepare the submission.csv file to submit to the competition. You should roughly achive a score of 0.80

In [11]:
y_test = clf.predict(X_test_scaled)

with open('submission.csv','w') as f:
    f.write('PassengerId,Survived\n')
    for i in range(len(titanic_test_set)):
        f.write('{}'.format(titanic_test_set.iloc[i]['PassengerId'])+','+
                   '{}'.format(y_test[i])+'\n')
                    #'{}'.format('0')+'\n')
    f.close()

### Thanks!