# First Project - Titanic Survival

My first project working on Kaggle using data from the Titanic to determine survival of passengers (https://www.kaggle.com/c/titanic). This is a first pass as I just want to see the score and will edit it for clarity/efficiency later.

## Table of Contents
***
* [1. Data Set](#dataset)
    * [1.1 Reading the data in](#reading)
    * [1.2 Exploring the data](#explore)
* [2. Data Cleaning](#cleaning)
    * [2.1 Outliers/NaN Values](#outliers)
    * [2.2 Fixing missing values](#fix)
    * [2.3 Feature Engineering](#feature)
    * [2.4 One-Hot Encoding/Dummy Variables](#dummy)
* [3. Submission](#sub)    
***

# <a id='dataset'> 1. Data Set </a>

In [73]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots
import eli5
from eli5.sklearn import PermutationImportance
import shap

### <a id='reading'> 1.1 Reading the data in </a>

***
Copying the Titanic dataset over from the competition page we read it in with `Pandas` and take a look.
***

In [74]:
# Read in csv files
titanic_original = pd.read_csv('datasets/titanic/train.csv') # kaggle path '../input/train.csv'
titanic_validation  = pd.read_csv('datasets/titanic/test.csv') # kaggle path '../input/test.csv'

# Makes an original copy that isn't a pointer
titanic = titanic_original.copy(deep = True) 
titanic_val = titanic_validation.copy(deep = True)

# Clean up number of lines to work on both sets
data_sets = [titanic,titanic_val] 

titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


***
Taking an initial look at the columns there are several types of qualitative and quantitative data. Our variables are:
1. *Passenger ID* - random unique identifier for each passenger
2. *Survived* - our target variable, 1 means survived 0 means didn't survive
3. *Pclass* - non-continuous (ordinal) variable that is a proxy for social economic status with high class being 1, middle class 2, and lower class 3
4. *Name* - nominal identifier which contains passenger name and title
5. *Sex* - nominal identifier indicating the gender of the passenger
6. *Age* - continuous quantitative variable for a passenger's age
7. *SibSp* - discrete quantitative variable for related siblings/spouse on board
8. *Parch* - discrete quantitative variable for parents/children on board
9. *Ticket* - random unique identifier for each passenger
10. *Fare* - continuous quantitative variable for the price of the ticket per passenger
11. *Cabin* - nominal identifier indicating where on the ship the passenger stayed
12. *Embarked* - nominal identifier for location on ship
***

### <a id='exploring'> 1.2 Exploring the data </a>

***
Now that we have read in the data we can take a basic look to determine which variable we think are good to use as predictor variables.
***

In [75]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


# <a id="exploring"> 2. Data cleaning </a>

***
Now that we have an idea of what is in our data we need to set about cleaning it up. This will happen in a couple phases. 
1. We need to identify outliers and missing/NaN values. Once we have identified them we can determine how to deal with them. 
2. Next is feature engineering, which is creating new features to see if we can get extra information with their inclusion. 
3. Lastly, we will convert our categorical data into dummy variables (either One-Hot or creating ordinals like *Pclass*).
***

### <a id='outliers'> 2.1 Outliers/NaN values </a>

***
Looking back at the `describe()` table, the columns that are likely to have outliers are the *Age* and *Fare* features. However, the maximum age is 80 which is reasonable and the maximum fare is 512. This seems like it may be a little large but the high fares are associated with a *Pclass* of 1 so we can ignore this for now. *Age* has some missing values but `describe()` doesn't mention the categorical columns so we need to see if anything is missing there too.
***

In [76]:
for dataset in data_sets:
    for col in dataset.columns:
    #     print(titanic[col].isnull().any())
        if dataset[col].isnull().any() == True:
            print(col + ' has {} empty values'.format(dataset[col].isnull().sum()))

Age has 177 empty values
Cabin has 687 empty values
Embarked has 2 empty values
Age has 86 empty values
Fare has 1 empty values
Cabin has 327 empty values


***
Not only does *Age* (in the training set) have 177 missing values, but *Cabin* is missing 77% of its values! While cabin would tell us where people were on the ship, we would also need a schematic to figure out front/back and deck positioning. We do have another feature that captures some location information, *Embarked*, and with *Cabin* missing more than 75% of the total values we will drop it.

In our validation set we also have some missing values in *Age* and *Cabin* like in the training set but we also are missing a single value in the *Fare* feature we will need to take care of.
***

### <a id='fix'> 2.2 Fixing missing values </a>

***
There are a couple different ways to correct missing/NaN values. The easiest methods are replacing the values with either the mean, median, or mode of the feature. This correction is called *imputing*. Imputing with the mean can be a little dangerous depending on how spread out your data set is, resulting in values that would increase/decrease the mean. Imputing with the median or mode values, however, don't carry this need to worry about the set. For *Age* we will impute with median and for *Embark* we will choose mode as there are only three choices to begin with.
***

In [77]:
# Fill in missing values with median/mode
for dataset in data_sets:
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)

# Delete columns we aren't interested in
for dataset in data_sets:
    del_cols = ['PassengerId','Ticket','Cabin']
    dataset.drop(del_cols, axis=1, inplace = True)

# Check to see if all missing values have been imputed
for dataset in data_sets:
    print(dataset.isnull().sum())


Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64


***
All of the missing values have been imputed so we can move on to engineering new features out of existing ones that aren't going to be very helpful in their current form.
***

### <a id='feature'> 2.3 Feature Engineering </a>

***
To create new features that we believe could provide additional insights into our problem we need some background knowledge about where the data set comes from. In this case we are looking at whether or not passengers survived the sinking of the Titanic. If we had seen the movie Titanic (you probably have or should) we'd know that when the crew was filling up the life boats they put women and children on first (more or less true but they also originally filled up the boats to half capacity believing that the ship would take a very long time to fully sink, but we can ignore these inconsistancies).

Something we can glean from this is that women and children were more likely to be the survivors and that it is also likely that having siblings/parents would correlate to a higher survival rate. To explore this we can combine *Parch* and *SibSp* together for a family size feature and a related feature for whether or not being alone on the ship would indicate a higher or lower survival rate.

Another interesting concept to explore would be how one's social class influenced survival rates. *Pclass* has some rough estimation of social standing but a passenger's title tells us more than that. So, let's create some new features.
***

In [78]:
for dataset in data_sets:
    # Family Size + 1 to include the passenger
    dataset['Family_Size'] = dataset['SibSp'] + dataset['Parch'] + 1
    
    # Alone
    dataset['Alone'] = 0 # Initially set them to not be alone
    dataset['Alone'].loc[dataset['Family_Size'] == 1] = 1 # Set to alone if family = 0

    # Titles
    dataset['Title'] = dataset['Name'].str.split(', ', expand = True)[1].str.split('.', expand = True)[0]

    print(dataset.Title.value_counts())

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
Don               1
the Countess      1
Ms                1
Lady              1
Jonkheer          1
Mme               1
Sir               1
Capt              1
Name: Title, dtype: int64
Mr        240
Miss       78
Mrs        72
Master     21
Rev         2
Col         2
Dona        1
Dr          1
Ms          1
Name: Title, dtype: int64


***
We've succesfully extracted the titles of everyone on board, but that is a good many titles some of which only occur once or twice. We can group up the rarer titles into one title *Rare*.
***

In [79]:
for dataset in data_sets:
    useable_titles = (dataset['Title'].value_counts() < 10)
    dataset['Title'] = dataset['Title'].apply(lambda x: 'Rare' 
                                              if useable_titles.loc[x] == True else x)
    # You can also use .replace() for dataframes
    print(dataset.Title.value_counts())

Mr        517
Miss      182
Mrs       125
Master     40
Rare       27
Name: Title, dtype: int64
Mr        240
Miss       78
Mrs        72
Master     21
Rare        7
Name: Title, dtype: int64


***
Now our titles are more simplified and we don't have to worry about whether being a Jonkheer determines our survival or not.

We do have another issue that may cause a problem in that some of our data is discrete, like *Pclass*, while some is continuous. Our target feature is also a binary survived or not so this is a classification problem. This means that we need to convert our continuous data into discrete.
***

In [80]:
for dataset in data_sets:
    
#     # Cutting the Fare values into 4 groups
#     dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)

#     # Cutting Age into 5 bins while also converting them from float to int
#     dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)
    
    # Or replacing them with chosen values using quartiles from above as ranges
    dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[dataset['Fare'] > 31, 'Fare'] = 3
    
    # We also convert ages to ranges
    dataset.loc[dataset['Age'] <= 20.125, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 20.125) & (dataset['Age'] <= 28), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 28) & (dataset['Age'] <= 38), 'Age'] = 2
    dataset.loc[dataset['Age'] > 38, 'Age'] = 3
    
    dataset['Fare'] = dataset['Fare'].astype(int)
    dataset['Age'] = dataset['Age'].astype(int)
#     dataset.Fare.astype(int)
#     dataset.Age.astype(int)


In [81]:
titanic.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Family_Size,Alone,Title
0,0,3,"Braund, Mr. Owen Harris",male,1,1,0,0,S,2,0,Mr
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,2,1,0,3,C,2,0,Mrs
2,1,3,"Heikkinen, Miss. Laina",female,1,0,0,1,S,1,1,Miss
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,2,1,0,3,S,2,0,Mrs
4,0,3,"Allen, Mr. William Henry",male,2,0,0,1,S,1,1,Mr


***
One thing we can't wait to do is One-Hot or Dummy encoding. We need to expand a couple features like *Pclass* as while humans can determine the difference between 1st, 2nd, and 3rd class a computer thinks that 3 is better than one. So we expand features with dummy variables for 1 being a passenger has a certain class or 0 if they don't. Similarly, we change categorical variables into dummy variables so that machine learning algorithms actually work (they work with numbers but not strings so well).
***

### <a id='dummy'> 2.4 One-Hot Encoding/Dummy Variables </a>

In [82]:
label = LabelEncoder()

# Change categorical variables into dummy variables
for dataset in data_sets:
    dataset['Sex_Var'] = label.fit_transform(dataset['Sex'])
    dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
    dataset['Title_Code'] = label.fit_transform(dataset['Title'])

print(titanic.columns.tolist())
titanic.head()

['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Family_Size', 'Alone', 'Title', 'Sex_Var', 'Embarked_Code', 'Title_Code']


Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Family_Size,Alone,Title,Sex_Var,Embarked_Code,Title_Code
0,0,3,"Braund, Mr. Owen Harris",male,1,1,0,0,S,2,0,Mr,1,2,2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,2,1,0,3,C,2,0,Mrs,0,0,3
2,1,3,"Heikkinen, Miss. Laina",female,1,0,0,1,S,1,1,Miss,0,2,1
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,2,1,0,3,S,2,0,Mrs,0,2,3
4,0,3,"Allen, Mr. William Henry",male,2,0,0,1,S,1,1,Mr,1,2,2


We also need to define out target variable and separate our data into training/test sets.

In [83]:
X = titanic.drop(['Survived','Name','Sex','Embarked','Title'],axis=1)
y = titanic['Survived']
X.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Family_Size,Alone,Sex_Var,Embarked_Code,Title_Code
0,3,1,1,0,0,2,0,1,2,2
1,1,2,1,0,3,2,0,0,0,3
2,3,1,0,0,1,1,1,0,2,1
3,1,2,1,0,3,2,0,0,2,3
4,3,2,0,0,1,1,1,1,2,2


In [84]:
# This is for one data set to split, we already have them split up
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
# X_train = X
# y_train = y
# X_test = titanic_val.drop(['Survived','Name','Sex','Embarked','Title'],axis=1)

In [85]:
X_train.count()

Pclass           668
Age              668
SibSp            668
Parch            668
Fare             668
Family_Size      668
Alone            668
Sex_Var          668
Embarked_Code    668
Title_Code       668
dtype: int64

In [86]:
# Logreg
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, y_train) * 100, 2)
acc_log

80.39

In [87]:
coeff_df = pd.DataFrame(X_train.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
coeff_df.sort_values(by='Correlation', ascending=False)

Unnamed: 0,Feature,Correlation
5,Alone,0.602562
4,Family_Size,0.344726
1,SibSp,-0.127954
8,Title_Code,-0.190725
6,Sex_Var,-0.412071
0,Age,-0.536295
3,Fare,-0.820885
2,Parch,-1.029403
7,Embarked_Code,-2.407119


In [88]:
random_forest = RandomForestClassifier(n_estimators=100)
first_model = random_forest.fit(X_train, y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, y_train)
acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
acc_random_forest

89.82

In [89]:
titanic_validation['Survived'] = random_forest.predict(titanic_val[X_train.columns.tolist()])
titanic_validation.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0


# <a id='sub'> 3. Submission </a>

In [90]:
#submit file
# submit = titanic_validation[['PassengerId','Survived']]
# submit.to_csv("submit.csv", index=False)