# Titanic Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Data Description

The columns/variables in this dataset are:  
> **PassengerId** = An index number for each passenger  
**Survived** = Whether or not the passenger survived  
**Pclass** = The passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)  
**Name** = The name of the passenger  
**Sex** = The gender of the passenger  
**Age** = The age of the passenger  
**SibSp** = The number of siblings/spouses the passenger has  
**Parch** = The number of parents/children the passenger has  
**Ticket** = The ticket number  
**Fare** = The cost of the ticket  
**Cabin** = The passengers cabin number  
**Embarked** = The port the passenger embarked from (C = Cherbourg, Q = Queenstown, S = Southhampton)  

In [None]:
train_df = pd.read_csv('./Data/train.csv')
test_df = pd.read_csv('./Data/test.csv')

In [None]:
train_df.head()

In [None]:
test_df.head()

Taking a quick look at the training and test data sets, looks like all the columns are the same except for the Survived column which we will be making predictions for on the test data. Let's work on cleaning up the training data.

In [None]:
train_df.info()

In [None]:
train_df.isnull().sum()

Looks like there are some null values that need to be taken care of for the Age, Cabin, and Embarked columns. We'll start by looking at the Age column.

In [None]:
print(f'Mean Age: {train_df.Age.mean()}')
print(f'Median Age: {train_df.Age.median()}')

We could fill the null values with either the average age or the median age of all passengers, both values being very similar. However, we can better predict age by grouping similar passengers together. We'll group passengers by Pclass and Sex, assuming that the people in each Pclass are more similar to each other than the people in other Pclasses.

In [None]:
s_class = train_df.groupby(['Pclass', 'Sex'])

In [None]:
s_class.median()

In [None]:
s_class.mean()

Just as we thought, grouping the similar passengers together gives us a much better picture of the age differences between Pclass/Sex and will allow us to make more accurate values for the null age values. Let's go ahead and fill the null age values with the median age of each group.

Grouping passengers this way also shows the discrepancy in fare price between Pclass/Sex.

In [None]:
train_df.loc[train_df['Age'].isnull()]

In [None]:
## replace null values with median values of each group

train_df['Age'].fillna(s_class['Age'].transform('median'), inplace=True)

In [None]:
train_df['Age'].isnull().sum()

In [None]:
train_df['Cabin'].isnull().sum() / len(train_df)

Because there are so many null values for the cabin column (77% of the cabin column is a null value), we'll remove it from the data.

In [None]:
train_df.drop('Cabin', axis=1, inplace=True)
train_df.head()

In [None]:
train_df.loc[train_df['Embarked'].isnull()]

In [None]:
train_df['Embarked'].unique()

There are only 2 null values for the Embarked column and 3 unique values for the column. We'll try to find the most similar passengers to the rows with null values and use the embarked values that we find.

In [None]:
## creating a new df so it's easier to navigate through the values

em_null = train_df.loc[(train_df['Pclass'] == 1) & (train_df['Sex'] == 'female') & (train_df['Survived'] == 1) & 
                       (train_df['SibSp'] == 0) & (train_df['Parch'] == 0)]
em_null.head()

In [None]:
## checking values for embarked for similar passengers

em_null['Embarked'].value_counts()

In [None]:
## checking if filtering by age will affect the values

em_null.loc[em_null['Age'] < 45]['Embarked'].value_counts()

In [None]:
em_null.loc[em_null['Age'] > 45]['Embarked'].value_counts()

While the values in the Embarked columns are very similar when looking at similar passengers, it looks like there are more 'C' values for younger people and more 'S' values for older people. The difference isn't too skewed to be certain of these values but we'll go ahead and use 'C' for the younger passenger and 'S' for the older one.

In [None]:
## replacing null values

train_df.iloc[[61], [-1]] = 'C'
train_df.iloc[[829], [-1]] = 'S'

In [None]:
train_df.info()

In [None]:
train_df.corr()

Taking a look at correlation values, it looks like passenger id has almost 0 correlation with whether or not a person will survive, and has very low correlation with all other columns. And after taking a look at the data, it seems like the passenger id is just a random number assigned to each passenger so we'll remove the column.

In [None]:
train_df.drop('PassengerId', axis=1, inplace=True)
train_df.head()

In [None]:
train_df['Pclass'].value_counts()

In [None]:
train_df['SibSp'].value_counts()

In [None]:
train_df['Parch'].value_counts()

In [None]:
train_df['Fare'].describe()

Taking a look at fare, it looks like there are a couple amounts with 0. Let's take a look and see if they're null values.

In [None]:
train_df.loc[train_df['Fare'] == 0]

It looks like all the passengers with a fare of 0 are male, do not have any family on board, and all departed from the same port. Since they are all in different classes, we will replace the fare with the median fare cost for each class/gender.

In [None]:
train_df.groupby(['Pclass', 'Sex'])['Fare'].mean()

In [None]:
## changing the 0's to null values so that we can use the same method we used to change null values for Age
## we can use the s_class dataframe we created earlier which grouped by pclass and sex

train_df['Fare'].replace(0, np.NaN, inplace=True)
train_df['Fare'].fillna(s_class['Fare'].transform('mean'), inplace=True)

In [None]:
## checking to see if the data was inputted correctly

train_df.iloc[[179, 263, 277]]

Now that we've taken care of the values in the training set, lets do the same for the missing values in the test dataset.

In [None]:
## Looks like we have to take care of the missing values in Age, Fare, and Cabin

test_df.info()

In [None]:
## we'll use the same method as we did for the training data to fill null values for Age

test_df['Age'].fillna(test_df.groupby(['Pclass', 'Sex'])['Age'].transform('median'), inplace=True)

In [None]:
test_df['Age'].isnull().sum()

In [None]:
## dropping the cabin column and passenger id column

test_df.drop(['PassengerId', 'Cabin'], axis=1, inplace=True)
test_df.head()

In [None]:
## take care of null values in fare using the same methods for the training set

test_df.loc[test_df['Fare'].isnull()]

In [None]:
## looks like there are also a couple 0 fare values we need to take care of

test_df.loc[test_df['Fare'] == 0]

In [None]:
test_df.groupby(['Pclass', 'Sex'])['Fare'].mean()

In [None]:
test_df['Fare'].replace(0, np.NaN, inplace=True)
test_df['Fare'].fillna(test_df.groupby(['Pclass', 'Sex'])['Fare'].transform('mean'), inplace=True)

In [None]:
test_df.iloc[[152, 266, 372]]

In [None]:
test_df.isnull().sum()

Now that we've taken care of our data lets move on to some visualizations

## EDA and Visualizations

In [None]:
plt.figure(figsize=(10, 8))
sns.violinplot(x='Pclass', y='Age', hue='Survived', data=train_df, split=True)
plt.title('Survival Rate Amongst Different Classes')
plt.show()

In [None]:
## find the percentage of passengers that survived in each pclass

train_df.groupby('Pclass')['Survived'].mean()

Taking a look at this data there are sharp peaks around 20-35 for pclass 2 and 3, falling off quickly thereafter. However for pclass 1, there is not a sharp peak and most of the passengers survive until you get around age 45 when there are more deaths than survivors. It seems like there are younger passengers in pclass 2 and 3 when compared to pclass 1.

In [None]:
print('Pclass 1 Under 45 Survival Rate: ', len(train_df.loc[(train_df['Pclass'] == 1) & (train_df['Age'] <= 45) & (train_df['Survived'] == 1)])/len(train_df.loc[(train_df['Pclass'] == 1) & (train_df['Age'] <= 45)])*100,'%')
print('Pclass 2 Under 45 Survival Rate: ', len(train_df.loc[(train_df['Pclass'] == 2) & (train_df['Age'] <= 45) & (train_df['Survived'] == 1)])/len(train_df.loc[(train_df['Pclass'] == 2) & (train_df['Age'] <= 45)])*100,'%')
print('Pclass 3 Under 45 Survival Rate: ', len(train_df.loc[(train_df['Pclass'] == 3) & (train_df['Age'] <= 45) & (train_df['Survived'] == 1)])/len(train_df.loc[(train_df['Pclass'] == 3) & (train_df['Age'] <= 45)])*100,'%')
print('')
print('Pclass 1 Over 45 Survival Rate: ', len(train_df.loc[(train_df['Pclass'] == 1) & (train_df['Age'] > 45) & (train_df['Survived'] == 1)])/len(train_df.loc[(train_df['Pclass'] == 1) & (train_df['Age'] > 45)])*100,'%')
print('Pclass 2 Over 45 Survival Rate: ', len(train_df.loc[(train_df['Pclass'] == 2) & (train_df['Age'] > 45) & (train_df['Survived'] == 1)])/len(train_df.loc[(train_df['Pclass'] == 2) & (train_df['Age'] > 45)])*100,'%')
print('Pclass 3 Over 45 Survival Rate: ', len(train_df.loc[(train_df['Pclass'] == 3) & (train_df['Age'] > 45) & (train_df['Survived'] == 1)])/len(train_df.loc[(train_df['Pclass'] == 3) & (train_df['Age'] > 45)])*100,'%')

Looking at the survival rates of ages over/under 45 for all 3 classes, it is very clear that passengers in pclass 1 had a much higher chance of survival than the other pclasses. The survival rate for younger passengers is much higher for all 3 classes, with passengers under 45 having about a 20% higher chance of survival than passengers over 45. We can speculate the reasons as to why there might be such a gap in survival range between the two age groups, perhaps the younger, able bodied passengers were able to make it to the life boats more quickly or survive the frigid temperatures. The older passengers may have also given up their spots to the younger passengers out of altruism.

Lets take a look to see if there are any differences in survival rate amongst genders in the pclasses.

In [None]:
## create a new column that groups pclass and sex

train_df['PclassSex'] = train_df['Pclass'].astype(str) + train_df['Sex']
train_df.head()

In [None]:
plt.figure(figsize=(10, 8))
sns.violinplot(x='PclassSex', y='Age', hue='Survived', data=train_df, split=True, order=['1male', '1female', '2male', '2female', '3male', '3female'])
plt.title('Survival Rate Amongst Different Classes')
plt.show()

In [None]:
print('Pclass 1 Male Survival Rate: ', len(train_df.loc[(train_df['PclassSex'] == '1male') & (train_df['Survived'] == 1)])/len(train_df.loc[train_df['PclassSex'] == '1male'])*100,'%')
print('Pclass 2 Male Survival Rate: ', len(train_df.loc[(train_df['PclassSex'] == '2male') & (train_df['Survived'] == 1)])/len(train_df.loc[train_df['PclassSex'] == '2male'])*100,'%')
print('Pclass 3 Male Survival Rate: ', len(train_df.loc[(train_df['PclassSex'] == '3male') & (train_df['Survived'] == 1)])/len(train_df.loc[train_df['PclassSex'] == '3male'])*100,'%')
print('')
print('Pclass 1 Female Survival Rate: ', len(train_df.loc[(train_df['PclassSex'] == '1female') & (train_df['Survived'] == 1)])/len(train_df.loc[train_df['PclassSex'] == '1female'])*100,'%')
print('Pclass 2 Female Survival Rate: ', len(train_df.loc[(train_df['PclassSex'] == '2female') & (train_df['Survived'] == 1)])/len(train_df.loc[train_df['PclassSex'] == '2female'])*100,'%')
print('Pclass 3 Female Survival Rate: ', len(train_df.loc[(train_df['PclassSex'] == '3female') & (train_df['Survived'] == 1)])/len(train_df.loc[train_df['PclassSex'] == '3female'])*100,'%')

After splitting up the glasses into genders we can see that there is a huge discrepancy in survival rates between genders. The survival rate of males in class 1 is a little more than double the survival rate of males in classes 2-3 at 37%. Females in classes 1 and 2 had a 90+% chance of survival while females in class 3 had a 50% chance of survival. It's clear by looking at this data that most of the survivors were female and we should make sure to take this into account when making our models.

In [None]:
plt.figure(figsize=(15, 5))
sns.countplot(x='SibSp', hue='Survived', data=train_df)
plt.title('# of Sibling/Spouse Survival Rate')
plt.show()

In [None]:
plt.figure(figsize=(15, 5))
sns.countplot(x='Parch', hue='Survived', data=train_df)
plt.title('# of Parent/Child Survival Rate')
plt.show()

Taking a quick look at the survival rates of passengers with siblings/spouses and parents/children, it looks like most of the passengers do not have any siblings/spouses or parents/children. Lets take a look at the graph excluding those passengers to get a better look at the survival rates of passengers with siblings/spouses or parents/children.

In [None]:
plt.figure(figsize=(15, 5))
sns.countplot(x='SibSp', hue='Survived', data=train_df.loc[train_df['SibSp'] > 0])
plt.title('# of Sibling/Spouse Survival Rate')
plt.show()

In [None]:
plt.figure(figsize=(15, 5))
sns.countplot(x='Parch', hue='Survived', data=train_df.loc[train_df['Parch'] > 0])
plt.title('# of Parent/Child Survival Rate')
plt.show()

Taking a look at these graphs, it looks like the solo passengers without any siblings/spouses or parents/children are more likely to not survive. For passengers with company, having 1 sibling/spouse or parent/sibling increases the chance of survival while having more than that does not seem to increase the chance of survival. Since we don't know if the passengers have both siblings and parents with these graphs, lets make a new column called 'Family' that accounts for both of these values.

In [None]:
## we'll add sibsp and parch to see the number of people in the family
## we add a 1 at the end to account for the passenger as well

train_df['Family'] = train_df['SibSp'] + train_df['Parch'] + 1
train_df.head()

In [None]:
plt.figure(figsize=(15, 5))
sns.countplot(x='Family', hue='Survived', data=train_df)
plt.title('Survival Rate Based on Number of Family Members')
plt.show()

We can see that solo passengers are still at the highest risk. Passengers that have 2-4 members are more likely to survive while the survival rate starts to dip considerably for passengers with families of 5 or more. There are many reasons why this may be the case; solo passengers may not have known the ship was sinking until it was too late. People with family members could have been alerted earlier with more members being aware of what's happening around the ship. People with large families could have had a hard time gathering all the members of their family. These reasons are all just speculation on my part, but we can see that family size does have an impact on the survival rate of a passenger.

## Feature Engineering

In [None]:
train_df.head()

In [None]:
# drop columns that cannot be used in model and separate independent and dependent variables

X = train_df.drop(labels=['Survived', 'Name', 'Ticket'], axis=1)
y = train_df['Survived']

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
X['Sex'] = encoder.fit_transform(X['Sex'])

In [None]:
pclasssex_dummies = pd.get_dummies(X['PclassSex'], drop_first=True)
embarked_dummies = pd.get_dummies(X['Embarked'], drop_first=True)

X.drop(['PclassSex', 'Embarked'], axis=1, inplace=True)
X = pd.concat([X, pclasssex_dummies, embarked_dummies], axis=1)
X.head()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=23)

In [None]:
## normalize our data
from sklearn.preprocessing import MinMaxScaler

norm = MinMaxScaler().fit(X_train)

X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')
baseline = logreg.fit(X_train, y_train)
baseline

In [None]:
y_hat_test = logreg.predict(X_test)
y_hat_train = logreg.predict(X_train)

In [None]:
residuals = np.abs(y_train - y_hat_train)
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True))

In [None]:
residuals = np.abs(y_test - y_hat_test)
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True))

In [None]:
normalized = logreg.fit(X_train_norm, y_train)
normalized

In [None]:
y_hat_test = logreg.predict(X_test_norm)
y_hat_train = logreg.predict(X_train_norm)

In [None]:
residuals = np.abs(y_train - y_hat_train)
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True))

In [None]:
residuals = np.abs(y_test - y_hat_test)
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True))