# <u>Titanic: Machine Learning from Disaster</u>
## Exploratory Data Analysis with Pandas Profiling

In [1]:
import sys
from google.colab import drive
drive.mount('/gdrive')
drive_path = '/gdrive/My Drive/Open Source Spotlight/Pandas Profiling/'
sys.path.append(drive_path)

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import numpy as np

# disable warnings
import warnings
warnings.filterwarnings('ignore')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [2]:
df = pd.read_csv(drive_path+'titanic.csv')
df.set_index(['PassengerId'], inplace=True)
df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


`int64`, `float64` and `object` are the data types of our features. With this same method, we can easily see if there are any missing values. We can see which columns don't have 891 rows as we would expect from what we've seen from displaying `shape`.

In [4]:
df.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

The `describe` method shows basic statistical characteristics of each numerical feature (`int64` and `float64` types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

In [5]:
df.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the `include` parameter.

In [6]:
df.describe(include=['object', 'bool'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Becker, Miss. Marion Louise",male,347082,G6,S
freq,1,577,7,4,644


For categorical (type `object`) and boolean (type `bool`) features we can use the `value_counts` method. Let’s have a look at the distribution of `Survived`:

In [7]:
df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

Only 342 passengers out of 981 are survived; their `Survived` value is `True`. To calculate fractions, pass `normalize=True` to the `value_counts` function.

In [8]:
df['Survived'].value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

## Profiling

GitHup repository link: [Pandas Profiling](https://github.com/pandas-profiling/pandas-profiling)
![Pandas Profiling](https://camo.githubusercontent.com/5915a3ee29e2e8be434e69115b247b9dc04d8b09/687474703a2f2f70616e6461732d70726f66696c696e672e6769746875622e696f2f70616e6461732d70726f66696c696e672f646f63732f6173736574732f6c6f676f5f6865616465722e706e67)



In [0]:
# !pip install pandas-profiling

In [0]:
import pandas_profiling

In [0]:
profile = pandas_profiling.ProfileReport(df)
profile.to_file(drive_path+"/output.html")

# Data Cleaning

## Correcting
There are no corrupt entries in this data which require correcting


## Handling Missing Data (Completing)

Now we’re ready to start exploring missing data and rectifying it through imputation. There are a number of different ways we could go about doing this. Given the small size of the dataset, we probably should not opt for deleting either entire observations (rows) or variables (columns) containing missing values. We’re left with the option of either replacing missing values with a sensible values given the distribution of the data, e.g., the mean, median or mode. Finally, we could go with prediction. We’ll use both of the two latter methods and I’ll rely on some data visualization to guide our decisions.


### Manual Analysis

In [12]:
df[df['Embarked'].isna()]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


We will try to infer from the passenger's Fare and Pclass where they embarked from

In [13]:
sns.boxplot(x='Embarked', y='Fare', data=df).axhline(80, ls='--', color='r')

<matplotlib.lines.Line2D at 0x7fe0b48d8940>

The median fare for a first class passenger departing from Charbourg (‘C’) coincides nicely with the $80 paid by our embarkment-deficient passengers. We can safely replace the NA values with ‘C’.

In [0]:
df['Embarked'] = df['Embarked'].apply(lambda x: 'C' if (x is np.nan) else x)

### Predictive Analysis
There are quite a few missing Age values in our data. We are going to get a bit more fancy in imputing missing age values. Why? Because we can. We will create a model predicting ages based on other variables.



In [15]:
df['Age'].isna().sum()

177

In [16]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=1000, min_value=0, max_value=1000, random_state=0, verbose=1)

# keep an original version for comparison
orig_age = df['Age'].copy()

# for simplicity's sake, we will use only numeric columns for the prediction
cols = df.select_dtypes([np.number]).columns
imputer.fit(df[cols])

[IterativeImputer] Completing matrix with shape (891, 6)
[IterativeImputer] Change: 29.69911764705882, scaled tolerance: 0.5123292 
[IterativeImputer] Change: 0.0, scaled tolerance: 0.5123292 
[IterativeImputer] Early stopping criterion reached.


IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=1000, max_value=1000, min_value=0, missing_values=nan,
                 n_nearest_features=None, random_state=0,
                 sample_posterior=False, skip_complete=False, tol=0.001,
                 verbose=1)

In [17]:
df[cols] = imputer.transform(df[cols])

[IterativeImputer] Completing matrix with shape (891, 6)


we will assess the algorithm results by comparing the 'Age' histograms before and after the transformation

In [18]:
df['Age'].isna().sum()

0

In [19]:
fig, axes = plt.subplots(1, 2)
axes[0].set_title('Before Imputation')
axes[1].set_title('After Imputation')
orig_age.hist(bins=35, density=True, ax=axes[0])
df['Age'].hist(bins=35, density=True, ax=axes[1])

<matplotlib.axes._subplots.AxesSubplot at 0x7fe0aeaf8668>

It seems most of the filled ages were in the 30's range, which is a reasonable assumption.

### Feature Dropping
Columns with high cardinality and/or missing values are candidates for elimination. There's also room for applying common sense and reasonable causality (e.g. Ticket number can't have anything to do with the probability to survive, even if by some chance there's correlation in the data)

In [0]:
df.drop(['Ticket', 'Cabin'], axis=1, inplace=True)

'Name' is also an entry with high cardinality (even unique), but with this specific column we can be a bit more creative

## Feature Engineering (Creating)
### Can a name predict survival?

In [21]:
df['Name'].head()

PassengerId
1                              Braund, Mr. Owen Harris
2    Cumings, Mrs. John Bradley (Florence Briggs Th...
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
Name: Name, dtype: object

The Name feature contains information on passenger's title.

Since some passenger with distingused title may be preferred during the evacuation, it is interesting to add them to the model.


In [22]:
df["Title"] = [name.split(",")[1].split(".")[0].strip() for name in df['Name']]
df["Title"].unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer'], dtype=object)

In [0]:
cp = sns.countplot(x="Title",data=df)
cp = plt.setp(cp.get_xticklabels(), rotation=45) 

There is 17 titles in the dataset, most of them are very rare and we can group them in 4 categories.



In [0]:
df["Title"] = df["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Nobility')
df["Title"] = df["Title"].replace([np.nan], 'Mr') # anomaly in parsing
df["Title"] = df["Title"].replace(["Miss", "Ms", "Mme", "Mlle", "Mrs"], 'Miss-Mrs')

In [0]:
cp = sns.countplot(x="Title",data=df)
cp = plt.setp(cp.get_xticklabels(), rotation=45) 

We will check if we can visually observe any significance to these features

In [26]:
sns.factorplot(x="Title", y='Survived', data=df)

<seaborn.axisgrid.FacetGrid at 0x7fe0ae9ca518>

If we don't see any visual difference, we should turn to statistics to make sure we're not dropping anything significant
 ### Nominal Variable Correlation

In [27]:
!pip install researchpy
import researchpy as rp



In [28]:
table, results = rp.crosstab(df['Title'], df['Survived'], prop='col', test='chi-square')
table

Unnamed: 0_level_0,Survived,Survived,Survived
Unnamed: 0_level_1,0.0,1.0,All
Title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Master,3.1,6.73,4.49
Miss-Mrs,14.75,67.25,34.9
Mr,79.42,23.68,58.02
Nobility,2.73,2.34,2.58
All,100.0,100.0,100.0


In [29]:
results

Unnamed: 0,Chi-square test,results
0,Pearson Chi-square ( 3.0) =,285.4969
1,p-value =,0.0
2,Cramer's V =,0.5661


A definite result for association



In [0]:
df.drop('Name', axis=1, inplace=True)

### Does having family help survive or the opposite?
Create a feature describing a passenger's family size (including himself)

In [0]:
df['FamilySize'] = (df['SibSp'] + df['Parch']).add(1).astype(int)

In [32]:
sns.factorplot(x='FamilySize', y='Survived', data=df)

<seaborn.axisgrid.FacetGrid at 0x7fe0ae4615f8>

Seems there's a positive trend for survival up until FamilySize = 4, which is followed by a downtrend for smaller families.

In [33]:
corr = df[['FamilySize', 'Survived']].corr(method='spearman')
sns.heatmap(corr, vmin=-1, vmax=1, cmap='coolwarm', annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe0ae4612e8>

You can understand from the correlation plot that a linear estimator won't do a good job at capturing the trend for this feature

In [0]:
df.drop(['SibSp', 'Parch'], axis=1, inplace=True)

## Encoding (Converting)

In [35]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,Fare,Embarked,Title,FamilySize
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0.0,3.0,male,22.0,7.25,S,Mr,2
2,1.0,1.0,female,38.0,71.2833,C,Miss-Mrs,2
3,1.0,3.0,female,26.0,7.925,S,Miss-Mrs,1
4,1.0,1.0,female,35.0,53.1,S,Miss-Mrs,2
5,0.0,3.0,male,35.0,8.05,S,Mr,1


In [0]:
# convert Sex into categorical value 0 for male and 1 for female
df["Sex"] = df["Sex"].map({"male": 0, "female":1})

# Return the Survived column to binary integers
df["Survived"] = df["Survived"].astype('int')

In [37]:
df = pd.get_dummies(df)
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,Fare,FamilySize,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss-Mrs,Title_Mr,Title_Nobility
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3.0,0,22.0,7.25,2,0,0,1,0,0,1,0
2,1,1.0,1,38.0,71.2833,2,1,0,0,0,1,0,0
3,1,3.0,1,26.0,7.925,1,0,0,1,0,1,0,0
4,1,1.0,1,35.0,53.1,2,0,0,1,0,1,0,0
5,0,3.0,0,35.0,8.05,1,0,0,1,0,0,1,0


# Verify

In [0]:
profile = pandas_profiling.ProfileReport(df)
profile.to_file(drive_path+"/output_result.html")