In [180]:
import numpy as np
import pandas as pd 
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
pd.plotting.register_matplotlib_converters()
%matplotlib inline
import os

In [119]:
train = pd.read_csv("train.csv")

In [13]:
train.head()

In [14]:
train.tail()

The data contains the following information:

* Pclass - a proxy for socio-economic status (SES) where 1st = Upper, 2nd = Middle and 3rd = Lower.
* Sex - male and female.
* Age - fractional if it less than 1 and age estimation in the form of xx.5.
* SibSp - number of siblings / spouses aboard the Synthanic; siblings are brother, sister, stepbrother and stepsister and spouses are husband and wife (mistresses and fiancés were ignored).
* Parch - # of parents / children aboard the Synthanic; parents are mother and father; child are daughter, son, stepdaughter and stepson. Some children travelled only with a nanny, therefore Parch is 0 for them.
* Fare - the paassenger fare.
* Cabin - the cabin number.
* Emarked - port of embarkation where C is Cherbourg, Q is Queenstown and S is Southampton.
* Ticket - ticket number.
* Name - passengers name.
* Survived - target variable where 0 is not survived and 1 is survived.

Missing values in dataset:

In [3]:
 sum(train.isnull().sum())


Check missing values in dataset Age and cabin column has missing values which later needs to be treated:

In [5]:
train.isnull().sum()

* train dataset have a same 100,000 rows.
* There is 76,165 missing values in the train dataset.
* Features that have missing values are Age, Ticket, Fare, Cabin and Embarked.
* Most of the missing value come from Cabin feature, which contributed almost 90% of missing values in train dataset.
* In a row wise, it contributes around 70% of the data. Missing values need to be treated carefully.

In [18]:
train.info()

Statistical information about the data:

In [7]:
train.describe()

Age :
Mean Age = 38 Min Age = 0.08 Max Age 87

Pclass :
3 different types of classes available

SibSp : assengers have travelled with sibling sapling

Parch : assengers have travelled with parent and child 

In [10]:
train.duplicated().sum()

No duplicated values 

Correlation of different variables:

In [12]:
sns.heatmap(train.corr() , cmap="YlGnBu")

In [21]:
sns.countplot( train['Survived'] , hue = train['Survived'])
train['Survived'].value_counts()

Majority of passengers were not able to survive and died in the accident



In [34]:
plt.figure(figsize=(7,5))
sns.countplot( train['Pclass'] , hue = train['Pclass'])
train['Pclass'].value_counts()

Majority of passengers travelled in Pclass 3 , followed by 1 and then 2

In [38]:
sns.countplot( train['Parch'] , hue = train['Parch'])
train['Parch'].value_counts()

In [130]:
sns.barplot(x=train['Survived'], y=train['Age'])
train['Survived'].value_counts()

In [102]:
sns.countplot( train['Sex'] , hue = train['Sex'])
train['Sex'].value_counts()

In [71]:
nan_data = (train.isna().sum().sort_values(ascending=False) / len(train) * 100)[:6]
fig, ax = plt.subplots(1,1,figsize=(7, 5))

ax.bar(nan_data.index, 100, color='#c6ccd8', width=0.6)

bar = ax.bar(
    nan_data.index, 
    nan_data, 
    color='#496595', 
    width=0.6
)
ax.bar_label(bar, fmt='%.01f %%')
ax.spines.left.set_visible(False)
ax.set_yticks([])
ax.set_title('Null Data Ratio', fontweight='bold')

plt.show()

In [82]:
feature_cols = train.drop(['Survived', 'PassengerId'], axis=1).columns
target_column = 'Survived'

## Getting all the data that are not of "object" type. 
numerical_columns = ['Age', 'Fare']
categorical_columns = train[feature_cols].drop(columns=numerical_columns).columns


In [87]:
num_rows, num_cols = 2,1
f, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(16, 16))
f.suptitle('Distribution of Features', fontsize=20, fontweight='bold', fontfamily='serif', x=0.13)


for index, column in enumerate(train[numerical_columns].columns):
    i,j = (index // num_cols, index % num_cols)
    sns.kdeplot(train.loc[train[target_column] == 0, column], color='#c6ccd8', shade=True, ax=axes[i])
    sns.kdeplot(train.loc[train[target_column] == 1, column], color='#496595', shade=True, ax=axes[i])

# f.delaxes(axes[-1, -1])
plt.tight_layout()
plt.show()

Correlation

About the correlation, we are going to check 2 points of view:

* The correlation between the continuos variables
* The correlation between this continuos features and the target

As we can see, the variables are not high correlation and also, no high correlation with the class, so we are not going to delete any variable.

In [89]:
corr = train[pure_num_cols].corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

fig, ax = plt.subplots(figsize=(12, 12))
ax.text(-1.1, 0.16, 'Correlation between the Continuous Features', fontsize=20, fontweight='bold', fontfamily='serif')
ax.text(-1.1, 0.3, 'There is no features that pass 0.4 correlation within each other', fontsize=13, fontweight='light', fontfamily='serif')


# plot heatmap
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
            cbar_kws={"shrink": .8}, vmin=0, vmax=1)
# yticks
plt.yticks(rotation=0)
plt.show()

In [93]:
sns.countplot(train['Sex'],hue = train['Survived'])

In [94]:
sns.countplot(train['Pclass'],hue = train['Survived'])

In [111]:
train['Fare'].hist(bins=40,figsize=(10,4))

In [112]:
corrMatrix = train.corr()
sns.heatmap(corrMatrix, annot=True)

In [28]:
fig, ax = plt.subplots(1, 3, figsize=(17 , 5))

feature_lst = ['Pclass', 'Age', 'SibSp','Parch','Fare', 'Family']

corr = train_df[feature_lst].corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

for idx, method in enumerate(['pearson', 'kendall', 'spearman']):
    sns.heatmap(train[feature_lst].corr(method=method), ax=ax[idx],
            square=True, annot=True, fmt='.2f', center=0, linewidth=2,
            cbar=False, cmap=sns.diverging_palette(240, 10, as_cmap=True),
            mask=mask
           ) 
    ax[idx].set_title(f'{method.capitalize()} Correlation', loc='left', fontweight='bold')     

plt.show()

In [118]:
train

In [121]:
train_df = train.copy()
train_df['Cabin_Code'] = train.Cabin.str[0]
new = train.Ticket.str.split(" ", n = 1, expand = True)
train_df['Ticket_Code'] = new[~new[0].astype(str).str.isnumeric()][0]
train_df['Family'] = train['SibSp'] + train['Parch']
train_df

In [145]:
sns.countplot(train_df['Cabin_Code'],hue = train_df['Survived'])
train_df.groupby('Cabin_Code')['Survived'].value_counts()

In [174]:
family_ratio = train_df.groupby('Family')['Survived'].mean() * 100
mean = train['Survived'].mean() *100 

fig, ax = plt.subplots(1, 1, figsize=(12, 7))

color_map = ['#d4dddd' for _ in range(len(family_ratio))]
color_map[np.argmax(family_ratio)] = 'orange'

bars = ax.bar(family_ratio.index, family_ratio, 
       color=color_map, width=0.55, 
       edgecolor='black', 
       linewidth=0.7)

ax.spines[["top","right","left"]].set_visible(False)
ax.bar_label(bars, fmt='%.2f%%')

# mean line + annotation
ax.axhline(mean ,color='black', linewidth=0.4, linestyle='dashdot')
ax.annotate(f"mean : {mean :.4}%", 
            xy=(15, mean + 2),
            va = 'center', ha='center',
            color='#4a4a4a',
            bbox=dict(boxstyle='round', pad=0.4, facecolor='#efe8d1', linewidth=0))
    

# Title & Subtitle    
fig.text(0.06, 1, '# of Family & Survived', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.06, 0.96, 'The more family members there are, the lower the survival rate tends to be.', fontsize=12, fontweight='light', fontfamily='serif')

ax.set_yticks([])
ax.set_xticks(np.arange(0, max(family_ratio.index)+1))
ax.grid(axis='y', linestyle='-', alpha=0.4)
ax.set_ylim(0, 65)

fig.tight_layout()
plt.show()

In [176]:
sibsp = train.groupby('SibSp')['Survived'].mean().sort_index()*100
parch = train.groupby('Parch')['Survived'].mean().sort_index()*100

fig, axes = plt.subplots(2, 2, figsize=(17, 12))

# Ratio 1
axes[0][0].bar(height=100, x=sibsp.index, color='#dedede')
hbar1 = axes[0][0].bar(height=sibsp, x=sibsp.index, color='orange')
axes[0][0].bar_label(hbar1, fmt='%.01f%%', padding=2)

# Bar1
sibsp_cnt = train['SibSp'].value_counts().sort_index()
bar1 = axes[1][0].bar(height=sibsp_cnt, x=sibsp_cnt.index)
axes[1][0].bar_label(bar1, fmt='%d', padding=2)

# Ratio 2
axes[0][1].bar(height=100, x=parch.index, color='#dedede')
hbar2 = axes[0][1].bar(height=parch, x=parch.index, color='orange')
axes[0][1].bar_label(hbar2, fmt='%.01f%%', padding=2)

# Bar2
parch_cnt = train['Parch'].value_counts().sort_index()
bar2 = axes[1][1].bar(height=parch_cnt, x=parch_cnt.index)
axes[1][1].bar_label(bar2, fmt='%d', padding=2)

for ax in axes.flatten():
    ax.set_yticks([])
    ax.set_xticks(range(0, max(parch.index)+1))
    ax.spines[['bottom', 'left']].set_visible(False)

axes[0][0].axhline(mean ,color='black', linewidth=0.4, linestyle='dashdot')
axes[0][1].axhline(mean ,color='black', linewidth=0.4, linestyle='dashdot')

for idx, ax in enumerate(axes[0]):
    ax.annotate(f"mean : {mean :.4}%", 
            xy=(6.5+idx, mean + 4),
            va = 'center', ha='center',
            color='#4a4a4a', fontsize=10,
            bbox=dict(boxstyle='round', pad=0.4, facecolor='#efe8d1', linewidth=0))
    

axes[0][0].set_title('Siblings/Spouses Survived Ratio', fontsize=14, fontweight='bold')
axes[0][1].set_title('Parent/Children Survived Ratio', fontsize=14, fontweight='bold')



plt.show()

In [123]:
pd.pivot_table(train_df, values='Fare', index=['Family'],
               columns=['Survived'], aggfunc=[np.mean, np.std]).style.bar(subset=['mean'], color='#205ff2').background_gradient(subset=['std'], cmap='Reds')

It tends to decrease according to the number of family members, and there seems to be a certain price difference depending on whether or not they survive.

In [116]:
pd.pivot_table(train_df, values='Fare', index=['Family'],
               columns=['Pclass'], aggfunc=[np.mean, np.std]).style.bar(subset=['mean'], color='#205ff2').background_gradient(subset=['std'], cmap='Reds')

In [183]:
plt.rcParams['figure.dpi'] = 300
fig = plt.figure(figsize=(5, 5), facecolor='#f6f5f5')
gs = fig.add_gridspec(4, 3)
gs.update(wspace=0.4, hspace=0.8)

background_color = "#f6f5f5"

column = 'Survived'
color_map = ['#eeb977', 'lightgray']
sns.set_palette(sns.color_palette(color_map))
temp_train = pd.DataFrame(train_df[column].value_counts()).reset_index(drop=False)
ax0 = fig.add_subplot(gs[0, 0])
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.set_facecolor(background_color)
ax0.tick_params(axis = "y", which = "both", left = False)
ax0.text(-1, 83, 'Survival Rate', color='black', fontsize=7, ha='left', va='bottom', weight='bold')
ax0.text(-1, 82, 'Survival rate on each individual feature', color='#292929', fontsize=5, ha='left', va='top')
ax0.text(1.18, 73.3, 'for age and fare', color='#292929', fontsize=4, ha='left', va='top')
ax0_sns = sns.barplot(ax=ax0, x=temp_train['index'], y=temp_train[column]/1000, zorder=2)
ax0_sns.set_xlabel("Survived",fontsize=5, weight='bold')
ax0_sns.set_ylabel('')
ax0.yaxis.set_major_formatter(ticker.PercentFormatter())
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE')
ax0_sns.tick_params(labelsize=5)
ax0_sns.legend(['Survived', 'Not Survived'], ncol=2, facecolor=background_color, edgecolor=background_color, fontsize=4, bbox_to_anchor=(-0.26, 1.3), loc='upper left')
leg = ax0_sns.get_legend()
leg.legendHandles[0].set_color('#eeb977')
leg.legendHandles[1].set_color('lightgray')

column = 'Pclass'
color_map = ['#eeb977', 'lightgray', 'lightgray']
sns.set_palette(sns.color_palette(color_map))
temp_train = train_df.groupby(column)['Survived'].sum()
ax1 = fig.add_subplot(gs[0, 1])
for s in ["right", "top"]:
    ax1.spines[s].set_visible(False)
ax1.set_facecolor(background_color)
ax1.tick_params(axis = "y", which = "both", left = False)
ax1_sns = sns.barplot(ax=ax1, x=temp_train.index, y=temp_train/1000, zorder=2)
ax1_sns.set_xlabel("Ticket Class",fontsize=5, weight='bold')
ax1_sns.set_ylabel('')
ax1.yaxis.set_major_formatter(ticker.PercentFormatter())
ax1_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax1_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE')
ax1_sns.tick_params(labelsize=5)

column = 'Sex'
color_map = ['#eeb977', 'lightgray']
sns.set_palette(sns.color_palette(color_map))
temp_train = train_df.groupby(column)['Survived'].sum()
ax2 = fig.add_subplot(gs[0, 2])
for s in ["right", "top"]:
    ax2.spines[s].set_visible(False)
ax2.set_facecolor(background_color)
ax2.tick_params(axis = "y", which = "both", left = False)
ax2_sns = sns.barplot(ax=ax2, x=temp_train.index, y=temp_train/1000, zorder=2)
ax2_sns.set_xlabel("Sex",fontsize=5, weight='bold')
ax2_sns.set_ylabel('')
ax2.yaxis.set_major_formatter(ticker.PercentFormatter())
ax2_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax2_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE')
ax2_sns.tick_params(labelsize=5)

column = 'Age'
color_map = ['#eeb977', 'lightgray']
sns.set_palette(sns.color_palette(color_map))
temp_train = train_df.groupby(column)['Survived'].sum()
ax3 = fig.add_subplot(gs[1, 0])
for s in ["right", "top"]:
    ax3.spines[s].set_visible(False)
ax3.set_facecolor(background_color)
ax3.tick_params(axis = "y", which = "both", left = False)
ax3_sns = sns.kdeplot(ax=ax3, x=train_df[train_df['Survived']==1]['Age'], zorder=2, shade=True)
ax3_sns = sns.kdeplot(ax=ax3, x=train_df[train_df['Survived']==0]['Age'], zorder=2, shade=True)
ax3_sns.set_xlabel("Age",fontsize=5, weight='bold')
ax3_sns.set_ylabel('')
ax3_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax3_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE')
ax3_sns.tick_params(labelsize=5)

column = 'SibSp'
color_map = ['lightgray' for _ in range(7)]
color_map[0] = '#eeb977'
sns.set_palette(sns.color_palette(color_map))
temp_train = train_df.groupby(column)['Survived'].sum()
ax4 = fig.add_subplot(gs[1, 1])
for s in ["right", "top"]:
    ax4.spines[s].set_visible(False)
ax4.set_facecolor(background_color)
ax4.tick_params(axis = "y", which = "both", left = False)
ax4_sns = sns.barplot(ax=ax4, x=temp_train.index, y=temp_train/1000, zorder=2)
ax4_sns.set_xlabel("Siblings / spouses",fontsize=5, weight='bold')
ax4_sns.set_ylabel('')
ax4.yaxis.set_major_formatter(ticker.PercentFormatter())
ax4_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax4_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE')
ax4_sns.tick_params(labelsize=5)

column = 'Parch'
color_map = ['lightgray' for _ in range(8)]
color_map[0] = '#eeb977'
sns.set_palette(sns.color_palette(color_map))
temp_train = train_df.groupby(column)['Survived'].sum()
ax5 = fig.add_subplot(gs[1, 2])
for s in ["right", "top"]:
    ax5.spines[s].set_visible(False)
ax5.set_facecolor(background_color)
ax5.tick_params(axis = "y", which = "both", left = False)
ax5_sns = sns.barplot(ax=ax5, x=temp_train.index, y=temp_train/1000, zorder=2)
ax5_sns.set_xlabel("Parents / children",fontsize=5, weight='bold')
ax5_sns.set_ylabel('')
ax5.yaxis.set_major_formatter(ticker.PercentFormatter())
ax5_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax5_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE')
ax5_sns.tick_params(labelsize=5)

column = 'Fare'
color_map = ['#eeb977', 'lightgray']
sns.set_palette(sns.color_palette(color_map))
temp_train = train_df.groupby(column)['Survived'].sum()
ax6 = fig.add_subplot(gs[2, 0])
for s in ["right", "top"]:
    ax6.spines[s].set_visible(False)
ax6.set_facecolor(background_color)
ax6.tick_params(axis = "y", which = "both", left = False)
ax6_sns = sns.kdeplot(ax=ax6, x=train_df[train_df['Survived']==1]['Fare'], zorder=2, shade=True)
ax6_sns = sns.kdeplot(ax=ax6, x=train_df[train_df['Survived']==0]['Fare'], zorder=2, shade=True)
ax6_sns.set_xlabel("Fare",fontsize=5, weight='bold')
ax6_sns.set_ylabel('')
ax6_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax6_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE')
ax6_sns.tick_params(labelsize=5)

column = 'Cabin_Code'
color_map = ['lightgray' for _ in range(9)]
color_map[7] = '#eeb977'
sns.set_palette(sns.color_palette(color_map))
temp_train = train_df.groupby(column)['Survived'].sum()
ax7 = fig.add_subplot(gs[2, 1])
for s in ["right", "top"]:
    ax7.spines[s].set_visible(False)
ax7.set_facecolor(background_color)
ax7.tick_params(axis = "y", which = "both", left = False)
ax7_sns = sns.barplot(ax=ax7, x=temp_train.index, y=temp_train/1000, zorder=2)
ax7_sns.set_xlabel("Cabin",fontsize=5, weight='bold')
ax7_sns.set_ylabel('')
ax7.yaxis.set_major_formatter(ticker.PercentFormatter())
ax7_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax7_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE')
ax7_sns.tick_params(labelsize=5)

column = 'Embarked'
color_map = ['lightgray' for _ in range(4)]
color_map[3] = '#eeb977'
sns.set_palette(sns.color_palette(color_map))
temp_train = train_df.groupby(column)['Survived'].sum()
ax8 = fig.add_subplot(gs[2, 2])
for s in ["right", "top"]:
    ax8.spines[s].set_visible(False)
ax8.set_facecolor(background_color)
ax8.tick_params(axis = "y", which = "both", left = False)
ax8_sns = sns.barplot(ax=ax8, x=temp_train.index, y=temp_train/1000, zorder=2)
ax8_sns.set_xlabel("Port",fontsize=5, weight='bold')
ax8_sns.set_ylabel('')
ax8.yaxis.set_major_formatter(ticker.PercentFormatter())
ax8_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax8_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE')
ax8_sns.tick_params(labelsize=5)

Survived
* There are 57,226 of Synthanic passengers not survived and 42,774 survived the accident, converted to survival rate of 57.2% for not survived and 42.8% for survived.

Pclass
* Ticket class 1 has the highest chance to survived with survival rate at 17.6% followed by class 2 with 15% and class 3 with 10.1%.
* Higher ticket class has a higher chance to survived, this may be a result of lifeboat priority based on ticket class.

Sex
* Female has higher chance to survived at 31.2% compared to male, this may also be the result of lifeboat priority for female than male.
* Male has survival rate at 11.5% which is a far below Female.

Age
* Passengers with age 15-40 have a lower chance to survived while older passengers at age 40 and above have a higher probability to survived, this may also due to lifeboat priority for older people.

SibSp
* Most of the passengers in Synthanic are travel alone, this make the survival rate for passengers without siblings / spouses higher than passengers with siblings / spouses.
* Survival rate for passengers without siblings / spouses are more than 30%.

Parch
* As stated earlier, that most of the passengers in Synthanic are travel alone, this also make the survival rate for passenger that travel without parents / children are higher.
* Survival rate for passengers that travel without parents / children is almost 30% which is almost the same with the survival rate for passenger that travel without siblings / spouses.

Fare
* Consistent with ticket class, passengers with lower fare have a lower chance to survived.
* It's expected that passengers that buy a low fare get a lower ticket class but further analysis will be needed to explore more.

Cabin
* There are many missing values in the cabin number which it hard to make an analysis on the survival rate.
* Passengers with unknown cabin (N) has the highest survival rate which is above 20% compared to others.
* Passengers with cabin C has the second highest survival rate that is above 5%.

Embarked
* Passengers that embarked from Southampton have the highest chance to survived which is above 20%.
* The second highest survival rate are passengers that embarked from Cherbourg with 15% survival rate.