https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling

https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy

https://www.kaggle.com/dejavu23/titanic-survival-seaborn-and-ensembles

* **1 Introduction**
* **2 Import Data and Cleansing**
* **3 Exploratory Data Analysis**
* **4 Feature Engineering**
* **5 Modeling**


# **1 Introduction**

I'm starting my Kaggle Competition Challenge with Python. The Titanic dataset is my first stop.

# **2 Import Data and Cleansing**

In [None]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

import os
print(os.listdir("../input"))

In [None]:
#Importing data
train=pd.read_csv('../input/train.csv')
test=pd.read_csv('../input/test.csv')
#Combine train and test in order to do data cleansing together.
full=pd.concat([train,test],axis=0,sort=False,ignore_index=True)
full.head()

In [None]:
full.dtypes

In [None]:
#Fill empty and NaNs with NaN
full=full.fillna(np.nan)
#Check missing values
full.isnull().sum()

In [None]:
train.isnull().sum()

In [None]:
#Convert Sex, Pclass and Embarked to Categorical type
full['Survived']=full['Survived'].astype('category')
full['Sex']=full['Sex'].astype('category')
full['Pclass']=full['Pclass'].astype('category')
full['Embarked']=full['Embarked'].astype('category')

In [None]:
full.describe()

**Some takeaways:**
* PassengerId is a index column which is not usefull for modeling; Survived is the dependent variable with value 1 and 0; Pclass, Sex, Embarked are categorical data.
* Name, Cabin and Ticket contain some extra information.
* Age, Cabin and Embarked have missing values.


# **3 Exploratory Data Analysis**

In [None]:
#Convert Sex, Pclass and Embarked to Categorical type
cat_col=full.select_dtypes(['category']).columns
full_cor=full.copy()
full_cor[cat_col]=full_cor[cat_col].apply(lambda x: x.cat.codes)

In [None]:
#Correlation matrix to explore feature relations
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(full_cor.loc[:,['Survived','Sex','Pclass','SibSp','Parch','Age','Fare','Embarked']].corr(),annot=True, fmt = ".2f", cmap = "coolwarm",ax=ax)

For the correlation matrix, Sex has relatively high correlation with Survived and SibSp has the lowest correlation.
We would like to explore Survived VS each feature, since there are only few features.

In [None]:
#Use seaborn graphics for multi-variable comparison: https://seaborn.pydata.org/api.html
fig = plt.figure(figsize=(21, 12))
grid = plt.GridSpec(3, 3)
ax1 = fig.add_subplot(grid[0, :2])
ax2 = fig.add_subplot(grid[0, 2])
ax3 = fig.add_subplot(grid[1, :2])
ax4 = fig.add_subplot(grid[1, 2])
ax5 = fig.add_subplot(grid[2, 0])
ax6 = fig.add_subplot(grid[2, 1])
ax7 = fig.add_subplot(grid[2, 2])
#Survived VS Sex
sns.countplot(x='Sex',hue='Survived',data=full.iloc[:len(train)],ax=ax2)
#Survived VS Pclass
sns.countplot(x='Pclass',hue='Survived',data=full.iloc[:len(train)],ax=ax4)
#Survived VS SibSp
sns.countplot(x='SibSp',hue='Survived',data=full.iloc[:len(train)],ax=ax5)
#Survived VS Parch
sns.countplot(x='Parch',hue='Survived',data=full.iloc[:len(train)],ax=ax6)
#Survived VS Age
sns.kdeplot(full.iloc[:len(train)][full.iloc[:len(train)]['Age'].notnull()&(full.iloc[:len(train)]['Survived']==1)]['Age'],color='Red',ax=ax1).set(xticks=[i for i in range(0,int(max(full.iloc[:len(train)]['Age'])),4)])
sns.kdeplot(full.iloc[:len(train)][full.iloc[:len(train)]['Age'].notnull()&(full.iloc[:len(train)]['Survived']==0)]['Age'],color='Blue',ax=ax1).legend(["Survived","Not Survived"])
#Survived VS Fare
sns.kdeplot(full.iloc[:len(train)][full.iloc[:len(train)]['Fare'].notnull()&(full.iloc[:len(train)]['Survived']==1)]['Fare'],color='Red',ax=ax3).set(xticks=[i for i in range(0,int(max(full.iloc[:len(train)]['Fare'])),20)])
sns.kdeplot(full.iloc[:len(train)][full.iloc[:len(train)]['Fare'].notnull()&(full.iloc[:len(train)]['Survived']==0)]['Fare'],color='Blue',ax=ax3).legend(["Survived","Not Survived"])
#Survived VS Embarked
sns.countplot(x='Embarked',hue='Survived',data=full.iloc[:len(train)],ax=ax7)

What we learn from the visualization:
* Chidren (Age<=12) have higher survival rate.
* Most of the survivers are Females.
* Passengers who bought a expensive ticket(>=18) are more like to survive.
* Passengers with no family members(Sibsp=0 & Parch=0) have a higer survival rate.
* People in the 3 class or embarked from S have a lower rate of surviving the disaster.

## Missing value imputation
We have missing values in Age(263), Fare(1), Cabin(1014)and Embarked(2).

We learn from the correlation chart that Age is related to Pclass and Sibsp.

For more info of how to handle missing value, please refer to [here.](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)

In [None]:
#Distribution of Age and Age VS Pclass and Age VS Sibsp
sns.kdeplot(full['Age'])
sns.factorplot(y="Age",x="Parch", data=full,kind="box")
sns.factorplot(y="Age",x="SibSp", data=full,kind="box")

It's obviouse that the more Parch a passenger has the older he is and the more Sibsp a passenger has the younger he is.

In [None]:
#We are fill age with median age of similar rows according to Pclass and Sibsp
list_nan=list(full[full['Age'].isnull()].index)
for i in list_nan:
    age_med=full['Age'].median()
    age_med2=full['Age'][(full['Pclass']==full.iloc[i]['Pclass'])&(full['SibSp']==full.iloc[i]['SibSp'])].median()
    if np.isnan(age_med2):
        full['Age'].iloc[i]=age_med
    else:
        full['Age'].iloc[i]=age_med2 #We can't use full.iloc[i]['Age']=age_med. Please see link here for more info.https://stackoverflow.com/questions/54211190/whats-the-difference-between-x-iloc1x-and-xx-iloc1

In [None]:
#Check missing values
print(full.isnull().sum())
sns.kdeplot(full['Age'])

The imputation didn't change the distrution of the Age. Now let's impute the Fare, Cabin and Embarked.

In [None]:
#Outliers
#We noticed that there are some extreme values in the Fare and it's very skewed.
#We will transform it with log function to reduce the skewness.
#
full

# **4 Feature Engineering**