# Titanic Top 4% with ensemble modeling 
<br>

## Yassine Ghouzam, PhD
<br>
<br>

13/07/2017
- **1 Introduction**
- **2 Load and check data**
    - 2.1 load data 
    - 2.2 Outlier detection 
    - 2.3 joining train and test set
    - 2.4 check for null and missing values 
- **3 Feature analysis**
    - 3.1 Numerical values 
    - 3.2 Categorical values
- **4 Filling missing Values**
    - 4.1 Age
- **5 Feature engineering** 
    - 5.1 Name/Title
    - 5.2 Family Size
    - 5.3 Cabin 
    - 5.4 Ticket
- **6 Modeling** 
    - 6.1 Simple modeling 
        - 6.1.1 Cross validate models
        - 6.1.2 Hyperparameter tunning for best models
        - 6.1.3 Plot learning curves
        - 6.1.4 Feature importance of the tree based classifiers 
     - 6.2 Ensemble modeling 
         - 6.2.1 Combining models
     - 6.3 Prediction 
         - 6.3.1 Predict and Submit results
<br>
<br>
<br>

## 1. Introduction 
<br>
This is my first kernel at Kaggle. I choosed the Titanic competition which is a good way to introduce feature engineering and ensemble modeling. Firstly, I will display some feature analyses then ill focus on the feature engineering. Last part concerns modeling and predicting the survival on the Titanic using and voting procedure. 
<br>

This script follows three main parts:
<br>

- Feature analysis
- Feature engineering 
- Modeling 
<br>
<br>
<br>

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from collections import Counter

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve

sns.set(style='white', context='notebook', palette='deep')

## 2. Load and check data
<br>

### 2.1 Load data


In [None]:
# Load data 
##### Load train and Test

train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')
IDtest = test['PassengerID']

### 2.2 Outlier detection

In [None]:
# Outlier detection 

def detect_outliers(df, n, features):
    '''
    Takes a dataframe df of features and returns a list the indices
    corresponding to the observations containing more than n outliers according to the Tukey method.
    '''
    outlier_indices=[]
    
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col], 75)
        # Interquartile range(IQR)
        IQR = Q3-Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index
        
        # append the found outlier indices for col to the list of outlier indices
        outlier_indices = Counter(outlier_indices)
        multiple_outliers = list(k for k, v in outlier_indices.items() if v > n)
        
        return multiple_outliers
    
    # detect outliers from Age, SibSp, Parch and Fare
    Outliers_to_drop = detect_outliers(train, 2, ['Age', 'SibSp', 'Parch', 'Fare'])
    

Since outliers can have a dramatic effect on the prediction(espacially for regression problems), I choosed to manage them. 
<br>

I used the Tukey method(Tukey JW, 1977) to detect outliers which defines an interquartile range comprised between the 1st and 3rd quartile of the distribution values(IQR). An outlier is a row that have a feature value outside the (IQR +- an outlier step).
<br>

I decided to detect outliers from the numerical values feature(Age, SibSp, Sarch and Fare).Then, i considered outliers as rows that have at least two outlied numerical values.

In [None]:
train.loc[Outliers_to_drop] # show the outliers row

We detect 10 outliers. The 28, 89 and 342 passenger have an high Ticket Fare The 7 others have very high values of SibSP.

In [None]:
# Drop outliers
train = train.drop(Outliers_to_drop).reset_index(drop=True)

### 2.3 joining train and test set

In [None]:
## Join train and test datasets in order to obtain the same number of features during categorical conversion
train_len = len(train)
dataset = pd.concat(objs=[train, test], axis=0).reset_index(drop=True)

### 2.4 check for null and missing values

In [None]:
# Fill empty and NaNs values with NaN
dataset = dataset.fillna(np.nan)

# Check for null values
dataset.isnull().sum()

Age and Cabin features have an important part of missing values.
<br>

**Survived missing values correspond to the join testing dataset(Survived column doesn't exist in test set and has been replace by NaN values when concatenating the train and test set)**

In [None]:
# Infos
train.info()
train.isnull().sum()

In [None]:
train.head()

In [None]:
train.dtypes

In [None]:
## Summerize data
# Summerize and statistics
train.describe()

## 3. Feature analysis
<br>

### 3.1 Numerical values


In [None]:
# Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived
g = sns.heatmap(train[["Survived", "SibSp", "Parch", "Age", "Fare"]].corr(), annot=True, fmt=".2f", cmap="coolwarm")

Only Fare feature seems to have a significative correlation with the survival probability.
<br>

It doesn't mean that the other features are not useful. Subpopulations in these features can be correlated with the survival. To determine this, we need to explore in detail these features
<br>
<br>

**SibSP**

In [None]:
# Explore SibSp feature vs Survived
g = sns.factorplot(x="SibSp", y="Survived", data=train, kind="bar", size=6, palette='muted')
g.despine(left=True)
g=g.set_ylabels("Survived probability")

It seems that passengers having a lot of siblings/spouses have less chance to survive
<br>

Single passengers(0 SibSP) or with two other persons(SibSP 1 or 2) have more chance to survive
<br>

This observation is quite interestin, we can consider a new feature describing these categories(See feature engineering)
<br>
<br>

**Parch**

In [None]:
# Explore Parch feature vs Survived
g = sns.factorplot(x='Parch', y='Survived', data=train, kind='bar', size=6, palette='muted')
g.despine(left=True)
g = g.set_ylabels('survived probability')

Small families have more chance to survive, more than single(Parch 0), medium(Parch 3, 4) and large families(Parch 5, 6).
<br>

Be careful there is an important standard deviation in the survival of passengers with 3 parents/children
<br>
<br>

**Age**

In [None]:
# Explore Age vs Survived
g = sns.FaceGrid(train, col='Survived')
g = g.map(sns.distplot, 'Age')

Age distribution seems to be a tailed distribution, maybe a gaussian distribution.
<br>

We notice that age distributions are not the same in the survived and not survived subpopulations. Indeed, there is a peak corresponding to young passengers, that have survived. We also see that passengers between 60-80 have less survived.
<br>

So, even if "Age" is not correlated with "Survived", we can see that there is age categories of passengers that of have more or less chance to survive.
<br>

It seems that very young passengers have more chance to survive.

In [None]:
# Explore Age distribution 
g = sns.kdeplot(train["Age"][(train["Survived"]==0) & (train["Age"].notnull())], color="Red", shade=True)
g = sns.kdeplot(train['Age'][(train['Survived']==1) & (train["Age"].notnull())], ax=g, color="Blue", shade=True)
g.set_xlabel("Age")
g.set_ylabel("Frequency")
g = g.legend(["Not Survived", "Survived"])