[Titanic Top 4% with ensemble modeling](https://www.kaggle.com/code/yassineghouzam/titanic-top-4-with-ensemble-modeling)

# 1. Introduction

This is my first kernel in Kaggle. I choosed the Titanic competition which is a good way to introduce feature engineering and ensemble modeling. Firstly, I will display some feature analyses then will focus on the feature engineering. Last part concerns modeling and predicting the survival on the Titanic using an voting procedure.

This scripts follows three main parts:
- **Feature analysis**
- **Feature engineering**
- **Modeling**

In [2]:
from collections import Counter

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import (
    AdaBoostClassifier,
    ExtraTreesClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    VotingClassifier,
)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    GridSearchCV,
    StratifiedKFold,
    cross_val_score,
    learning_curve,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

sns.set(style="white", context="notebook", palette="deep")

# 2. Load and check data

## 2.1 Load data

In [4]:
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")
IDtest = test["PassengerId"]

## 2.2 Outlier detection

In [5]:
def detect_outliers(df, n, features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing morethan n outliers according to the Tukey method.
    """
    outlier_indices = []

    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col], 75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1

        # outlier step
        outlier_step = 1.5 * IQR

        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[
            (df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)
        ].index

        # append the found outlier indices for col to the list of outlier indices
        outlier_indices.extend(outlier_list_col)

    # select observations containing more than n outliers
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(k for k, v in outlier_indices.items() if v > n)

    return multiple_outliers


# detect outliers from Age, SibSp, Parch and Fare
Outliers_to_drop = detect_outliers(train, 2, ["Age", "SibSp", "Parch", "Fare"])
Outliers_to_drop

[27, 88, 159, 180, 201, 324, 341, 792, 846, 863]

Since outliers can have a dramatic effect on the prediction(especially for regression problems), i choosed to manage them.

I used the Tukey method(Tukey JW., 1977) to detect outliers which defines an interquartile range comprised between the 1st and 3rd quartile of the distribution values(IQR). An outliers is a row that have a feature value outside the (IQR +- an outlier step).

I decided to detect outliers from the numerical values features(Age, SibSp, Parch and Fare). Then, I considered outliers as rows that have at least two outlied numerical values.

#### Tukey Method

**Tukey Method**는 John Tukey가 제안한 이상치(outlier) 탐지 방법으로, **사분위수**(quartiles)를 이용한 간단한 이상치 탐지 기법입니다.

##### 핵심 개념
- **Q1 (제 1 사분위수)**: 하위 25%에 해당하는 값
- **Q3 (제 3 사분위수)**: 상위 25%에 해당하는 값
- **IQR (Interquartile Range)**: 중간 50%의 범위, \( IQR = Q3 - Q1 \)

##### 이상치 탐지 기준
- **일반적인 이상치**: 
  - 하위 경계: \( Q1 - 1.5 \times IQR \)
  - 상위 경계: \( Q3 + 1.5 \times IQR \)
- **극단적인 이상치**:
  - 하위 경계: \( Q1 - 3 \times IQR \)
  - 상위 경계: \( Q3 + 3 \times IQR \)

##### 절차
1. 데이터 정렬
2. **Q1**, **Q3** 계산 및 **IQR** 산출
3. 경계를 계산하여 벗어난 값은 이상치로 간주

##### 장점
- **간단하고 직관적**: 계산이 쉽고 시각적으로 이해하기 용이
- **통계적 가정이 없음**: 데이터 분포에 대한 가정이 없어 다양한 데이터셋에 적용 가능

##### 단점
- **극단적인 분포에 취약**: 비대칭적이거나 극단값이 많은 경우 한계가 있음


In [6]:
train.loc[Outliers_to_drop]  # show the outliers rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
159,160,0,3,"Sage, Master. Thomas Henry",male,,8,2,CA. 2343,69.55,,S
180,181,0,3,"Sage, Miss. Constance Gladys",female,,8,2,CA. 2343,69.55,,S
201,202,0,3,"Sage, Mr. Frederick",male,,8,2,CA. 2343,69.55,,S
324,325,0,3,"Sage, Mr. George John Jr",male,,8,2,CA. 2343,69.55,,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0,C23 C25 C27,S
792,793,0,3,"Sage, Miss. Stella Anna",female,,8,2,CA. 2343,69.55,,S
846,847,0,3,"Sage, Mr. Douglas Bullen",male,,8,2,CA. 2343,69.55,,S
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.55,,S


We detect 10 outliers. The 28, 89 and 342 passenger have an high Ticket Fare.
The 7 others have very high values of SibSp.

In [7]:
# Drop outliers
train = train.drop(Outliers_to_drop, axis=0).reset_index(drop=True)

## 2.3 joining train and test set

In [8]:
# Join train and test datasets in order to obtain the same number of features during categorical conversion
train_len = len(train)
dataset = pd.concat(objs=[train, test], axis=0).reset_index(drop=True)

I join train and test datasets to obtain the same number of features during categorical conversion(See feature engineering).

## 2.4 check for null and missing values

In [9]:
# Fill empty and NaNs values with NaN
dataset = dataset.fillna(np.nan)

# Check for Null values
dataset.isnull().sum()

PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             256
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1007
Embarked          2
dtype: int64

Age and Cabin features have an important part of missing values.

**Survived missing values correspond to the join testing dataset (Survived column doesn't exist in test set and has been replace by NaN values when concatenating the train and tes set)**

In [10]:
# Infos
train.info()
train.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 881 entries, 0 to 880
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  881 non-null    int64  
 1   Survived     881 non-null    int64  
 2   Pclass       881 non-null    int64  
 3   Name         881 non-null    object 
 4   Sex          881 non-null    object 
 5   Age          711 non-null    float64
 6   SibSp        881 non-null    int64  
 7   Parch        881 non-null    int64  
 8   Ticket       881 non-null    object 
 9   Fare         881 non-null    float64
 10  Cabin        201 non-null    object 
 11  Embarked     879 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 82.7+ KB


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            170
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          680
Embarked         2
dtype: int64

In [11]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [13]:
### Summarize data
# Summarize and statistics
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,881.0,881.0,881.0,711.0,881.0,881.0,881.0
mean,446.713961,0.385925,2.307605,29.731603,0.455165,0.363224,31.121566
std,256.617021,0.48709,0.835055,14.547835,0.871571,0.791839,47.996249
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,226.0,0.0,2.0,20.25,0.0,0.0,7.8958
50%,448.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.0,1.0,3.0,38.0,1.0,0.0,30.5
max,891.0,1.0,3.0,80.0,5.0,6.0,512.3292
