## loading of packages

In [1]:
import pandas as pd
import numpy as np

**Loading the training database**

In [2]:
data_train = pd.read_csv("../input/titanic/train.csv", index_col = "PassengerId")

In [3]:
data_train.shape

(891, 11)

**Visualization of the first 10 rows of the database**

In [4]:
data_train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Observations of missing values.**

In [5]:
(data_train.isnull().sum()/data_train.shape[0])*100

Survived     0.000000
Pclass       0.000000
Name         0.000000
Sex          0.000000
Age         19.865320
SibSp        0.000000
Parch        0.000000
Ticket       0.000000
Fare         0.000000
Cabin       77.104377
Embarked     0.224467
dtype: float64

__Interpretation:__ We note that the variable *Cabin* poses a problem, we have nearly 80% of the missing values ​​which represents a great lack so we cannot take the trouble to impute this variable, this will bias the results even more.
This is what caught our attention.

I bother removing the inutes variables IMHO.

In [6]:
data_train = data_train.drop(["Cabin", "Ticket", "Name"], axis = 1)

In [7]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, KFold

## **Encoding and imputation of missing values**

In [8]:
data_train = data_train.fillna(method = "bfill").fillna(0)

def encodageSex(df):
    if df["Sex"] == 'male':
        df["Sex"] = 0
        return df
    else:
        df["Sex"] = 1
        return df
    
def encodageEmbarked(df):
    if df["Embarked"] == 'Q':
        df["Embarked"] = 0
        return df
    elif df["Embarked"] == 'C':
        df["Embarked"] = 1
        return df 
    else:
        df["Embarked"] = 2
        return df

data_train = data_train.apply(encodageEmbarked, axis="columns")
data_train = data_train.apply(encodageSex, axis="columns")

In [9]:
y = data_train['Survived']
X = data_train.drop("Survived", axis = 1 )

In [10]:
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [11]:
model = GradientBoostingClassifier(n_estimators=400)

In [12]:
model.fit(X_train, y_train)

GradientBoostingClassifier(n_estimators=400)

In [13]:
model.score(X_test, y_test)

0.8097014925373134

### **Let's visualize the variables that contribute the most in the predictions**

In [14]:
import eli5
from eli5.sklearn import PermutationImportance

In [15]:
perm = PermutationImportance( model, random_state = 123 ).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

Weight,Feature
0.1784  ± 0.0393,Sex
0.0754  ± 0.0099,Fare
0.0694  ± 0.0301,Pclass
0.0172  ± 0.0121,SibSp
0.0119  ± 0.0173,Age
0.0075  ± 0.0142,Embarked
-0.0112  ± 0.0125,Parch


__Interpretation__ : We find that the variables that contribute the most in the predictions are the following:
 **Age**, **Pclass**, **Fare**, **SibSp** and **Embarked**.

### **Normalize database**

We will try to normalize **(X-mean(Xi))/std(Xi)** the data to see if this is a score improvement.

In [16]:
from sklearn.preprocessing import StandardScaler

transformer = StandardScaler()

X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)

In [17]:
model_1 = GradientBoostingClassifier(n_estimators=400)
model_1.fit(X_train, y_train)

GradientBoostingClassifier(n_estimators=400)

In [18]:
model_1.score(X_test, y_test)

0.8097014925373134

__Conclusion__ : This represents an improvement in the score with the formalization