In [None]:
### TD Classification : Prédire la survie des passagers du Titanic

##### 1) Les données : 
fichier de données : titanic_data.csv

variables descriptives: caractéristiques des passagers

variable cible : vie ou mort du passager

- Explorer et comprendre les données 
- Nettoyer les données, missing values et outliers
- Choisir les bonnes variables pour l'entraînement du modèle, distribution et corrélation
- Transformer les données, scaling et encoding

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [53]:
titanic = pd.read_csv('data/titanic_data.csv')
validation = pd.read_csv('data/validation_data.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
titanic.shape

(891, 12)

In [14]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [15]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [16]:
titanic.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [65]:
def clean(data):
    #On enlève les colonnes 'Ticket' 'Cabin' 'Name' & 'PassengerId'
    #Cabin aurait pu être utile à traiter 
    data = data.drop(["Ticket", "Cabin", "Name", "PassengerId"], axis=1)
    
    cols =  ["SibSp", "Parch", "Fare", "Age"]
    for col in cols:
        data[col].fillna(data[col].mean(), inplace=True)
        
    #Remplacer les Not Assigned par U pour Unknown
    data.Embarked.fillna("U", inplace=True)
    return data

titanic = clean(titanic)
test = clean(validation)

KeyError: "['Ticket', 'Cabin', 'Name', 'PassengerId'] not found in axis"

In [66]:
titanic.head(10)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0
2,1,3,0,26.0,0,0,7.925,2
3,1,1,0,35.0,1,0,53.1,2
4,0,3,1,35.0,0,0,8.05,2
5,0,3,1,29.699118,0,0,8.4583,1
6,0,1,1,54.0,0,0,51.8625,2
7,0,3,1,2.0,3,1,21.075,2
8,1,3,0,27.0,0,2,11.1333,2
9,1,2,0,14.0,1,0,30.0708,0


##### 2) Entraîner un modèle de classification

- Séparer les features et les labels dans deux variables X et y
- Séparer les données en deux sous ensembles 70%-30% pour l'entraînement et le test
- Importer le modèle LogisticRegression de sklearn et proceder à l'entrainement sur l'ensemble d'entrainement

In [61]:
from sklearn import preprocessing
labelE = preprocessing.LabelEncoder()

cols = ["Sex", "Embarked"]

for col in cols:
    titanic[col] = labelE.fit_transform(titanic[col])
    print(labelE.classes_)
    
titanic.head(5)

['female' 'male']
['C' 'Q' 'S' 'U']


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0
2,1,3,0,26.0,0,0,7.925,2
3,1,1,0,35.0,1,0,53.1,2
4,0,3,1,35.0,0,0,8.05,2


In [62]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

y = titanic["Survived"]
X = titanic.drop("Survived", axis=1)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=18)

In [63]:
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)

In [64]:
predictions = clf.predict(X_val)
from sklearn.metrics import accuracy_score
accuracy_score(y_val, predictions)

0.8603351955307262

In [55]:
submission_preds = clf.predict(validation)

Feature names unseen at fit time:
- Cabin
- Name
- PassengerId
- Ticket



ValueError: could not convert string to float: 'Kelly, Mr. James'

##### 3) Evaluer le modèle

- Faire les prédictions sur l'ensemble de données de test
- Calculer la precision et le recall
- Afficher la matrice de confusion
- Afficher la coube ROC et l'aire sous la courbe
- Expliquer les résultat en quelques phrases

In [None]:
from sklearn.metrics import accuracy_score
from sklearn. metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

##### 4) Optimisation du modèle

- si vous avez le temps, essayez d'améliorer votre modèle (plus de feature engineering, essayer un autre algorithme qu la régression logistique, ...)
- si vous n'avez pas le temps, décrivez à l'écrit quelques pistes d'optimisation que vous auriez voulu essayer.