# Enoncé

Vous avez participé à une compétition sur Kaggle sur le jeu de données de Titanic (celle-ci existe, les curieux peuvent la retrouver sur Kaggle !). Vous avez pour cela à votre disposition [une liste de 891 passagers](https://www.kaggle.com/c/titanic), contenant les caractéristiques suivantes :


- PassengerID : Identifiant du passager
- Survived : Indicateur de survi d'un passager (1 si le passager a survecu, 0 s’il est décédé)
- Pclass: Classe du passager (1 = 1ère classe, 2 = 2ème classe, 3 = 3ème classe)
- Name : Nom et titre du passager
- Sex : Sexe du passager
- Age : Age du passager (Décimal si inférieur à 1, estimé si de la forme xx.5)
- SibSp : Nombre d’époux, de frères et de soeurs présents à bord
- Parch : Nombre de parents ou d’enfants présents à bord 
- Ticket : Numéro du ticket 
- Fare : Tarif des tickets (Le prix est indiqué en £ et pour un seul achat (peut correspondre à plusieurs tickets)
- Cabin : Numéro de Cabine
- Embarked : Port d’embarcation (C = Cherbourg, Q = Queenstown, S = Southampton)
	 	

# Exercice

La compétition a été l’occasion de revenir sur ce jeu de données très célèbre, et plusieurs tâches étaient attendues, :
- identifier les facteurs favorisants la survie d'un passager par rapport à un autre, en dressant une typologie des survivants
- créer un algorithme qui pourrait prédire la survie d'un individu à partir de ces caractéristiques.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import statsmodels.formula.api as smf
import itertools
from random import randint
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier #for using Decision Tree Algoithm
try:
    from sklearn.utils._testing import ignore_warnings
except ImportError:
    from sklearn.utils.testing import ignore_warnings
import warnings

from sklearn.exceptions import ConvergenceWarning
from datetime import datetime
from os import getcwd
from function import *
from mpl_toolkits.mplot3d import Axes3D
from joblib import dump, load

## 1. Charger vos données dans un DataFrame Pandas

In [2]:
# ---------------------------------------------------------------------------------------------
#                               MAIN
# ---------------------------------------------------------------------------------------------
verbose = False
verboseMain = False

print("Chargement des données...")
# Récupère le répertoire du programme
file_path = getcwd() + "\\"

Chargement des données...


In [3]:
file_name_test = 'titanic_dataset_kaggle_test_process_2022-01-15-18_44_17.csv'
file_name_train = 'titanic_dataset_kaggle_train_process_2022-01-15-18_44_17.csv'
file_name_test_y = 'titanic_dataset_kaggle_test_y.csv'

file_separator = ','

df_origin_train = pd.read_csv(file_path+file_name_train, sep=file_separator, index_col="PassengerId")
df_origin_test = pd.read_csv(file_path+file_name_test, sep=file_separator, index_col="PassengerId")
df_origin_test_y = pd.read_csv(file_path+file_name_test_y, sep=file_separator, index_col="PassengerId")

print("Chargement des données train:", df_origin_train.shape, ", test:", df_origin_test.shape, ".................................. END")

Chargement des données train: (891, 18) , test: (418, 17) .................................. END


In [4]:
df_origin_train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Age,Sex,SibSp,Parch,group,Fare,Embarked,deck,Titre,Last_name,First_name,Sex_cod,Titre_cod,Embarked_cod,deck_cod
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0,3,"Braund, Mr. Owen Harris",22.0,male,1,0,1,7.25,S,G,Mr,Braund,Owen Harris,1,7,2,6
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,female,1,0,1,71.2833,C,C,Mrs,Cumings,John Bradley (Florence Briggs Thayer),0,8,0,2
3,1,3,"Heikkinen, Miss. Laina",26.0,female,0,0,0,7.925,S,G,Miss,Heikkinen,Laina,0,6,2,6
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,female,1,0,1,53.1,S,C,Mrs,Futrelle,Jacques Heath (Lily May Peel),0,8,2,2
5,0,3,"Allen, Mr. William Henry",35.0,male,0,0,0,8.05,S,B,Mr,Allen,William Henry,1,7,2,1


In [5]:
df_origin_test.head()

Unnamed: 0_level_0,Pclass,Name,Age,Sex,SibSp,Parch,group,Fare,Embarked,deck,Titre,Last_name,First_name,Sex_cod,Titre_cod,Embarked_cod,deck_cod
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
892,3,"Kelly, Mr. James",34.5,male,0,0,0,7.8292,Q,F,Mr,Kelly,James,1,5,1,5
893,3,"Wilkes, Mrs. James (Ellen Needs)",47.0,female,1,0,1,7.0,S,F,Mrs,Wilkes,James (Ellen Needs),0,6,2,5
894,2,"Myles, Mr. Thomas Francis",62.0,male,0,0,0,9.6875,Q,B,Mr,Myles,Thomas Francis,1,5,1,1
895,3,"Wirz, Mr. Albert",27.0,male,0,0,0,8.6625,S,A,Mr,Wirz,Albert,1,5,2,0
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.0,female,1,1,2,12.2875,S,E,Mrs,Hirvonen,Alexander (Helga E Lindqvist),0,6,2,4


In [6]:
df_origin_test_y.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,1
894,0
895,0
896,1


#  2. Typage et Organisation des données

In [7]:
df_origin_train.dtypes

Survived          int64
Pclass            int64
Name             object
Age             float64
Sex              object
SibSp             int64
Parch             int64
group             int64
Fare            float64
Embarked         object
deck             object
Titre            object
Last_name        object
First_name       object
Sex_cod           int64
Titre_cod         int64
Embarked_cod      int64
deck_cod          int64
dtype: object

In [8]:
df_origin_test.dtypes

Pclass            int64
Name             object
Age             float64
Sex              object
SibSp             int64
Parch             int64
group             int64
Fare            float64
Embarked         object
deck             object
Titre            object
Last_name        object
First_name       object
Sex_cod           int64
Titre_cod         int64
Embarked_cod      int64
deck_cod          int64
dtype: object

In [9]:
df_origin_test_y.dtypes

Survived    int64
dtype: object

In [10]:
df_origin_train.isna().sum()

Survived        0
Pclass          0
Name            0
Age             0
Sex             0
SibSp           0
Parch           0
group           0
Fare            0
Embarked        2
deck            0
Titre           0
Last_name       0
First_name      0
Sex_cod         0
Titre_cod       0
Embarked_cod    0
deck_cod        0
dtype: int64

In [11]:
df_origin_test['Fare'] = df_origin_test['Fare'].fillna(0)
df_origin_test.isna().sum()

Pclass          0
Name            0
Age             0
Sex             0
SibSp           0
Parch           0
group           0
Fare            0
Embarked        0
deck            0
Titre           0
Last_name       0
First_name      0
Sex_cod         0
Titre_cod       0
Embarked_cod    0
deck_cod        0
dtype: int64

In [12]:
df_origin_test_y.isna().sum()

Survived    0
dtype: int64

# Machine Learning

La compétition a été l’occasion de revenir sur ce jeu de données très célèbre, et plusieurs tâches étaient attendues, :
- identifier les facteurs favorisants la survie d'un passager par rapport à un autre, en dressant une typologie des survivants
- créer un algorithme qui pourrait prédire la survie d'un individu à partir de ces caractéristiques.

Vous avez été ajouté à une équipe et le travail et lancé depuis quelques semaines : à cette étape,  la mission est en réalité finie (c.f. le présent notebook). Vos co-équipiers ont travaillé dur : il faut dans un premier temps vous approprier leur travail.

In [13]:
random_state = 0

In [14]:
df_origin_train.columns

Index(['Survived', 'Pclass', 'Name', 'Age', 'Sex', 'SibSp', 'Parch', 'group',
       'Fare', 'Embarked', 'deck', 'Titre', 'Last_name', 'First_name',
       'Sex_cod', 'Titre_cod', 'Embarked_cod', 'deck_cod'],
      dtype='object')

In [15]:
to_predict_columns = 'Survived'
columns = ['Pclass', 'Sex_cod', 'Titre_cod', 'Age', 'group', 'Fare', 'Embarked_cod', 'deck_cod'] 

X_train = df_origin_train[columns]
y_train = df_origin_train[to_predict_columns]

X_test = df_origin_test[columns]
y_test = df_origin_test_y[to_predict_columns]

<mark>Manuellement j'avais les résultats suivants :</mark>
```python
1.0 for test ( 0.79 for train) => Logistic Regression, with: ['Pclass', 'Sex_cod', 'Titre_cod']
1.0 for test ( 0.79 for train) => K Nearest Neighbor, with: ['Pclass', 'Sex_cod']
1.0 for test ( 0.79 for train) => Decision Tree, with: ['Sex_cod']
1.0 for test ( 0.79 for train) => Random Forest, with: ['Pclass', 'Sex_cod']
1.0 for test ( 0.79 for train) => Support Vector Machine (Linear)
0.65 for test ( 0.69 for train) => Support Vector Machine (RBF)
0.81 for test ( 0.79 for train) => Gaussian Naive Bayes
```
Les paramétrages correspondants :
```python
LogisticReg : 
* fit_intercept= True or False
* penalty=none or l2 or l1
* solver=liblinear or newton-cg or lbfgs or sag or saga
* columns = ['Pclass', 'Sex_cod', 'Titre_cod']

KNN : 
* KNN = 2 :['Pclass', 'Sex_cod']
* KNN = 4 :['Embarked_cod', 'Pclass', 'Sex_cod']
* KNN = 1 :['Sex_cod']
* KNN = 1 :['Sex_cod', 'Titre_cod']
* KNN = 1 :['Embarked_cod', 'Sex_cod']
* KNN = 7 :['Sex_cod', 'deck_cod']

Decision Tree
* criterion = gini or entropy
* splitter=best or random
* columns = ['Sex_cod'] (0.98 de test <=> 0.81 de train, avec ['Pclass', 'Sex_cod', 'group'])

RandomForest :
* n_estimators= de 3 à 99 
* criterion=gini or entropy
* columns = ['Pclass', 'Sex_cod']
```

## 5.0 Transform

In [16]:
# Transformer
scaler = StandardScaler()
X_train_transform = scaler.fit_transform(X_train[['Embarked_cod', 'Pclass', 'Sex_cod']])
X_train_transform

array([[ 0.58111394,  0.82737724,  0.73769513],
       [-1.93846038, -1.56610693, -1.35557354],
       [ 0.58111394,  0.82737724, -1.35557354],
       ...,
       [ 0.58111394,  0.82737724, -1.35557354],
       [-1.93846038, -1.56610693,  0.73769513],
       [-0.67867322,  0.82737724,  0.73769513]])

In [17]:
grid_ex_params = {
            'polynomialfeatures__degree' : [2,3,4],
            'standardscaler__penality' : [None, 'l2', 'l1', 'elasticnet'],
            'KNeighborsClassifier__n_neighbors': [range(1,10)],
            'KNeighborsClassifier__p': [range(1,10)],# TODO compléter les valeurs
            'KNeighborsClassifier__metric' : ['minkowski'], # TODO compléter les valeurs
            'KNeighborsClassifier__weights' : ['uniform', 'distance'],
            'KNeighborsClassifier__algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'],
            'DecisionTreeClassifier__criterion' : ["gini", "entropy"],
            'DecisionTreeClassifier__splitter' : ["best", "random"],
            'RandomForestClassifier__criterion' : ["gini", "entropy"],
            'RandomForestClassifier__n_estimators' : [range(1,100,10)],
            'LogisticRegression__solver' : ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
            'LogisticRegression__penality' : [None, 'l2', 'l1', 'elasticnet'],
            'LogisticRegression__fit_intercept' :[True, False],
            'SVC__kernel': ['linear', 'rbf'] # TODO compléter les valeurs
}

In [18]:
grid_ex_params = {
            'polynomialfeatures__degree' : [2,3,4],
            'standardscaler__with_mean' : [True, False],
            'standardscaler__with_std' : [True, False]
}

In [19]:
grid_ex_pipeline = make_pipeline(PolynomialFeatures(), 
                           StandardScaler(), 
                           SGDClassifier(random_state=random_state))
grid_ex_pipeline

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('standardscaler', StandardScaler()),
                ('sgdclassifier', SGDClassifier(random_state=0))])

In [20]:
grid_ex = GridSearchCV(grid_ex_pipeline,param_grid=grid_ex_params, cv=4)


In [21]:
grid_ex.fit(X_train, y_train)

GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('standardscaler', StandardScaler()),
                                       ('sgdclassifier',
                                        SGDClassifier(random_state=0))]),
             param_grid={'polynomialfeatures__degree': [2, 3, 4],
                         'standardscaler__with_mean': [True, False],
                         'standardscaler__with_std': [True, False]})

In [22]:
grid_ex.best_params_

{'polynomialfeatures__degree': 3,
 'standardscaler__with_mean': False,
 'standardscaler__with_std': True}

In [23]:
grid_ex.score(X_test, y_test)

0.715311004784689

In [24]:
# Estimator
#model = KNeighborsClassifier(n_neighbors=4, weights="uniform", algorithm="auto", metric = 'minkowski', p = 2)
model = KNeighborsClassifier(n_neighbors=4, metric = 'minkowski', p = 2)
model.fit(X_train_transform, y_train)
X_test_transform = scaler.fit_transform(X_test[['Embarked_cod', 'Pclass', 'Sex_cod']])
score_test = model.score(X_test_transform,y_test)
score_test

0.84688995215311

## 5.1. GridSearchCV

In [25]:
random_state=0

In [26]:
grid_ex_params = {
            'polynomialfeatures__degree' : [2,3,4],
            'standardscaler__with_mean' : [True, False],
            'standardscaler__with_std' : [True, False]
}

In [27]:
grid_ex_pipeline = make_pipeline(PolynomialFeatures(), 
                           StandardScaler(), 
                           SGDClassifier(random_state=random_state))
grid_ex_pipeline

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('standardscaler', StandardScaler()),
                ('sgdclassifier', SGDClassifier(random_state=0))])

In [28]:
grid_ex = GridSearchCV(grid_ex_pipeline,param_grid=grid_ex_params, cv=4)


In [29]:
grid_ex.fit(X_train, y_train)

GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('standardscaler', StandardScaler()),
                                       ('sgdclassifier',
                                        SGDClassifier(random_state=0))]),
             param_grid={'polynomialfeatures__degree': [2, 3, 4],
                         'standardscaler__with_mean': [True, False],
                         'standardscaler__with_std': [True, False]})

In [30]:
grid_ex.best_params_

{'polynomialfeatures__degree': 3,
 'standardscaler__with_mean': False,
 'standardscaler__with_std': True}

In [31]:
grid_ex.score(X_test, y_test)

0.715311004784689

In [32]:
def get_models_grid(X_train, y_train, verbose=False):
    grid_dic = {}
    if verbose: print("randomforestclassifier", end="")
    grid_rf_params = { 'randomforestclassifier__criterion' : ["gini", "entropy"],
                   'randomforestclassifier__n_estimators' : list(range(1,100,10))}
    grid_rf_pipeline = make_pipeline( RandomForestClassifier(random_state=random_state))
    grid_rf = GridSearchCV(grid_rf_pipeline,param_grid=grid_rf_params, cv=4)
    grid_rf.fit(X_train, y_train)
    grid_dic['randomforestclassifier'] = grid_rf
    if verbose: print(", kneighborsclassifier", end="")
    grid_knn_params = { 'kneighborsclassifier__n_neighbors': list(range(1,10,1)),
                    'kneighborsclassifier__p': list(range(1,10,1)),
                    'kneighborsclassifier__metric' : ['minkowski']}
    grid_knn_pipeline = make_pipeline( KNeighborsClassifier())
    grid_knn = GridSearchCV(grid_knn_pipeline,param_grid=grid_knn_params, cv=4)
    grid_knn.fit(X_train, y_train)
    grid_dic['kneighborsclassifier'] = grid_knn
    if verbose: print(", decisiontreeclassifier", end="")
    grid_dtc_params = { 'decisiontreeclassifier__criterion' : ["gini", "entropy"],
                    'decisiontreeclassifier__splitter' : ["best", "random"]}
    grid_dtc_pipeline = make_pipeline( DecisionTreeClassifier(random_state=random_state))
    grid_dtc = GridSearchCV(grid_dtc_pipeline,param_grid=grid_dtc_params, cv=4)
    grid_dtc.fit(X_train, y_train)
    grid_dic['decisiontreeclassifier'] = grid_dtc
    if verbose: print(", logisticregression", end="")
    grid_lr_params = { 'logisticregression__solver' : ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
                    'logisticregression__penalty' : [None, 'l2', 'l1', 'elasticnet'],
                    'logisticregression__fit_intercept' : [True, False]}
    # penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None
    grid_lr_pipeline = make_pipeline( LogisticRegression(random_state=random_state))
    grid_lr = GridSearchCV(grid_lr_pipeline,param_grid=grid_lr_params, cv=4)
    grid_lr.fit(X_train, y_train)
    grid_dic['logisticregression'] = grid_lr
    if verbose: print("                 DONE")
    return grid_dic

In [33]:
grid_dic = get_models_grid(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [34]:
warnings.filterwarnings("ignore")
@ignore_warnings(category=ConvergenceWarning)
def found_better_config_by_model(X_train, X_test, y_train, y_test, verbose=False):

    # on prend un maximum de colonne pour commencer
    columns_started = list(X_train.columns)
    better_grid_score_dic = {}
    better_grid_equals = {}
    ever_test = []
    
    # Modifier l'ordre des colonnes pour trouver encore d'autres configurations pertinentes
    # Positionnement de 6 suite aux tests lancés et des premiers résultats
    for subset in itertools.permutations(columns_started, 6):
        columns = list(subset)
            
        # a chaque tour, on regardera le meilleur score
        while len(columns)>0:
            str_col = str(sorted(columns))
            if str_col not in ever_test:
                grid_dic = get_models_grid(X_train[columns], y_train)
                for model_name,grid in grid_dic.items():
                    score = grid.score(X_test[columns], y_test)

                    model_better_score = better_grid_score_dic.get(model_name, 0)
                    model_grig_res = (grid, score, str_col)
                    if score > model_better_score:
                        model_better_score = score
                        better_grid_equals[model_name] = [model_grig_res]
                        if verbose:
                            print(f"{model_name} New Best :{round(score,2)} de test, {str_col}, {grid.best_params_}")
                    elif score == model_better_score:
                        better_grid_equals[model_name].append(model_grig_res)
                        if verbose:
                            print(f"{model_name} Same Best :{round(score,2)} de test, {str_col}, {grid.best_params_}")

                    better_grid_score_dic[model_name] = model_better_score
                ever_test.append(str_col)
                if verbose>1: print(str_col, "         DONE")
            # On supprime une colonne
            columns.pop()
    
    return better_grid_score_dic, better_grid_equals

In [35]:
better_grid_score_dic, better_grid_equals = found_better_config_by_model(X_train, X_test, y_train, y_test, verbose=True)

randomforestclassifier New Best :0.5 de test, ['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group'], {'randomforestclassifier__criterion': 'entropy', 'randomforestclassifier__n_estimators': 81}
kneighborsclassifier New Best :0.67 de test, ['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group'], {'kneighborsclassifier__metric': 'minkowski', 'kneighborsclassifier__n_neighbors': 7, 'kneighborsclassifier__p': 1}
decisiontreeclassifier New Best :0.82 de test, ['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group'], {'decisiontreeclassifier__criterion': 'gini', 'decisiontreeclassifier__splitter': 'best'}
logisticregression New Best :0.94 de test, ['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group'], {'logisticregression__fit_intercept': True, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}
logisticregression New Best :0.94 de test, ['Age', 'Pclass', 'Sex_cod', 'Titre_cod', 'group'], {'logisticregression__fit_intercept': True, 'logisticregressi

In [36]:
better_grid_score_dic

{'randomforestclassifier': 1.0,
 'kneighborsclassifier': 1.0,
 'decisiontreeclassifier': 1.0,
 'logisticregression': 1.0}

In [46]:
# Affichage des meilleures configurations et colonnes
for k in better_grid_equals.keys():
    v = better_grid_equals[k]
    print(k)
    for val in v:
        print(val[-1], end="")
        if isinstance(val[0], GridSearchCV):
            print(val[0].best_params_, end="")
        print("")

randomforestclassifier
['Sex_cod']{'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__n_estimators': 1}
['Sex_cod', 'Titre_cod']{'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__n_estimators': 1}
['Embarked_cod', 'Sex_cod']{'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__n_estimators': 11}
['Sex_cod', 'deck_cod']{'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__n_estimators': 11}
kneighborsclassifier
['Pclass', 'Sex_cod']{'kneighborsclassifier__metric': 'minkowski', 'kneighborsclassifier__n_neighbors': 4, 'kneighborsclassifier__p': 1}
['Sex_cod']{'kneighborsclassifier__metric': 'minkowski', 'kneighborsclassifier__n_neighbors': 9, 'kneighborsclassifier__p': 1}
['Embarked_cod', 'Sex_cod']{'kneighborsclassifier__metric': 'minkowski', 'kneighborsclassifier__n_neighbors': 5, 'kneighborsclassifier__p': 1}
decisiontreeclassifier
['Sex_cod']{'decisiontreeclassifier__criterion': 'gini', 'decisiontreeclassifier__s

<mark>Avec Grid Search CV j'ai les résultats :</mark>

```Python
randomforestclassifier = 1.0
kneighborsclassifier = 1.0
decisiontreeclassifier = 1.0
logisticregression = 1.0
```
Avec les paramètres :

```Python
# logisticregression
['Pclass', 'Sex_cod', 'Titre_cod'],              {'fit_intercept': True, 'penalty': 'l2', 'solver': 'liblinear'}
['Pclass', 'Sex_cod'],                           {'fit_intercept': True, 'penalty': 'l2', 'solver': 'newton-cg'}
['Pclass', 'Sex_cod', 'deck_cod'],               {'fit_intercept': False, 'penalty': 'l2', 'solver': 'newton-cg'}
['Sex_cod'],                                     {'fit_intercept': True, 'penalty': 'l2', 'solver': 'newton-cg'}
['Sex_cod', 'Titre_cod'],                        {'fit_intercept': True, 'penalty': 'l2', 'solver': 'newton-cg'}
['Age', 'Sex_cod', 'Titre_cod'],                 {'fit_intercept': True, 'penalty': 'l2', 'solver': 'liblinear'}
['Age', 'Embarked_cod', 'Sex_cod', 'Titre_cod'], {'fit_intercept': True, 'penalty': 'l2', 'solver': 'liblinear'}
['Age', 'Sex_cod', 'Titre_cod', 'deck_cod'],     {'fit_intercept': True, 'penalty': 'l2', 'solver': 'liblinear'}
['Embarked_cod', 'Sex_cod', 'Titre_cod'],        {'fit_intercept': True, 'penalty': 'l2', 'solver': 'liblinear'}
['Sex_cod', 'Titre_cod', 'deck_cod'],            {'fit_intercept': True, 'penalty': 'l2', 'solver': 'liblinear'}
['Age', 'Sex_cod'],                              {'fit_intercept': True, 'penalty': 'l2', 'solver': 'newton-cg'}
['Age', 'Sex_cod', 'deck_cod'],                  {'fit_intercept': True, 'penalty': 'l2', 'solver': 'sag'}
['Embarked_cod', 'Sex_cod'],                     {'fit_intercept': True, 'penalty': 'l2', 'solver': 'newton-cg'}
['Embarked_cod', 'Sex_cod', 'deck_cod'],         {'fit_intercept': True, 'penalty': 'l2', 'solver': 'liblinear'}
['Sex_cod', 'deck_cod'],                         {'fit_intercept': True, 'penalty': 'l2', 'solver': 'newton-cg'}

# randomforestclassifier :
['Sex_cod'],                 {'criterion': 'gini', 'n_estimators': 1}
['Sex_cod', 'Titre_cod'],    {'criterion': 'gini', 'n_estimators': 1}
['Embarked_cod', 'Sex_cod'], {'criterion': 'gini', 'n_estimators': 11}
['Sex_cod', 'deck_cod'],     {'criterion': 'gini', 'n_estimators': 11}

# kneighborsclassifier:
['Pclass', 'Sex_cod'],       {'metric': 'minkowski', 'n_neighbors': 4, 'p': 1}
['Sex_cod'],                 {'metric': 'minkowski', 'n_neighbors': 9, 'p': 1}
['Embarked_cod', 'Sex_cod'], {'metric': 'minkowski', 'n_neighbors': 5, 'p': 1}

# decisiontreeclassifier:
['Sex_cod'],                 {'criterion': 'gini', 'splitter': 'best'}
['Embarked_cod', 'Sex_cod'], {'criterion': 'gini', 'splitter': 'best'}
['Sex_cod', 'deck_cod'],     {'criterion': 'gini', 'splitter': 'best'}
```


In [33]:
grid_rf_params = { 'randomforestclassifier__criterion' : ["gini", "entropy"],
                   'randomforestclassifier__n_estimators' : list(range(1,100,10))}
grid_rf_pipeline = make_pipeline( RandomForestClassifier(random_state=random_state))
print(grid_rf_pipeline)
grid_rf = GridSearchCV(grid_rf_pipeline,param_grid=grid_rf_params, cv=4)
print(grid_rf.fit(X_train, y_train))
grid_dic['randomforestclassifier'] = grid_rf

Pipeline(steps=[('randomforestclassifier',
                 RandomForestClassifier(random_state=0))])
GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('randomforestclassifier',
                                        RandomForestClassifier(random_state=0))]),
             param_grid={'randomforestclassifier__criterion': ['gini',
                                                               'entropy'],
                         'randomforestclassifier__n_estimators': [1, 11, 21, 31,
                                                                  41, 51, 61,
                                                                  71, 81, 91]})


In [34]:
grid_knn_params = { 'kneighborsclassifier__n_neighbors': list(range(1,10,1)),
                    'kneighborsclassifier__p': list(range(1,10,1)),
                    'kneighborsclassifier__metric' : ['minkowski']}
grid_knn_pipeline = make_pipeline( KNeighborsClassifier())
print(grid_knn_pipeline)
grid_knn = GridSearchCV(grid_knn_pipeline,param_grid=grid_knn_params, cv=4)
print(grid_knn.fit(X_train, y_train))
grid_dic['kneighborsclassifier'] = grid_knn

Pipeline(steps=[('kneighborsclassifier', KNeighborsClassifier())])
GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('kneighborsclassifier',
                                        KNeighborsClassifier())]),
             param_grid={'kneighborsclassifier__metric': ['minkowski'],
                         'kneighborsclassifier__n_neighbors': [1, 2, 3, 4, 5, 6,
                                                               7, 8, 9],
                         'kneighborsclassifier__p': [1, 2, 3, 4, 5, 6, 7, 8,
                                                     9]})


In [35]:
grid_dtc_params = { 'decisiontreeclassifier__criterion' : ["gini", "entropy"],
                    'decisiontreeclassifier__splitter' : ["best", "random"]}
grid_dtc_pipeline = make_pipeline( DecisionTreeClassifier(random_state=random_state))
print(grid_dtc_pipeline)
grid_dtc = GridSearchCV(grid_dtc_pipeline,param_grid=grid_dtc_params, cv=4)
print(grid_dtc.fit(X_train, y_train))
grid_dic['decisiontreeclassifier'] = grid_dtc

Pipeline(steps=[('decisiontreeclassifier',
                 DecisionTreeClassifier(random_state=0))])
GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('decisiontreeclassifier',
                                        DecisionTreeClassifier(random_state=0))]),
             param_grid={'decisiontreeclassifier__criterion': ['gini',
                                                               'entropy'],
                         'decisiontreeclassifier__splitter': ['best',
                                                              'random']})


In [36]:
grid_lr_params = { 'logisticregression__solver' : ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
                    'logisticregression__penalty' : [None, 'l2', 'l1', 'elasticnet'],
                    'logisticregression__fit_intercept' : [True, False]}
# penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None
grid_lr_pipeline = make_pipeline( LogisticRegression(random_state=random_state))
print(grid_lr_pipeline)
grid_lr = GridSearchCV(grid_lr_pipeline,param_grid=grid_lr_params, cv=4)
print(grid_lr.fit(X_train, y_train))
grid_dic['logisticregression'] = grid_lr

Pipeline(steps=[('logisticregression', LogisticRegression(random_state=0))])


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('logisticregression',
                                        LogisticRegression(random_state=0))]),
             param_grid={'logisticregression__fit_intercept': [True, False],
                         'logisticregression__penalty': [None, 'l2', 'l1',
                                                         'elasticnet'],
                         'logisticregression__solver': ['newton-cg', 'lbfgs',
                                                        'liblinear', 'sag',
                                                        'saga']})


In [39]:
dic_model_score = {}
for model_name,grid in grid_dic.items():
    score = grid.score(X_test, y_test)
    dic_model_score[model_name] = score
    print(round(score,3), model_name, "Best params:",grid.best_params_)


0.39 randomforestclassifier Best params: {'randomforestclassifier__criterion': 'entropy', 'randomforestclassifier__n_estimators': 11}
0.656 kneighborsclassifier Best params: {'kneighborsclassifier__metric': 'minkowski', 'kneighborsclassifier__n_neighbors': 5, 'kneighborsclassifier__p': 1}
0.818 decisiontreeclassifier Best params: {'decisiontreeclassifier__criterion': 'entropy', 'decisiontreeclassifier__splitter': 'best'}
0.943 logisticregression Best params: {'logisticregression__fit_intercept': True, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}


### 5.1.1 KNN

La meilleure configuration sur la base du score de test est (exécution = nouveau train et nouveau test) :
* Exécution 1: 0.89 de test <=> 0.85 de train, KNN = 4 :['Pclass', 'sex_cod', 'title_cod', 'family_on_board', 'embarked_cod', 'deck_cod']
* Exécution 2: 0.87 de test <=> 0.82 de train, KNN= 5 avec les colonnes : ['Pclass', 'sex_cod', 'title_cod', 'embarked_cod']
* Exécution 3: 
   * 0.82 de test <=> 0.86 de train, KNN = 2 :['Pclass', 'sex_cod', 'family_on_board', 'embarked_cod', 'deck_cod']
   * 0.82 de test <=> 0.85 de train, KNN = 4 :['Pclass', 'sex_cod', 'title_cod', 'embarked_cod', 'deck_cod']
   * 0.82 de test <=> 0.82 de train, KNN = 3 :['Pclass', 'sex_cod', 'family_on_board', 'embarked_cod']
   * 0.82 de test <=> 0.86 de train, KNN = 4 :['Pclass', 'title_cod', 'embarked_cod', 'deck_cod']
* Exécution 4:
   * 0.9 de test <=> 0.83 de train, KNN= 7 avec les colonnes : ['Pclass', 'sex_cod', 'title_cod', 'embarked_cod', 'deck_cod']


Sauvegarde du model car j'aurai du mal à avoir un meilleur score que 0.9

In [18]:
X_train.columns

Index(['Pclass', 'Sex_cod', 'Titre_cod', 'Age', 'group', 'Fare',
       'Embarked_cod', 'deck_cod'],
      dtype='object')

In [19]:
# Logistic Regression 
log, log_better_columns, log_test_res, log_better_score, log_better_score_train = logisticRegression_found_better_full(X_train, X_test, y_train, y_test, random_state=0, plot=False, verbose=1)
print('[0]Logistic Regression Training Accuracy:', log_better_score, "( train:", log_better_score_train, ") with:", log_better_columns)
#print(log_test_res[log_better_score])

--------------------------------------------------------------------------------------------------------
New Best :0.92 de test <=> 0.8 de train, param = penalty=none, fit_intercept=True, solver=newton-cg,random_state=0 :['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group']
Same Best :0.92 de test <=> 0.8 de train, param = penalty=none, fit_intercept=True, solver=lbfgs,random_state=0 :['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group']
Same Best :0.92 de test <=> 0.8 de train, param = penalty=l2, fit_intercept=True, solver=newton-cg,random_state=0 :['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group']
Same Best :0.92 de test <=> 0.8 de train, param = penalty=l2, fit_intercept=True, solver=lbfgs,random_state=0 :['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group']
--------------------------------------------------------------------------------------------------------
New Best :0.94 de test <=> 0.81 de train, param = penalty=l2, fit_intercept=True, solver=liblinear,

In [20]:
# KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
knn, knn_better_columns, knn_test_res, knn_better_score, knn_better_score_train = knn_found_better_config(X_train,X_test, y_train, y_test, verbose=1)
print('[1]K Nearest Neighbor Training Accuracy:', knn_better_score, "( train:", knn_better_score_train, ") with:", knn_better_columns)
#print(knn_test_res[knn_better_score])

--------------------------------------------------------------------------------------------------------
New Best :0.63 de test <=> 0.77 de train, KNN = 8 :['Pclass', 'Sex_cod', 'Titre_cod', 'Age', 'group', 'Fare']
--------------------------------------------------------------------------------------------------------
New Best :1.0 de test <=> 0.79 de train, KNN = 2 :['Pclass', 'Sex_cod']
Same Best :1.0 de test <=> 0.79 de train, KNN = 4 :['Embarked_cod', 'Pclass', 'Sex_cod']
Same Best :1.0 de test <=> 0.79 de train, KNN = 1 :['Sex_cod']
Same Best :1.0 de test <=> 0.79 de train, KNN = 1 :['Sex_cod', 'Titre_cod']
Same Best :1.0 de test <=> 0.79 de train, KNN = 1 :['Embarked_cod', 'Sex_cod']
Same Best :1.0 de test <=> 0.79 de train, KNN = 7 :['Sex_cod', 'deck_cod']
--------------------------------------------------------------------------------------------------------
KNN Score 1.0 de test <=> 0.79 de train, KNN= 2 avec les colonnes : ['Pclass', 'Sex_cod']
[1]K Nearest Neighbor Training 

In [21]:
# Arbre de décision
tree, tree_better_columns, tree_test_res, tree_better_score, tree_better_score_train = decisionTree_found_best(X_train, X_test, y_train, y_test, random_state=0, plot=False, verbose=True)
print('[2]Decision Tree Classifier Training Accuracy:', tree_better_score, "( train:", tree_better_score_train, ") with:", tree_better_columns)
#print(tree_test_res[tree_better_score])

--------------------------------------------------------------------------------------------------------
New Best :0.82 de test <=> 0.99 de train, param = criterion=gini, splitter=best,random_state=0 :['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group']
--------------------------------------------------------------------------------------------------------
New Best :0.83 de test <=> 0.79 de train, param = criterion=gini, splitter=best,random_state=0 :['Pclass', 'Sex_cod']
Same Best :0.83 de test <=> 0.79 de train, param = criterion=gini, splitter=random,random_state=0 :['Pclass', 'Sex_cod']
Same Best :0.83 de test <=> 0.79 de train, param = criterion=entropy, splitter=best,random_state=0 :['Pclass', 'Sex_cod']
Same Best :0.83 de test <=> 0.79 de train, param = criterion=entropy, splitter=random,random_state=0 :['Pclass', 'Sex_cod']
--------------------------------------------------------------------------------------------------------
New Best :0.83 de test <=> 0.92 de train, par

In [22]:
# Forêt aléatoire
forest, forest_better_columns, forest_test_res, forest_better_score, forest_better_score_train = randomForest_found_best(X_train, X_test, y_train, y_test, random_state=0, plot=False, verbose=True)
print('[3]Random Forest Classifier Training Accuracy:', forest_better_score, "( train:", forest_better_score_train, ") with:", forest_better_columns)
#print(forest_test_res[forest_better_score])

--------------------------------------------------------------------------------------------------------
New Best :0.44 de test <=> 0.92 de train, param = n_estimators=1, criterion=gini,random_state=0 :['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group']
--------------------------------------------------------------------------------------------------------
New Best :0.45 de test <=> 0.92 de train, param = n_estimators=1, criterion=entropy,random_state=0 :['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group']
--------------------------------------------------------------------------------------------------------
New Best :0.47 de test <=> 0.98 de train, param = n_estimators=11, criterion=entropy,random_state=0 :['Age', 'Fare', 'Pclass', 'Sex_cod', 'Titre_cod', 'group']
--------------------------------------------------------------------------------------------------------
New Best :0.48 de test <=> 0.99 de train, param = n_estimators=51, criterion=gini,random_state=0 :['Age',

In [26]:
print(log_better_score, "for test (", round(log_better_score_train, 2), "for train) => Logistic Regression, with:", log_better_columns)
print(knn_better_score, "for test (",  round(knn_better_score_train, 2), "for train) => K Nearest Neighbor, with:", knn_better_columns)
print(tree_better_score, "for test (",  round(tree_better_score_train, 2), "for train) => Decision Tree, with:", tree_better_columns)
print(forest_better_score, "for test (",  round(forest_better_score_train, 2), "for train) => Random Forest, with:", forest_better_columns)

1.0 for test ( 0.79 for train) => Logistic Regression, with: ['Pclass', 'Sex_cod', 'Titre_cod']
1.0 for test ( 0.79 for train) => K Nearest Neighbor, with: ['Pclass', 'Sex_cod']
1.0 for test ( 0.79 for train) => Decision Tree, with: ['Sex_cod']
1.0 for test ( 0.79 for train) => Random Forest, with: ['Pclass', 'Sex_cod']


In [24]:
svc_lin, svc_rbf, gauss = best_other_model(X_train,y_train, X_test, y_test, verbose=False)
svc_lin, svc_rbf, gauss

[X]Support Vector Machine (Linear Classifier) Training Accuracy: 1.0 ( train: 0.7867564534231201 )
[X]Support Vector Machine (RBF Classifier) Training Accuracy: 0.645933014354067 ( train: 0.6902356902356902 )
[X]Gaussian Naive Bayes Training Accuracy: 0.8086124401913876 ( train: 0.7867564534231201 )


(SVC(kernel='linear', random_state=0), SVC(random_state=0), GaussianNB())

In [22]:
# better_knn_model = None
# if model_knn_save_exist:
#     better_knn_model = load(file_path+ model_knn_save_path)
#     score_test = better_knn_model.score(X_test[['Pclass', 'Sex_cod', 'Titre_cod', 'Embarked_cod', 'deck_cod']],y_test)
#     print(better_knn_model, round(score_test, 2))
# else:
better_knn_model, better_columns, test_res, better_score = knn_found_better_config(X_train, X_test, y_train, y_test, knn_min=1, knn_max=10, plot=False)
# Sauvegarde du meilleur modele
now = datetime.now() # current date and time
date_time = now.strftime("%Y-%m-%d-%H_%M_%S")
# Attention, il faudra mettre à jour les colonnes correspondantes dans le premier if en cas de modification du model
dump(better_knn_model, file_path+'Titanic_model_saved_score_'+str(round(better_score,2))+'_' + date_time + '.joblib')

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [169]:
titres_dic = {"Mr" : 7, "Miss" : 6, "Mrs" : 8, "Master" : 5, "Dr" : 2, "Rev" : 6}
col = ['Pclass', 'sex_cod', 'title_cod', 'embarked_cod', 'deck_cod']

In [170]:
print("Miss")
print(f"{col}, [proba]")
knn_predire_survie(better_knn_model, pclass=3, sex=1, age=randint(0, 80), title=titres_dic["Miss"])
knn_predire_survie(better_knn_model, pclass=2, sex=1, age=randint(0, 80), title=titres_dic["Miss"])
knn_predire_survie(better_knn_model, pclass=1, sex=1, age=randint(0, 80), title=titres_dic["Miss"])
print("Mrs")
print(f"{col}, [proba]")
knn_predire_survie(better_knn_model, pclass=3, sex=1, age=randint(15, 80), title=titres_dic["Mrs"])
knn_predire_survie(better_knn_model, pclass=2, sex=1, age=randint(15, 80), title=titres_dic["Mrs"])
knn_predire_survie(better_knn_model, pclass=1, sex=1, age=randint(15, 80), title=titres_dic["Mrs"])
print("Mr")
print(f"{col}, [proba]")
knn_predire_survie(better_knn_model, pclass=3, sex=0, age=randint(0, 80), title=titres_dic["Mr"])
knn_predire_survie(better_knn_model, pclass=2, sex=0, age=randint(0, 80), title=titres_dic["Mr"])
knn_predire_survie(better_knn_model, pclass=1, sex=0, age=randint(0, 80), title=titres_dic["Mr"])
print("Master")
print(f"{col}, [proba]")
knn_predire_survie(better_knn_model, pclass=3, sex=0, age=randint(20, 80), title=titres_dic["Master"])
knn_predire_survie(better_knn_model, pclass=2, sex=0, age=randint(20, 80), title=titres_dic["Master"])
knn_predire_survie(better_knn_model, pclass=1, sex=0, age=randint(20, 80), title=titres_dic["Master"])

Miss
['Pclass', 'sex_cod', 'title_cod', 'embarked_cod', 'deck_cod'], [proba]
[3, 1, 6, 1, 3] = [0] [[0.71428571 0.28571429]]
[2, 1, 6, 0, 6] = [0] [[0.85714286 0.14285714]]
[1, 1, 6, 0, 1] = [1] [[0.14285714 0.85714286]]
Mrs
['Pclass', 'sex_cod', 'title_cod', 'embarked_cod', 'deck_cod'], [proba]
[3, 1, 8, 1, 6] = [0] [[1. 0.]]
[2, 1, 8, 0, 5] = [0] [[0.57142857 0.42857143]]
[1, 1, 8, -1, 3] = [1] [[0.14285714 0.85714286]]
Mr
['Pclass', 'sex_cod', 'title_cod', 'embarked_cod', 'deck_cod'], [proba]
[3, 0, 7, 2, 1] = [0] [[1. 0.]]
[2, 0, 7, 1, 2] = [0] [[0.85714286 0.14285714]]
[1, 0, 7, -1, 1] = [1] [[0. 1.]]
Master
['Pclass', 'sex_cod', 'title_cod', 'embarked_cod', 'deck_cod'], [proba]
[3, 0, 5, 2, 3] = [1] [[0.42857143 0.57142857]]
[2, 0, 5, 1, 1] = [1] [[0. 1.]]
[1, 0, 5, 1, 2] = [1] [[0.28571429 0.71428571]]


(array([1], dtype=int64), array([[0.28571429, 0.71428571]]))

La meilleure configuration pour ce model et le meilleur score :     
* 0.84 de test <=> 0.79 de train, param = penalty=none, fit_intercept=True, solver=saga,random_state=0 :['Age', 'Pclass', 'embarked_cod', 'sex_cod']
* 0.84 de test <=> 0.79 de train, param = penalty=none, fit_intercept=True, solver=saga,random_state=0 :['Age', 'Pclass', 'embarked_cod', 'sex_cod']
* 0.84 de test <=> 0.79 de train, param = penalty=l2, fit_intercept=True, solver=saga,random_state=0 :['Age', 'Pclass', 'embarked_cod', 'sex_cod']
* 0.84 de test <=> 0.79 de train, param = penalty=l1, fit_intercept=True, solver=saga,random_state=0 :['Age', 'Pclass', 'embarked_cod', 'sex_cod']

## 5.3. RandomForestClassifier

* RandomForestClassifier
j'ai essayé le random forest sur ton fichier, j'ai eu un score de 0,85  et en ajoutant une colonne titre comme 'Countess,Major ...' j'arrive à avoir un score de 0,86
oui, avec les variables catégorielles comme le deck par exemple tu peux utiliser le from sklearn.preprocessing import OrdinalEncoder pour qu'elle soit exploitable par l'algorithme de machine learning
* KNeighborsClassifier

In [177]:
from sklearn.ensemble import RandomForestClassifier 

columns_logistic = ['Pclass', 'sex_cod', 'title_cod', 'embarked_cod', 'deck_cod']

model_forest = RandomForestClassifier(n_estimators=100,random_state=random_state)
model_forest.fit(X_train[columns_logistic],y_train)
pd.Series(model_forest.feature_importances_,index=columns_logistic).sort_values(ascending=False)

sex_cod         0.317152
deck_cod        0.226847
title_cod       0.206376
Pclass          0.159743
embarked_cod    0.089883
dtype: float64

In [178]:
model_forest.score(X_test[columns_logistic], y_test)

0.8770949720670391

## 5.4. Decision tree

In [179]:
columns_decision_tree_started = ['Pclass', 'sex_cod', 'title_cod', 'Age', 'family_on_board', 'Fare', 'embarked_cod', 'deck_cod']

model_decision_tree=DecisionTreeClassifier()
model_decision_tree.fit(X_train[columns_decision_tree_started],y_train)
print(model_decision_tree.score(X_test[columns_decision_tree_started], y_test))
prediction=model_decision_tree.predict(X_test[columns_decision_tree_started])
# print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction,test_y))

0.7988826815642458


une question à réfléchir pourquoi OrdinalEncoder()
 pas LabelEncoder()

In [180]:
juste to fail

SyntaxError: invalid syntax (Temp/ipykernel_25944/3596157890.py, line 1)

In [None]:
# Mise en commentaire pour ne traiter que lorsque le meilleur modèle n'a pas été sauvegardé.
# better_model, better_columns, test_res = knn_found_better_config(X_train, X_test, y_train, y_test, knn_min=1, knn_max=10, plot=False)

--------------------------------------------------------------------------------------------------------
New Best :0.75 de test <=> 0.83 de train, KNN = 3 :['Pclass', 'sex_cod', 'title_cod', 'Age', 'family_on_board', 'Fare']
--------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------
New Best :0.83 de test <=> 0.86 de train, KNN = 3 :['Pclass', 'sex_cod', 'title_cod', 'Age', 'family_on_board']
--------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------
New Best :0.84 de test <=> 0.81 de train, KNN = 7 :['Pclass', 'sex_cod', 'title_cod', 'Age']
--------------------------------------------------------------------------------------------------------
---------------------------------------------

In [None]:
df_clean.columns

Index(['Survived', 'Pclass', 'sex_cod', 'Titre', 'title_cod', 'Age',
       'family_on_board', 'Fare', 'embarked_cod', 'Deck', 'deck_cod',
       'Last_name', 'First_name'],
      dtype='object')

In [None]:
pd.crosstab(df_clean['Deck'], df_clean['Pclass'])

Pclass,1,2,3
Deck,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,19,18,25
B,54,3,74
C,75,5,15
D,29,16,11
E,27,20,33
F,0,41,36
G,12,81,297
