Florent Giauna (AMSD) et Zewei Lin (MLSD)

Appentissage supervisé pour des données avec classes déséquilibrées

Séance 2 - Prédiction de churn, Partie I

In [46]:
#Librairies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Pré-traitement
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler

#Sélection de modèles
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV

#Modèles
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

#Métriques
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import average_precision_score

# Dataset : Credit fraud

In [47]:
#Chargement des données 
df = pd.read_csv('data/creditcard_v2.csv')

#Séparation du dataset en jeux d'entraînement et de test
X = df.drop('Class', axis=1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=7, shuffle=True, stratify=y)

In [48]:
#Liste des variables
var_list = list(X_train)
    
#Redimensionnement des variables quantitatives
scaler = RobustScaler()

for var in var_list:
    X_train[[var]] = scaler.fit_transform(X_train[[var]])
    X_test[[var]] = scaler.transform(X_test[[var]])

Les variables contenant de nombreux outliers RobustScaler est plus approprié que StandardScaler et MinMaxScaler.

## (a) Pour chaque approche, avec les hyper-paramètres par défaut, évaluez la prédiction du churn sur la base de l’AUC (Area Under the Curve)

La courbe ROC (Receiver operating characteristic) permet de mesurer la performance d'un classifieur en contrebalançant la proportion de vrais positifs correctement prédits (recall ou sensitivity) par la proportion de vrais négatifs correctement prédits (specificity ou inverse de la précision). Le meilleur algorithme maximise l'aire sous la courbe ROC: l'AUC. 

Cependant utiliser la courbe ROC et l'AUC lorsque le jeu de données est déséquilibré pose problème (cf. Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432). La courbe Precision-Recall (PR) et le PR AUC sont plus adaptés. Il s'agit de la moyenne des précisions sur les classes, calculée à chaque seuil de recall.

Dans notre cas, la PR AUC (average precision) est donc une meilleure mesure pour l'instant (avant d'essayer des techniques d'upsampling et de downsampling).

In [5]:
#Algorithmes à utiliser
clfs = {'DecisionTree': DecisionTreeClassifier(random_state=7),
        'LogisiticRegression': LogisticRegression(random_state=7),
        'LinearSVC' : LinearSVC(random_state=7)}

#Evaluation de la performance de chaque algorithme à partir de l'average precision (PR AUC)
for key, clf in clfs.items():
    clf_results = cross_validate(clf, X_train, y_train, scoring='average_precision', cv=5)
    print(key, 
          "\nAverage precision (PR AUC):", clf_results['test_score'].mean(),
          "\n")

DecisionTree 
Average precision (PR AUC): 0.5602304654507083 



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

LogisiticRegression 
Average precision (PR AUC): 0.751853214438474 





LinearSVC 
Average precision (PR AUC): 0.7425441922405724 





Le dataset étant volumineux, LinearSVC est la seule méthode de SVM utilisable. La documentation de scikit-learn précise que pour sklearn.svm.SVC() "The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples". https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Les résultats obtenus par défaut par la régression logistique et le SVM sont encourageants. L'utilisation de différents hyper-paramètres peut permettre de les améliorer.

## (b) Pour chaque approche, définissez un modèle performant en recherchant de bons hyper-paramètres via un grid search

L'hyper-paramètre 'class_weight' fixé à 'balanced' permet de limiter le déséquilibre et de donner plus de poids à une classification correcte de la classe minortiaire.

In [51]:
#Instanciation du modèle d'arbre de décision
clf_tree = DecisionTreeClassifier()

#Création de la grille de paramètres à tester
param_grid = {'criterion': ['gini', 'entropy'], 
              'max_depth': [1,5,7,9,15,25,30],
              'class_weight': ['balanced'],
              'random_state': [7]}

grid = GridSearchCV(estimator=clf_tree, 
                    param_grid=param_grid,
                    scoring='average_precision',
                    return_train_score=True,
                    cv=5)

grid.fit(X_train, y_train)

print("Meilleure configuration:", grid.best_params_, 
      "\nAverage precision (PR AUC):", grid.best_score_)

Meilleure configuration: {'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 7, 'random_state': 7} 
Average precision (PR AUC): 0.7183737945447246


In [54]:
#Instanciation du modèle de régression logistique
clf_logreg = LogisticRegression()

#Création de la grille de paramètres à tester
param_grid = {'penalty':['l2', 'l1', 'elasticnet'],
              'C':[0.01, 0.1, 1],
              'solver': ['lbfgs', 'sag', 'saga'],
              'class_weight': ['balanced'],
              'random_state': [7]}

grid = GridSearchCV(estimator=clf_logreg, 
                    param_grid=param_grid,
                    scoring='average_precision',
                    return_train_score=True,
                    cv=5)

grid.fit(X_train, y_train)

print("Meilleure configuration:", grid.best_params_, 
      "\nAverage precision (PR AUC):", grid.best_score_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Meilleure configuration: {'C': 0.1, 'class_weight': 'balanced', 'penalty': 'l2', 'random_state': 7, 'solver': 'lbfgs'} 
Average precision (PR AUC): 0.7339314988792072


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [56]:
#Instanciation du modèle SVC sans kernel
clf_svc = LinearSVC()

#Création de la grille de paramètres à tester
param_grid = {'penalty': ['l1', 'l2', 'elasticnet'],
              'loss': ['hinge', 'squared_hinge'],
              'C': [0.1, 1, 10],
              'class_weight': ['balanced'],
              'random_state': [7]}

grid = GridSearchCV(estimator=clf_svc, 
                    param_grid=param_grid,
                    scoring='average_precision',
                    return_train_score=True,
                    cv=5)

grid.fit(X_train, y_train)

print("Meilleure configuration:", grid.best_params_, 
      "\nAverage precision (PR AUC):", grid.best_score_)

60 fits failed out of a total of 90.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/home/utilisateur/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/utilisateur/anaconda3/lib/python3.9/site-packages/sklearn/svm/_classes.py", line 274, in fit
    self.coef_, self.intercept_, n_iter_ = _fit_liblinear(
  File "/home/utilisateur/anaconda3/lib/python3.9/site-packages/sklearn/svm/_base.py", line 1223, in _fit_liblinear
    solver_type = _get_liblinear_solver_type(multi_class, penalty, loss, dual)
  File "/home/utilisateur/anaconda3/lib/py

Meilleure configuration: {'C': 0.1, 'class_weight': 'balanced', 'loss': 'squared_hinge', 'penalty': 'l2', 'random_state': 7} 
Average precision (PR AUC): 0.7157001813949233




In [58]:
#Instanciation du meilleur modèle
clf = LogisticRegression(C=0.1, class_weight='balanced', penalty='l2', solver='lbfgs', random_state=7)

clf.fit(X_train, y_train)

#Classification du jeu test
pred = cross_val_predict(clf, X_test, y_test, cv=5)

#Résultats
clf_results = cross_validate(clf, X_test, y_test, scoring='average_precision', cv=5)
print("\nAverage precision (PR AUC):", clf_results['test_score'].mean())

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt


Average precision (PR AUC): 0.7314798136819697


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [59]:
#Matrice de confusion
print(confusion_matrix(y_test, pred))

[[55373  1491]
 [   11    87]]


In [61]:
#Rapport de classification
print(classification_report(y_test, pred, target_names=["0", "1"]))

              precision    recall  f1-score   support

           0       1.00      0.97      0.99     56864
           1       0.06      0.89      0.10        98

    accuracy                           0.97     56962
   macro avg       0.53      0.93      0.55     56962
weighted avg       1.00      0.97      0.99     56962



Le F1 score est la moyenne harmonique entre la précision (proportion de vrais positifs correctement prédits sur tous les positifs prédits) et le recall (la proportion de vrais positifs correctement prédits parmi tous les vrais positifs). Il faut prendre en compte le macro average pour avoir une idée de la performance du classfieur sur les deux classes.

Les différents hyperparamètres ont forcé le modèle à trouver des exemples de la classe minoritaire. Mais au prix d'énormément de faux négatifs. La précision sur la classe minoritaire n'est que de 6%.

# Dataset : Bank marketing

In [62]:
#Chargement des données 
df = pd.read_csv('data/bank-additional-full_v2.csv')

In [63]:
#Séparation du dataset en jeux d'entraînement et de test
X = df.drop('y', axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=7, shuffle=True, stratify=y)

In [64]:
#Liste des variables catégorielles nominales
var_nom = list(X_train.select_dtypes(['object']).columns)
var_nom += ['age', 'duration', 'pdays']

#Liste des variables quantitatives
var_quant = ['campaign', 'cons.conf.idx', 'cons.price.idx', 'emp.var.rate', 
             'euribor3m', 'nr.employed', 'previous']

In [65]:
#Encodage des variables nominales 
ohe = OneHotEncoder(sparse_output=False).set_output(transform="pandas")

for var in var_nom:
    ohe_train = ohe.fit_transform(X_train[[var]])
    X_train = pd.concat([X_train, ohe_train],axis=1).drop(columns=[var])
    ohe_test = ohe.transform(X_test[[var]])
    X_test = pd.concat([X_test, ohe_test],axis=1).drop(columns=[var])

In [66]:
#Redimensionnement des variables quantitatives
scaler = RobustScaler()

for var in var_quant:
    X_train[[var]] = scaler.fit_transform(X_train[[var]])
    X_test[[var]] = scaler.transform(X_test[[var]])

Les variables n'ayant pas de distribution gaussienne et le jeu de données contenant de nombreux outliers RobustScaler est plus approprié que StandardScaler et MinMaxScaler.

## (a) Pour chaque approche, avec les hyper-paramètres par défaut, évaluez la prédiction du churn sur la base de l’AUC (Area Under the Curve)

La courbe ROC (Receiver operating characteristic) permet de mesurer la performance d'un classifieur en contrebalançant la proportion de vrais positifs correctement prédits (recall ou sensitivity) par la proportion de vrais négatifs correctement prédits (specificity ou inverse de la précision). Le meilleur algorithme maximise l'aire sous la courbe ROC: l'AUC. 

Cependant utiliser la courbe ROC et l'AUC lorsque le jeu de données est déséquilibré pose problème (cf. Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432). La courbe Precision-Recall (PR) et le PR AUC sont plus adaptés. Il s'agit de la moyenne des précisions sur les classes, calculée à chaque seuil de recall.

Dans notre cas, la PR AUC (average precision) est donc une meilleure mesure pour l'instant (avant d'essayer des techniques d'upsampling et de downsampling).

In [67]:
#Algorithmes à utiliser
clfs = {'DecisionTree': DecisionTreeClassifier(random_state=7),
        'LogisiticRegression': LogisticRegression(random_state=7),
        'LinearSVC' : LinearSVC(random_state=7)}

#Evaluation de la performance de chaque algorithme à partir de l'average precision (PR AUC)
for key, clf in clfs.items():
    clf_results = cross_validate(clf, X_train, y_train, scoring='average_precision', cv=5)
    print(key, 
          "\nAverage precision (PR AUC):", clf_results['test_score'].mean(),
          "\n")

DecisionTree 
Average precision (PR AUC): 0.2501548881235589 



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

LogisiticRegression 
Average precision (PR AUC): 0.5609490485292269 





LinearSVC 
Average precision (PR AUC): 0.5689709935233397 





Le dataset étant volumineux, LinearSVC est la seule méthode de SVM utilisable. La documentation de scikit-learn précise que pour sklearn.svm.SVC() "The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples". https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Les résultats obtenus par défaut par la régression logistique et le SVM sont encourageants. L'utilisation de différents hyper-paramètres peut permettre de les améliorer.

## (b) Pour chaque approche, définissez un modèle performant en recherchant de bons hyper-paramètres via un grid search

L'hyper-paramètre 'class_weight' fixé à 'balanced' permet de limiter le déséquilibre et de donner plus de poids à une classification correcte de la classe minortiaire.

In [71]:
#Instanciation du modèle d'arbre de décision
clf_tree = DecisionTreeClassifier()

#Création de la grille de paramètres à tester
param_grid = {'criterion': ['gini', 'entropy'], 
              'max_depth': [1,5,7,9,15,25,50,70],
              'class_weight': ['balanced'],
              'random_state': [7]}

grid = GridSearchCV(estimator=clf_tree, 
                    param_grid=param_grid,
                    scoring='average_precision',
                    return_train_score=True,
                    cv=5)

grid.fit(X_train, y_train)

print("Meilleure configuration:", grid.best_params_, 
      "\nAverage precision (PR AUC):", grid.best_score_)

Meilleure configuration: {'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 7, 'random_state': 7} 
Average precision (PR AUC): 0.5486338066206959


In [79]:
#Instanciation du modèle de régression logistique
clf_logreg = LogisticRegression()

#Création de la grille de paramètres à tester
param_grid = {'penalty':['l2', 'l1', 'elasticnet'],
              'C':[1e5, 1e6, 1e7],
              'solver': ['lbfgs', 'sag', 'saga'],
              'class_weight': ['balanced'],
              'random_state': [7]}

grid = GridSearchCV(estimator=clf_logreg, 
                    param_grid=param_grid,
                    scoring='average_precision',
                    return_train_score=True,
                    cv=5)

grid.fit(X_train, y_train)

print("Meilleure configuration:", grid.best_params_, 
      "\nAverage precision (PR AUC):", grid.best_score_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Meilleure configuration: {'C': 1000000.0, 'class_weight': 'balanced', 'penalty': 'l2', 'random_state': 7, 'solver': 'lbfgs'} 
Average precision (PR AUC): 0.5509924963917635


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [78]:
#Instanciation du modèle SVC sans kernel
clf_svc = LinearSVC()

#Création de la grille de paramètres à tester
param_grid = {'penalty': ['l1', 'l2', 'elasticnet'],
              'loss': ['hinge', 'squared_hinge'],
              'C': [0.1, 1, 10],
              'class_weight': ['balanced'],
              'random_state': [7]}

grid = GridSearchCV(estimator=clf_svc, 
                    param_grid=param_grid,
                    scoring='average_precision',
                    return_train_score=True,
                    cv=5)

grid.fit(X_train, y_train)

print("Meilleure configuration:", grid.best_params_, 
      "\nAverage precision (PR AUC):", grid.best_score_)

60 fits failed out of a total of 90.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/home/utilisateur/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/utilisateur/anaconda3/lib/python3.9/site-packages/sklearn/svm/_classes.py", line 274, in fit
    self.coef_, self.intercept_, n_iter_ = _fit_liblinear(
  File "/home/utilisateur/anaconda3/lib/python3.9/site-packages/sklearn/svm/_base.py", line 1223, in _fit_liblinear
    solver_type = _get_liblinear_solver_type(multi_class, penalty, loss, dual)
  File "/home/utilisateur/anaconda3/lib/py

Meilleure configuration: {'C': 1, 'class_weight': 'balanced', 'loss': 'squared_hinge', 'penalty': 'l2', 'random_state': 7} 
Average precision (PR AUC): 0.545748530638497




In [81]:
#Instanciation du meilleur modèle
clf = LogisticRegression(C=1e6, class_weight='balanced', penalty='l2', solver='lbfgs', random_state=7)
clf.fit(X_train, y_train)

#Classification du jeu test
pred = cross_val_predict(clf, X_test, y_test, cv=5)

#Résultats
clf_results = cross_validate(clf, X_test, y_test, scoring='average_precision', cv=5)
print("\nAverage precision (PR AUC):", clf_results['test_score'].mean())

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt


Average precision (PR AUC): 0.5517747972278378


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [82]:
#Matrice de confusion
print(confusion_matrix(y_test, pred))

[[5745 1565]
 [ 116  812]]


In [83]:
#Rapport de classification
print(classification_report(y_test, pred, target_names=["0", "1"]))

              precision    recall  f1-score   support

           0       0.98      0.79      0.87      7310
           1       0.34      0.88      0.49       928

    accuracy                           0.80      8238
   macro avg       0.66      0.83      0.68      8238
weighted avg       0.91      0.80      0.83      8238



Le F1 score est la moyenne harmonique entre la précision (proportion de vrais positifs correctement prédits sur tous les positifs prédits) et le recall (la proportion de vrais positifs correctement prédits parmi tous les vrais positifs). Il faut prendre en compte le macro average pour avoir une idée de la performance du classfieur sur les deux classes.

Les différents hyperparamètres ont forcé le modèle à trouver des exemples de la classe minoritaire. Mais au prix d'énormément de faux négatifs. La précision sur la classe minoritaire n'est que de 34%.

# Dataset : Employee attrition

In [84]:
#Chargement des données 
df = pd.read_csv('data/whole_data_v2.csv')

In [85]:
#Liste des variables quantitatives
var_quant = ['Age', 'DistanceFromHome', 'MonthlyIncome', 'NumCompaniesWorked', 
            'PercentSalaryHike', 'TotalWorkingYears', 'TrainingTimesLastYear', 
            'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager']

#Liste des variables qualitatives
var_cat = ['JobInvolvement', 'PerformanceRating', 'EnvironmentSatisfaction', 'JobSatisfaction',
           'WorkLifeBalance', 'BusinessTravel', 'Department', 'Education', 
           'EducationField', 'Gender', 'MaritalStatus', 'JobLevel', 'JobRole', 'StockOptionLevel']

#Liste des variables qualitatives ordinales
var_ord = ['JobInvolvement', 'PerformanceRating', 'EnvironmentSatisfaction', 'JobSatisfaction',
           'WorkLifeBalance', 'BusinessTravel', 'JobLevel', 'StockOptionLevel']

#Liste des variables qualitatives nominales
var_nom = ['Department', 'Education', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']

In [86]:
#Séparation du dataset en jeux d'entraînement et de test
X = df.drop('Attrition', axis=1)
y = df['Attrition']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=7, shuffle=True, stratify=y)

In [87]:
#Encodage des variables nominales 
ohe = OneHotEncoder(sparse_output=False).set_output(transform="pandas")

for var in var_nom:
    ohe_train = ohe.fit_transform(X_train[[var]])
    X_train = pd.concat([X_train, ohe_train],axis=1).drop(columns=[var])
    ohe_test = ohe.transform(X_test[[var]])
    X_test = pd.concat([X_test, ohe_test],axis=1).drop(columns=[var])

#Encodage des variables ordinales
encoder = OrdinalEncoder(categories=[[0,1,2,3,4,5]], 
                         handle_unknown='use_encoded_value',
                         unknown_value=99)

for var in var_ord:
    X_train[[var]] = encoder.fit_transform(X_train[[var]])
    X_test[[var]] = encoder.transform(X_test[[var]])

In [88]:
#Redimensionnement des variables quantitatives
scaler = RobustScaler()

for var in var_quant:
    X_train[[var]] = scaler.fit_transform(X_train[[var]])
    X_test[[var]] = scaler.transform(X_test[[var]])

Les variables n'ayant pas de distribution gaussienne (hormis Age) et le jeu de données contenant de nombreux outliers RobustScaler est plus approprié que StandardScaler et MinMaxScaler.

## (a) Pour chaque approche, avec les hyper-paramètres par défaut, évaluez la prédiction du churn sur la base de l’AUC (Area Under the Curve)

La courbe ROC (Receiver operating characteristic) permet de mesurer la performance d'un classifieur en contrebalançant la proportion de vrais positifs correctement prédits (recall ou sensitivity) par la proportion de vrais négatifs correctement prédits (specificity ou inverse de la précision). Le meilleur algorithme maximise l'aire sous la courbe ROC: l'AUC. 

Cependant utiliser la courbe ROC et l'AUC lorsque le jeu de données est déséquilibré pose problème (cf. Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432). La courbe Precision-Recall (PR) et le PR AUC sont plus adaptés. Il s'agit de la moyenne des précisions sur les classes, calculée à chaque seuil de recall.

Dans notre cas, la PR AUC (average precision) est donc une meilleure mesure pour l'instant (avant d'essayer des techniques d'upsampling et de downsampling).

In [89]:
#Algorithmes à utiliser
clfs = {'DecisionTree': DecisionTreeClassifier(random_state=7),
        'LogisiticRegression': LogisticRegression(random_state=7),
        'SVC (linear)': SVC(kernel='linear', random_state=7),
        'SVC (rbf)': SVC(kernel='rbf', random_state=7),
        'SVC (poly)': SVC(kernel='poly', random_state=7),
        'SVC (sigmoid)': SVC(kernel='sigmoid', random_state=7)}

#Evaluation de la performance de chaque algorithme à partir de l'average precision (PR AUC)
for key, clf in clfs.items():
    clf_results = cross_validate(clf, X_train, y_train, scoring='average_precision', cv=5)
    print(key, 
          "\nAverage precision (PR AUC):", clf_results['test_score'].mean(),
          "\n")

DecisionTree 
Average precision (PR AUC): 0.8477093505328502 

LogisiticRegression 
Average precision (PR AUC): 0.39658224485841764 



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

SVC (linear) 
Average precision (PR AUC): 0.3101622808961998 

SVC (rbf) 
Average precision (PR AUC): 0.5648236925228558 

SVC (poly) 
Average precision (PR AUC): 0.5275688025481712 

SVC (sigmoid) 
Average precision (PR AUC): 0.2705333872439934 



DecisionTree obtient de bons résultats par défaut mais il est probablement en overfitting avec une profondeur maximale (le paramètre max_depth est fixé à None par défaut). Les résultats des autres modèles ne sont pas bons mais le SVC avec le kernel rbf semble le plus prometteur.

L'utilisation de différents hyper-paramètres peut permettre d'améliorer ces modèles.

## (b) Pour chaque approche, définissez un modèle performant en recherchant de bons hyper-paramètres via un grid search

L'hyper-paramètre 'class_weight' fixé à 'balanced' permet de limiter le déséquilibre et de donner plus de poids à une classification correcte de la classe minortiaire.

Le paramètre refit='f1' signifie que le score qui doit être optimisé est le F1 score. L'AUC est simplement enregistrée.

In [90]:
#Instanciation du modèle d'arbre de décision
clf_tree = DecisionTreeClassifier()

#Création de la grille de paramètres à tester
param_grid = {'criterion': ['gini', 'entropy'], 
              'max_depth': [5,10,15,20,25,30,35,40,46],
              'class_weight': ['balanced'],
              'random_state': [7]}

grid = GridSearchCV(estimator=clf_tree, 
                    param_grid=param_grid,
                    scoring='average_precision',
                    return_train_score=True,
                    cv=5)

grid.fit(X_train, y_train)

print("Meilleure configuration:", grid.best_params_, 
      "\nAverage precision (PR AUC):", grid.best_score_)

Meilleure configuration: {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 20, 'random_state': 7} 
Average precision (PR AUC): 0.8478533070322941


In [92]:
#Instanciation du modèle de régression logistique
clf_logreg = LogisticRegression()

#Création de la grille de paramètres à tester
param_grid = {'penalty':['l2', 'l1', 'elasticnet'],
              'C':[0.0001,0.001,0.1,10,100],
              'solver': ['lbfgs', 'sag', 'saga'],
              'class_weight': ['balanced'],
              'random_state': [7]}

grid = GridSearchCV(estimator=clf_logreg, 
                    param_grid=param_grid,
                    scoring='average_precision',
                    return_train_score=True,
                    cv=5)

grid.fit(X_train, y_train)

print("Meilleure configuration:", grid.best_params_, 
      "\nAverage precision (PR AUC):", grid.best_score_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Meilleure configuration: {'C': 0.0001, 'class_weight': 'balanced', 'penalty': 'l2', 'random_state': 7, 'solver': 'sag'} 
Average precision (PR AUC): 0.4014044360989607


125 fits failed out of a total of 225.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/home/utilisateur/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/utilisateur/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/home/utilisateur/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l

In [36]:
#Instanciation du modèle SVC avec kernel
clf_svc = SVC()

#Création de la grille de paramètres à tester
param_grid = {'C': [1e-6, 1e-5, 1e-4], 
              'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
              'gamma': ['scale', 'auto', 1, 10, 100],
              'class_weight': ['balanced'],
              'random_state': [7]}

grid = GridSearchCV(estimator=clf_svc, 
                    param_grid=param_grid,
                    scoring='average_precision',
                    return_train_score=True,
                    cv=5)

grid.fit(X_train, y_train)

print("Meilleure configuration:", grid.best_params_, 
      "\nAverage precision (PR AUC):", grid.best_score_)

Meilleure configuration: {'C': 1e-05, 'class_weight': 'balanced', 'gamma': 1, 'kernel': 'rbf', 'random_state': 7} 
Average precision (PR AUC): 0.9666830134060511


In [38]:
#Instanciation du meilleur modèle
clf = SVC(C=1e-05, gamma=1, kernel='rbf', class_weight='balanced', random_state=7)
clf.fit(X_train, y_train)

#Classification du jeu test
pred = cross_val_predict(clf, X_test, y_test, cv=5)

#Résultats
clf_results = cross_validate(clf, X_test, y_test, scoring='average_precision', cv=5)
print("\nAverage precision (PR AUC):", clf_results['test_score'].mean())


Average precision (PR AUC): 0.5796955880876223


In [39]:
#Matrice de confusion
print(confusion_matrix(y_test, pred))

[[148 588]
 [ 28 113]]


In [40]:
#Rapport de classification
print(classification_report(y_test, pred, target_names=["0", "1"]))

              precision    recall  f1-score   support

           0       0.84      0.20      0.32       736
           1       0.16      0.80      0.27       141

    accuracy                           0.30       877
   macro avg       0.50      0.50      0.30       877
weighted avg       0.73      0.30      0.32       877



Le SVC était probablement en overfitting, le paramètre C a favorisé un modèle complexe qui généralise mal. Essayons le deuxième meilleur modèle.

In [43]:
#Instanciation du 2ème meilleur modèle
clf = DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=20, random_state=7)
clf.fit(X_train, y_train)

#Classification du jeu test
pred = cross_val_predict(clf, X_test, y_test, cv=5)

#Résultats
clf_results = cross_validate(clf, X_test, y_test, scoring='average_precision', cv=5)
print("\nAverage precision (PR AUC):", clf_results['test_score'].mean())


Average precision (PR AUC): 0.3457831545503959


In [44]:
#Matrice de confusion
print(confusion_matrix(y_test, pred))

[[663  73]
 [ 66  75]]


In [45]:
#Rapport de classification
print(classification_report(y_test, pred, target_names=["0", "1"]))

              precision    recall  f1-score   support

           0       0.91      0.90      0.91       736
           1       0.51      0.53      0.52       141

    accuracy                           0.84       877
   macro avg       0.71      0.72      0.71       877
weighted avg       0.84      0.84      0.84       877



Lui aussi était en overfitting comme l'indique l'écart de PR AUC entre le jeu d'entraînement et le jeu test mais le rapport de classification est bien meilleur avec un F1 score macro de 0,71. Les différents hyperparamètres ont forcé le modèle à trouver des exemples de la classe minoritaire et il a moins de faux négatifs que les autres. La précision sur la classe minoritaire n'est toutefois que de 51%.