# Prêt à dépenser : Construire un modèle de scoring


## Contexte

"Prêt à dépenser" (Home Credit) est une société financière qui propose des crédits à la consommation pour des personnes ayant peu ou pas d'historique de prêt.
Pour accorder un crédit à la consommation, l'entreprise calcule la probabilité qu'un client le rembourse, ou non. Elle souhaite donc développer un algorithme de scoring pour aider à décider si un prêt peut être accordé à un client.

Les chargés de relation client seront les utilisateurs du modèle de scoring. Puisqu'ils s'adressent aux clients, ils ont besoin que votre modèle soit facilement interprétable. Les chargés de relation souhaitent, en plus, disposer d'une mesure de l'importance des variables qui ont poussé le modèle à donner cette probabilité à un client.


## Chargement des modules du projet

Afin de simplifier le Notebook, le code métier du projet est placé dans le répertoire [src/](../src/).


In [32]:
# Import project modules from source directory

# system modules
import os
import sys

# Append source directory to system path
src_path = os.path.abspath(os.path.join("../src"))
if src_path not in sys.path:
    sys.path.append(src_path)

# helper functions
import data.helpers as data_helpers
import features.helpers as feat_helpers
import visualization.helpers as vis_helpers


Nous allons utiliser le langage [Python](https://www.python.org/about/gettingstarted/), et présenter ici le code, les résultats et l'analyse sous forme de [Notebook JupyterLab](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html).

Nous allons aussi utiliser les bibliothèques usuelles d'exploration et analyse de données, afin d'améliorer la simplicité et la performance de notre code :
  * [NumPy](https://numpy.org/doc/stable/user/quickstart.html) et [Pandas](https://pandas.pydata.org/docs/user_guide/index.html) : effectuer des calculs scientifiques (statistiques, algèbre, ...) et manipuler des séries et tableaux de données volumineuses et complexes
  * [scikit-learn](https://scikit-learn.org/stable/getting_started.html) et [XGBoost](https://xgboost.readthedocs.io/en/latest/get_started.html) : pour effectuer des analyses prédictives 
  * [Matplotlib](https://matplotlib.org/stable/tutorials/introductory/usage.html), [Pyplot](https://matplotlib.org/stable/tutorials/introductory/pyplot.html), [Seaborn](https://seaborn.pydata.org/tutorial/function_overview.html) et [Plotly](https://plotly.com/python/getting-started/) : générer des graphiques lisibles, intéractifs et pertinents


In [33]:
# numpy and pandas for data manipulation
import numpy as np
import pandas as pd

# sklearn preprocessing for dealing with categorical variables
from sklearn.model_selection import train_test_split
import xgboost as xgb

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Accelerate the development cycle
SAMPLE_FRAC: float = 1

# Prevent excessive memory usage used by plotly
DRAW_PLOTS: bool = False


## Chargement des données

Les données mises à disposition sont issues de [Home Credit](https://www.homecredit.net/) et plus précisément de la compétition hébergée sur Kaggle [Home Credit Default Risk - Can you predict how capable each applicant is of repaying a loan?](https://www.kaggle.com/c/home-credit-default-risk)

Les données sont fournies sous la forme de plusieurs fichiers CSV pouvant être liés entre eux de la manière suivante :

![Home Credit data relations](https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png)

_source : [Introduction: Home Credit Default Risk Competition](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction) by [Will Koehrsen](https://www.kaggle.com/willkoehrsen)_


In [34]:
# Download and extract the raw data
data_helpers.download_extract_zip(
    zip_file_url="https://s3-eu-west-1.amazonaws.com/static.oc-static.com/prod/courses/files/Parcours_data_scientist/Projet+-+Impl%C3%A9menter+un+mod%C3%A8le+de+scoring/Projet+Mise+en+prod+-+home-credit-default-risk.zip",
    files_names=(
        "application_test.csv",
        "application_train.csv",
        "bureau_balance.csv",
        "bureau.csv",
        "credit_card_balance.csv",
        "installments_payments.csv",
        "POS_CASH_balance.csv",
        "previous_application.csv",
    ),
    target_path="../data/raw/",
)


Nous allons charger toutes les données des fichiers `application_{train|test}.csv` dans le même DataFrame, afin de travailler les données en commun des jeux d'entraînement (variable `TARGET` vaut `O` : le client n'a pas fait défaut ou `1` : le client a fait défaut) et de test(variable `TARGET` non définie). Nous les séparerons à nouveau au moment de l'entrainement et évaluation de nos modèles.

Les fichiers contiennent un grand nombre de variables booléennes et categorielles que nous pouvons déjà typer comme telles.


In [35]:
# Read column names
application_train_column_names = pd.read_csv(
    "../data/raw/application_train.csv", nrows=0
).columns.values
application_test_column_names = pd.read_csv(
    "../data/raw/application_test.csv", nrows=0
).columns.values

# TARGET variable must be present in the Train datase
if "TARGET" not in application_train_column_names:
    raise ValueError(
        "TARGET column not found in application_train.csv. Please check that the file is not corrupted."
    )

# SK_ID_CURR variable must be present in the Train datase
if "SK_ID_CURR" not in application_train_column_names:
    raise ValueError(
        "SK_ID_CURR column not found in application_train_column_names.csv. Please check that the file is not corrupted."
    )

# Train and Test datasets must have the same variables, except for the TARGET variable
if list(
    application_train_column_names[application_train_column_names != "TARGET"]
) != list(application_test_column_names):
    raise ValueError(
        "Column names in application_train.csv and application_test.csv do not match. Please check that the files are not corrupted."
    )

# Set column types according to fields description (../data/raw/HomeCredit_columns_description.csv)
# Categorical variables
column_types = {
    col: "category"
    for col in application_train_column_names
    if col.startswith(("NAME_",))
    or col.endswith(("_TYPE"))
    or col
    in [
        "CODE_GENDER",
        "WEEKDAY_APPR_PROCESS_START",
        "FONDKAPREMONT_MODE",
        "HOUSETYPE_MODE",
        "WALLSMATERIAL_MODE",
        "EMERGENCYSTATE_MODE",
    ]
}
# Boolean variables
column_types |= {
    col: bool
    for col in application_train_column_names
    if col.startswith(("FLAG_", "REG_", "LIVE_"))
}

# Load application data
app_train_df = pd.read_csv(
    "../data/raw/application_train.csv",
    dtype=column_types,
    true_values=["Y", "Yes", "1"],
    false_values=["N", "No", "0"],
    na_values=["XNA"],
)
app_test_df = pd.read_csv(
    "../data/raw/application_test.csv",
    dtype=column_types,
    true_values=["Y", "Yes", "1"],
    false_values=["N", "No", "0"],
    na_values=["XNA"], # bad values
)
app_test_df['TARGET'] = -1 # identify test data

# Merge Train and Test datasets
app_df = app_train_df.append(app_test_df)

# Sample to speed up development
if 0 < SAMPLE_FRAC < 1:
    app_df = app_df.sample(frac=SAMPLE_FRAC)

# Let's display basic statistical info about the data
app_df.describe(include="all")


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,356255.0,356255.0,356255,356251,356255,356255,356255.0,356255.0,356255.0,356219.0,...,356255,356255,356255,356255,308687.0,308687.0,308687.0,308687.0,308687.0,308687.0
unique,,,2,2,2,2,,,,,...,2,2,2,2,,,,,,
top,,,Cash loans,F,False,True,,,,,...,False,False,False,False,,,,,,
freq,,,326537,235126,235235,246970,,,,,...,353679,356072,356099,356152,,,,,,
mean,278128.0,-0.06714,,,,,0.414316,170116.1,587767.4,27425.560657,...,,,,,0.005808,0.006281,0.029995,0.231697,0.304399,1.911564
std,102842.104413,0.449443,,,,,0.720378,223506.8,398623.7,14732.80819,...,,,,,0.079736,0.10425,0.191374,0.855949,0.786915,1.865338
min,100001.0,-1.0,,,,,0.0,25650.0,45000.0,1615.5,...,,,,,0.0,0.0,0.0,0.0,0.0,0.0
25%,189064.5,0.0,,,,,0.0,112500.0,270000.0,16731.0,...,,,,,0.0,0.0,0.0,0.0,0.0,0.0
50%,278128.0,0.0,,,,,0.0,153000.0,500211.0,25078.5,...,,,,,0.0,0.0,0.0,0.0,0.0,1.0
75%,367191.5,0.0,,,,,1.0,202500.0,797557.5,34960.5,...,,,,,0.0,0.0,0.0,0.0,0.0,3.0


In [36]:
app_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 356255 entries, 59381 to 269900
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: bool(34), category(12), float64(65), int64(9), object(2)
memory usage: 224.9+ MB


Le jeu de données contient 122 variables, dont la variable cible que nous devons estimer : `TARGET`. Parmis ces variables, nous avons :
- 34 variables booléennes
- 14 variables catégorielles
- 74 variables numériques

## Analyse exploratoire

Nous allons analyser la distribution de quelques variables.

### Variable cible

Voyons spécifiquement la distribution de la variable `TARGET` qui est celle que nous devrons estimer par la suite.
Les valeurs nulles représentent notre jeu d'entrainement.
Nous pouvons oberver que nous avons à faire à un problème de __classification binaire déséquilibré__ (il y a deux valeurs possibles, mais les deux valeurs ne sont pas également représentées).
Ceci va influencer la manière dont nous allons construire et entraîner notre modèle.


In [37]:
# Let's plot the distribution of the TARGET variable
if DRAW_PLOTS:
    px.bar(
        app_df["TARGET"].replace({
            "0": "0 : payments OK", 
            "1": "1 : payment difficulties", 
            "-1": "-1 : undefined (test dataset)",
        }).value_counts(),
        title="Distribution of TARGET variable",
        width=800,
        height=400,
    ).update_xaxes(
        title="TARGET",
    ).update_yaxes(title="Count")


### Valeurs vides

Nous voyons que toutes les variables ont moins de 30% de valeurs vides, et près de la moitié a moins de 1% de valeurs vides. Le jeu de données est donc relativement bien rempli, ce qui ne devrait pas poser de problème pour la suite.


In [38]:
# Let's display variables with missing values ratio
if DRAW_PLOTS:
    vis_helpers.plot_empty_values(app_df)


### Valeurs impossibles

Quelques valeurs présentes dans les données semblent impossibles. Nous allons supprimer ces "outliers".


In [39]:
# Define data constraints
data_constraints = {
    "DAYS_EMPLOYED": {"min": -35000, "max": 0,}, # max 100 years, only negative values
}

if DRAW_PLOTS:
    # Let's display box plots for variables with outliers
    vis_helpers.plot_boxes(app_df, plot_columns=data_constraints.keys(), categorical_column="TARGET")

# Remove values that are outside possible range
app_df = feat_helpers.drop_impossible_values(
    app_df, constraints=data_constraints,
)


### Variables quantitatives

Nous allons simplement afficher la distribution de quelques variables numériques. Nous voyons déjà que selon la valeur de TARGET, la distribution (moyenne) des variables peut être sensiblement différente.


In [40]:
# Draw the BoxPlots of some numeric columns, split per Target
if DRAW_PLOTS:
    vis_helpers.plot_boxes(app_df,
        plot_columns=[
            "AMT_INCOME_TOTAL",
            "AMT_CREDIT",
            "AMT_ANNUITY",
            "AMT_GOODS_PRICE",
            "DAYS_BIRTH",
            "DAYS_EMPLOYED",
            "OWN_CAR_AGE",
            "REGION_RATING_CLIENT",
            "REGION_RATING_CLIENT_W_CITY",
            "EXT_SOURCE_1",
            "EXT_SOURCE_2",
            "EXT_SOURCE_3",
            "DAYS_LAST_PHONE_CHANGE",
            "AMT_REQ_CREDIT_BUREAU_YEAR",
        ],
        categorical_column="TARGET",
    )


### Variables qualitatives

De la même manière, nous allons simplement afficher la distribution de quelques variables catégorielles. Nous voyons déjà que selon la valeur de TARGET, la distribution (répartition entre classes) des variables peut être sensiblement différente (`TARGET=0` pour 77,6% des `NAME_CONTRACT_TYPE="Cash loans"`, tandis que `TARGET=0` pour 93,1% des `NAME_CONTRACT_TYPE="Revolving loans"`).
Certaines variables ont une répartition très inégale entre classes (`FLAG_MOBIL` vaut systématiquement `True` et jamais `False`).


In [41]:
# Draw the Bar charts of some categorical columns, split per Target
if DRAW_PLOTS:
    vis_helpers.plot_categories_bars(app_df,
        plot_columns=[
            "NAME_CONTRACT_TYPE",
            "CODE_GENDER",
            "FLAG_OWN_CAR",
            "FLAG_OWN_REALTY",
            "NAME_INCOME_TYPE",
            "NAME_EDUCATION_TYPE",
            "NAME_FAMILY_STATUS",
            "NAME_HOUSING_TYPE",
            "OCCUPATION_TYPE",
            "FLAG_MOBIL",
        ],
        categorical_column="TARGET",
    )


In [42]:
X = app_df.loc[app_df["TARGET"] >= 0].drop(["TARGET"], axis=1)
y = app_df.loc[app_df["TARGET"] >= 0, "TARGET"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Feature Engineering

Nous allons tenter d'enrichir nos données en intégrant des variables qui sont des compositions non linéaires des variables existantes.


### Données métier

Afin d'apporter plus de sens aux données que nous allons fournir à nos modèles, nous pouvons faire appel aux experts métier qui peuvent nous indiquer des informations qui sont réputées importantes afin de prédire si un client risque d'avoir des problèmes de remboursement ou non.

Les informations métier pertinentes sont :
- Montant emprunté / Prix du bien acheté : `AMT_CREDIT / AMT_GOODS_PRICE`
- Montant des annuités / Montant emprunté : `AMT_ANNUITY / AMT_CREDIT`
- Montant des annuités / Revenu annuel : `AMT_ANNUITY / AMT_INCOME_TOTAL`
- Ancienneté au travail / Age : `DAYS_EMPLOYED / DAYS_BIRTH`


In [43]:
# Create the new features
feat_app_df = app_df.copy()
feat_app_df["CREDIT_PRICE_RATIO"] = feat_app_df["AMT_CREDIT"] / feat_app_df["AMT_GOODS_PRICE"]
feat_app_df["ANNUITY_CREDIT_RATIO"] = feat_app_df["AMT_ANNUITY"] / feat_app_df["AMT_CREDIT"]
feat_app_df["ANNUITY_INCOME_RATIO"] = feat_app_df["AMT_ANNUITY"] / feat_app_df["AMT_INCOME_TOTAL"]
feat_app_df["EMPLOYED_BIRTH_RATIO"] = feat_app_df["DAYS_EMPLOYED"] / feat_app_df["DAYS_BIRTH"]

# Draw the BoxPlots for these features
if DRAW_PLOTS:
    vis_helpers.plot_boxes(feat_app_df,
        plot_columns=[
            "CREDIT_PRICE_RATIO",
            "ANNUITY_CREDIT_RATIO",
            "ANNUITY_INCOME_RATIO",
            "EMPLOYED_BIRTH_RATIO",
        ],
        categorical_column="TARGET",
    )


### Composition polynomiales de variables existantes

Les variables `EXT_SOURCE_{1-3}` n'ont a priori pas de sens concret. On peut imaginer que `TARGET` ne soit pas forcément linéairement dépendant de ces variables. Nous allons donc générer des combinaisons polynomiales de ces variables.


In [44]:
from sklearn.preprocessing import PolynomialFeatures


# Let's keep only non null data
ext_source = feat_app_df[["SK_ID_CURR", "TARGET", "EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].dropna()

# Let's create the new features
poly = PolynomialFeatures()
poly_feat = pd.DataFrame(poly.fit_transform(
        X=ext_source[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]],
        y=ext_source["TARGET"],
    )
)
poly_feat.columns=poly.get_feature_names()

poly_feat.insert(0, "SK_ID_CURR", ext_source["SK_ID_CURR"].values)

# Merge the new features with the original dataset
feat_app_df = feat_app_df.merge(
    poly_feat,
    on="SK_ID_CURR",
    how="left",
)

# Draw the BoxPlots for these features
if DRAW_PLOTS:
    vis_helpers.plot_boxes(feat_app_df,
        plot_columns=poly_feat.columns[5:],
        categorical_column="TARGET",
    )


In [45]:
X_feat = feat_app_df.loc[feat_app_df["TARGET"] >= 0].drop(["TARGET"], axis=1)
y_feat = feat_app_df.loc[feat_app_df["TARGET"] >= 0, "TARGET"]
X_feat_train, X_feat_test, y_feat_train, y_feat_test = train_test_split(X_feat, y_feat, test_size=0.2, random_state=42)


## Préparation des données

Afin que nos modèles puissent exploiter au mieux les données, nous allons les transformer.


### Encodage des catégories

Lorsque les données qualitatives ne sont pas ordinales (on ne peu pas les classer selon un certain ordre), l'encodage "One Hot Encoding" sera plus performant que le "Label Encoding".


In [46]:
# Encode categorical variables with One Hot Encoding
encoded_app_df = pd.get_dummies(feat_app_df, dtype=bool)

encoded_app_df.describe(include="all")


Unnamed: 0,SK_ID_CURR,TARGET,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
count,291607.0,291607.0,291607,291607,291607.0,291607.0,291607.0,291574.0,291351.0,291607.0,...,291607,291607,291607,291607,291607,291607,291607,291607,291607,291607
unique,,,2,2,,,,,,,...,2,2,2,2,2,2,2,2,2,2
top,,,False,True,,,,,,,...,False,False,False,False,False,False,False,False,True,False
freq,,,182447,197676,,,,,,,...,290438,282949,289407,289841,290014,228110,229715,286517,152584,289371
mean,278077.369076,-0.060475,,,0.495911,177280.5,600553.8,28174.994612,539528.8,0.020939,...,,,,,,,,,,
std,102876.577175,0.454505,,,0.761873,243796.0,403404.4,14916.227732,370611.3,0.013948,...,,,,,,,,,,
min,100001.0,-1.0,,,0.0,25650.0,45000.0,1980.0,40500.0,0.000253,...,,,,,,,,,,
25%,188930.5,0.0,,,0.0,112500.0,274500.0,17266.5,238500.0,0.010006,...,,,,,,,,,,
50%,278034.0,0.0,,,0.0,157500.0,509400.0,26050.5,450000.0,0.01885,...,,,,,,,,,,
75%,367229.5,0.0,,,1.0,216000.0,808650.0,35937.0,679500.0,0.028663,...,,,,,,,,,,


In [47]:
X_encoded = encoded_app_df.loc[encoded_app_df["TARGET"] >= 0].drop(["TARGET"], axis=1)
y_encoded = encoded_app_df.loc[encoded_app_df["TARGET"] >= 0, "TARGET"]
X_encoded_train, X_encoded_test, y_encoded_train, y_encoded_test = train_test_split(X_encoded, y_encoded, test_size=0.2, random_state=42)


Nous avons ici créé 133 nouvelles variables booléennes qui correspondent aux différentes classes de chacune des 14 anciennes variables catégorielles qui ont été encodées et supprimées.


### Normalisation des données

Afin d'éviter que certains modèles pondèrent l'importance de certaines variables à cause de leur ordre de grandeur, nous allons normaliser chaque variable afin de les ramener à une moyenne nulle et une variance de 1.


In [48]:
# Scale each variable of the DataFrame
from sklearn.preprocessing import StandardScaler


# define scaler
scaler = StandardScaler()
# fit scaler on train data only, to avoid data leak
scaler.fit(X_encoded_train)
# transform the dataset
transform_app_df = encoded_app_df.drop(["TARGET"], axis=1)
scaled_app_df = pd.DataFrame(
    scaler.transform(transform_app_df), 
    columns=transform_app_df.columns, 
    index=transform_app_df.index
)
scaled_app_df.insert(0, "TARGET", encoded_app_df["TARGET"].values)

scaled_app_df.describe(include="all")


Unnamed: 0,TARGET,SK_ID_CURR,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
count,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291574.0,291351.0,291607.0,...,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0
mean,-0.060475,0.000673,-0.001611,-0.000653,-0.005114,0.003862,-0.027631,0.024393,-0.027321,0.003606,...,0.0006,-0.002123,0.000274,0.003233,0.001077,0.003961,-0.000756,-0.002519,0.002329,-0.002041
std,0.454505,1.001008,0.999585,1.00025,0.996918,0.85553,0.991907,1.018197,0.992467,1.00531,...,1.004725,0.994137,1.001562,1.020804,1.007257,1.002722,0.99947,0.990765,0.999896,0.988551
min,-1.0,-1.732042,-0.774794,-1.451696,-0.654018,-0.52824,-1.39365,-1.763704,-1.36368,-1.487345,...,-0.063142,-0.176024,-0.08705,-0.076448,-0.073574,-0.525074,-0.519546,-0.134574,-1.045198,-0.088938
25%,0.0,-0.866742,-0.774794,-1.451696,-0.654018,-0.223466,-0.829346,-0.720232,-0.833452,-0.784394,...,-0.063142,-0.176024,-0.08705,-0.076448,-0.073574,-0.525074,-0.519546,-0.134574,-1.045198,-0.088938
50%,0.0,0.000251,-0.774794,0.688849,-0.654018,-0.065552,-0.251764,-0.120627,-0.267072,-0.146959,...,-0.063142,-0.176024,-0.08705,-0.076448,-0.073574,-0.525074,-0.519546,-0.134574,0.956757,-0.088938
75%,0.0,0.868139,1.290665,0.688849,0.654491,0.139737,0.484044,0.554235,0.347511,0.560317,...,-0.063142,-0.176024,-0.08705,-0.076448,-0.073574,-0.525074,-0.519546,-0.134574,0.956757,-0.088938
max,1.0,1.734373,1.290665,0.688849,25.516158,409.958529,8.454009,15.714223,9.373437,3.720463,...,15.837344,5.68105,11.487633,13.080774,13.591752,1.904494,1.924758,7.430859,0.956757,11.243753


In [49]:
X_scaled = scaled_app_df.loc[scaled_app_df["TARGET"] >= 0].drop(["TARGET"], axis=1)
y_scaled = scaled_app_df.loc[scaled_app_df["TARGET"] >= 0, "TARGET"]
X_scaled_train, X_scaled_test, y_scaled_train, y_scaled_test = train_test_split(X_scaled, y_scaled, test_size=0.2, random_state=42)


Nous avons ici équilibré les ordres de grandeur de chaque variables, afin que nos fututs modèles ne soient pas influencés par leur différence.


### Imputation des valeurs manquantes

Afin d'éviter que certains modèles ne puissent être utilisés à cause des valeurs manquantes, nous allons remplacer toutes les valeurs nulles par leur meilleure estimation possible.


In [50]:
# Impute missing values by modeling each feature with missing values as a function of other features in a round-robin fashion
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer


# If imputed data already exist, load from CSV
# Else, do the imputation and save the data to CSV
if os.path.exists("../data/processed/imputed_application.csv"):
    imputed_app_df = pd.read_csv(
        "../data/processed/imputed_application.csv", index_col=0
    )
else:
    # define imputer
    imputer = IterativeImputer(n_nearest_features=10)
    # fit scaler on train data only, to avoid data leak
    imputer.fit(X_scaled_train)
    # transform the dataset
    transform_app_df = scaled_app_df.drop(["TARGET"], axis=1)
    imputed_app_df = pd.DataFrame(
        imputer.transform(transform_app_df), 
        columns=transform_app_df.columns, 
        index=transform_app_df.index,
    )
    imputed_app_df.insert(0, "TARGET", scaled_app_df["TARGET"].values)

    imputed_app_df.to_csv("../data/processed/imputed_application.csv")


if (DRAW_PLOTS):
    vis_helpers.plot_empty_values(imputed_app_df)


In [51]:
X_imputed = imputed_app_df.loc[imputed_app_df["TARGET"] >= 0].drop(["TARGET"], axis=1)
y_imputed = imputed_app_df.loc[imputed_app_df["TARGET"] >= 0, "TARGET"]
X_imputed_train, X_imputed_test, y_imputed_train, y_imputed_test = train_test_split(
    X_imputed, y_imputed, test_size=0.2, random_state=42
)


### Features selection

Le but ici est d'éliminer un certain nombre de variables afin d'accélérer l'entrainement et la prédiction de nos modèles. Nous souhaitons éliminer les variables qui pénaliseront le moins possible les performances de nos modèles.

Nous savons déjà que les colonnes `SK_ID_CURR` (simple identifiant sans sens métier), `1` et `x{0-2}` (variables polynomiales d'ordre 0 et 1), et `FLAG_MOBIL` (vaut toujours 1) n'apportent pas d'information. Nous allons donc les éliminer.

In [52]:
# Let's drop the features that are not useful for the prediction
simple_app_df = imputed_app_df.drop(
    columns=["SK_ID_CURR", "1", "x0", "x1", "x2", "FLAG_MOBIL"]
)


Nous allons ici observer la corrélation :
- de chaque variable avec la variables cible `TARGET` : les variables les moins corrélées à `TARGET` seront a priori les moins utiles pour prédire sa valeur.
- entre les différentes variables deux à deux : si deux variables sont très corrélées, elles apportent une information redondante et nous pouvons donc en éliminer une des deux.


In [53]:
# Let's compute the correlation matrix
app_correlations = simple_app_df.loc[imputed_app_df["TARGET"] >= 0].corr().abs().sort_values(
    "TARGET", ascending=False, axis=0
).sort_values(
    "TARGET", ascending=False, axis=1
)

if DRAW_PLOTS:
    fig = px.imshow(app_correlations,
        title="Correlations between features",
        width=1200,
        height=1200,
    )
    fig.show()


Nous voyons qu'il y a des variables très peu corrélées à `TARGET` (ex. : `abs(corr("TARGET", "LANDAREA_MEDI")) = 50µ`), et d'autres très corrélées entre elles (ex. : `abs(corr("LIVINGAPARTMENTS_AVG", ""LIVINGAPARTMENTS_MEDI")) = 0.994367`).
Nous allons simplifier notre jeu de données en supprimant ces variables.


In [54]:
# Let's find variables that are highly de-correlated from TARGET
corr_target_min_threshold = 0.01
highly_decorrelated_from_target = pd.Series({})
for col in app_correlations.columns:
    if col != "TARGET" and (
        pd.isnull(app_correlations[col]["TARGET"])
        or abs(app_correlations[col]["TARGET"]) < corr_target_min_threshold
    ):
        highly_decorrelated_from_target[col] = app_correlations[col][
            "TARGET"
        ]

highly_decorrelated_from_target.sort_values()






ORGANIZATION_TYPE_Industry: type 11      0.000023
FLAG_EMP_PHONE                           0.000080
NAME_EDUCATION_TYPE_Incomplete higher    0.000223
FLAG_DOCUMENT_20                         0.000241
AMT_REQ_CREDIT_BUREAU_WEEK               0.000318
                                           ...   
ORGANIZATION_TYPE_Other                  0.009630
ORGANIZATION_TYPE_Kindergarten           0.009678
ORGANIZATION_TYPE_University             0.009729
OCCUPATION_TYPE_Cooking staff            0.009858
NONLIVINGAREA_MODE                       0.009967
Length: 124, dtype: float64

In [55]:
# Let's find variables that have a highly correlated pair
corr_pair_max_threshold = 0.9
highly_correlated = pd.DataFrame(columns=["pair", "correlation"])
for i in range(len(app_correlations.columns)):
    for j in range(i + 1, len(app_correlations.columns)):
        if app_correlations.iloc[i, j] > corr_pair_max_threshold:
            # variables are highly correlated
            if app_correlations.iloc[0, i] > app_correlations.iloc[0, j]:
                # first variable is more correlated with target => we want to keep it
                keep_index = i
                drop_index = j
            else:
                keep_index = j
                drop_index = i

            highly_correlated.loc[app_correlations.columns[drop_index]] = [
                app_correlations.columns[keep_index],
                app_correlations.iloc[i, j],
            ]

highly_correlated.sort_values(by="correlation", ascending=False)


Unnamed: 0,pair,correlation
NAME_CONTRACT_TYPE_Cash loans,NAME_CONTRACT_TYPE_Revolving loans,1.0
CODE_GENDER_F,CODE_GENDER_M,0.999966
OBS_60_CNT_SOCIAL_CIRCLE,OBS_30_CNT_SOCIAL_CIRCLE,0.998473
FLOORSMIN_MEDI,FLOORSMIN_AVG,0.99786
COMMONAREA_MEDI,COMMONAREA_AVG,0.997088
FLOORSMAX_MEDI,FLOORSMAX_AVG,0.996905
ENTRANCES_MEDI,ENTRANCES_AVG,0.996889
BASEMENTAREA_MEDI,BASEMENTAREA_AVG,0.995606
ELEVATORS_MEDI,ELEVATORS_AVG,0.994684
YEARS_BEGINEXPLUATATION_AVG,YEARS_BEGINEXPLUATATION_MEDI,0.993357


In [56]:
# Drop irrelevant columns
simple_app_df.drop(
    columns=highly_decorrelated_from_target.index,
    inplace=True,
    errors="ignore",
)
simple_app_df.drop(
    columns=highly_correlated.index,
    inplace=True,
    errors="ignore",
)

simple_app_df.describe(include="all")


Unnamed: 0,TARGET,FLAG_OWN_CAR,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,...,ORGANIZATION_TYPE_Security Ministries,ORGANIZATION_TYPE_Self-employed,ORGANIZATION_TYPE_Transport: type 3,FONDKAPREMONT_MODE_org spec account,FONDKAPREMONT_MODE_reg oper account,FONDKAPREMONT_MODE_reg oper spec account,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",EMERGENCYSTATE_MODE_No
count,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,...,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0,291607.0
mean,-0.060475,7.8e-05,0.024355,-0.027241,0.004438,0.00182,-0.006302,0.000663,-0.002677,-0.001204,...,0.002856,-0.00095,3.1e-05,-0.000999,0.0031,0.000863,0.000549,0.007472,-0.000731,0.004482
std,0.454505,1.000022,1.01878,0.992466,1.008019,0.996529,0.999733,0.999977,1.005276,0.620734,...,1.015967,0.999082,1.000226,0.996431,1.001875,1.002043,1.003506,1.005153,0.999487,0.999803
min,-1.0,-0.773443,-1.764798,-1.363165,-1.490532,-2.848137,-6.650674,-5.623359,-2.901039,-0.997523,...,-0.088028,-0.423978,-0.068462,-0.137466,-0.563065,-0.202467,-0.077782,-0.522845,-0.51953,-1.042947
25%,0.0,-0.773443,-0.7207,-0.832872,-0.785686,-0.757838,-0.350313,-0.711168,-0.919015,-0.173365,...,-0.088028,-0.423978,-0.068462,-0.137466,-0.563065,-0.202467,-0.077782,-0.522845,-0.51953,-1.042947
50%,0.0,-0.773443,-0.120735,-0.266422,-0.146533,0.053979,0.307902,0.113832,-0.053076,-0.01917,...,-0.088028,-0.423978,-0.068462,-0.137466,-0.563065,-0.202467,-0.077782,-0.522845,-0.51953,0.958821
75%,0.0,1.29292,0.554533,0.348236,0.562649,0.815826,0.686044,0.86168,0.870944,0.092404,...,-0.088028,-0.423978,-0.068462,-0.137466,-0.563065,-0.202467,-0.077782,-0.522845,-0.51953,0.958821
max,1.0,1.29292,15.723622,9.375275,3.731312,2.029318,1.020076,1.42418,1.849086,6.609954,...,11.360059,2.358613,14.606711,7.274508,1.775993,4.939085,12.856491,1.912612,1.924815,0.958821


Après simplification, nous voyons que nous avons drastiquement réduit le nombre de variables pour ne conserver que celles réellement pertinentes pour la prédiction de `TARGET`.


In [57]:
X_simple = simple_app_df.loc[simple_app_df["TARGET"] >= 0].drop(["TARGET"], axis=1)
y_simple = simple_app_df.loc[simple_app_df["TARGET"] >= 0, "TARGET"]
X_simple_train, X_simple_test, y_simple_train, y_simple_test = train_test_split(
    X_simple, y_simple, test_size=0.2, random_state=42
)



---

# Annexe

Les Notebooks Kaggle [Introduction: Home Credit Default Risk Competition](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction) (et suivants) de [Will Koehrsen](https://www.kaggle.com/willkoehrsen) ont été d'une très grande aide dans l'exploration des données.
