******************************************************************************************************************************* 
NOTE SUR LE PROJET: Michaël, votre manager, vous incite à sélectionner un ou des kernels Kaggle pour vous faciliter l’analyse exploratoire, la préparation des données et le feature engineering nécessaires à l’élaboration du modèle de scoring. Si vous le faites, vous devez analyser ce ou ces kernels et le ou les adapter pour vous assurer qu’il(s) répond(ent) aux besoins de votre mission. C’est **optionnel**, mais nous vous encourageons à le faire afin de vous permettre de vous focaliser sur l’élaboration du modèle, son optimisation et sa compréhension.                            

Comme ce n'était pas obligatoire, je ne suis pas partie du kernel de Kaggle dans un souhait d'apprentissage. En effet, étant intéressée par ce domaine et mon mentor étant expert dans ce domaine, j'ai eu l'opportunité de réaliser ce projet en immersion dans une agence bancaire.

Même si cela n'était pas demandé, j'ai réalisé une EDA ce qui est de mon point de vue indispensable pour la compréhension des données en notre possession.

Le processing et le feature engineering des tables 'bureau' et 'bureau_balance'. Par soucis de temps d'exécution des notebooks, j'ai été dans l'obligation de scinder les antécédents de prêts en plusieurs notebooks. Le prochain notebook sera dédié à la table 'previous_application.csv'.
*******************************************************************************************************************************

# IMPLEMENTEZ UN MODELE DE SCORING

# Création du modèle de scoring

### Contexte

Vous êtes Data Scientist au sein d'une société financière, nommée "Prêt à dépenser", qui propose des crédits à la consommation pour des personnes ayant peu ou pas du tout d'historique de prêt.

L’entreprise souhaite mettre en œuvre un outil de “scoring crédit” pour calculer la probabilité qu’un client rembourse son crédit, puis classifier la demande en crédit accordé ou refusé. Elle souhaite donc développer un algorithme de classification en s’appuyant sur des sources de données variées (données comportementales, données provenant d'autres institutions financières, etc.).

De plus, les chargés de relation client ont fait remonter le fait que les clients sont de plus en plus demandeurs de transparence vis-à-vis des décisions d’octroi de crédit. Cette demande de transparence des clients va tout à fait dans le sens des valeurs que l’entreprise veut incarner.

Prêt à dépenser décide donc de développer un dashboard interactif pour que les chargés de relation client puissent à la fois expliquer de façon la plus transparente possible les décisions d’octroi de crédit, mais également permettre à leurs clients de disposer de leurs informations personnelles et de les explorer facilement. 

### Missions
- **Mission 1: Construction d'un modèle de scoring donnant une prédiction sur la probabilité de faillite d'un client de façon automatique.**
- **Mission 2: Construction d'un dashboard interactif à destination des gestionnaires de la relation client permettant d'interpréter les prédictions faites par le modèle, et d’améliorer la connaissance client des chargés de relation client.**
- **Mission 3: Mise en production du modèle de scoring de prédiction à l’aide d’une API, ainsi que du dashboard interactif appelant l’API pour les prédictions.**

In [1]:
# Import des librairies classiques pour l'EDA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import mode

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# PREPARATION DES JEUX DE DONNEES A JOINDRE AUX  JEUX DE DONNEES D'ENTRAINEMENT ET DE TEST

# 1. Les jeux de données 'bureau' et 'bureau_balance'

## 1.1. Informations principales des 2 fichiers

### 1. Ouverture et copie des 2 jeux de données

In [2]:
# Ouverture du fichier 'balance_bureau'
file_1 = pd.read_csv("bureau_balance.csv", sep=",")
pd.set_option("Display.max_rows", None)
pd.set_option("Display.max_columns", None)
file_1.head()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


In [3]:
# Copie du fichier
balance = file_1.copy()

In [4]:
# Informations sur le jeu de données
balance.info(verbose = True, show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Non-Null Count     Dtype 
---  ------          --------------     ----- 
 0   SK_ID_BUREAU    27299925 non-null  int64 
 1   MONTHS_BALANCE  27299925 non-null  int64 
 2   STATUS          27299925 non-null  object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB


In [5]:
# Vérification de l'absence de doublon
print(f"Ce premier jeu de données contient {balance.duplicated().sum()} doublon(s).")

Ce premier jeu de données contient 0 doublon(s).


In [6]:
# Contenu de la variable paraissant intéressante 'Status'
print("Contenu de la variable 'STATUS'")
print(balance["STATUS"].unique().tolist())

Contenu de la variable 'STATUS'
['C', '0', 'X', '1', '2', '3', '5', '4']


**A NOTER: Au vu de son contenu et après avoir regardé sa description dans le fichier 'description.csv' mis à notre disposition, cette dernière ne sera pas utilisée. Ce choix est d'autant plus pertinent que nous sommes en possession d'une variable 'CREDIT_ACTIVE' beaucoup plus précise dans le jeu de données 'bureau'.**

In [7]:
# Ouverture du fichier 'bureau'
file_2 = pd.read_csv("bureau.csv", sep=",")
pd.set_option("Display.max_rows", None)
pd.set_option("Display.max_columns", None)
file_2.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


In [8]:
# Copie du fichier
bureau = file_2.copy()

In [9]:
# Informations sur le jeu de données
bureau.info(verbose = True, show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   SK_ID_CURR              1716428 non-null  int64  
 1   SK_ID_BUREAU            1716428 non-null  int64  
 2   CREDIT_ACTIVE           1716428 non-null  object 
 3   CREDIT_CURRENCY         1716428 non-null  object 
 4   DAYS_CREDIT             1716428 non-null  int64  
 5   CREDIT_DAY_OVERDUE      1716428 non-null  int64  
 6   DAYS_CREDIT_ENDDATE     1610875 non-null  float64
 7   DAYS_ENDDATE_FACT       1082775 non-null  float64
 8   AMT_CREDIT_MAX_OVERDUE  591940 non-null   float64
 9   CNT_CREDIT_PROLONG      1716428 non-null  int64  
 10  AMT_CREDIT_SUM          1716415 non-null  float64
 11  AMT_CREDIT_SUM_DEBT     1458759 non-null  float64
 12  AMT_CREDIT_SUM_LIMIT    1124648 non-null  float64
 13  AMT_CREDIT_SUM_OVERDUE  1716428 non-null  float64
 14  CR

**Ce jeu de données présente des valeurs manquantes.**

In [10]:
# Vérification de l'absence de doublons
print(f"Ce deuxième jeu de données contient {bureau.duplicated().sum()} doublon(s).")

Ce deuxième jeu de données contient 0 doublon(s).


## 1.2. Analyse des différentes variables catégorielles et feature engineering

### 1. La variable 'CREDIT_ACTIVE'

**1. CONTENU DE CETTE VARIABLE**

In [11]:
print("Contenu de la variable 'CREDIT_ACTIVE'")
print(bureau["CREDIT_ACTIVE"].unique().tolist())

Contenu de la variable 'CREDIT_ACTIVE'
['Closed', 'Active', 'Sold', 'Bad debt']


**2. ENCODAGE MANUEL POUR LE STATUT DES CREDITS**

In [12]:
# Comptage de chaque statut de crédits par client dans une catégorie
credit_statut = bureau.groupby(["SK_ID_CURR", "SK_ID_BUREAU", "CREDIT_ACTIVE"]).agg({"CREDIT_ACTIVE": "count"}).unstack() # unstack = pivot: clients en lignes et catégorie en colonnes
credit_statut.columns = credit_statut.columns.droplevel(0) # Drop de order_item_id
credit_statut.fillna(0, inplace=True),
credit_statut["CREDIT_STATUT"] = credit_statut.sum(axis=1)

# Pourcentage du nombre de crédits par catégorie sur le nombre total de crédits
for col in credit_statut.columns:
    if (col != "CREDIT_STATUT"):
        credit_statut[col] = (credit_statut[col]/credit_statut["CREDIT_STATUT"])
        
# Elimination de la variable 'CREDIT_STATUT' devenue inutile
credit_statut = credit_statut.drop("CREDIT_STATUT", axis=1)

credit_statut.reset_index(inplace=True) 
credit_statut.head()

CREDIT_ACTIVE,SK_ID_CURR,SK_ID_BUREAU,Active,Bad debt,Closed,Sold
0,100001,5896630,0.0,0.0,1.0,0.0
1,100001,5896631,0.0,0.0,1.0,0.0
2,100001,5896632,0.0,0.0,1.0,0.0
3,100001,5896633,0.0,0.0,1.0,0.0
4,100001,5896634,1.0,0.0,0.0,0.0


In [13]:
# Nombre de crédits répertoriés
print(f"{credit_statut.shape[0]} crédits sont répertoriés dans le Bureau de crédits.")

1716428 crédits sont répertoriés dans le Bureau de crédits.


### 2. La variable 'CREDIT_TYPE'

**1. CONTENU DE LA VARIABLE**

In [14]:
print("Contenu de la variable 'CREDIT_TYPE'")
print(bureau["CREDIT_TYPE"].unique().tolist())

Contenu de la variable 'CREDIT_TYPE'
['Consumer credit', 'Credit card', 'Mortgage', 'Car loan', 'Microloan', 'Loan for working capital replenishment', 'Loan for business development', 'Real estate loan', 'Unknown type of loan', 'Another type of loan', 'Cash loan (non-earmarked)', 'Loan for the purchase of equipment', 'Mobile operator loan', 'Interbank credit', 'Loan for purchase of shares (margin lending)']


In [15]:
# Comptage du nombre de valeurs par catégorie
bureau["CREDIT_TYPE"].value_counts()

CREDIT_TYPE
Consumer credit                                 1251615
Credit card                                      402195
Car loan                                          27690
Mortgage                                          18391
Microloan                                         12413
Loan for business development                      1975
Another type of loan                               1017
Unknown type of loan                                555
Loan for working capital replenishment              469
Cash loan (non-earmarked)                            56
Real estate loan                                     27
Loan for the purchase of equipment                   19
Loan for purchase of shares (margin lending)          4
Mobile operator loan                                  1
Interbank credit                                      1
Name: count, dtype: int64

**Afin d'essayer d'obtenir des catégories plus équilibrées, les prêts seront regroupés en 4 grandes catégories:**
- **Prêts personnels**: Credit card, Consumer credit, Car loan, Microloan, Cash loan (non-earmarked) et Mobile operator loan
- **Prêts immobiliers**: Mortgage et Real estate loan
- **Prêts affaires/business/investissement** : Loan for business development, Loan for working capital replenishment et Loan for purchase of shares (margin lending) et Interbank credit
- **Autres prêts**: Another type of loan et Unknown type of loan

**2. REGROUPEMENT DES PRETS**

In [16]:
'''# Prêts personnels
personnel = bureau[(bureau["CREDIT_TYPE"] == "Consumer credit") | (bureau["CREDIT_TYPE"] == "Credit card") |
                 (bureau["CREDIT_TYPE"] == "Car loan") | (bureau["CREDIT_TYPE"] == "Microloan") |
                 (bureau["CREDIT_TYPE"] == "Cash loan (non-earmarked)") | (bureau["CREDIT_TYPE"] == "Mobile operator loan")]

for idx in personnel.index:
    bureau.loc[idx, "CREDIT_TYPE"] = "Pret personnel"
    
# Prêts immobiliers
immobilier = bureau[(bureau["CREDIT_TYPE"] == "Mortgage") | (bureau["CREDIT_TYPE"] == "Real estate loan")]

for idx in immobilier.index:
    bureau.loc[idx, "CREDIT_TYPE"] = "Pret immobilier"
    
# Prêt d'affaires
affaire = bureau[(bureau["CREDIT_TYPE"] == "Loan for purchase of shares (margin lending)") | (bureau["CREDIT_TYPE"] == "Loan for business development") |
              (bureau["CREDIT_TYPE"] == "Loan for working capital replenishment") | (bureau["CREDIT_TYPE"] == "Loan for the purchase of equipment") |
              (bureau["CREDIT_TYPE"] == "Interbank credit")]

for idx in affaire.index:
    bureau.loc[idx, "CREDIT_TYPE"] = "Pret business"
    
# Autres prêts
autres_prets = bureau[(bureau["CREDIT_TYPE"] == "Another type of loan") | (bureau["CREDIT_TYPE"] == "Unknown type of loan")]

for idx in autres_prets.index:
    bureau.loc[idx, "CREDIT_TYPE"] = "Autre pret"'''

'# Prêts personnels\npersonnel = bureau[(bureau["CREDIT_TYPE"] == "Consumer credit") | (bureau["CREDIT_TYPE"] == "Credit card") |\n                 (bureau["CREDIT_TYPE"] == "Car loan") | (bureau["CREDIT_TYPE"] == "Microloan") |\n                 (bureau["CREDIT_TYPE"] == "Cash loan (non-earmarked)") | (bureau["CREDIT_TYPE"] == "Mobile operator loan")]\n\nfor idx in personnel.index:\n    bureau.loc[idx, "CREDIT_TYPE"] = "Pret personnel"\n    \n# Prêts immobiliers\nimmobilier = bureau[(bureau["CREDIT_TYPE"] == "Mortgage") | (bureau["CREDIT_TYPE"] == "Real estate loan")]\n\nfor idx in immobilier.index:\n    bureau.loc[idx, "CREDIT_TYPE"] = "Pret immobilier"\n    \n# Prêt d\'affaires\naffaire = bureau[(bureau["CREDIT_TYPE"] == "Loan for purchase of shares (margin lending)") | (bureau["CREDIT_TYPE"] == "Loan for business development") |\n              (bureau["CREDIT_TYPE"] == "Loan for working capital replenishment") | (bureau["CREDIT_TYPE"] == "Loan for the purchase of equipment") |\n   

In [17]:
'''# Nombre de données par type de prêts
bureau["CREDIT_TYPE"].value_counts()'''

'# Nombre de données par type de prêts\nbureau["CREDIT_TYPE"].value_counts()'

**A NOTER**: 98,6% des prêts sont des prêts personnels. Par conséquent, nous pouvons considérer que nous sommes en présence d'une variable unique. De ce fait, les différents types de prêts seront éliminés du jeu de données.

In [18]:
# Elimination de la variable 'CREDIT_TYPE'
bureau = bureau.drop("CREDIT_TYPE", axis=1)

### 3. La variable 'CREDIT_CURRENCY'

*Cette variable signifie que la devise d'origine du crédit a été modifiée ou catégorisée d'une manière ou d'une autre pour des raisons d'analyse ou de clarté. Elle sera donc encodée via pd.get_dummies si cette dernière s'avère être utile.*

**1. ANALYSE DES VALEURS UNIQUES**

In [19]:
# Nombre de valeurs uniques
print(bureau["CREDIT_CURRENCY"].unique().tolist())

['currency 1', 'currency 2', 'currency 4', 'currency 3']


In [20]:
# Pourcentage de données de chaque cétégorie
bureau["CREDIT_CURRENCY"].value_counts(normalize=True)

CREDIT_CURRENCY
currency 1    0.999180
currency 2    0.000713
currency 3    0.000101
currency 4    0.000006
Name: proportion, dtype: float64

**99,91% des données se trouvant dans 'currency 1', nous pouvons considérer la variable comme étant unique et donc l'éliminer du jeu de données.**

**2. ELIMINATION DE LA VARIABLE DU JEU DE DONNEES**

In [21]:
bureau = bureau.drop('CREDIT_CURRENCY', axis= 1)

### 4. Agrégation des données catégorielles

In [22]:
aggregated_bureau_categorielle = credit_statut.groupby("SK_ID_CURR").agg({"SK_ID_BUREAU":"count", "Active": "sum", "Closed":"sum",
                                                                            "Bad debt": "sum", "Sold": "sum"}).reset_index()

# Renommage des colonnes pour plus de clarté
aggregated_bureau_categorielle = aggregated_bureau_categorielle.rename(columns={"SK_ID_BUREAU":"home_total_loans", "Active": "home_active_sum",
                                                                           "Closed":"home_closed_sum", "Bad debt": "home_bad_debt_sum",
                                                                           "Sold": "home_sold_sum"})

aggregated_bureau_categorielle.head()

CREDIT_ACTIVE,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum
0,100001,7,3.0,4.0,0.0,0.0
1,100002,8,2.0,6.0,0.0,0.0
2,100003,4,1.0,3.0,0.0,0.0
3,100004,2,0.0,2.0,0.0,0.0
4,100005,3,2.0,1.0,0.0,0.0


In [23]:
# Vérification du nombre de clients (attendu 305811)
print(f'{aggregated_bureau_categorielle.shape[0]} clients sont présents dans le Bureau de crédits.')

305811 clients sont présents dans le Bureau de crédits.


## 1.3. Analyse des différentes variables numériques et feature engineering

### 1. La variable 'DAYS_CREDIT'

*Dans le contexte du crédit, il s'agit de déterminer le laps de temps entre la demande de crédit actuelle d'un client et une précédente demande de crédit qu'il a effectuée et qui a été enregistrée dans le Bureau de crédits (ou tout autre organisme équivalent de suivi du crédit dans différents pays).*

*Cette information peut être utile pour les prêteurs afin d'évaluer le comportement financier d'un emprunteur. Par exemple, si un emprunteur fait fréquemment des demandes de crédit, cela peut indiquer une situation financière instable ou un comportement à risque, ce qui pourrait rendre les prêteurs plus prudents lors de la prise de décision concernant l'octroi d'un nouveau crédit.*

**1. DESCRIPTION DE LA VARIABLE**

In [24]:
bureau["DAYS_CREDIT"].describe()

count    1.716428e+06
mean    -1.142108e+03
std      7.951649e+02
min     -2.922000e+03
25%     -1.666000e+03
50%     -9.870000e+02
75%     -4.740000e+02
max      0.000000e+00
Name: DAYS_CREDIT, dtype: float64

**Le nombre de jours étant exprimé de façon négative, un feature engineering sera réalisé: multiplication par -1 afin d'avoir des valeurs positives.**

**2. FEATURE ENGINEERING**

In [25]:
# Modification de la variable
bureau["DAYS_CREDIT"] = bureau["DAYS_CREDIT"]*-1
bureau.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,-131,
1,215354,5714463,Active,208,0,1075.0,,,0,225000.0,171342.0,,0.0,-20,
2,215354,5714464,Active,203,0,528.0,,,0,464323.5,,,0.0,-16,
3,215354,5714465,Active,203,0,,,,0,90000.0,,,0.0,-16,
4,215354,5714466,Active,629,0,1197.0,,77674.5,0,2700000.0,,,0.0,-21,


In [26]:
# Description de la variable modifiée
bureau["DAYS_CREDIT"].describe()

count    1.716428e+06
mean     1.142108e+03
std      7.951649e+02
min      0.000000e+00
25%      4.740000e+02
50%      9.870000e+02
75%      1.666000e+03
max      2.922000e+03
Name: DAYS_CREDIT, dtype: float64

*Cette variable ne présente pas d'anomalie. L'agrégation avec les jeux d'entraînement et de test sera réalisée sur le **MINIMUM** (min).*

### 2. La variable 'CREDIT_DAY_OVERDUE'

**Cette variable ne possède pas de valeur manquante.**

**Signification de la variable:** Le prêteur regarde combien de jours le demandeur était en retard pour un autre crédit, selon les données du Bureau de crédits.

In [27]:
# Description de la variable
bureau["CREDIT_DAY_OVERDUE"].describe()

count    1.716428e+06
mean     8.181666e-01
std      3.654443e+01
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      2.792000e+03
Name: CREDIT_DAY_OVERDUE, dtype: float64

*Le nombre de jours allant jusque 2792 et plus de 75% de données étant de 0 jours, je pense qu'il est préférable de conserver cette variable en jours. L'agrégation sera réalisée la **MOYENNE** (mean).*

### 3. Les variables 'DAYS_CREDIT_ENDDATE' et 'DAYS_ENDDATE_FACT'

**ATTENTION: Ces 2 variables présentent des données manquantes.**

*Ces 2 variables sont plus ou moins corrélées entre elles. La variable 'DAYS_CREDIT_ENDDATE' correspond au nombre de jours restants pour un crédit et la variable 'DAYS_ENDDATE_FACT' ne concerne que les crédits clos.*

**1. DESCRIPTION DES 2 VARIABLES**

In [28]:
# Le variable 'DAYS_CREDIT_ENDDATE' et 'DAYS_ENDDATE_FACT'
print("Description de la variable DAYS_CREDIT_ENDDATE")
print(f'{bureau["DAYS_CREDIT_ENDDATE"].describe()}')
print("---------------------------------------------------------------------")
print("Description de la variable DAYS_ENDDATE_FACT")
print(f'{bureau["DAYS_ENDDATE_FACT"].describe()}')

Description de la variable DAYS_CREDIT_ENDDATE
count    1.610875e+06
mean     5.105174e+02
std      4.994220e+03
min     -4.206000e+04
25%     -1.138000e+03
50%     -3.300000e+02
75%      4.740000e+02
max      3.119900e+04
Name: DAYS_CREDIT_ENDDATE, dtype: float64
---------------------------------------------------------------------
Description de la variable DAYS_ENDDATE_FACT
count    1.082775e+06
mean    -1.017437e+03
std      7.140106e+02
min     -4.202300e+04
25%     -1.489000e+03
50%     -8.970000e+02
75%     -4.250000e+02
max      0.000000e+00
Name: DAYS_ENDDATE_FACT, dtype: float64


**REMARQUE SUR LA VARIABLE 'DAYS_CREDIT_ENDDATE'**
- **Présence de valeurs négatives anormales**
- **Imputation pour les valeurs manquantes par -2**

**REMARQUE SUR LA VARIABLE 'DAYS_ENDDATE_FACT'**
- **Absence d'anomalie visible**
- **Pas d'imputation pour les valeurs manquantes**

**2. LA VARIABLE 'DAYS_CREDIT_ENDDATE'**

In [29]:
# Création d'un dataframe
negative_enddate = bureau[bureau["DAYS_CREDIT_ENDDATE"] < 0]
negative_enddate.head()

# Groupby par crédit
nb_negative_enddate = negative_enddate.groupby("DAYS_CREDIT_ENDDATE").agg({"SK_ID_BUREAU":"nunique"})
nb_negative_enddate.shape[0]

2924

*Sur les 774354 crédits, 2924 sont concernés par cette anomalie, soit environ 3.7% des crédits.*

*Nous pouvons remarquer qu'au niveau de la première ligne, la valeur négative apparait sur un crédit clos avec justement la même valeur que celle présente dans la variable 'DAYS_ENDDATE_FACT'. Dans ce cas de figure, il semble tout à fait pertinent de modifier la valeur par 0. Un dataframe spécial sera donc généré pour les crédits clos.* 

In [30]:
# Création du dataframe
closed_home = bureau[(bureau["DAYS_CREDIT_ENDDATE"] < 0) & (bureau["CREDIT_ACTIVE"] == "Closed")]
closed_home.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,-131,
7,162297,5714469,Closed,1896,0,-1684.0,-1710.0,14985.0,0,76878.45,0.0,0.0,0.0,-1710,
8,162297,5714470,Closed,1146,0,-811.0,-840.0,0.0,0,103007.7,0.0,0.0,0.0,-840,
11,162297,5714473,Closed,2456,0,-629.0,-825.0,,0,675000.0,0.0,0.0,0.0,-706,
14,238881,5714482,Closed,318,0,-187.0,-187.0,,0,0.0,0.0,0.0,0.0,-185,


In [31]:
# Nombre de crédits concernés
closed_home.shape[0]

927109

*Par précaution, pour tous les crédits clos, les valeurs de la variable 'DAYS_CREDIT_ENDDATE' seront remplacées par 0.*

In [32]:
# Remplacement de la valeur par 0
for idx in closed_home.index:
    bureau.loc[idx, "DAYS_CREDIT_ENDDATE"] = 0

*Dans notre jeu de données, nous avons également des crédits vendus qui peuvent être considérés comme clos. Ainsi, pour ces crédits, les valeurs seront également mises à zéro.*

In [33]:
# Création du dataframe pour les crédits vendus
credit_sold = bureau[(bureau["CREDIT_ACTIVE"] == "Sold") & (bureau["DAYS_CREDIT_ENDDATE"] < 0)]
credit_sold.head()                                                                             

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
541,161678,5715165,Sold,1319,0,-1134.0,-75.0,133308.81,0,202500.0,0.0,0.0,0.0,-71,
746,147546,5715420,Sold,1337,0,-239.0,,,0,477000.0,0.0,,0.0,-280,
1533,163163,5716347,Sold,2047,0,-1499.0,,,0,153405.0,,,0.0,-1957,
2493,351664,5717545,Sold,1911,0,-815.0,-134.0,42988.5,0,222750.0,,,0.0,-22,
3832,249801,5719129,Sold,1057,0,-692.0,,,0,405000.0,,,0.0,-1051,40504.5


In [34]:
# Remplacement de la valeur par 0
for idx in credit_sold.index:
    bureau.loc[idx, "DAYS_CREDIT_ENDDATE"] = 0

*RAPPEL: le jeu de données présente des valeurs manquantes au niveau de cette variable. Ainsi, pour les crédits clos ou vendus, ces dernières seront imputées par la valeur 0.* 

In [35]:
# Création du dataframe pour les crédits clos dont la variable 'DAYS_CREDIT_ENDDATE' n'est pas renseignée
nan_closed = bureau[(bureau["DAYS_CREDIT_ENDDATE"].isna()) & (bureau["CREDIT_ACTIVE"] == "Closed")]
nan_closed.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
32,136226,5714503,Closed,559,0,,-355.0,0.0,0,110250.0,0.0,0.0,0.0,-351,
179,373324,5714689,Closed,1805,0,,-320.0,0.0,0,13500.0,0.0,0.0,0.0,-229,
206,312983,5714725,Closed,1278,0,,-244.0,0.0,0,66780.0,0.0,0.0,0.0,-211,
383,273814,5714968,Closed,186,0,,-118.0,,0,1935000.0,,,0.0,-111,
409,414912,5715003,Closed,1036,0,,-510.0,,0,157500.0,0.0,,0.0,-510,


In [36]:
# Remplacement des valeurs manquantes par 0
for idx in nan_closed.index:
    bureau.loc[idx, "DAYS_CREDIT_ENDDATE"] = 0

In [37]:
# Création du dataframe pour les crédits vendus dont la variable 'DAYS_CREDIT_ENDDATE' n'est pas renseignée
nan_sold = bureau[(bureau["DAYS_CREDIT_ENDDATE"].isna()) & (bureau["CREDIT_ACTIVE"] == "Sold")]
nan_sold.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
6708,289078,5722540,Sold,792,0,,-206.0,0.0,0,256500.0,0.0,0.0,0.0,-206,
7168,125741,5723103,Sold,1725,0,,,,0,2700000.0,,,0.0,-1657,
19667,296367,5193890,Sold,447,0,,,,0,225000.0,236713.5,,0.0,-436,
20667,445702,5195111,Sold,2905,0,,-71.0,15664.5,0,292500.0,0.0,0.0,0.0,-71,
22483,246683,5197283,Sold,2726,0,,-328.0,2515.5,0,45000.0,0.0,0.0,0.0,-328,


In [38]:
# Remplacement des valeurs manquantes par 0
for idx in nan_sold.index:
    bureau.loc[idx, "DAYS_CREDIT_ENDDATE"] = 0

In [39]:
# Création d'un dataframe pour les valeurs négatives restantes
reste_negative_enddate = bureau[bureau["DAYS_CREDIT_ENDDATE"] < -0.00001]
reste_negative_enddate.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
9,162297,5714471,Active,1146,0,-484.0,,0.0,0,4500.0,0.0,0.0,0.0,-690,
10,162297,5714472,Active,1146,0,-180.0,,0.0,0,337500.0,0.0,0.0,0.0,-690,
34,400486,5714506,Active,941,0,-17.0,,,0,40500.0,0.0,0.0,0.0,-15,
50,452585,5714525,Active,2538,0,-1427.0,,0.0,0,45000.0,0.0,0.0,0.0,-682,
51,452585,5714527,Active,42,0,-26.0,,,0,45000.0,54000.0,0.0,0.0,-31,


In [40]:
# Nombre de crédits concernés
print(f'Cette anomalie concerne {reste_negative_enddate.shape[0]} crédits.')

Cette anomalie concerne 76378 crédits.


*IMPORTANT: Pour cette variable, l'agrégation se fera sur la **MOYENNE** (mean). D'un point de vue métier, il est inenvisageable d'imputer par une valeur pouvant être présente dans le jeu de données. Ainsi, l'ensemble des variables ne présentant que des valeurs positives seront imputées par la valeur fictive de -2.*

In [41]:
# Imputation par la valeur -2 pour les valeurs négatives restantes
for idx in reste_negative_enddate.index:
    bureau.loc[idx, "DAYS_CREDIT_ENDDATE"] = -2

In [42]:
# Imputation par -2 pour les valeurs manquantes
bureau["DAYS_CREDIT_ENDDATE"] = bureau["DAYS_CREDIT_ENDDATE"].fillna(value=-2)

In [43]:
# Description de la variable
bureau["DAYS_CREDIT_ENDDATE"].describe()

count    1.716428e+06
mean     1.092133e+03
std      4.623290e+03
min     -2.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      3.890000e+02
max      3.119900e+04
Name: DAYS_CREDIT_ENDDATE, dtype: float64

### 4. LA VARIABLE 'DAYS_ENDDATE_FACT'

*Concernant cette variable, cette dernière renseigne sur la date de clôture d'un crédit. L'agrégation sera réalisée sur le **MINIMUM** (min).*

*A NOTER: Une analyse plus poussée permettrait peut-être de mettre en évidence la présence d'éventuelles anomalies, qui pourrait engendrer une modification du statut du crédit. Néanmoins, trouvant la variable statut du crédit plus fiable, je préfère me fier à cette dernière.*

### 5. La variable 'AMT_CREDIT_MAX_OVERDUE'

*Cette variable présente des données manquantes. Un simple describe() sera réalisé et l'agrégation par client sera effectuée sur la **MOYENNE** (mean).*

In [44]:
# Description de la variable
bureau["AMT_CREDIT_MAX_OVERDUE"].describe()

count    5.919400e+05
mean     3.825418e+03
std      2.060316e+05
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.159872e+08
Name: AMT_CREDIT_MAX_OVERDUE, dtype: float64

*Cette variable ne semble pas présenter d'anomalie.*

In [45]:
# Imputation par -2 pour les valeurs manquantes
bureau["AMT_CREDIT_MAX_OVERDUE"] = bureau["AMT_CREDIT_MAX_OVERDUE"].fillna(value=-2)

### 6. La variable 'CNT_CREDIT_PROLONG'

*Cette variable correspond au nombre de fois où un crédit a été prolongé. Cette variable ne présente pas de donnée manquante et une simple describe() sera effectué. L'agrégation se fera sur la **SOMME** (sum).*

In [46]:
# Description de la variable
bureau["CNT_CREDIT_PROLONG"].describe()

count    1.716428e+06
mean     6.410406e-03
std      9.622391e-02
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      9.000000e+00
Name: CNT_CREDIT_PROLONG, dtype: float64

*Cette variable ne semble pas présenter d'anomalie.*

### 7. La variable 'AMT_CREDIT_SUM'

*Cette variable présente des données manquantes qui seront imputées comme précedémment par la valeur -2 avant l'aggrégration. L'aggrégation sera réalisée à la fois sur la **SOMME** (sum) et la **MOYENNE** (mean).*

In [47]:
# Imputation des valeurs manquantes par -2
bureau["AMT_CREDIT_SUM"] = bureau["AMT_CREDIT_SUM"].fillna(value=-2)

In [48]:
# Description de la variable
bureau["AMT_CREDIT_SUM"].describe()

count    1.716428e+06
mean     3.549919e+05
std      1.149807e+06
min     -2.000000e+00
25%      5.130000e+04
50%      1.255185e+05
75%      3.150000e+05
max      5.850000e+08
Name: AMT_CREDIT_SUM, dtype: float64

*L'imputation s'est correctement réalisée.*

### 8. La variable 'AMT_CREDIT_SUM_DEBT'

*Cette variable présente des valeurs manquantes. L'agrégation sera réalisée sur la **SOMME** (sum).*

In [49]:
bureau["AMT_CREDIT_SUM_DEBT"].describe()

count    1.458759e+06
mean     1.370851e+05
std      6.774011e+05
min     -4.705600e+06
25%      0.000000e+00
50%      0.000000e+00
75%      4.015350e+04
max      1.701000e+08
Name: AMT_CREDIT_SUM_DEBT, dtype: float64

**ATTENTION**: Nous sommes en présence d'une anomalie. En effet, il est impossible d'avoir une dette négative est surtout de cette ampleur!!!. Ainsi, pour les valeurs négatives, une imputation de -2 sera réalisée.

In [50]:
# Création du dataframe dette_negative
dette_negative = bureau[bureau["AMT_CREDIT_SUM_DEBT"] < -0.00001]
dette_negative.head(10)

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
21,222183,5714491,Active,784,0,1008.0,,0.0,0,0.0,-411.615,411.615,0.0,-694,
87,119939,5714568,Closed,1447,0,0.0,-1272.0,0.0,0,99000.0,-2692.17,137692.17,0.0,-1272,
88,119939,5714569,Closed,1390,0,0.0,-1263.0,0.0,0,135000.0,-149.04,135149.04,0.0,-1263,
89,119939,5714570,Active,1390,0,-2.0,,0.0,0,4500.0,-2.565,2.565,0.0,-691,
125,293201,5714621,Closed,2389,0,0.0,-1780.0,11250.0,0,225000.0,-701.28,225701.28,0.0,-1780,
166,373324,5714674,Active,837,0,-2.0,,0.0,0,0.0,-45.36,45.36,0.0,-685,
225,435368,5714751,Active,2915,0,-2.0,,35.055,0,135000.0,-638.1,135638.1,0.0,-694,
236,228777,5714770,Closed,2264,0,0.0,-1479.0,21161.115,0,337500.0,-15.255,337515.255,0.0,-1202,
297,333498,5714844,Closed,2805,0,0.0,-1224.0,0.0,0,180000.0,-455.805,180455.805,0.0,-1224,
300,333498,5714847,Active,1290,0,-2.0,,0.0,0,135000.0,-45.0,135045.0,0.0,-674,


In [51]:
# Remplacement des valeurs négatives par -2
for idx in dette_negative.index:
    bureau.loc[idx, "AMT_CREDIT_SUM_DEBT"] = -2

In [52]:
# Imputation des valeurs manquantes par -2
bureau["AMT_CREDIT_SUM_DEBT"] = bureau["AMT_CREDIT_SUM_DEBT"].fillna(value=-2)

In [53]:
# Description de la variable pour vérification de l'absence de valeur négative (sauf -2)
bureau["AMT_CREDIT_SUM_DEBT"].describe()

count    1.716428e+06
mean     1.165440e+05
std      6.263618e+05
min     -2.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      1.975500e+03
max      1.701000e+08
Name: AMT_CREDIT_SUM_DEBT, dtype: float64

### 9. La variable 'AMT_CREDIT_SUM_LIMIT'

*Cet indicateur montre le niveau de confiance qu'une banque a en la capacité financière d'un client. Une faible limite pourrait indiquer un risque plus élevé, tandis qu'une limite plus élevée pourrait indiquer une meilleure solvabilité du consommateur. Ainsi, l'agrégation se fera sur **MOYENNE** (mean).*

In [54]:
# Description de la variable
bureau["AMT_CREDIT_SUM_LIMIT"].describe()

count    1.124648e+06
mean     6.229515e+03
std      4.503203e+04
min     -5.864061e+05
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      4.705600e+06
Name: AMT_CREDIT_SUM_LIMIT, dtype: float64

**ATTENTION**: Nous sommes encore en présence de valeurs négatives ce qui est impossible. Par conséquent, ces valeurs négatives seront remplacées par -2.

In [55]:
# Création du dataframe limite_negative
limite_negative = bureau[bureau["AMT_CREDIT_SUM_LIMIT"] < -0.00001]
limite_negative.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
1966,319729,5716893,Active,2782,0,1272.0,,24243.48,0,180000.0,680779.44,-5779.44,0.0,-558,
7774,436084,5723858,Sold,2835,0,0.0,-2436.0,822038.895,0,225000.0,785293.605,-110293.605,0.0,-2332,
17054,131849,5190606,Active,366,0,730.0,,0.0,0,45000.0,56413.26,-11413.26,0.0,-158,
17952,442762,5191738,Active,744,0,1078.0,,0.0,0,450000.0,653852.385,-1352.385,0.0,-495,
20661,445702,5195104,Active,685,0,1151.0,,0.0,0,450000.0,463834.305,-13834.305,0.0,-434,


In [56]:
# Remplacement des valeurs négatives dans le jeu de données par -2
for idx in limite_negative.index:
    bureau.loc[idx, "AMT_CREDIT_SUM_LIMIT"] = -2

In [57]:
# Imputation des valeurs manquantes par -2
bureau["AMT_CREDIT_SUM_LIMIT"] = bureau["AMT_CREDIT_SUM_LIMIT"].fillna(value=-2)

In [58]:
# Description de la variable pour vérification de l'absence de valeur négative
bureau["AMT_CREDIT_SUM_LIMIT"].describe()

count    1.716428e+06
mean     4.085908e+03
std      3.655866e+04
min     -2.000000e+00
25%     -2.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      4.705600e+06
Name: AMT_CREDIT_SUM_LIMIT, dtype: float64

### 10. La variable 'AMT_CREDIT_SUM_OVERDUE' 

*L'agrégation sera réalisée sur la **MOYENNE** (mean), un client pouvant avoir plusieurs crédits.*

In [59]:
# Description de la variable AMT_CREDIT_SUM_OVERDUE
bureau["AMT_CREDIT_SUM_OVERDUE"].describe()

count    1.716428e+06
mean     3.791276e+01
std      5.937650e+03
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      3.756681e+06
Name: AMT_CREDIT_SUM_OVERDUE, dtype: float64

Cette variable ne présente pas d'anomalie.

### 11. La variable 'DAYS_CREDIT_UPDATE'

*Cette mesure peut être utile pour les prêteurs pour savoir à quel point les informations du Bureau de crédits sont récentes et, par conséquent, leur pertinence lors de l'évaluation de la demande de prêt. Etant déjà en possession de nombreuses variables beaucoup plus pertinentes et cette dernière ne jouant pas un rôle primordial sur le profil à risque du client, elle ne sera pas utilisée.*

### 12. La variable 'AMT_ANNUITY'

*Cette variable se réfère au montant périodique que l'emprunteur doit payer pour un crédit rapporté ou enregistré par le Bureau de crédits. L'agrégration se fera donc sur la **MOYENNE** (mean).*

In [60]:
# Description de la variable
bureau["AMT_ANNUITY"].describe()

count    4.896370e+05
mean     1.571276e+04
std      3.258269e+05
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      1.350000e+04
max      1.184534e+08
Name: AMT_ANNUITY, dtype: float64

*Il n'est pas anormal d'avoir des 0 car lorsqu'un crédit est clos ou vendu, il n'y a plus d'annuité. Néanmoins afin de s'assurer que tel est bien le cas, l'AMT_ANNUITY sera mise à 0 pour les crédits clos et vendus*.

In [61]:
# Création d'un jeu de données contenant tous les crédits clos
closed_loan = bureau[bureau["CREDIT_ACTIVE"] == "Closed"]
closed_loan.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,497,0,0.0,-153.0,-2.0,0,91323.0,0.0,-2.0,0.0,-131,
7,162297,5714469,Closed,1896,0,0.0,-1710.0,14985.0,0,76878.45,0.0,0.0,0.0,-1710,
8,162297,5714470,Closed,1146,0,0.0,-840.0,0.0,0,103007.7,0.0,0.0,0.0,-840,
11,162297,5714473,Closed,2456,0,0.0,-825.0,-2.0,0,675000.0,0.0,0.0,0.0,-706,
14,238881,5714482,Closed,318,0,0.0,-187.0,-2.0,0,0.0,0.0,0.0,0.0,-185,


*Ceci a l'air d'être déjà respecté MAIS la mise à zéro sera tout de même effectuée par précaution.*

In [62]:
for idx in closed_loan.index:
    bureau.loc[idx, "AMT_ANNUITY"] = 0

In [63]:
# Création d'un jeu de données contenant tous les crédits vendus
sold_loan = bureau[bureau["CREDIT_ACTIVE"] == "Sold"]
sold_loan.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
541,161678,5715165,Sold,1319,0,0.0,-75.0,133308.81,0,202500.0,0.0,0.0,0.0,-71,
746,147546,5715420,Sold,1337,0,0.0,,-2.0,0,477000.0,0.0,-2.0,0.0,-280,
1077,113548,5715804,Sold,151,0,764.0,,-2.0,0,435973.5,-2.0,-2.0,0.0,-24,
1533,163163,5716347,Sold,2047,0,0.0,,-2.0,0,153405.0,-2.0,-2.0,0.0,-1957,
2493,351664,5717545,Sold,1911,0,0.0,-134.0,42988.5,0,222750.0,-2.0,-2.0,0.0,-22,


In [64]:
for idx in sold_loan.index:
    bureau.loc[idx, "AMT_ANNUITY"] = 0

In [65]:
# Imputation des valeurs manquantes par -2
bureau["AMT_ANNUITY"] = bureau["AMT_ANNUITY"].fillna(value=-2)

### 13. La variable 'DAYS_ENDDATE_FACT'

*Afin d'éviter d'avoir des valeurs négatives dans notre jeu de données, la variable 'DAYS_ENDDATE_FACT' sera multipliée par -1 puis une imputation par -2 sera réalisée.*

In [66]:
# Multiplication par -1
bureau["DAYS_ENDDATE_FACT"] = bureau["DAYS_ENDDATE_FACT"]*-1

In [67]:
# Imputation par la valeur -2 pour les autres données manquantes
bureau["DAYS_ENDDATE_FACT"] = bureau["DAYS_ENDDATE_FACT"].fillna(value=-2)

### 14. Agrégation des informations numériques

In [68]:
aggregated_bureau_numerique = bureau.groupby("SK_ID_CURR").agg({"DAYS_CREDIT":"min", "CREDIT_DAY_OVERDUE":"mean", 
                                                            "DAYS_CREDIT_ENDDATE":"mean","DAYS_ENDDATE_FACT":"min","CNT_CREDIT_PROLONG":"sum", 
                                                            "AMT_CREDIT_SUM":["sum","mean"], "AMT_CREDIT_SUM_DEBT":"sum", "AMT_CREDIT_SUM_LIMIT":"mean",
                                                            "AMT_CREDIT_SUM_OVERDUE":"mean", "AMT_CREDIT_MAX_OVERDUE":"mean", "AMT_ANNUITY":"mean"}).reset_index()

aggregated_bureau_numerique.head()

Unnamed: 0_level_0,SK_ID_CURR,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,AMT_CREDIT_MAX_OVERDUE,AMT_ANNUITY
Unnamed: 0_level_1,Unnamed: 1_level_1,min,mean,mean,min,sum,sum,mean,sum,mean,mean,mean,mean
0,100001,49,0.0,441.571429,-2.0,0,1453365.0,207623.571429,596686.5,-0.285714,0.0,-2.0,3545.357143
1,100002,103,0.0,115.625,-2.0,0,865055.565,108131.945625,245775.0,3997.570625,0.0,1049.893125,0.0
2,100003,606,0.0,304.0,-2.0,0,1017400.5,254350.125,0.0,202500.0,0.0,0.0,-0.5
3,100004,408,0.0,0.0,382.0,0,189037.8,94518.9,0.0,0.0,0.0,-1.0,0.0
4,100005,62,0.0,482.0,-2.0,0,657126.0,219042.0,568408.5,0.0,0.0,-1.333333,1420.5


In [69]:
# Elimination du multi-indexage
aggregated_bureau_numerique.columns = ['_'.join(col).strip() for col in aggregated_bureau_numerique.columns.values]

aggregated_bureau_numerique.head()

Unnamed: 0,SK_ID_CURR_,DAYS_CREDIT_min,CREDIT_DAY_OVERDUE_mean,DAYS_CREDIT_ENDDATE_mean,DAYS_ENDDATE_FACT_min,CNT_CREDIT_PROLONG_sum,AMT_CREDIT_SUM_sum,AMT_CREDIT_SUM_mean,AMT_CREDIT_SUM_DEBT_sum,AMT_CREDIT_SUM_LIMIT_mean,AMT_CREDIT_SUM_OVERDUE_mean,AMT_CREDIT_MAX_OVERDUE_mean,AMT_ANNUITY_mean
0,100001,49,0.0,441.571429,-2.0,0,1453365.0,207623.571429,596686.5,-0.285714,0.0,-2.0,3545.357143
1,100002,103,0.0,115.625,-2.0,0,865055.565,108131.945625,245775.0,3997.570625,0.0,1049.893125,0.0
2,100003,606,0.0,304.0,-2.0,0,1017400.5,254350.125,0.0,202500.0,0.0,0.0,-0.5
3,100004,408,0.0,0.0,382.0,0,189037.8,94518.9,0.0,0.0,0.0,-1.0,0.0
4,100005,62,0.0,482.0,-2.0,0,657126.0,219042.0,568408.5,0.0,0.0,-1.333333,1420.5


In [70]:
# Renommage des colonnes pour plus de clarté
aggregated_bureau_numerique = aggregated_bureau_numerique.rename(columns= {"DAYS_CREDIT_min":"home_DAYS_CREDIT_min",
                                                                          "CREDIT_DAY_OVERDUE_mean":"home_CREDIT_DAY_OVERDUE_mean", "DAYS_CREDIT_ENDDATE_mean":"home_DAYS_CREDIT_ENDDATE_mean",
                                                                          "DAYS_ENDDATE_FACT_min":"home_DAYS_ENDDATE_FACT_min", "CNT_CREDIT_PROLONG_sum":"home_CNT_CREDIT_PROLONG_sum",
                                                                          "AMT_CREDIT_SUM_sum":"home_AMT_CREDIT_SUM_sum", "AMT_CREDIT_SUM_mean":"home_AMT_CREDIT_SUM_mean", 
                                                                           "AMT_CREDIT_SUM_DEBT_sum":"home_AMT_CREDIT_SUM_DEBT_sum", "AMT_CREDIT_SUM_LIMIT_mean":"home_AMT_CREDIT_SUM_LIMIT_mean",
                                                                           "AMT_CREDIT_SUM_OVERDUE_mean":"home_AMT_CREDIT_SUM_OVERDUE_mean", "AMT_CREDIT_MAX_OVERDUE_mean":"home_AMT_CREDIT_MAX_OVERDUE_mean",
                                                                           "AMT_ANNUITY_mean":"home_AMT_ANNUITY_mean"})


# Renommage de la variable 'SK_ID_CURR_' pour permettre la jointure
aggregated_bureau_numerique.rename(columns={"SK_ID_CURR_":"SK_ID_CURR"}, inplace=True)
aggregated_bureau_numerique.head()

Unnamed: 0,SK_ID_CURR,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
0,100001,49,0.0,441.571429,-2.0,0,1453365.0,207623.571429,596686.5,-0.285714,0.0,-2.0,3545.357143
1,100002,103,0.0,115.625,-2.0,0,865055.565,108131.945625,245775.0,3997.570625,0.0,1049.893125,0.0
2,100003,606,0.0,304.0,-2.0,0,1017400.5,254350.125,0.0,202500.0,0.0,0.0,-0.5
3,100004,408,0.0,0.0,382.0,0,189037.8,94518.9,0.0,0.0,0.0,-1.0,0.0
4,100005,62,0.0,482.0,-2.0,0,657126.0,219042.0,568408.5,0.0,0.0,-1.333333,1420.5


## 1.4 . Merge des 2 dataframes agrégés

### 1. La jointure en elle-même

In [71]:
home_credit = pd.merge(aggregated_bureau_categorielle, aggregated_bureau_numerique, on="SK_ID_CURR", how="left")
home_credit.head()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
0,100001,7,3.0,4.0,0.0,0.0,49,0.0,441.571429,-2.0,0,1453365.0,207623.571429,596686.5,-0.285714,0.0,-2.0,3545.357143
1,100002,8,2.0,6.0,0.0,0.0,103,0.0,115.625,-2.0,0,865055.565,108131.945625,245775.0,3997.570625,0.0,1049.893125,0.0
2,100003,4,1.0,3.0,0.0,0.0,606,0.0,304.0,-2.0,0,1017400.5,254350.125,0.0,202500.0,0.0,0.0,-0.5
3,100004,2,0.0,2.0,0.0,0.0,408,0.0,0.0,382.0,0,189037.8,94518.9,0.0,0.0,0.0,-1.0,0.0
4,100005,3,2.0,1.0,0.0,0.0,62,0.0,482.0,-2.0,0,657126.0,219042.0,568408.5,0.0,0.0,-1.333333,1420.5


In [72]:
# Vérification du nombre de clients (attendu 305811)
home_credit.shape[0]

305811

In [73]:
# Informations sur le jeu de données
home_credit.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 305811 entries, 0 to 305810
Data columns (total 18 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   SK_ID_CURR                        305811 non-null  int64  
 1   home_total_loans                  305811 non-null  int64  
 2   home_active_sum                   305811 non-null  float64
 3   home_closed_sum                   305811 non-null  float64
 4   home_bad_debt_sum                 305811 non-null  float64
 5   home_sold_sum                     305811 non-null  float64
 6   home_DAYS_CREDIT_min              305811 non-null  int64  
 7   home_CREDIT_DAY_OVERDUE_mean      305811 non-null  float64
 8   home_DAYS_CREDIT_ENDDATE_mean     305811 non-null  float64
 9   home_DAYS_ENDDATE_FACT_min        305811 non-null  float64
 10  home_CNT_CREDIT_PROLONG_sum       305811 non-null  int64  
 11  home_AMT_CREDIT_SUM_sum           305811 non-null  f

**C'est bon, aucun client n'a été perdu en cours de route, le jeu de données est complet et entièrement numérique.**

In [74]:
# Description du jeu de données
home_credit.describe()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
count,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0
mean,278047.300091,5.612709,2.062081,3.529216,6.9e-05,0.021343,490.942608,0.965926,1167.359619,123.521613,0.03598,1992466.0,380731.2,654127.5,3864.032,45.95072,1836.538,2288.027
std,102849.568343,4.430354,1.791724,3.430504,0.008286,0.158325,533.529324,24.957209,2886.230908,380.350867,0.232951,4165820.0,879277.7,1640573.0,20346.7,4956.64,123623.1,23459.76
min,100001.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.0,-2.0,0.0,-2.0,-2.0,-38.0,-2.0,0.0,-2.0,-2.0
25%,188878.5,2.0,1.0,1.0,0.0,0.0,149.0,0.0,27.666667,-2.0,0.0,346967.6,103960.1,0.0,-1.0,0.0,-2.0,-1.0
50%,277895.0,4.0,2.0,3.0,0.0,0.0,305.0,0.0,252.166667,-2.0,0.0,978820.7,197292.9,173443.5,-0.4,0.0,-1.333333,-0.3333333
75%,367184.5,8.0,3.0,5.0,0.0,0.0,623.0,0.0,759.75,-2.0,0.0,2345121.0,397859.3,676761.9,0.0,0.0,29.89208,0.0
max,456255.0,116.0,32.0,108.0,1.0,9.0,2922.0,2776.0,31198.0,2887.0,9.0,1017958000.0,198072300.0,334498300.0,1755000.0,1617404.0,47406120.0,8120712.0


**ATTENTION:** Ayant réalisé des agrégations, nous pouvons observer que la variable 'home_AMT_CREDIT_SUM_DEBT_sum' présente une valeur minimale inférieure à -2. Par précaution, toutes les variables présentant des valeurs négatives seront ré-imputées par la valeur -2.

### 2. Imputation par la valeur -2 pour les valeurs négatives des variables numériques

**1. La variable 'home_DAYS_CREDIT_ENDDATE_mean'**

In [75]:
corrected_enddate = home_credit[(home_credit["home_DAYS_CREDIT_ENDDATE_mean"] <-0.00001)]
corrected_enddate.head()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
26,100032,4,1.0,3.0,0.0,0.0,556,0.0,-0.5,-2.0,0,1271160.0,317790.0,0.0,0.0,0.0,-1.0,4359.375
29,100036,1,1.0,0.0,0.0,0.0,889,0.0,-2.0,-2.0,0,94959.0,94959.0,8339.355,-2.0,0.0,-2.0,-2.0
30,100037,8,1.0,7.0,0.0,0.0,1247,0.0,-0.25,-2.0,0,592132.5,74016.5625,0.0,-0.5,0.0,-1.5,-0.25
36,100045,3,1.0,2.0,0.0,0.0,337,0.0,-0.666667,-2.0,0,438054.39,146018.13,89351.46,-0.666667,0.0,-0.666667,-0.666667
42,100053,8,1.0,7.0,0.0,0.0,1764,0.0,-0.25,-2.0,0,675508.5,84438.5625,-6.0,-1.5,0.0,-2.0,-0.25


In [76]:
for idx in corrected_enddate.index:
    home_credit.loc[idx, "home_DAYS_CREDIT_ENDDATE_mean"] =-2
    
home_credit.head()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
0,100001,7,3.0,4.0,0.0,0.0,49,0.0,441.571429,-2.0,0,1453365.0,207623.571429,596686.5,-0.285714,0.0,-2.0,3545.357143
1,100002,8,2.0,6.0,0.0,0.0,103,0.0,115.625,-2.0,0,865055.565,108131.945625,245775.0,3997.570625,0.0,1049.893125,0.0
2,100003,4,1.0,3.0,0.0,0.0,606,0.0,304.0,-2.0,0,1017400.5,254350.125,0.0,202500.0,0.0,0.0,-0.5
3,100004,2,0.0,2.0,0.0,0.0,408,0.0,0.0,382.0,0,189037.8,94518.9,0.0,0.0,0.0,-1.0,0.0
4,100005,3,2.0,1.0,0.0,0.0,62,0.0,482.0,-2.0,0,657126.0,219042.0,568408.5,0.0,0.0,-1.333333,1420.5


**2. La variable 'home_DAYS_ENDDATE_FACT_min'**

In [77]:
corrected_enddate_fact = home_credit[(home_credit["home_DAYS_ENDDATE_FACT_min"] <-0.00001)]
corrected_enddate_fact.head()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
0,100001,7,3.0,4.0,0.0,0.0,49,0.0,441.571429,-2.0,0,1453365.0,207623.571429,596686.5,-0.285714,0.0,-2.0,3545.357143
1,100002,8,2.0,6.0,0.0,0.0,103,0.0,115.625,-2.0,0,865055.565,108131.945625,245775.0,3997.570625,0.0,1049.893125,0.0
2,100003,4,1.0,3.0,0.0,0.0,606,0.0,304.0,-2.0,0,1017400.5,254350.125,0.0,202500.0,0.0,0.0,-0.5
4,100005,3,2.0,1.0,0.0,0.0,62,0.0,482.0,-2.0,0,657126.0,219042.0,568408.5,0.0,0.0,-1.333333,1420.5
6,100008,3,1.0,2.0,0.0,0.0,78,0.0,157.0,-2.0,0,468445.5,156148.5,240057.0,0.0,0.0,-1.333333,-0.666667


In [78]:
for idx in corrected_enddate_fact.index:
    home_credit.loc[idx, "home_DAYS_ENDDATE_FACT_min"] =-2

**3. La variable 'home_AMT_CREDIT_SUM_sum'**

In [79]:
corrected_credit_sum_sum =  home_credit[(home_credit["home_AMT_CREDIT_SUM_sum"] <-0.00001)]
corrected_credit_sum_sum.head()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
44248,151540,1,1.0,0.0,0.0,0.0,1067,0.0,777.0,-2.0,0,-2.0,-2.0,0.0,-2.0,0.0,-2.0,31234.5
273336,418331,1,1.0,0.0,0.0,0.0,0,0.0,-2.0,-2.0,0,-2.0,-2.0,2250000.0,-2.0,0.0,-2.0,-2.0


In [80]:
for idx in corrected_credit_sum_sum.index:
    home_credit.loc[idx, "home_AMT_CREDIT_SUM_sum"] =-2

**4. La variable 'home_AMT_CREDIT_SUM_mean'**

In [81]:
corrected_credit_sum_mean = home_credit[(home_credit["home_AMT_CREDIT_SUM_mean"] <-0.00001)]
corrected_credit_sum_mean.head()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
44248,151540,1,1.0,0.0,0.0,0.0,1067,0.0,777.0,-2.0,0,-2.0,-2.0,0.0,-2.0,0.0,-2.0,31234.5
273336,418331,1,1.0,0.0,0.0,0.0,0,0.0,-2.0,-2.0,0,-2.0,-2.0,2250000.0,-2.0,0.0,-2.0,-2.0


In [82]:
for idx in corrected_credit_sum_mean.index:
    home_credit.loc[idx, "home_AMT_CREDIT_SUM_mean"] =-2

**5. La variable 'home_AMT_CREDIT_SUM_DEBT_sum'**

In [83]:
corrected_credit_sum_debt = home_credit[(home_credit["home_AMT_CREDIT_SUM_DEBT_sum"] <-0.00001)]
corrected_credit_sum_debt.head()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
9,100011,4,0.0,4.0,0.0,0.0,1309,0.0,0.0,968.0,0,435228.3,108807.075,-2.0,-0.5,0.0,2535.8075,0.0
10,100013,4,0.0,4.0,0.0,0.0,1210,0.0,0.0,549.0,0,2072280.06,518070.015,-6.0,-2.0,0.0,4824.75,0.0
12,100015,4,0.0,4.0,0.0,0.0,319,0.0,0.0,8.0,0,409495.5,102373.875,-6.0,-1.5,0.0,-2.0,0.0
14,100017,6,0.0,6.0,0.0,0.0,909,0.0,32.833333,738.0,0,859770.0,143295.0,-6.0,-1.0,0.0,-1.0,0.0
42,100053,8,1.0,7.0,0.0,0.0,1764,0.0,-2.0,-2.0,0,675508.5,84438.5625,-6.0,-1.5,0.0,-2.0,-0.25


In [84]:
for idx in corrected_credit_sum_debt.index:
    home_credit.loc[idx, "home_AMT_CREDIT_SUM_DEBT_sum"] =-2

**6. La variable 'home_AMT_CREDIT_SUM_LIMIT_mean'**

In [85]:
corrected_credit_sum_limit = home_credit[(home_credit["home_AMT_CREDIT_SUM_LIMIT_mean"] <-0.00001)]
corrected_credit_sum_limit.head()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
0,100001,7,3.0,4.0,0.0,0.0,49,0.0,441.571429,-2.0,0,1453365.0,207623.571429,596686.5,-0.285714,0.0,-2.0,3545.357143
7,100009,18,4.0,14.0,0.0,0.0,239,0.0,143.277778,-2.0,0,4800811.5,266711.75,1077341.5,-0.777778,0.0,-1.555556,-0.444444
8,100010,2,1.0,1.0,0.0,0.0,1138,0.0,344.5,-2.0,0,990000.0,495000.0,348007.5,-1.0,0.0,-2.0,-1.0
9,100011,4,0.0,4.0,0.0,0.0,1309,0.0,0.0,968.0,0,435228.3,108807.075,-2.0,-0.5,0.0,2535.8075,0.0
10,100013,4,0.0,4.0,0.0,0.0,1210,0.0,0.0,549.0,0,2072280.06,518070.015,-2.0,-2.0,0.0,4824.75,0.0


In [86]:
for idx in corrected_credit_sum_limit.index:
    home_credit.loc[idx, "home_AMT_CREDIT_SUM_LIMIT_mean"] =-2

**7. La variable 'home_AMT_CREDIT_MAX_OVERDUE_mean'**

In [87]:
corrected_credit_max_overdue =  home_credit[(home_credit["home_AMT_CREDIT_MAX_OVERDUE_mean"] <-0.00001)]
corrected_credit_max_overdue.head()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
0,100001,7,3.0,4.0,0.0,0.0,49,0.0,441.571429,-2.0,0,1453365.0,207623.571429,596686.5,-2.0,0.0,-2.0,3545.357143
3,100004,2,0.0,2.0,0.0,0.0,408,0.0,0.0,382.0,0,189037.8,94518.9,0.0,0.0,0.0,-1.0,0.0
4,100005,3,2.0,1.0,0.0,0.0,62,0.0,482.0,-2.0,0,657126.0,219042.0,568408.5,0.0,0.0,-1.333333,1420.5
6,100008,3,1.0,2.0,0.0,0.0,78,0.0,157.0,-2.0,0,468445.5,156148.5,240057.0,0.0,0.0,-1.333333,-0.666667
7,100009,18,4.0,14.0,0.0,0.0,239,0.0,143.277778,-2.0,0,4800811.5,266711.75,1077341.5,-2.0,0.0,-1.555556,-0.444444


In [88]:
for idx in corrected_credit_max_overdue.index:
    home_credit.loc[idx, "home_AMT_CREDIT_MAX_OVERDUE_mean"] =-2

**8. La variable 'home_AMT_ANNUITY_mean'**

In [89]:
corrected_amt_annuity =  home_credit[(home_credit["home_AMT_ANNUITY_mean"] <-0.00001)]
corrected_amt_annuity.head()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
2,100003,4,1.0,3.0,0.0,0.0,606,0.0,304.0,-2.0,0,1017400.5,254350.125,0.0,202500.0,0.0,0.0,-0.5
6,100008,3,1.0,2.0,0.0,0.0,78,0.0,157.0,-2.0,0,468445.5,156148.5,240057.0,0.0,0.0,-2.0,-0.666667
7,100009,18,4.0,14.0,0.0,0.0,239,0.0,143.277778,-2.0,0,4800811.5,266711.75,1077341.5,-2.0,0.0,-2.0,-0.444444
8,100010,2,1.0,1.0,0.0,0.0,1138,0.0,344.5,-2.0,0,990000.0,495000.0,348007.5,-2.0,0.0,-2.0,-1.0
11,100014,8,2.0,6.0,0.0,0.0,376,0.0,184.0,-2.0,0,2729932.425,341241.553125,758208.0,-2.0,0.0,2794.264375,-0.5


In [90]:
for idx in corrected_amt_annuity.index:
    home_credit.loc[idx, "home_AMT_ANNUITY_mean"] =-2

### 3. Vérification et sauvegarde du jeu de données

In [91]:
# Description du jeu de données
home_credit.describe()

Unnamed: 0,SK_ID_CURR,home_total_loans,home_active_sum,home_closed_sum,home_bad_debt_sum,home_sold_sum,home_DAYS_CREDIT_min,home_CREDIT_DAY_OVERDUE_mean,home_DAYS_CREDIT_ENDDATE_mean,home_DAYS_ENDDATE_FACT_min,home_CNT_CREDIT_PROLONG_sum,home_AMT_CREDIT_SUM_sum,home_AMT_CREDIT_SUM_mean,home_AMT_CREDIT_SUM_DEBT_sum,home_AMT_CREDIT_SUM_LIMIT_mean,home_AMT_CREDIT_SUM_OVERDUE_mean,home_AMT_CREDIT_MAX_OVERDUE_mean,home_AMT_ANNUITY_mean
count,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0,305811.0
mean,278047.300091,5.612709,2.062081,3.529216,6.9e-05,0.021343,490.942608,0.965926,1167.288773,123.521613,0.03598,1992466.0,380731.2,654127.7,3863.497,45.95072,1836.258,2287.47
std,102849.568343,4.430354,1.791724,3.430504,0.008286,0.158325,533.529324,24.957209,2886.259593,380.350867,0.232951,4165820.0,879277.7,1640573.0,20346.8,4956.64,123623.1,23459.82
min,100001.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.0,-2.0,0.0,-2.0,-2.0,-2.0,-2.0,0.0,-2.0,-2.0
25%,188878.5,2.0,1.0,1.0,0.0,0.0,149.0,0.0,27.666667,-2.0,0.0,346967.6,103960.1,0.0,-2.0,0.0,-2.0,-2.0
50%,277895.0,4.0,2.0,3.0,0.0,0.0,305.0,0.0,252.166667,-2.0,0.0,978820.7,197292.9,173443.5,-2.0,0.0,-2.0,-2.0
75%,367184.5,8.0,3.0,5.0,0.0,0.0,623.0,0.0,759.75,-2.0,0.0,2345121.0,397859.3,676761.9,0.0,0.0,29.89208,0.0
max,456255.0,116.0,32.0,108.0,1.0,9.0,2922.0,2776.0,31198.0,2887.0,9.0,1017958000.0,198072300.0,334498300.0,1755000.0,1617404.0,47406120.0,8120712.0


In [92]:
# Informations sur le jeu de données
home_credit.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 305811 entries, 0 to 305810
Data columns (total 18 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   SK_ID_CURR                        305811 non-null  int64  
 1   home_total_loans                  305811 non-null  int64  
 2   home_active_sum                   305811 non-null  float64
 3   home_closed_sum                   305811 non-null  float64
 4   home_bad_debt_sum                 305811 non-null  float64
 5   home_sold_sum                     305811 non-null  float64
 6   home_DAYS_CREDIT_min              305811 non-null  int64  
 7   home_CREDIT_DAY_OVERDUE_mean      305811 non-null  float64
 8   home_DAYS_CREDIT_ENDDATE_mean     305811 non-null  float64
 9   home_DAYS_ENDDATE_FACT_min        305811 non-null  float64
 10  home_CNT_CREDIT_PROLONG_sum       305811 non-null  int64  
 11  home_AMT_CREDIT_SUM_sum           305811 non-null  f

**Le jeu de données home_credit est complet et entièrement numérique, le rendant donc utilisable pour la suite du projet.**

## CONCLUSION

### Jeux de données bureau et bureaux_balance
**Ces jeux de données font référence à des informations provenant d'autres institutions pour les prêts uniquement acceptés. De ce fait, le jeu de données previous_application jouera un rôle essentiel en nous renseignant non seulement sur des crédits non référencés au Bureau de crédits mais aussi sur les prêts refusés.**

### Le jeu de données previous_application et les jeux de données associés
- **Analyse et feature engineering des variables catégorielles et numériques pertinentes pour notre projet du jeu de données 'previous_application' dans le prochain notebook.**
- **Analyse et feature engineering des variables pertinentes des 3 dernières tables dans un autre notebook (le notebook concernant le jeu de données 'previous_application prenant plus de 6 heures d'exécution).**

### Sauvegarde du notebook avant l'analyse des autres jeux de données
**Ce notebook prenant lui aussi beaucoup de temps d'exécution, il sera sauvegardé sous le nom de 'Ple_Coline_1_notebook_home_credit_092023' dans les livrables et 'home_credit' sur GitHub.**

In [93]:
# Sauvegarde du jeu de données
home_credit = home_credit.to_csv("home_credit.csv", index=False)