# 6. Decision Tree and Ensemble Learning
Exercice issu de ML Zoomcamp 6 [https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master/06-trees](https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master/06-trees)
Reprise de son tutoriel avec certaines modifications et commentaires

## 6.1 Credit risk scoring project
* Dataset: https://github.com/gastonstat/CreditScoring

In [37]:
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

## 6.2 Data cleaning and preparation

In [38]:
data_url = 'https://raw.githubusercontent.com/gastonstat/CreditScoring/master/CreditScoring.csv'

In [39]:
import os
filename = 'CreditScoring.csv'
if not os.path.exists(filename):
    os.system(f'wget {data_url} -O {filename}')
    print(f"Le fichier {filename} a été téléchargé avec succès.")
else:
    # Si le fichier existe déjà, imprimez un message
    print(f"Le fichier {filename} existe déjà.")

Le fichier CreditScoring.csv existe déjà.


In [40]:
!head CreditScoring.csv

"Status","Seniority","Home","Time","Age","Marital","Records","Job","Expenses","Income","Assets","Debt","Amount","Price"
1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
1,0,1,36,26,1,1,1,46,107,0,0,310,910
1,1,2,60,36,2,1,1,75,214,3500,0,650,1645
1,29,2,60,44,2,1,1,75,125,10000,0,1600,1800
1,9,5,12,27,1,1,1,35,80,0,0,200,1093
1,0,2,60,32,2,1,3,90,107,15000,0,1200,1957


In [49]:
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Status,Seniority,Home,Time,Age,Marital,Records,Job,Expenses,Income,Assets,Debt,Amount,Price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


On remarque que le nom des colonnes du dataset contient des majuscules 

In [50]:
def clean_headers(df:pd.DataFrame)->pd.DataFrame:
    df.columns = df.columns.str.lower().str.replace(' ','_')
    return df

def clean_rows(df:pd.DataFrame)->pd.DataFrame:
    for column in df.columns:
        if pd.api.types.is_string_dtype(df[column]):
            df[column] = df[column].str.lower().str.replace(' ', '_')
    return df

def clean_headers_rows(df:pd.DataFrame)->pd.DataFrame:
    df = clean_headers(df)
    df = clean_rows(df)
    return df

In [51]:
df = clean_headers_rows(df)
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


Il nous manque des informations pour comprendre le dataset, elles sont disponibles sur le github d'où provient de fichier csv. Les voici sous forme de dictionnaire:
status = {0:"unknow", 1:"good", 2:"bad"}
home = {0:"unknow", 1:"rent", 2:"owner", 3:"priv", 4:"ignore", 5:"parents", 6:"other"}
marital = {0:"unknow", 1:"single", 2:"married", 3:"widow", 4:"separated", 5:"divorced"}
records = {0:"unknow", 1:"no_rec", 2:"yes_rec"}
job = {0:"unknow", 1:"fixed", 2:"partime", 3:"freelance", 4:"other"}

On va créer un dataframe lisible pour mieux comprendre les données:

In [52]:
status = {0:"unknow", 1:"good", 2:"bad"}
home = {0:"unknow", 1:"rent", 2:"owner", 3:"priv", 4:"ignore", 5:"parents", 6:"other"}
marital = {0:"unknow", 1:"single", 2:"married", 3:"widow", 4:"separated", 5:"divorced"}
records = {0:"unknow", 1:"no_rec", 2:"yes_rec"}
job = {0:"unknow", 1:"fixed", 2:"partime", 3:"freelance", 4:"other"}

df['status'] = df.status.map(status)
df['home'] = df.home.map(home)
df['marital'] = df.marital.map(marital)
df['records'] = df.records.map(records)
df['job'] = df.job.map(job)


In [53]:
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,good,9,rent,60,30,married,no_rec,freelance,73,129,0,0,800,846
1,good,17,rent,60,58,widow,no_rec,fixed,48,131,0,0,1000,1658
2,bad,10,owner,36,46,married,yes_rec,freelance,90,200,3000,0,2000,2985
3,good,0,rent,60,24,single,no_rec,fixed,63,182,2500,0,900,1325
4,good,0,rent,36,26,single,no_rec,fixed,46,107,0,0,310,910


Dans les sources, il est indiqué que les valeurs manquantes sont remplacées par le nombre 99999999. Nous devons donc remplacer par np.nan afin de ne pas fausser les résultats

In [54]:
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,763317.0,1060341.0,404382.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,8703625.0,10217569.0,6344253.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3500.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,166.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,99999999.0,99999999.0,99999999.0,5000.0,11140.0


On constate avec la commande ```df.describe().round()``` en valeur **max** nous avons income, assets et debt qui contiennent le nombre 99999999

In [55]:
for c in ['income', 'assets', 'debt']:
    df[c] = df[c].replace(to_replace=99999999, value=np.nan)

In [56]:
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4421.0,4408.0,4437.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,131.0,5403.0,343.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,86.0,11573.0,1246.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3000.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,165.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,959.0,300000.0,30000.0,5000.0,11140.0


In [57]:
df.status.value_counts()

status
good      3200
bad       1254
unknow       1
Name: count, dtype: int64

In [58]:
df = df[df.status != 'unknow'].reset_index(drop=True)

In [59]:
from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

In [60]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [61]:
y_train = (df_train.status == 'bad').astype('int').values
y_val = (df_val.status == 'bad').astype('int').values
y_test = (df_test.status == 'bad').astype('int').values

In [62]:
df_train.drop('status', axis=1, inplace=True)
df_val.drop('status', axis=1, inplace=True)
df_test.drop('status', axis=1, inplace=True)

In [63]:
df_train

Unnamed: 0,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,10,owner,36,36,married,no_rec,freelance,75,0.0,10000.0,0.0,1000,1400
1,6,parents,48,32,single,yes_rec,fixed,35,85.0,0.0,0.0,1100,1330
2,1,parents,48,40,married,no_rec,fixed,75,121.0,0.0,0.0,1320,1600
3,1,parents,48,23,single,no_rec,partime,35,72.0,0.0,0.0,1078,1079
4,5,owner,36,46,married,no_rec,freelance,60,100.0,4000.0,0.0,1100,1897
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2667,18,priv,36,45,married,no_rec,fixed,45,220.0,20000.0,0.0,800,1600
2668,7,priv,60,29,married,no_rec,fixed,60,51.0,3500.0,500.0,1000,1290
2669,1,parents,24,19,single,no_rec,fixed,35,28.0,0.0,0.0,400,600
2670,15,owner,48,43,married,no_rec,freelance,60,100.0,18000.0,0.0,2500,2976


## 6.3 Decision trees
* How a decision tree looks like
* Training a decision tree
* Overfitting
* Controlling the size of a tree

In [64]:
#la sortie c'est default ou ok -> sortie binaire
def assess_risk(client):
    if client['records'] == 'yes':
        if client['job'] == 'parttime':
            return 'default'
        else:
            return 'ok'
    else:
        if client['assets'] > 6000:
            return 'ok'
        else:
            return 'default'

In [72]:
xi = df_train.iloc[0].to_dict()

In [73]:
assess_risk(xi)

'ok'

In [86]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score

In [87]:
train_dicts = df_train.fillna(0).to_dict(orient='records')

In [88]:
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)

In [89]:
dv.get_feature_names_out()

array(['age', 'amount', 'assets', 'debt', 'expenses', 'home=ignore',
       'home=other', 'home=owner', 'home=parents', 'home=priv',
       'home=rent', 'home=unknow', 'income', 'job=fixed', 'job=freelance',
       'job=other', 'job=partime', 'job=unknow', 'marital=divorced',
       'marital=married', 'marital=separated', 'marital=single',
       'marital=unknow', 'marital=widow', 'price', 'records=no_rec',
       'records=yes_rec', 'seniority', 'time'], dtype=object)

In [92]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [93]:
val_dicts = df_val.fillna(0).to_dict(orient='records')
X_val = dv.transform(val_dicts) # only transform!

In [96]:
y_pred = dt.predict_proba(X_val)[:, 1]

In [97]:
roc_auc_score(y_true=y_val, y_score=y_pred)

0.6613396381778112

In [99]:
y_pred = dt.predict_proba(X_train)[:,1]
roc_auc_score(y_true=y_train, y_score=y_pred)

1.0

On peut voir que nous avons de l'overfitting, score parfait sur le train et mauvais sur le val 

In [100]:
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X_train, y_train)

In [101]:
y_pred = dt.predict_proba(X_train)[:,1]
auc = roc_auc_score(y_true=y_train, y_score=y_pred)
print('train:', auc)

y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_true=y_val, y_score=y_pred)
print('val:', auc)

train: 0.7761016984958594
val: 0.7389079944782155


In [102]:
from sklearn.tree import export_text

In [103]:
print(export_text(dt, feature_names=dv.get_feature_names_out()))

|--- records=yes_rec <= 0.50
|   |--- job=partime <= 0.50
|   |   |--- income <= 74.50
|   |   |   |--- class: 0
|   |   |--- income >  74.50
|   |   |   |--- class: 0
|   |--- job=partime >  0.50
|   |   |--- assets <= 8750.00
|   |   |   |--- class: 1
|   |   |--- assets >  8750.00
|   |   |   |--- class: 0
|--- records=yes_rec >  0.50
|   |--- seniority <= 6.50
|   |   |--- amount <= 862.50
|   |   |   |--- class: 0
|   |   |--- amount >  862.50
|   |   |   |--- class: 1
|   |--- seniority >  6.50
|   |   |--- income <= 103.50
|   |   |   |--- class: 1
|   |   |--- income >  103.50
|   |   |   |--- class: 0


@15:02 https://www.youtube.com/watch?v=YGiQvFbSIg8&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=59