# Decision Trees and Ensemble Learning

## Credit risk scoring project

**Dataset**

In [1]:
!wget https://github.com/gastonstat/CreditScoring/raw/master/CreditScoring.csv

--2022-10-11 12:04:44--  https://github.com/gastonstat/CreditScoring/raw/master/CreditScoring.csv
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/gastonstat/CreditScoring/master/CreditScoring.csv [following]
--2022-10-11 12:04:44--  https://raw.githubusercontent.com/gastonstat/CreditScoring/master/CreditScoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 182489 (178K) [text/plain]
Saving to: ‘CreditScoring.csv.8’


2022-10-11 12:04:44 (7.42 MB/s) - ‘CreditScoring.csv.8’ saved [182489/182489]



### Librairies

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

plt.style.use('ggplot')

**Data cleanining**

In [3]:
df = pd.read_csv('CreditScoring.csv')
df.head()

Unnamed: 0,Status,Seniority,Home,Time,Age,Marital,Records,Job,Expenses,Income,Assets,Debt,Amount,Price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


In [4]:
# make all columns in lower
df.columns = df.columns.str.lower()

In [5]:
df.shape

(4455, 14)

**Mapping the target**

In [6]:
df['status'].value_counts(dropna=False)

1    3200
2    1254
0       1
Name: status, dtype: int64

In [7]:
df['status'] = df['status'].map({
    1:'ok',
    2:'default',
    0:'unk'
})

** Mapping categorical features **

In [8]:
def mapping_categorical(df, cat, cat_lst):
  to_lst = df[cat].value_counts().sort_index().index.to_list()
  cat_lst = cat_lst

  df[cat] = (
      df[cat].map({
          k:v for (k,v) in zip(to_lst, cat_lst)
      })
   )

In [9]:
cols = ['home', 'marital', 'records', 'job']

home_lst = ['unk', 'rent', 'owner', 'private', 'ignore', 'parents', 'other']
marital_lst = ['unk', 'single', 'married', 'widow', 'separated', 'divorced']
records_lst = ['no', 'yes', 'unk']
job_lst = ['unk', 'fixed', 'partime', 'freelance', 'others']
cat_lst = [home_lst, marital_lst, records_lst, job_lst]

for col, cat in zip(cols, cat_lst):
  mapping_categorical(df, col, cat)


In [10]:
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,rent,60,30,married,no,freelance,73,129,0,0,800,846
1,ok,17,rent,60,58,widow,no,fixed,48,131,0,0,1000,1658
2,default,10,owner,36,46,married,yes,freelance,90,200,3000,0,2000,2985
3,ok,0,rent,60,24,single,no,fixed,63,182,2500,0,900,1325
4,ok,0,rent,36,26,single,no,fixed,46,107,0,0,310,910


**Check numerical values**

In [11]:
df.describe().T.round()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
seniority,4455.0,8.0,8.0,0.0,2.0,5.0,12.0,48.0
time,4455.0,46.0,15.0,6.0,36.0,48.0,60.0,72.0
age,4455.0,37.0,11.0,18.0,28.0,36.0,45.0,68.0
expenses,4455.0,56.0,20.0,35.0,35.0,51.0,72.0,180.0
income,4455.0,763317.0,8703625.0,0.0,80.0,120.0,166.0,99999999.0
assets,4455.0,1060341.0,10217569.0,0.0,0.0,3500.0,6000.0,99999999.0
debt,4455.0,404382.0,6344253.0,0.0,0.0,0.0,0.0,99999999.0
amount,4455.0,1039.0,475.0,100.0,700.0,1000.0,1300.0,5000.0
price,4455.0,1463.0,628.0,105.0,1118.0,1400.0,1692.0,11140.0


One thing we immediately can notice is that the max value is 99999999 in some cases. This is quite suspicious. As it turns out, it’s an artificial value — this is how missing values are encoded in this dataset.

In [12]:
def fix_missing_values(df, val_to_rep, rep, *f_lst):
  for f in f_lst:
    df[f] = df[f].replace(val_to_rep, rep)

In [13]:
fix_missing_values(df, 99999999.0, np.nan, ['income', 'assets', 'debt'])

In [14]:
df.describe().T.round()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
seniority,4455.0,8.0,8.0,0.0,2.0,5.0,12.0,48.0
time,4455.0,46.0,15.0,6.0,36.0,48.0,60.0,72.0
age,4455.0,37.0,11.0,18.0,28.0,36.0,45.0,68.0
expenses,4455.0,56.0,20.0,35.0,35.0,51.0,72.0,180.0
income,4421.0,131.0,86.0,0.0,80.0,120.0,165.0,959.0
assets,4408.0,5403.0,11573.0,0.0,0.0,3000.0,6000.0,300000.0
debt,4437.0,343.0,1246.0,0.0,0.0,0.0,0.0,30000.0
amount,4455.0,1039.0,475.0,100.0,700.0,1000.0,1300.0,5000.0
price,4455.0,1463.0,628.0,105.0,1118.0,1400.0,1692.0,11140.0


In [15]:
df.isna().sum()

status        0
seniority     0
home          0
time          0
age           0
marital       0
records       0
job           0
expenses      0
income       34
assets       47
debt         18
amount        0
price         0
dtype: int64

We notice that there’s one row with “unknown” status: we don’t know whether this client
managed to pay back the loan or not. For our project, this row is not useful

In [16]:
df = df[df.status != 'unk']

# Dataset preparation

Separate target from features. Since our objective is to determine if somebody fails to pay back their credit, the positive class is “default”. This means that y is “1” if the client defaulted and
“0” otherwise.

In [22]:
data, target = df.drop(columns=['status']), df['status'].map({'ok':0, 'default':1})

In [23]:
target.head()

0    0
1    0
2    1
3    0
4    0
Name: status, dtype: int64

**Spliting data**

In [24]:
from sklearn import model_selection

X_full_train, X_test, y_full_train, y_test = model_selection.train_test_split(
    data,
    target,
    test_size=.2,
    random_state=11,
)
X_train, X_dev, y_train, y_dev = model_selection.train_test_split(
    X_full_train,
    y_full_train,
    test_size=.25,
    random_state=11,
)

X_train.shape, y_train.shape, X_dev.shape, y_dev.shape, X_test.shape, y_test.shape

((2672, 13), (2672,), (891, 13), (891,), (891, 13), (891,))

**impute missing values and make pipelines**

In [28]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector as selector, ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline


numerical = selector(dtype_include=np.number)(data)
categorical = selector(dtype_include=object)(data)

num_imputer = SimpleImputer(missing_values=np.NaN, strategy='constant', fill_value=0)
cat_imputer = SimpleImputer(strategy='constant', fill_value='unk')
cat_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

cat_preprocessor = Pipeline([
    ('category_imputer', cat_imputer),
    ('category_encoder', cat_encoder)
])

processor = ColumnTransformer([
    ('numeric', num_imputer, numerical),
    ('category', cat_preprocessor, categorical)
])

# Models

## Decision Trees

In [29]:
from sklearn.tree import DecisionTreeClassifier

dt = make_pipeline(processor, DecisionTreeClassifier())
dt.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('numeric',
                                                  SimpleImputer(fill_value=0,
                                                                strategy='constant'),
                                                  ['seniority', 'time', 'age',
                                                   'expenses', 'income',
                                                   'assets', 'debt', 'amount',
                                                   'price']),
                                                 ('category',
                                                  Pipeline(steps=[('category_imputer',
                                                                   SimpleImputer(fill_value='unk',
                                                                                 strategy='constant')),
                                                                  ('category_encoder',
     

In [30]:
from sklearn import metrics

y_pred = dt.predict_proba(X_train)[:, 1]
metrics.roc_auc_score(y_train, y_pred)

1.0

In [31]:
y_pred = dt.predict_proba(X_dev)[:, 1]
metrics.roc_auc_score(y_dev, y_pred)

0.6585938824441161

In [32]:
metrics.recall_score(y_dev, y_pred)

0.5019011406844106

We just observed a case of overfitting. The tree learned the training data so well that it simply memorized the outcome for each customer. However, when we applied it to the validation set, the model failed. The rules it extracted from data turned out too specific to the training set, so it worked poorly for customers it didn’t see during training. In such cases, we say that the model cannot generalize. Overfitting happens when we have a complex model with enough power to remember all the training data. If we force the model to be simpler, we can make it less powerful, and improve the model’s ability to generalize. There are multiple ways of controlling the complexity of a tree. One of the options is restricting its size: we can specify the max_depth parameter, which controls the maximal number of levels. The more levels a tree has, the more complex rules it can learn.