# Decision Trees and Ensemble Learning

## Credit Risk Scoring

* Build a model, that the bank can use to take a decision on whether they give a credit or not
* The model gives the risk, that a customer won't pay back the credit ("Risk of Defaulting")
![model](Screenshot_1.png)

## Setup

In [3]:
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

## Data Cleaning and Preparation
* Download the data
* Re-encoding categorical variables
* Doing the train / validation / test split

In [5]:
# !wget -P ../data/ https://github.com/gastonstat/CreditScoring/raw/master/CreditScoring.csv

In [6]:
!head ../data/CreditScoring.csv

"Status","Seniority","Home","Time","Age","Marital","Records","Job","Expenses","Income","Assets","Debt","Amount","Price"
1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
1,0,1,36,26,1,1,1,46,107,0,0,310,910
1,1,2,60,36,2,1,1,75,214,3500,0,650,1645
1,29,2,60,44,2,1,1,75,125,10000,0,1600,1800
1,9,5,12,27,1,1,1,35,80,0,0,200,1093
1,0,2,60,32,2,1,3,90,107,15000,0,1200,1957


In [30]:
# read the data
df = pd.read_csv("../data/CreditScoring.csv")
df.head()

Unnamed: 0,Status,Seniority,Home,Time,Age,Marital,Records,Job,Expenses,Income,Assets,Debt,Amount,Price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


* lower the capitals of the column names
* replace numbers with strings in categorical values
* replace missing values

In [31]:
# lower the capitals of the column names
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


In [32]:
df.status.value_counts()

1    3200
2    1254
0       1
Name: status, dtype: int64

In [33]:
# replace numbers with strings in categorical values
# define the map dictionaries
status_values = {
    1: "ok", 
    2: "default", 
    0: "unk"
}

home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}

marital_values = {
    1: 'single',
    2: 'married',
    3: 'widow',
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}

records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}

job_values = {
    1: 'fixed',
    2: 'partime',
    3: 'freelance',
    4: 'others',
    0: 'unk'
}


# map the values of the dictionaries
df.status = df.status.map(status_values)
df.home = df.home.map(home_values)
df.marital = df.marital.map(marital_values)
df.records = df.records.map(records_values)
df.job = df.job.map(job_values)

In [34]:
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,rent,60,30,married,no,freelance,73,129,0,0,800,846
1,ok,17,rent,60,58,widow,no,fixed,48,131,0,0,1000,1658
2,default,10,owner,36,46,married,yes,freelance,90,200,3000,0,2000,2985
3,ok,0,rent,60,24,single,no,fixed,63,182,2500,0,900,1325
4,ok,0,rent,36,26,single,no,fixed,46,107,0,0,310,910


In [35]:
# replace missing values
# have a look at some statistical values of the numerical values
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,763317.0,1060341.0,404382.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,8703625.0,10217569.0,6344253.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3500.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,166.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,99999999.0,99999999.0,99999999.0,5000.0,11140.0


* We see that ```Income```, ```assets``` and ```debt``` have max value ```99999999.0```
* Replace this number with  ```nan```

In [36]:
for c in ["income", "assets", "debt"]:
    df[c] = df[c].replace(to_replace=99999999, value= np.nan)

In [37]:
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4421.0,4408.0,4437.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,131.0,5403.0,343.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,86.0,11573.0,1246.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3000.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,165.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,959.0,300000.0,30000.0,5000.0,11140.0


In [38]:
# look at the status varianle again
df.status.value_counts()

ok         3200
default    1254
unk           1
Name: status, dtype: int64

* The row with status = 1 (unknown) is useless for us
* We will remove it

In [41]:
df = df[df.status != "unk"].reset_index(drop=True)

In [43]:
df.status.value_counts()

ok         3200
default    1254
Name: status, dtype: int64

In [45]:
# Do train / validation / test split
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

In [48]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

* define y as status
 * set the default values to 1 and the ok values to 0

In [53]:
y_train = (df_train.status == "default").astype(int)
y_val = (df_val.status == "default").astype(int)
y_test = (df_test.status == "default").astype(int)

In [55]:
# remove status column from dataframe
del df_train["status"]
del df_val["status"]
del df_test["status"]

In [56]:
df_train.head()

Unnamed: 0,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,10,owner,36,36,married,no,freelance,75,0.0,10000.0,0.0,1000,1400
1,6,parents,48,32,single,yes,fixed,35,85.0,0.0,0.0,1100,1330
2,1,parents,48,40,married,no,fixed,75,121.0,0.0,0.0,1320,1600
3,1,parents,48,23,single,no,partime,35,72.0,0.0,0.0,1078,1079
4,5,owner,36,46,married,no,freelance,60,100.0,4000.0,0.0,1100,1897


## Decision Trees

* How a decision tree looks like
* Training a decision tree
* Overfitting
* Controlling the size of a tree

A Decision Tree is a data structure, that looks like this
![decisoin tree](Screenshot_2.png)

* A decision tree is a sequence of if-else-then 
* A simple example
![example](Screenshot_3.png)

In [62]:
def asses_risk(client):
    if client["records"] == "yes":
        if client["job"] == "parttime":
            return "default"
        else:
            return "ok"
    else:
        if client["assets"] > 6000:
            return "ok"
        else:
            return "default"

In [63]:
# take first entry as example
xi = df_train.iloc[0].to_dict()
xi

{'seniority': 10,
 'home': 'owner',
 'time': 36,
 'age': 36,
 'marital': 'married',
 'records': 'no',
 'job': 'freelance',
 'expenses': 75,
 'income': 0.0,
 'assets': 10000.0,
 'debt': 0.0,
 'amount': 1000,
 'price': 1400}

In [64]:
asses_risk(xi)

'ok'

* This if-then-eñse rules can be learned from the data

In [76]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score

In [72]:
train_dicts = df_train.fillna(0).to_dict(orient="records")
val_dicts = df_val.fillna(0).to_dict(orient="records")
train_dicts[:2]

[{'seniority': 10,
  'home': 'owner',
  'time': 36,
  'age': 36,
  'marital': 'married',
  'records': 'no',
  'job': 'freelance',
  'expenses': 75,
  'income': 0.0,
  'assets': 10000.0,
  'debt': 0.0,
  'amount': 1000,
  'price': 1400},
 {'seniority': 6,
  'home': 'parents',
  'time': 48,
  'age': 32,
  'marital': 'single',
  'records': 'yes',
  'job': 'fixed',
  'expenses': 35,
  'income': 85.0,
  'assets': 0.0,
  'debt': 0.0,
  'amount': 1100,
  'price': 1330}]

In [73]:
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)

In [74]:
# train the decision tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

DecisionTreeClassifier()

In [78]:
# test the decision tree
X_val = dv.transform(val_dicts)
y_pred = dt.predict_proba(X_val)[:,1]

In [79]:
roc_auc_score(y_val, y_pred)

0.6557663897701679

* The score is not very high, look at the roc_auc_score for the train set

In [80]:
y_pred_train = dt.predict_proba(X_train)[:,1]

In [81]:
roc_auc_score(y_train, y_pred_train)

1.0

* Our model is overfitting
* It memorizes the training data, but does not generalize
* This happens, because we did not restrict the tree, but it can learn very specific conditions
* One possibility to avoid this is to restrict the depth

In [83]:
# retrain the model, with max depth restriced to 3
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X_train, y_train)
y_pred = dt.predict_proba(X_val)[:,1]
ras = roc_auc_score(y_val, y_pred)
print(f"roc_auc_score on validation data: {ras}")

y_pred_train = dt.predict_proba(X_train)[:,1]
ras_train = roc_auc_score(y_train, y_pred_train)
print(f"roc_auc_score on training data: {ras_train}")

roc_auc_score on validation data: 0.7389079944782155
roc_auc_score on training data: 0.7761016984958594


* The performance of our model on validation data is much better
* Visualize the tree

In [85]:
from sklearn.tree import export_text

print(export_text(dt))

|--- feature_26 <= 0.50
|   |--- feature_16 <= 0.50
|   |   |--- feature_12 <= 74.50
|   |   |   |--- class: 0
|   |   |--- feature_12 >  74.50
|   |   |   |--- class: 0
|   |--- feature_16 >  0.50
|   |   |--- feature_2 <= 8750.00
|   |   |   |--- class: 1
|   |   |--- feature_2 >  8750.00
|   |   |   |--- class: 0
|--- feature_26 >  0.50
|   |--- feature_27 <= 6.50
|   |   |--- feature_1 <= 862.50
|   |   |   |--- class: 0
|   |   |--- feature_1 >  862.50
|   |   |   |--- class: 1
|   |--- feature_27 >  6.50
|   |   |--- feature_12 <= 103.50
|   |   |   |--- class: 1
|   |   |--- feature_12 >  103.50
|   |   |   |--- class: 0



* To know what the features are use ```dv.get_feature_names()```

In [89]:
dv.get_feature_names()

['age',
 'amount',
 'assets',
 'debt',
 'expenses',
 'home=ignore',
 'home=other',
 'home=owner',
 'home=parents',
 'home=private',
 'home=rent',
 'home=unk',
 'income',
 'job=fixed',
 'job=freelance',
 'job=others',
 'job=partime',
 'job=unk',
 'marital=divorced',
 'marital=married',
 'marital=separated',
 'marital=single',
 'marital=unk',
 'marital=widow',
 'price',
 'records=no',
 'records=yes',
 'seniority',
 'time']

In [93]:
print(export_text(dt, feature_names=dv.get_feature_names()))

|--- records=yes <= 0.50
|   |--- job=partime <= 0.50
|   |   |--- income <= 74.50
|   |   |   |--- class: 0
|   |   |--- income >  74.50
|   |   |   |--- class: 0
|   |--- job=partime >  0.50
|   |   |--- assets <= 8750.00
|   |   |   |--- class: 1
|   |   |--- assets >  8750.00
|   |   |   |--- class: 0
|--- records=yes >  0.50
|   |--- seniority <= 6.50
|   |   |--- amount <= 862.50
|   |   |   |--- class: 0
|   |   |--- amount >  862.50
|   |   |   |--- class: 1
|   |--- seniority >  6.50
|   |   |--- income <= 103.50
|   |   |   |--- class: 1
|   |   |--- income >  103.50
|   |   |   |--- class: 0



## Decision Tree Algorithm Learning
* Finding the best plit for one column
* Finding the best split for the entire dataset
* Stopping criteria
* Decision tree learning algorithm



## Decision Tree Parameter Tuning
* Selecting max depth
* Selecting ```min_samples_leaf```

## Ensembles and Random Forests
* Board of experts
* Ensembling models
* Random forest - ensembling decision trees
* Tuning a random forest

* Other useful parameters:
    * ```max_features```
    * ```bootstrap```
    * Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

## Gradient Boosing and XGBoost
* Gradient boosint vs. random forest
* Installing XGBoost
* Training the first model
* Performance monitoring
* Parsing XGBoost's monitoring output

## XGBoost Parameter Tuning
* Tuning the following parameters
    * ```eta```
    * ```max_depth```
    * ```min_child_weight```

* Documentation: https://xgboost.readthedocs.io/en/stable/
* Other useful parameters:
    * ```subsample``` and ```colsample_bytree```
    * ```lambda``` and ```alpha```

## Selecting the final Model
* Choosing between XGBoost, Random Forest and Decision Tree
* Training the final model
* Saving the model


## Summary
* Decion Trees learn if-then-else rules from the data
* Finding the best split: select the least impure split. This algorithm can overfit, that's why we control it by limiting the max depth and the size of the group
* Random Forests are a way of combining multiple decision trees 