## Credit Risk Scoring Project
This project involves credit risk scoring. Imagine you want to buy a mobile phone, so you visit your bank to apply for a loan. You fill out an application form that requests various details, such as your income, the price of the phone, and the loan amount you need. The bank evaluates your application and assigns a score, ultimately deciding whether to approve or decline your request with a ‘yes’ or ‘no’ response.

In this chapter, our goal is to build a model that the bank can utilize to make informed decisions about lending money to customers. The bank can provide the model with customer information, and in return, the model will generate a risk score, indicating the likelihood of a customer defaulting on the loan. This risk score enables the bank to make well-informed lending decisions.

Our approach involves analyzing historical data from various customers and their loan applications. For each case, we have information about the requested loan amount and whether the customer successfully repaid the loan or defaulted.

For instance:

 * Customer A –> OK
 * Customer B –> OK
 * Customer C –> DEFAULT
 * Customer D –> DEFAULT
 * Customer E –> OK

   
This problem can be framed as binary classification, where ‘y’ represents the target variable, and it can take on two values: 0 (OK) or 1 (DEFAULT). Our objective is to train a model to predict, for each new customer, the probability that they will default:

g(xi) –> PROBABILITY OF DEFAULT


We have ‘X,’ which encompasses all the customer information, and the target variable ‘y,’ which indicates the default probability.

**Dataset**: You can find the dataset at this [link](https://github.com/gastonstat/CreditScoring). The ‘Status’ variable in the dataset denotes whether the customer defaulted or not.



1. Preparation Steps – Part 1/2
   1. Imports for this project
   2. Downloading the dataset
   3. Previewing the CSV File
   4. Adapting Column Format
      
In part 1 of this chapter, “Decision Trees and Ensemble Learning,” we introduced the project, which is a binary classification problem aimed at predicting the probability of a client defaulting on a loan. Part 2 of this chapter is divided into two main sections.

**Preparation Steps**

In the first part, we focus on necessary preparation steps. This includes importing essential libraries, downloading the dataset, previewing the the CSV File, and performing an initial column format adaptation to ensure uniformity in our data.

**Data Transformation and Splitting**

The second part is dedicated to re-encoding categorical variables and performing the train/validation/test split, a crucial step in preparing our data for modeling and evaluation.

## Preparation Steps – Part 1/2
### Imports for this project
For this project, we’ll need to import several essential libraries that we’re already familiar with. These libraries provide the foundation for our data analysis and machine learning tasks. The necessary libraries include:

**NumPy**: NumPy is a fundamental library for numerical and array operations in Python.<br>
**Pandas**: Pandas is used for data manipulation and analysis, allowing us to work with structured data efficiently.<br>
**Scikit-Learn**: Scikit-Learn is a powerful machine learning library that provides a wide range of tools and algorithms for our classification task. This library we’ll import at a later point.<br>
**Matplotlib**: Matplotlib is essential for data visualization, enabling us to create informative plots and charts.<br>
**Seaborn**: Seaborn complements Matplotlib and simplifies the creation of aesthetically pleasing statistical visualizations.<br>


By ensuring that we have these libraries at our disposal, we’ll be well-equipped to tackle the various tasks involved in our credit risk scoring project.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
#download data by using wget command 
data = 'https://github.com/gastonstat/CreditScoring/blob/master/CreditScoring.csv'
!wget $data

--2025-12-10 07:47:41--  https://github.com/gastonstat/CreditScoring/blob/master/CreditScoring.csv
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘CreditScoring.csv.2’

CreditScoring.csv.2     [        <=>         ] 488.41K  96.5KB/s    in 5.8s    

2025-12-10 07:48:07 (84.9 KB/s) - ‘CreditScoring.csv.2’ saved [500128]



In [3]:
!head CreditScoring.csv
 
#df = pd.read_csv(data)
df = pd.read_csv('CreditScoring.csv')
df.head()

"Status","Seniority","Home","Time","Age","Marital","Records","Job","Expenses","Income","Assets","Debt","Amount","Price"
1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
1,0,1,36,26,1,1,1,46,107,0,0,310,910
1,1,2,60,36,2,1,1,75,214,3500,0,650,1645
1,29,2,60,44,2,1,1,75,125,10000,0,1600,1800
1,9,5,12,27,1,1,1,35,80,0,0,200,1093
1,0,2,60,32,2,1,3,90,107,15000,0,1200,1957


Unnamed: 0,Status,Seniority,Home,Time,Age,Marital,Records,Job,Expenses,Income,Assets,Debt,Amount,Price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


In [4]:
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


Building upon the necessary preparation steps outlined in the previous article, this section is dedicated to two critical processes: re-encoding categorical variables and performing the train/validation/test split, a crucial step in preparing our data for modeling and evaluation.

## Data cleaning and preparation – Data Transformation and Splitting – Part 2/2
Re-encoding the categorical variables
To handle the categorical variables, we need to address a few important considerations. The R file in the repository provides insights into preprocessing this data.

Here are the key points:

 **Missing Values**: The missing values are encoded as a series of nines (99999999). We’ll need to address how to handle these missing values.<br>
 **Categorical Variable Information**: The R file also offers information about the categorical variables. For example:<br>
 * Status: is encoded as ‘good’ (1) and ‘bad’ (2).
 * Home: includes categories like ‘rent,’ ‘owner,’ ‘priv,’ ‘ignore,’ ‘parents,’ and ‘other.’
 * Marital: encompasses ‘single,’ ‘married,’ ‘widow,’ ‘separated,’ and ‘divorced.’
 * Records: has ‘yes’ and ‘no.’
 * Job: includes ‘fixed,’ ‘partime,’ ‘freelance,’ and ‘other.’<br>
To proceed, we’ll need to translate these numerical values back into their respective categorical strings. This ensures our data is more interpretable and ready for analysis.

To address the ‘status’ variable, we examine the possible values: 1, 2, and 0. As previously mentioned, 1 corresponds to ‘good,’ and 2 corresponds to ‘bad.’ However, we also have one record with the value 0, which we’ll designate as ‘unknown.’

To map these values accordingly, we can use the ‘map’ method. This method takes a dictionary that maps each original dataframe value to a new value.


In [5]:
df.status.value_counts()

status
1    3200
2    1254
0       1
Name: count, dtype: int64

In [6]:
status_values = {
    1: 'ok',
    2: 'default',
    0: 'unk'
}

In [7]:
df.status = df.status.map(status_values)
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,1,60,30,2,1,3,73,129,0,0,800,846
1,ok,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,default,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,ok,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,ok,0,1,36,26,1,1,1,46,107,0,0,310,910


The same re-encoding process applied to the ‘status’ column should also be carried out for the remaining categorical columns: ‘home_values,’ ‘marital_values,’ ‘records_values,’ and ‘job_values’.

In [8]:
home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}
df.home = df.home.map(home_values)
 
marital_values = {
    1: 'single', 
    2: 'married', 
    3: 'widow', 
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}
df.marital = df.marital.map(marital_values)
 
records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}
df.records = df.records.map(records_values)
 
job_values = {
    1: 'fixed', 
    2: 'partime', 
    3: 'freelance', 
    4: 'others',
    0: 'unk'
}
df.job = df.job.map(job_values)
 
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,rent,60,30,married,no,freelance,73,129,0,0,800,846
1,ok,17,rent,60,58,widow,no,fixed,48,131,0,0,1000,1658
2,default,10,owner,36,46,married,yes,freelance,90,200,3000,0,2000,2985
3,ok,0,rent,60,24,single,no,fixed,63,182,2500,0,900,1325
4,ok,0,rent,36,26,single,no,fixed,46,107,0,0,310,910


In [9]:
df.marital.value_counts()

marital
married      3241
single        978
separated     130
widow          67
divorced       38
unk             1
Name: count, dtype: int64

## Missing values
With all categorical variables decoded back to strings, the next step is to address the missing values.

In [10]:
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,763317.0,1060341.0,404382.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,8703625.0,10217569.0,6344253.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3500.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,166.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,99999999.0,99999999.0,99999999.0,5000.0,11140.0


It’s apparent that ‘income,’ ‘assets,’ and ‘debt’ columns contain extremely large values (e.g., 99999999.0) as maximum values. To address this issue, we need to replace these outlier values. Let’s explore the replacement process.

In [11]:
df.income.max()


99999999

In [12]:
df.income.replace(to_replace=99999999, value=np.nan)
 
df.income.replace(to_replace=99999999, value=np.nan).max()

959.0

In [13]:
for c in ['income', 'assets', 'debt']:
    df[c] = df[c].replace(to_replace=99999999, value=np.nan)
 
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4421.0,4408.0,4437.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,131.0,5403.0,343.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,86.0,11573.0,1246.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3000.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,165.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,959.0,300000.0,30000.0,5000.0,11140.0


While analyzing the ‘status’ column, we discovered a single record with the value ‘unk,’ representing a missing or unknown status. Since our focus is solely on ‘ok’ and ‘default’ values, we can safely remove this record from the dataframe.

In [14]:
df.status.value_counts()

status
ok         3200
default    1254
unk           1
Name: count, dtype: int64

In [15]:
df = df[df.status != 'unk'].reset_index(drop=True)
df

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,rent,60,30,married,no,freelance,73,129.0,0.0,0.0,800,846
1,ok,17,rent,60,58,widow,no,fixed,48,131.0,0.0,0.0,1000,1658
2,default,10,owner,36,46,married,yes,freelance,90,200.0,3000.0,0.0,2000,2985
3,ok,0,rent,60,24,single,no,fixed,63,182.0,2500.0,0.0,900,1325
4,ok,0,rent,36,26,single,no,fixed,46,107.0,0.0,0.0,310,910
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4449,default,1,rent,60,39,married,no,fixed,69,92.0,0.0,0.0,900,1020
4450,ok,22,owner,60,46,married,no,fixed,60,75.0,3000.0,600.0,950,1263
4451,default,0,owner,24,37,married,no,partime,60,90.0,3500.0,0.0,500,963
4452,ok,0,rent,48,23,single,no,freelance,49,140.0,0.0,0.0,550,550


The final step in our data preparation is to split the dataset into training, validation, and test sets. We achieve this using the ‘train_test_split’ function from scikit-learn.

Here’s the code for the split:

In [16]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state = 11)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state = 11)

df_train = df_train.reset_index(drop= True)
df_test = df_test.reset_index(drop= True)
df_val = df_val.reset_index(drop= True)

df_train.status

0       default
1       default
2            ok
3       default
4            ok
         ...   
2667         ok
2668         ok
2669         ok
2670         ok
2671         ok
Name: status, Length: 2672, dtype: object

To predict a probability, we need to convert our target variable ‘status’ into a numerical format.


In [17]:
(df_train.status == 'default').astype('int')

0       1
1       1
2       0
3       1
4       0
       ..
2667    0
2668    0
2669    0
2670    0
2671    0
Name: status, Length: 2672, dtype: int64

To complete our data preparation, we need to assign target variables for the training, validation, and test sets. Additionally, to prevent accidental use of the target variable during training, we should remove it from ‘df_train,’ ‘df_val,’ and ‘df_test.’



In [18]:
y_train = (df_train.status == 'default').astype('int').values
y_test  = (df_test.status == 'default').astype('int').values
y_val  =  (df_val.status  == 'default').astype('int').values

del df_train['status']
del df_test['status']
del df_val['status']

df_train

Unnamed: 0,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,10,owner,36,36,married,no,freelance,75,0.0,10000.0,0.0,1000,1400
1,6,parents,48,32,single,yes,fixed,35,85.0,0.0,0.0,1100,1330
2,1,parents,48,40,married,no,fixed,75,121.0,0.0,0.0,1320,1600
3,1,parents,48,23,single,no,partime,35,72.0,0.0,0.0,1078,1079
4,5,owner,36,46,married,no,freelance,60,100.0,4000.0,0.0,1100,1897
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2667,18,private,36,45,married,no,fixed,45,220.0,20000.0,0.0,800,1600
2668,7,private,60,29,married,no,fixed,60,51.0,3500.0,500.0,1000,1290
2669,1,parents,24,19,single,no,fixed,35,28.0,0.0,0.0,400,600
2670,15,owner,48,43,married,no,freelance,60,100.0,18000.0,0.0,2500,2976


## Decision Trees – Part 1/2
 * Introduction to Decision Trees
 * How a decision tree looks like
 * Training a decision tree<br>
The next part is also divided into two parts. First I give a brief introduction to decision trees, how a decision tree look like. The last section is about how to train a decision tree.
The second part will be about overfitting a decision tree and how to control the size of a tree.

Decision Trees – Part 1/2
This time we want to use the ready-to-use data set from the last article to predict if customers are going to default or not. We want to use decision trees for that.

Introduction to Decision Trees
Decision trees are powerful tools in the field of machine learning and data analysis. They are a versatile and interpretable way to make decisions and predictions based on a set of input features. Imagine a tree-like structure where each internal node represents a feature or attribute, each branch signifies a decision or outcome, and each leaf node provides a final prediction or classification.

Decision trees are widely used for tasks such as classification and regression. They are known for their simplicity and ease of interpretation, making them a valuable resource for understanding and solving complex problems. In this blog post, we’ll delve into the world of decision trees, exploring how they work, and how to build them.

How a decision tree looks like
A decision tree is a data structure where we have a node which is the condition. And from this node there is one arrow to the left (condition = false) and one to the right (condition=true). Then there is the next condition which can be true or false. … until there is the final decision ‘OK’ or ‘DEFAULT’

In [19]:
def assess_risk(client):
    if client['records'] == 'yes':
        if client['job'] == 'parttime':
            return 'default'
        else:
            return 'ok'
    else:
        if client['assets'] > 6000:
            return 'ok'
        else: 
            return 'default'


# just to take one record and test
xi = df_train.iloc[0].to_dict()
xi
    

{'seniority': 10,
 'home': 'owner',
 'time': 36,
 'age': 36,
 'marital': 'married',
 'records': 'no',
 'job': 'freelance',
 'expenses': 75,
 'income': 0.0,
 'assets': 10000.0,
 'debt': 0.0,
 'amount': 1000,
 'price': 1400}

When we look at the decision tree, what would be the result for this client? First condition is “RECORDS = YES.” Our client has no records, so we go to the left. The second condition is “ASSETS > 6000.” The assets of our client are 10,000, so we go to the right. Now we reach the decision node, which in this case is “OK.” 

In [20]:
assess_risk(xi)

'ok'

## Training a decision tree
Before we can train a decision tree, we first need to import necessary packages. From Scikit-Learn, we import DecisionTreeClassifier. Because we have categorical variables, we also need to import DictVectorizer as seen before.

In [21]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score

Now we need to turn our training dataframe into a list of dictionaries then turn this list of dictionaries into the feature matrix. After that we train the model.

In [22]:
train_dicts = df_train.fillna(0).to_dict(orient='records')
train_dicts[:5]

[{'seniority': 10,
  'home': 'owner',
  'time': 36,
  'age': 36,
  'marital': 'married',
  'records': 'no',
  'job': 'freelance',
  'expenses': 75,
  'income': 0.0,
  'assets': 10000.0,
  'debt': 0.0,
  'amount': 1000,
  'price': 1400},
 {'seniority': 6,
  'home': 'parents',
  'time': 48,
  'age': 32,
  'marital': 'single',
  'records': 'yes',
  'job': 'fixed',
  'expenses': 35,
  'income': 85.0,
  'assets': 0.0,
  'debt': 0.0,
  'amount': 1100,
  'price': 1330},
 {'seniority': 1,
  'home': 'parents',
  'time': 48,
  'age': 40,
  'marital': 'married',
  'records': 'no',
  'job': 'fixed',
  'expenses': 75,
  'income': 121.0,
  'assets': 0.0,
  'debt': 0.0,
  'amount': 1320,
  'price': 1600},
 {'seniority': 1,
  'home': 'parents',
  'time': 48,
  'age': 23,
  'marital': 'single',
  'records': 'no',
  'job': 'partime',
  'expenses': 35,
  'income': 72.0,
  'assets': 0.0,
  'debt': 0.0,
  'amount': 1078,
  'price': 1079},
 {'seniority': 5,
  'home': 'owner',
  'time': 36,
  'age': 46,
  'm

In [23]:
#All the numerical features remain unchanged, but we have encoding for categorical features.

dv = DictVectorizer(sparse= False)
X_train = dv.fit_transform(train_dicts)
X_train

array([[3.60e+01, 1.00e+03, 1.00e+04, ..., 0.00e+00, 1.00e+01, 3.60e+01],
       [3.20e+01, 1.10e+03, 0.00e+00, ..., 1.00e+00, 6.00e+00, 4.80e+01],
       [4.00e+01, 1.32e+03, 0.00e+00, ..., 0.00e+00, 1.00e+00, 4.80e+01],
       ...,
       [1.90e+01, 4.00e+02, 0.00e+00, ..., 0.00e+00, 1.00e+00, 2.40e+01],
       [4.30e+01, 2.50e+03, 1.80e+04, ..., 0.00e+00, 1.50e+01, 4.80e+01],
       [2.70e+01, 4.50e+02, 5.00e+03, ..., 1.00e+00, 1.20e+01, 4.80e+01]])

In [24]:
dv.get_feature_names_out()

array(['age', 'amount', 'assets', 'debt', 'expenses', 'home=ignore',
       'home=other', 'home=owner', 'home=parents', 'home=private',
       'home=rent', 'home=unk', 'income', 'job=fixed', 'job=freelance',
       'job=others', 'job=partime', 'job=unk', 'marital=divorced',
       'marital=married', 'marital=separated', 'marital=single',
       'marital=unk', 'marital=widow', 'price', 'records=no',
       'records=yes', 'seniority', 'time'], dtype=object)

The is part 2 of Decision Trees. While part 1 introduces the concept of a Decision Tree briefly, this section is about overfitting a decision tree and how to control the size of a tree.

Decision Trees – Part 2/2
Let’s look back at the performance of our trained Decision Tree from part 1.

In [25]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

val_dicts = df_val.fillna(0).to_dict(orient='records')
X_val = dv.transform(val_dicts)
y_pred = dt.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)

0.6710239519507883

### Overfitting
0.65 is not really a gre
at value, let’s look at training data and calulate auc score.



In [26]:
y_pred = dt.predict_proba(X_train)[:, 1]
roc_auc_score(y_train, y_pred)

1.0

This is called **overfitting**. Overfitting is when our model simply memorizes the data, but it memorizes in such a way that when it sees a new example it doesn’t know what to do with this example. So it memorizes the training data but it fails to generalize. The reason why this happens to decision trees is that the model creates a specific rule for each example. That works fine for training data, but it doesn’t work for any unseen example. The reason why this can happens is, that we let the tree grow too deep. If we restrict the tree to only grow up to three levels deep, the tree will learn rules that are less specific.

In [28]:
dt = DecisionTreeClassifier(max_depth =3)
dt.fit(X_train, y_train)

y_pred = dt.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_pred)
print('train', auc)

train 0.7761016984958594


In [29]:
y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
print('val', auc)

val 0.7389079944782155


### Decision Stump
If we restrict the depth to 3, the model performance on validation is significantly better. It’s now 74% compared to 65%. By the way a decistion tree with a depth of 1 is called Decision Stump. It’s not really a tree, because this is only one condition.

In [30]:
dt = DecisionTreeClassifier(max_depth=1)
dt.fit(X_train, y_train)
 
y_pred = dt.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_pred)
print('train', auc)

train 0.6282660131823559


In [31]:
y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
print('val', auc)

val 0.6058644740984719


The auc score of this decision stump is only a bit worse than the overfitted one.



### Visualizing Decision Stump
Let’s examine this tree to understand the rules it has learned. To do that, we can use a specialized function in Scikit-Learn for visualizing trees.

In [32]:
from sklearn.tree import export_text
print(export_text(dt))

|--- feature_26 <= 0.50
|   |--- class: 0
|--- feature_26 >  0.50
|   |--- class: 1



To understand the meaning of ‘feature_25,’ we need to consult the DictVectorizer feature names dictionary.



In [33]:
names = dv.get_feature_names_out().tolist()
print(export_text(dt, feature_names=names))

|--- records=yes <= 0.50
|   |--- class: 0
|--- records=yes >  0.50
|   |--- class: 1



### Decision tree with depth of 2


In [34]:
dt = DecisionTreeClassifier(max_depth=2)
dt.fit(X_train, y_train)
 
y_pred = dt.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_pred)
print('train', auc)

train 0.7054989859726213


In [35]:
y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
print('val', auc)

val 0.6685264343319367


Even two levels in a decision tree already performs better than the overfitted one.

### Visualizing Decision tree

In [36]:
print(export_text(dt))

|--- feature_26 <= 0.50
|   |--- feature_16 <= 0.50
|   |   |--- class: 0
|   |--- feature_16 >  0.50
|   |   |--- class: 1
|--- feature_26 >  0.50
|   |--- feature_27 <= 6.50
|   |   |--- class: 1
|   |--- feature_27 >  6.50
|   |   |--- class: 0



In [37]:
names = dv.get_feature_names_out().tolist()
print(export_text(dt, feature_names=names))

|--- records=yes <= 0.50
|   |--- job=partime <= 0.50
|   |   |--- class: 0
|   |--- job=partime >  0.50
|   |   |--- class: 1
|--- records=yes >  0.50
|   |--- seniority <= 6.50
|   |   |--- class: 1
|   |--- seniority >  6.50
|   |   |--- class: 0

