# Hands On - Predicting Credit Default - Classification RECAP SESSION

# Import & Prepare Data

In [1]:
import pandas as pd
default = pd.read_csv("https://raw.githubusercontent.com/casbdai/datasets/main/prosper.data.csv")



*   Investigate the structure of the data. 
*   Delete missing values if required 
*   Transform objects to dummies 
*   Seperate features from label (loan.default) in X and y


In [2]:
default.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   loan.default                 5000 non-null   object 
 1   employment.status            5000 non-null   object 
 2   borrower.rate                5000 non-null   float64
 3   loan.amount                  5000 non-null   int64  
 4   term                         5000 non-null   int64  
 5   monthly.income               5000 non-null   float64
 6   home.ownership               5000 non-null   bool   
 7   public.records.last.10years  5000 non-null   int64  
 8   inquiries.last.6months       5000 non-null   int64  
 9   current.delinquencies        5000 non-null   int64  
 10  open.credit.lines            5000 non-null   int64  
 11  debt.to.income.ratio         4583 non-null   float64
 12  monthly.loan.payment         5000 non-null   float64
 13  investors         

Delete missing values

In [3]:
default = default.dropna()
default.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4583 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   loan.default                 4583 non-null   object 
 1   employment.status            4583 non-null   object 
 2   borrower.rate                4583 non-null   float64
 3   loan.amount                  4583 non-null   int64  
 4   term                         4583 non-null   int64  
 5   monthly.income               4583 non-null   float64
 6   home.ownership               4583 non-null   bool   
 7   public.records.last.10years  4583 non-null   int64  
 8   inquiries.last.6months       4583 non-null   int64  
 9   current.delinquencies        4583 non-null   int64  
 10  open.credit.lines            4583 non-null   int64  
 11  debt.to.income.ratio         4583 non-null   float64
 12  monthly.loan.payment         4583 non-null   float64
 13  investors         

Create dummies

In [4]:
default = pd.get_dummies(default, drop_first=True)
default.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4583 entries, 0 to 4999
Data columns (total 20 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   borrower.rate                    4583 non-null   float64
 1   loan.amount                      4583 non-null   int64  
 2   term                             4583 non-null   int64  
 3   monthly.income                   4583 non-null   float64
 4   home.ownership                   4583 non-null   bool   
 5   public.records.last.10years      4583 non-null   int64  
 6   inquiries.last.6months           4583 non-null   int64  
 7   current.delinquencies            4583 non-null   int64  
 8   open.credit.lines                4583 non-null   int64  
 9   debt.to.income.ratio             4583 non-null   float64
 10  monthly.loan.payment             4583 non-null   float64
 11  investors                        4583 non-null   int64  
 12  investment.friends.a

Separate Features and Labels

In [5]:
X = default.drop("loan.default_defaulted", axis=1)
y = default["loan.default_defaulted"]

# Train and a Decision Tree


## 1) Import Model Function

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## 2) Instantiate Model

In [7]:
tree = DecisionTreeClassifier(criterion="entropy", random_state=1)

## 3) Create Test & Training Data


In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## 4) Fit Model to Training Data


In [9]:
tree.fit(X_train, y_train)

## 5) Make Predictions on Testing Data


In [10]:
y_pred = tree.predict(X_test)

## 6) Score Accuracy



*   Produce a Classification Report
*   Plot a ROC Curve



In [11]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.75      0.75       974
           1       0.38      0.37      0.37       401

    accuracy                           0.64      1375
   macro avg       0.56      0.56      0.56      1375
weighted avg       0.64      0.64      0.64      1375



Is it good model? Why? Why Not?

**Answer:** Not a very good model. Overall Accuracy is 64%, but Recall is 37%: only a third of the defaulting credits can be identified. Precision is 38%: Only 4 out of 10 credits that are flagged as defaulting actually default.

# Build a better model - Apply a Random Forest

## 1) Import Model Function

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## 2) Instantiate Model

Build a random forest with 500 Trees

In [13]:
forest = RandomForestClassifier(n_estimators=500, random_state = 1)

## 3) Create Test & Training Data

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## 4) Fit Model to Training Data

In [15]:
forest.fit(X_train, y_train)

## 5) Make Predictions on Testing Data

In [16]:
y_pred = forest.predict(X_test)

## 6) Evaluate Performance

In [17]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.92      0.83       974
           1       0.56      0.24      0.33       401

    accuracy                           0.72      1375
   macro avg       0.65      0.58      0.58      1375
weighted avg       0.69      0.72      0.68      1375



Compare the RandomForest to the DedicionTree? Why model is better?

**Answer:** Overall accuraccy has increased to 72% (this is quite substantial from 64%). Precision increased to 56% - this is again better and we make more accurate identify defaulting credits. But recall is very low only 24% of defaulting credits are identified.

# Try To Further Improve the RandomForest with Cross Validation

Apply GridSearch to The Random Forest vary

*   values for max_depth 20,30, and 40
*   values for n_estimators (number of trees) to 500 and 1000

It may take a couple of minutes....




## 1) Import GridSearchCV and Define a Parameter Grid

In [18]:
from sklearn.model_selection import GridSearchCV

parameters = {'max_depth':[20,40,60], 
              'n_estimators':[500,1000]}
forest_CV = GridSearchCV(RandomForestClassifier(random_state=1), parameters, n_jobs=5)

How many Cross Validation Folds does the model training apply?

**Answer:** 5

## 2) Fit forest_CV object

In [19]:
forest_CV.fit(X_train, y_train)

## 3) Get best parameters of RandomForest

In [20]:
forest_CV.best_params_

{'max_depth': 20, 'n_estimators': 500}

## 5) Make Prediction on Testing Data

In [21]:
y_pred = forest_CV.predict(X_test)

## 6) Evaluate Performance 

In [22]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.92      0.83       974
           1       0.57      0.25      0.35       401

    accuracy                           0.73      1375
   macro avg       0.66      0.59      0.59      1375
weighted avg       0.70      0.73      0.69      1375



Has we model improved? 

**Answer:**  yes, but marginally