## Credit Risk Classification 

## Synopsis 

For this Data analysis, we wanto to classify customers as low credit risk (class 0) or as high credit risk (class 1).
We will proceed in steps:
    - Step 1: We will start with data exploration and cleansing:
            In this steps we will be handling missing values
    - Step 2: We will build features and target variables
        We will build the train and test set and propose a method to deal with imbalanced class
    - Step 3: We will explore several models and evaulate their performance
        We will look at 3 models:
            1. Decision Tree Classifier
            2. Random Forest Classifier 
            3. Gradient Boosting Tree Classifier
        We will try to find the best set of parameters with a f1-driven study, using GridSearchCV
    - Step 4: We will compare the model and propose several directions to progress with this analysis 
 
###  Note on performance:
One main driver in our model evalutation will be to find the right balance between recall and precision. 
Indeed, we want to control the recall to ensure our model does not classify as high risk, customers who are low risk and we also want to make sure to identify all customers with high credit risk profile. In other words, we want to find the right balance between customer satisfaction and risk for the bank. 

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from imblearn.over_sampling import SMOTE

## Step 1: Data exploration 

In [4]:
#loading the data 
parse_dates = ['update_date', 'report_date']
df_cust = pd.read_csv('customer_data_ratio20.csv')
df_payment = pd.read_csv('payment_data_ratio20.csv', parse_dates=parse_dates)
#We map each payment instance to the customer demographic by merging on 'id'
df = pd.merge(df_cust, df_payment, how = 'right', left_on='id', right_on='id')

df.sample(5)

Unnamed: 0,label,id,fea_1,fea_2,fea_3,fea_4,fea_5,fea_6,fea_7,fea_8,...,OVD_t2,OVD_t3,OVD_sum,pay_normal,prod_code,prod_limit,update_date,new_balance,highest_balance,report_date
1175,0,58992868,7,1316.0,1,117000.0,2,11,-1,110,...,0,0,0,1,6,,2015-02-14,0.0,30500.0,2015-10-19
7356,0,58989774,7,1281.5,1,80000.0,2,12,-1,111,...,0,0,0,36,1,,2009-02-08,0.0,250500.0,2013-01-21
887,0,58986215,5,1250.0,1,167000.0,2,15,-1,90,...,0,0,0,12,6,,2014-11-05,0.0,14100.0,2015-07-24
2021,0,58983850,5,1221.5,3,97000.0,2,15,9,109,...,0,0,0,1,10,72600.0,2015-10-02,1518.0,1764.0,NaT
251,0,58998166,4,1355.0,3,177000.0,2,8,5,87,...,0,0,0,1,1,,2007-06-11,0.0,200500.0,2010-05-24


In [5]:
df.shape

(8250, 24)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8250 entries, 0 to 8249
Data columns (total 24 columns):
label              8250 non-null int64
id                 8250 non-null int64
fea_1              8250 non-null int64
fea_2              7222 non-null float64
fea_3              8250 non-null int64
fea_4              8250 non-null float64
fea_5              8250 non-null int64
fea_6              8250 non-null int64
fea_7              8250 non-null int64
fea_8              8250 non-null int64
fea_9              8250 non-null int64
fea_10             8250 non-null int64
fea_11             8250 non-null float64
OVD_t1             8250 non-null int64
OVD_t2             8250 non-null int64
OVD_t3             8250 non-null int64
OVD_sum            8250 non-null int64
pay_normal         8250 non-null int64
prod_code          8250 non-null int64
prod_limit         2132 non-null float64
update_date        8224 non-null datetime64[ns]
new_balance        8250 non-null float64
highest_balance 

In [7]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
label,8250.0,0.1681212,0.3739966,0.0,0.0,0.0,0.0,1.0
id,8250.0,57821730.0,1822724.0,54982353.0,54990497.0,58989050.0,58996550.0,59006240.0
fea_1,8250.0,5.522667,1.388358,1.0,4.0,5.0,7.0,7.0
fea_2,7222.0,1286.157,52.00243,1116.5,1248.5,1283.0,1317.5,1481.0
fea_3,8250.0,2.319636,0.8874141,1.0,1.0,3.0,3.0,3.0
fea_4,8250.0,138671.2,108156.5,15000.0,77000.0,111000.0,151000.0,1200000.0
fea_5,8250.0,1.940848,0.2359224,1.0,2.0,2.0,2.0,2.0
fea_6,8250.0,11.01394,2.694611,3.0,8.0,11.0,12.0,16.0
fea_7,8250.0,4.881091,3.031902,-1.0,5.0,5.0,5.0,10.0
fea_8,8250.0,100.0263,12.54008,64.0,90.0,105.0,110.0,115.0


### Missing values - Data cleaning 
1. Some dates are missing for 'report_date' and 'update_date'
     We will fill the missing values from each column using the other column
2. hightest_balance also has a number of missing values that we will fill with 'new_balance' 
3. fea_2 missing values will be replaced by the mean of the column 
4. missing values for prod_limit will be replaced by 0 (this could mean that 0 credit limit is set) 

In [9]:
df['report_date']= df['report_date'].fillna(df['update_date'])
df['update_date']= df['report_date'].fillna(df['update_date'])
df['highest_balance']= df['highest_balance'].fillna(df['new_balance'])
df['fea_2']= df['fea_2'].fillna(df['fea_2'].mean())
df['prod_limit']= df['prod_limit'].fillna(0)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8250 entries, 0 to 8249
Data columns (total 24 columns):
label              8250 non-null int64
id                 8250 non-null int64
fea_1              8250 non-null int64
fea_2              8250 non-null float64
fea_3              8250 non-null int64
fea_4              8250 non-null float64
fea_5              8250 non-null int64
fea_6              8250 non-null int64
fea_7              8250 non-null int64
fea_8              8250 non-null int64
fea_9              8250 non-null int64
fea_10             8250 non-null int64
fea_11             8250 non-null float64
OVD_t1             8250 non-null int64
OVD_t2             8250 non-null int64
OVD_t3             8250 non-null int64
OVD_sum            8250 non-null int64
pay_normal         8250 non-null int64
prod_code          8250 non-null int64
prod_limit         8250 non-null float64
update_date        8226 non-null datetime64[ns]
new_balance        8250 non-null float64
highest_balance 

We still have about 24 rows with missing values for report_date, we will remove these rows 

In [11]:
df = df.dropna(axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8226 entries, 0 to 8249
Data columns (total 24 columns):
label              8226 non-null int64
id                 8226 non-null int64
fea_1              8226 non-null int64
fea_2              8226 non-null float64
fea_3              8226 non-null int64
fea_4              8226 non-null float64
fea_5              8226 non-null int64
fea_6              8226 non-null int64
fea_7              8226 non-null int64
fea_8              8226 non-null int64
fea_9              8226 non-null int64
fea_10             8226 non-null int64
fea_11             8226 non-null float64
OVD_t1             8226 non-null int64
OVD_t2             8226 non-null int64
OVD_t3             8226 non-null int64
OVD_sum            8226 non-null int64
pay_normal         8226 non-null int64
prod_code          8226 non-null int64
prod_limit         8226 non-null float64
update_date        8226 non-null datetime64[ns]
new_balance        8226 non-null float64
highest_balance 

In [12]:
df.shape

(8226, 24)

We will work with 8226 instances

## Step 2: Building features 

The data will be separated into:
    - our target, the column 'label' that has 2 classes: 0 for low risk and 1 for high risk 
    - features, the other columns of df

In [13]:
y = df['label']

class_ratio = np.bincount(y)/len(y)
class_ratio

array([0.83187454, 0.16812546])

We can see the class are imbalanced with 17% of high risk and 83% of low risk.
 
We will be using SMOTE to 'rebalance' the training set 

In [15]:
X = df.drop(axis=1, columns = 'label')

#We will use 'update_date' and 'report_date' as categorical  variables in the model

encoder = LabelEncoder()
X['update_date'] =  encoder.fit_transform(X['update_date'])
X['report_date'] =  encoder.fit_transform(X['report_date'])

In [16]:
X_train, X_test, y_train, y_test =  train_test_split(X,y)

We use SMOTE to build a balanced training set

In [20]:
sm = SMOTE(random_state = 0, ratio = 1)
X_SMOTE, y_SMOTE = sm.fit_sample(X_train, y_train)
ratio_y_SMOTE = np.bincount(y_SMOTE) /len(y_SMOTE)
ratio_y_SMOTE

array([0.5, 0.5])

The training set is well balanced now, and we will use X_SMOTE and y_SMOTE to fit the models.

##  Step 3: Building model 

We will be exploring 3 models:
    1. Decisition Tree Classifier
    2. Random Forest Classier
    3. Gradient Boosting Classifier
   
For each of these models, we will use GridSearchCV to search for the best paramaters to outperform the F1 score.
This metric will enable to control both recall and precision.
We want our model:
    1. to obtain good recall to minimise the risk of classifing
        clients as high risk when they are not (customer satisfaction) 
    2. to obtain a good precision to minimise the risk of classifing 
        clients as low risk when they are high risk (bank protection)

### Decision Tree Classifier 

#### Decision Tree Classififier with max_depth = 10

In [22]:
dt = DecisionTreeClassifier(max_depth=10, random_state=0).fit(X_SMOTE, y_SMOTE)
accuracy_score(y_test, dt.predict(X_test))

0.893048128342246

#### Decision Tree Classififier with max_depth = 20

In [24]:
dt = DecisionTreeClassifier(max_depth=20, random_state=0).fit(X_SMOTE, y_SMOTE)
accuracy_score(y_test, dt.predict(X_test))

0.9426349052017501

#### Search optimal parameters for f1 performance

In [25]:
dt_params = {'max_depth': np.arange(3,20),
             'criterion' : ['gini','entropy'],
             'max_leaf_nodes': [5,10,20,100],
             'min_samples_split': [2, 5, 10, 20]}
grid_dt = GridSearchCV(DecisionTreeClassifier(), param_grid = dt_params, cv = 5, scoring= 'f1')

grid_dt.fit(X_SMOTE, y_SMOTE)
best_params_dt = grid_dt.best_params_
dt_f1 = DecisionTreeClassifier(random_state = 0)
dt_f1.set_params(**best_params_dt)
dt_f1.fit(X_SMOTE, y_SMOTE)
dt_predicted = dt_f1.predict(X_test)

#### Performance for Decision Tree Classifier

In [26]:
print(best_params_dt)
print('Decision tree fitted on balanced sample - Accuracy: {:.2f}'.format(accuracy_score(y_test, dt_predicted)))
print('Decision tree fitted on balanced sample - Precision: {:.2f}'.format(precision_score(y_test, dt_predicted)))
print('Decision tree fitted on balanced sample - Recall: {:.2f}'.format(recall_score(y_test, dt_predicted)))
print('Decision tree fitted on balanced sample - F1: {:.2f}'.format(f1_score(y_test, dt_predicted)))

{'criterion': 'entropy', 'max_depth': 12, 'max_leaf_nodes': 100, 'min_samples_split': 2}
Decision tree fitted on balanced sample - Accuracy: 0.87
Decision tree fitted on balanced sample - Precision: 0.59
Decision tree fitted on balanced sample - Recall: 0.64
Decision tree fitted on balanced sample - F1: 0.62


We can observe that the recall is quite low and this model is not satisfying at this stage

### Random Forest Classifier

In [None]:
#### Search optimal parameters for f1 performance

In [27]:
rdf_params = {'n_estimators' : [10,15,20],
              'max_depth': np.arange(3,20),
              'criterion' : ['gini','entropy'],
              'max_features': [1,5,10,18]}
grid_rdf = GridSearchCV(RandomForestClassifier(), param_grid = rdf_params, cv = 5, scoring= 'f1')


In [29]:
grid_rdf.fit(X_SMOTE, y_SMOTE)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_estimators': [10, 15, 20], 'max_depth': array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]), 'criterion': ['gini', 'entropy'], 'max_features': [1, 5, 10, 18]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1', verbose=0)

In [34]:
best_params_rdf = grid_rdf.best_params_
rdf_f1 = RandomForestClassifier(random_state = 0)
rdf_f1.set_params(**best_params_rdf)
rdf_f1.fit(X_SMOTE, y_SMOTE)
rdf_predicted = rdf_f1.predict(X_test)

#### Performance for Random Forest Classifier

In [35]:
print(best_params_rdf)
print('Random Forest fitted on balanced sample - Accuracy: {:.2f}'.format(accuracy_score(y_test, rdf_predicted)))
print('Random Forest fitted on balanced sample - Precision: {:.2f}'.format(precision_score(y_test, rdf_predicted)))
print('Random Forest fitted on balanced sample - Recall: {:.2f}'.format(recall_score(y_test, rdf_predicted)))
print('Random Forest fitted on balanced sample - F1: {:.2f}'.format(f1_score(y_test, rdf_predicted)))

{'criterion': 'entropy', 'max_depth': 17, 'max_features': 18, 'n_estimators': 15}
Random Forest fitted on balanced sample - Accuracy: 0.98
Random Forest fitted on balanced sample - Precision: 0.95
Random Forest fitted on balanced sample - Recall: 0.90
Random Forest fitted on balanced sample - F1: 0.93


We can observe much better performance, especially on the recall, as compared to Decision Tree Classification.
Let's see if the Gradient Boosting Decision Tree will give us even better result

### Gradient Boosting Decision Tree Classifier 

For the Gradient Boosting Tree Classifier, we will use the same parameters as for the Random Forest model, and we will set the learning_rate to 0.1

In [40]:
gbdt_f1 = GradientBoostingClassifier(random_state = 0, n_estimators = 15, max_depth = 17,
                                     max_features = 18, learning_rate = 0.1)

gbdt_f1.fit(X_SMOTE,y_SMOTE)
gbdt_predicted = gbdt_f1.predict(X_test)

#### Performance for Gradient Boosting Tree Classifier

In [41]:
print('Gradient Boosting Tree fitted on balanced sample - Accuracy: {:.2f}'.format(accuracy_score(y_test, gbdt_predicted)))
print('Gradient Boosting Tree fitted on balanced sample - Precision: {:.2f}'.format(precision_score(y_test, gbdt_predicted)))
print('Gradient Boosting Tree fitted  on balanced sample - Recall: {:.2f}'.format(recall_score(y_test, gbdt_predicted)))
print('Gradient Boosting Tree fitted  on balanced sample - F1: {:.2f}'.format(f1_score(y_test, gbdt_predicted)))

Gradient Boosting Tree fitted on balanced sample - Accuracy: 0.97
Gradient Boosting Tree fitted on balanced sample - Precision: 0.91
Gradient Boosting Tree fitted  on balanced sample - Recall: 0.92
Gradient Boosting Tree fitted  on balanced sample - F1: 0.91


So far, this model provides the best balance between precision and recall

## Step 4:  Model explanation and evaluation

Given the performance for the 3 models, the Gradient Boosting Decision Tree Classifier provides the best balance between recall and precision with f1 score of 0.91.

To improve the model, we can explore some areas such as: 
    1. Explore the features more (with a correlation heatmap for instance) and study the impact of removing 
        certain features to the model performance
    2. Explore the data more to address outliers 
    3. Tune the parameters with GridSearchCV and K-fold cross-validation with different parameters 
    4. Explore other models
    5. Obtain more data  