In [None]:
df.REASON.value_counts()

## Class distribution

In [None]:
(df.BAD.value_counts()/df.shape[0])*100

- The class is composed of 80% majority (negative) class and 20% minority (positive) class. This is an example of a fairly imbalanced class


- The objective of this project is to maximize the prediction on the minority (positive) class, i.e sensitivity or recall of the applicant defaulted on loan or seriously delinquent.

## Statistical description

In [None]:
df.describe()

- The average loan is about 11207 dollars and the average collateral is about 57,386 dollars 

## Correlation coefficient

In [None]:
corr_matrix = df.corr()

In [None]:
corr_matrix.style.background_gradient().set_precision(1)

- We can see that value and amount due on existing mortgage have a strong correlation of 0.9

## Exploratory data analysis

Next, let's look at the scatter plot for proper visualization

Let's check for the class distribution after splitting 

In [None]:
print('Class training distribution:\n', (y_train.value_counts()/X_train.shape[0])*100)
print('\n')
print('Class test distribution:\n', (y_test.value_counts()/X_test.shape[0])*100)

- Indeed, the training and test class distributions are a representative of the original class distribution. This is called stratified sampling

# Data preprocessing

As we saw above, there quite a substantial amount of missing values. Before we build a machine learning model, we will first fix the missing values for a better performance

In [None]:
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelBinarizer

## DataFrame Imputer

We now create a dataframe imputer to impute missing categorical and numerical values.

In [None]:
# This is based on some nice code by 'sveitser' at http://stackoverflow.com/a/25562948
class DataFrameImputer(TransformerMixin):
    
    def __init_(self):
        
        """ Impute missing categorical and numerical  values.
        Columns of dtype object are imputed with the most frequent value in column.
        Columns of other types are imputed with median of column."""   
        
    def fit(self, X, y= None):
        
        self.impute = pd.Series([X[col].value_counts().index[0] if X[col].dtype == np.dtype('O') 
                                else X[col].median() for col in X.columns], index = X.columns)
        return self
    
    def transform(self, X, y= None):
        
        return X.fillna(self.impute)
    

In [None]:
# Instantiate the DataFrame Imputer class
imputer = DataFrameImputer()

# Fit transform the training set
X_train_imputed = imputer.fit_transform(X_train)

# Only transform the training set
X_test_imputed = imputer.transform(X_test)

Now we can cross-check for missing values after imputation

In [None]:
X_train_imputed.isnull().sum()

In [None]:
X_test_imputed.isnull().sum()

## Feature scaling and handling categorical attributes

In this section, we will scale the features and convert the categorical attributes to numerics. This process is standard for machine learning algorithms to work efficiently

In [None]:
# List of categorical attributes
cat_subset = list(X.select_dtypes('O'))

# List of numerical attributes
num_subset = list(X.select_dtypes('number'))

# Binarize the categorical attributes
cat_attribs = [([cat], LabelBinarizer()) for cat in cat_subset]

# Scale the numerical attributes
num_attribs = [([num], StandardScaler()) for num in num_subset]

# Build a dataframe mapper pipeline
mapper = DataFrameMapper(cat_attribs + num_attribs)

# Fit transform the training set
X_train_prepared = mapper.fit_transform(X_train_imputed)

# Only transform the training set
X_test_prepared = mapper.transform(X_test_imputed)

In [None]:
# Size of the data after pre-processing
print('Training set size after pre-processing:', X_train_prepared.shape)
print('Test set size after pre-processing:', X_test_prepared.shape)

## Class variable to numpy array

In this section, we convert the class variable to numpy array, which is what machine learning algorithms expect 

In [None]:
y_train_label = y_train.values
y_test_label = y_test.values

In [None]:
# Size of the the target variable
print(y_train_label.shape)
print(y_test_label.shape)

In [None]:
print('Training set class distribution:', np.bincount(y_train_label))
print('Test set class distribution:',  np.bincount(y_test_label))

# Machine Learning Algorithms

In [None]:
# Ignore deprecated warning
import warnings
warnings.filterwarnings("ignore")

# Machine learning models
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

# Model performance metrics
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, auc
from sklearn.metrics import roc_curve, roc_auc_score, average_precision_score, precision_recall_curve

Let's define some helper functions

In [None]:
# Model selection by cross-validation
def model_selection_cv(model, n_training_samples, n_training_labels, cv_fold, scoring = None):
    
        # Fit the imbalanced training set
        model.fit(n_training_samples, n_training_labels)
        
        # Compute accuracy on 10-fold cross validation
        score = cross_val_score(model, n_training_samples, n_training_labels, cv = cv_fold, scoring = scoring)

        # Make prediction on 10-fold cross validation
        y_val_pred = cross_val_predict(model, n_training_samples, n_training_labels, cv = cv_fold)

        # Make probability prediction on 10-fold cross validation
        y_pred_proba = cross_val_predict(model,n_training_samples, n_training_labels, 
                                         cv = cv_fold, method = 'predict_proba')[:,1]

        # Print results
        print('CV score: %f (%f)'%(score.mean(), score.std()))
        print('AUROC: %f'%(roc_auc_score(n_training_labels, y_pred_proba)))
        print('Predicted classes:', np.unique(y_val_pred))
        print('Confusion matrix:\n', confusion_matrix(n_training_labels, y_val_pred))
        print('Classification report:\n', classification_report(n_training_labels, y_val_pred))
        print('#####################################################################')   

In [None]:
# Model prediction on the test set
def model_prediction(model, n_training_samples, n_training_labels, n_test_samples, n_test_labels):
   
        # Fit the training set
        model.fit(n_training_samples, n_training_labels)

        # Make prediction on the test set
        y_predict = model.predict(n_test_samples)

        # Compute the accuracy of the model
        accuracy = accuracy_score(n_test_labels, y_predict)

        # Predict probability
        y_predict_proba = model.predict_proba(n_test_samples)[:,1]

        print('Test accuracy:  %f'%(accuracy))
        print('AUROC: %f'%(roc_auc_score(n_test_labels, y_predict_proba)))
        print('AUPRC: %f'%(average_precision_score(n_test_labels, y_predict_proba)))
        print('Predicted classes:', np.unique(y_predict))
        print('Confusion matrix:\n', confusion_matrix(n_test_labels, y_predict))
        print('Classification report:\n', classification_report(n_test_labels, y_predict))
        print('#####################################################################') 

In [None]:
# ROC and PR Curves
def Plot_ROC_Curve_and_PRC(model, n_training_samples, n_training_labels, n_test_samples, n_test_labels,
                         color= None, label =None): 
                        
    model.fit(n_training_samples, n_training_labels)

    y_pred_proba = model.predict_proba(n_test_samples)[:, 1]

    # Compute the fpr and tpr for each classifier
    fpr, tpr, thresholds = roc_curve(n_test_labels, y_pred_proba)
    
    # Compute the precisions and recalls for the classifier
    precisions, recalls, thresholds = precision_recall_curve(n_test_labels, y_pred_proba)
    
    # Compute the area under the ROC curve for each classifier
    area_auc =roc_auc_score(n_test_labels, y_pred_proba)
    
    # Compute the area under the PR curve for the classifier
    area_prc = auc(recalls, precisions)
    
    # ROC Curve
    plt.subplot(121)
    plt.plot(fpr, tpr, color = color, label = (label) %area_auc)
    plt.plot([0,1], [0,1], 'k--')
    plt.axis([0,1,0,1])
    plt.xlabel('False positive rate (FPR)')
    plt.ylabel('True positive rate (TPR)')
    plt.title('ROC Curve')
    plt.legend(loc = 'best')
    
    # PR Curve
    plt.subplot(122)
    plt.plot(recalls, precisions, color = color, label = (label) %area_prc)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall Curve')
    plt.legend(loc = 'best')

# 1. Random Forest

In [None]:
# Cross-validation prediction
model_selection_cv(RandomForestClassifier(), X_train_prepared, y_train_label, cv_fold = 10, scoring = 'accuracy')

## A. Hyperparameter tuning to optimize AUPRC

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

params = {'n_estimators': [100, 300, 500],'max_features': ['auto', 'sqrt', 'log2', None],
          'max_depth': range(2,10,1),'min_samples_split': range(2,10,1),'min_samples_leaf': range(1,10,2),
          'bootstrap': [True, False],'class_weight': ['balanced', 'balanced_subsample',{0:1, 1:4}, {0:1, 1:10}]}
          
rf =  RandomForestClassifier()
                         
rsearch_rf = RandomizedSearchCV(estimator = rf, param_distributions = params, n_iter = 100,
                                  scoring = 'average_precision', cv = 10, n_jobs = -1, 
                                  random_state = 42, verbose = 1)  
                                                        
rsearch_rf.fit(X_train_prepared, y_train_label)

print('Best score:', rsearch_rf.best_score_) 
print('Best hyperparameters:', rsearch_rf.best_params_) 

## B. Prediction on the test set

In [None]:
model_prediction(rsearch_rf.best_estimator_, X_train_prepared, y_train_label, X_test_prepared, y_test_label)

# 2. XGBoost

In [None]:
import xgboost as xgb
# Cross-validation prediction
model_selection_cv(xgb.XGBClassifier(), X_train_prepared, y_train_label, cv_fold = 10, scoring = 'accuracy')

## A. Hyperparameter tuning  to optimize AUPRC

In [None]:
# import train test split library
from sklearn.model_selection import train_test_split

In [None]:
# Stratify sampling with 20% test set and 80% training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state = 42, stratify = y)

In [None]:
print('Training set size:', X_train.shape)
print('Class training set size:', y_train.shape)
print('Test set size:', X_test.shape)
print('Class test set size:', y_test.shape)

In [None]:
ratio = sum(y_train_label == 0)/sum(y_train_label ==1)
ratio

In [None]:
import xgboost as xgb

xgb =  xgb.XGBClassifier(objective = 'binary:logistic', scale_pos_weight = ratio, n_estimators = 5000)
                         
params = {'max_depth': range(1,10,2), 'min_child_weight': range(1,6,2),'gamma':[i/100 for i in range(0,6)],
         'subsample':[i/10 for i in range(5,9)],'colsample_bytree': [i/10 for i in range(5,9)],
         'learning_rate': [0.01, 0.02, 0.05, 0.1]}
              
rsearch_xgb = RandomizedSearchCV(estimator = xgb, param_distributions = params, n_iter = 200,  
                                  scoring = 'average_precision', cv = 10, n_jobs = -1, 
                                  random_state = 42, verbose = 1)  
                                                        
rsearch_xgb.fit(X_train_prepared, y_train_label)

print('Best score:', rsearch_xgb.best_score_) 
print('Best hyperparameters:', rsearch_xgb.best_params_) 

In [None]:
model_prediction(rsearch_xgb.best_estimator_, X_train_prepared, y_train_label, X_test_prepared, y_test_label)

## ROC and PR Curve

# Home Equity  (HMEQ) Loan
The data set Home Equity Loan (HMEQ) reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral. 


# Attribute Information

The data set has the following characteristics:

1). BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan

2). LOAN: Amount of the loan request

3). MORTDUE: Amount due on existing mortgage

4). VALUE: Value of current property

5). REASON: DebtCon = debt consolidation; HomeImp = home improvement

6). JOB: Occupational categories

7). YOJ: Years at present job

8). DEROG: Number of major derogatory reports

9). DELINQ: Number of delinquent credit lines

10).  CLAGE: Age of oldest credit line in months

11). NINQ: Number of recent credit inquiries

12). CLNO: Number of credit lines

13). DEBTINC: Debt-to-income ratio

# Data Set Information

[CREDIT RISK ANALYTICS](http://www.creditriskanalytics.net/citation-formats.html)


# Objective

The object of this project is to maximize the prediction  on the applicant defaulted on loan or seriously delinquent. In other words, we will select models based on high sensitivity.

# Practice Skills
Listed below are the practice skills we will learn in this project

- Data cleaning

- Advanced classification techniques like random forest and xgboost

- Imbalanced learning

- Receiver operating Characteristics (ROC) and Precision Recall Curves (PRC)

## Data visualization and manupulation library

In [None]:
%matplotlib inline
# Ignore deprecated warning
import warnings
warnings.filterwarnings("ignore")


# Pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# Matplotlib  for data visualization
import matplotlib.pyplot as plt

# Seaborn for data visualization
import seaborn as sns

## Load and examine data

In [None]:
df = pd.read_csv('hmeq.csv')

In [None]:
df.head()

## Data shape

In [None]:
df.shape

## Data type and missing values

In [None]:
df.info()

- There are two categorical attributes and 11 numerical attributes


- There are also missing values in many attributes. We will come back to them later

## Categorical count

In [None]:
# Set font scale and style
sns.set(font_scale = 1)
sns.set_style('ticks')

# List few attributes
attrib_list = ['LOAN','MORTDUE', 'VALUE', 'CLAGE']

# make pairplot
sns.pairplot(data = df.dropna(), vars = attrib_list, palette="hls", hue = 'BAD', height = 3)

Next, let's look at the boxplot which is useful for visualizing outliers

In [None]:
# Set font scale and style
sns.set(font_scale = 1.5)
sns.set_style('ticks')

fig, axes = plt.subplots(2,2, figsize = (20,10))

# make pairplot
sns.boxplot(data = df.dropna(), x ='BAD',  y= 'MORTDUE', hue = 'JOB', palette="husl", ax = axes[0,0])
sns.boxplot(data = df.dropna(), x ='BAD',  y= 'LOAN', hue = 'JOB', palette="husl", ax = axes[0,1])
sns.boxplot(data = df.dropna(), x ='BAD',  y= 'VALUE', hue = 'JOB', palette="husl", ax = axes[1,0])
sns.boxplot(data = df.dropna(), x ='BAD',  y= 'DEBTINC', hue = 'JOB', palette="husl", ax = axes[1,1])

- A thorough check of the numerical attributes shows that there are no extreme outliers

- The box plot also shows  that self employed are likely to default on their loan

Let's look at the relationship of loan and the class using the bar chart