
## **Home Credit Default Risk Prediction using Several Algorithms and Pipelines**


### **Introduction**

This is a continuation of the home credit default risk project. The goal here is to build a model for predicting loan defaulters.

Preprocessing, feature engineering, crossvalidation and hyperparameter optimization will be carried out to choose the best performing model.

In [1]:
pip install catboost



In [2]:
import numpy as np
import pandas as pd
import time
from scipy.stats import skew
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import math

from sklearn.model_selection import cross_val_score, KFold, cross_validate, RandomizedSearchCV, train_test_split, StratifiedKFold
from sklearn.feature_selection import SelectFromModel

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import roc_auc_score



import warnings
warnings.filterwarnings('ignore')

In [3]:
from catboost import CatBoostClassifier

In [4]:
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)

In [5]:
train= pd.read_csv('/content/drive/MyDrive/home-credit-default-risk/application_train.csv')

test = pd.read_csv('/content/drive/MyDrive/home-credit-default-risk/application_test.csv')

train_df= train.drop('SK_ID_CURR', axis= 1)
test_df= test.drop('SK_ID_CURR', axis= 1)

In [6]:
test_df.shape, train_df.shape

((48744, 120), (307511, 121))

In [7]:
train.head(2)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.311267,0.622246,,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


#### **Building Pipelines for Preprocessing and modelling**

Due to the fact that the dataset contains a lot of missing values and categorical variables, there is a need to build custom transformers and integrate it with existing Scikit-Learn pipeline for building the models. These transformer classes will be built on top of Scikit-Learn base classes, BaseEstimator and TransformerMixin.


In [8]:
class TransformColumns(BaseEstimator, TransformerMixin):
  
  def __init__(self, transform_ordcols = True):
    self.transform_ordcols = transform_ordcols

  def fit(self, X, y=None):
    return self
    
  def convert_days(self, days_negative):
    days_positive = -1 * days_negative
    year_positive= days_positive/365
    year_positive= math.ceil(year_positive)
    return year_positive
   
  def transform(self, X, y=None):
    """
    Transforms time-based features
    
    Returns:
            X: the modified dataset      
    """
    if self.transform_ordcols:
      X['YEARS_BIRTH'] = X['DAYS_BIRTH'].apply(self.convert_days)
      X['YEARS_EMPLOYED'] = X['DAYS_EMPLOYED'].apply(self.convert_days)
      X['YEARS_REGISTRATION'] = X['DAYS_REGISTRATION'].apply(self.convert_days)
      X['YEARS_ID_PUBLISH'] = X['DAYS_ID_PUBLISH'].apply(self.convert_days)
            
      return X



In [9]:
class InputMissing(BaseEstimator, TransformerMixin):
  def __init__(self, input_missingsvals = True):
    self.input_missingsvals = input_missingsvals

  def fit(self, X, y=None):
    return self
    
  def transform(self, X, y=None):
    """
    Inputs missing values
    
    Returns:
            X: the modified dataset      
    """
    if self.input_missingsvals:
      X= X.fillna(0)
    return X

In [10]:
class GenerateFeatures(BaseEstimator, TransformerMixin):
  def __init__(self, add_features = True):
    self.add_features = add_features

  def fit(self, X, y=None):
    return self
    
  def transform(self, X, y=None):
    """
    Generates new features
    
    Returns:
            X: the modified dataset      
    """
    if self.add_features:
      X['CREDIT_INCOME_RATIO'] = X['AMT_CREDIT'] / X['AMT_INCOME_TOTAL']
      X['ANNUITY_INCOME_RATIO'] = X['AMT_ANNUITY'] / X['AMT_INCOME_TOTAL']
      X['CREDIT_TERM'] = X['AMT_ANNUITY'] / X['AMT_CREDIT']
      X['DAYS_EMPLOYED_RATIO'] = X['DAYS_EMPLOYED'] / X['DAYS_BIRTH']
      X['GOODS_PRICE_LOAN_DIFFERENCE'] = X['AMT_GOODS_PRICE'] - X['AMT_CREDIT']      
    return X

In [11]:
class RemoveSkewness(BaseEstimator, TransformerMixin):
  def __init__(self):
    pass

  def fit(self, X, y=None):
    return self
    
  def transform(self, X, y=None):
    """
    Eliminates skewness from the skewed features
    
    Returns:
            X: the modified dataset      
    """
    X_numeric= X[['AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'REGION_POPULATION_RELATIVE']]
    skewness = X_numeric.apply(lambda x: skew(x))
    skewness = skewness[abs(skewness) > 0.5]
    skewed_features = skewness.index
    X[skewed_features] = np.log1p(X[skewed_features])
    return X

In [12]:
class LabelEncode(BaseEstimator, TransformerMixin):
  def __init__(self):
    pass

  def fit(self, X, y=None):
    return self
    
  def transform(self, X, y=None):
    """
    Transforms nominal features to discrete values
    
    Returns:
            X: the modified dataset      
    """
    cat_cols= ['NAME_CONTRACT_TYPE', 'CODE_GENDER','FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 
               'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START',
               'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
    for cat_col in cat_cols:
      encoded, categories = pd.factorize(X[cat_col])
      X[cat_col]= encoded
    return X

In [13]:

class DropColumns(BaseEstimator, TransformerMixin):
  def __init__(self, remove_cols = True):
    self.remove_cols= remove_cols

  def fit(self, X, y=None):
    return self
    
  def transform(self, X, y=None):
    """
    Deletes unwanted columns
    
    Returns:
            X: the modified dataset      
    """
    if self.remove_cols:
      X.drop(['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH'], axis= 1, inplace=True)
    return X


### **Model Building and Evaluation**

In [14]:
transformer_pipeline= Pipeline(steps=[
    ('transformer', TransformColumns()),
    ('imputer', InputMissing()),
    ('new_features', GenerateFeatures()),
    ('remove_columns', DropColumns()),
    ('encoder', LabelEncode()),
    ('scaler', StandardScaler())])

In [15]:
train_data= train_df.drop('TARGET', axis=1)
transformer_pipeline.fit(train_data)

processed_train = transformer_pipeline.transform(train_data)
processed_test= transformer_pipeline.transform(test_df)

processed_train_data = pd.DataFrame(processed_train)
processed_test_data = pd.DataFrame(processed_test)


y= train_df['TARGET']

In [16]:
X_train, X_valid, y_train, y_valid= train_test_split(processed_train_data, y, test_size= 0.2, random_state= 2000, stratify= y)

### **Logistic Regression**

In [17]:
lr = LogisticRegression(random_state= 200)

#Training time
start = time.time()
lr.fit(X_train, y_train)
lr_training_time= time.time() - start

#Inference time
start = time.time()
y_pred= lr.predict_proba(X_valid.loc[[861]])[:, 1]

lr_inference_time= time.time() - start


In [18]:
y_pred_train_lr= lr.predict_proba(X_train)[:, 1]
lr_roc_train= roc_auc_score(y_train, y_pred_train_lr)
print('Training AUC Score is', roc_auc_score(y_train, y_pred_train_lr))

y_pred_lr = lr.predict_proba(X_valid)[:, 1]
lr_roc_valid= roc_auc_score(y_valid, y_pred_lr)
print('Validation AUC Score is', roc_auc_score(y_valid, y_pred_lr))

Training AUC Score is 0.7301975305016859
Validation AUC Score is 0.7259908396235177


### **Decision Tree**

In [19]:
dtc = DecisionTreeClassifier(random_state= 200)

#Training time
start = time.time()
dtc.fit(X_train, y_train)
dtc_training_time= time.time() - start

#Inference time
start = time.time()
y_pred= dtc.predict_proba(X_valid.loc[[861]])[:, 1]

dtc_inference_time= time.time() - start


In [20]:
y_pred_train_dtc= dtc.predict_proba(X_train)[:, 1]
dtc_roc_train= roc_auc_score(y_train, y_pred_train_dtc)
print('Training AUC Score is', roc_auc_score(y_train, y_pred_train_dtc))

y_pred_dtc = dtc.predict_proba(X_valid)[:, 1]
dtc_roc_valid= roc_auc_score(y_valid, y_pred_dtc)
print('Validation AUC Score is', roc_auc_score(y_valid, y_pred_dtc))

Training AUC Score is 1.0
Validation AUC Score is 0.5364411861487378


### **Random Forest**

In [21]:
rfc = RandomForestClassifier(random_state= 200)

#Training time
start = time.time()
rfc.fit(X_train, y_train)
rfc_training_time= time.time() - start

#Inference time
start = time.time()
y_pred= rfc.predict_proba(X_valid.loc[[861]])[:, 1]

rfc_inference_time= time.time() - start


In [22]:
y_pred_train_rfc= rfc.predict_proba(X_train)[:, 1]
rfc_roc_train= roc_auc_score(y_train, y_pred_train_rfc)
print('Training AUC Score is', roc_auc_score(y_train, y_pred_train_rfc))

y_pred_rfc = rfc.predict_proba(X_valid)[:, 1]
rfc_roc_valid= roc_auc_score(y_valid, y_pred_rfc)
print('Validation AUC Score is', roc_auc_score(y_valid, y_pred_rfc))

Training AUC Score is 1.0
Validation AUC Score is 0.7113711292642896


### **GradientBoostingClassifier**

In [None]:
gbc = GradientBoostingClassifier(random_state= 200)

#Training time
start = time.time()
gbc.fit(X_train, y_train)
gbc_training_time= time.time() - start

#Inference time
start = time.time()
y_pred= gbc.predict_proba(X_valid.loc[[861]])[:, 1]

gbc_inference_time= time.time() - start


In [24]:
y_pred_train_gbc= gbc.predict_proba(X_train)[:, 1]
gbc_roc_train= roc_auc_score(y_train, y_pred_train_gbc)
print('Training AUC Score is', roc_auc_score(y_train, y_pred_train_gbc))

y_pred_gbc = gbc.predict_proba(X_valid)[:, 1]
gbc_roc_valid= roc_auc_score(y_valid, y_pred_gbc)
print('Validation AUC Score is', roc_auc_score(y_valid, y_pred_gbc))

Training AUC Score is 0.7663734734461962
Validation AUC Score is 0.7553100932891271


### **LightGBM**

In [25]:
lgbm = LGBMClassifier(random_state= 200)

#Training time
start = time.time()
lgbm.fit(X_train, y_train)
lgbm_training_time= time.time() - start

#Inference time
start = time.time()
y_pred= lgbm.predict_proba(X_valid.loc[[861]])[:, 1]

lgbm_inference_time= time.time() - start


In [26]:
y_pred_train_lgbm= lgbm.predict_proba(X_train)[:, 1]
lgbm_roc_train= roc_auc_score(y_train, y_pred_train_lgbm)
print('Training AUC Score is', roc_auc_score(y_train, y_pred_train_lgbm))

y_pred_lgbm = lgbm.predict_proba(X_valid)[:, 1]
lgbm_roc_valid= roc_auc_score(y_valid, y_pred_lgbm)
print('Validation AUC Score is', roc_auc_score(y_valid, y_pred_lgbm))

Training AUC Score is 0.8059476873694331
Validation AUC Score is 0.7613765280519474


### **XGBoost**

In [27]:
xgb= XGBClassifier(random_state= 200)

#Training time
start = time.time()
xgb.fit(X_train, y_train)
xgb_training_time= time.time() - start

#Inference time
start = time.time()
y_pred= xgb.predict_proba(X_valid.loc[[861]])[:, 1]

xgb_inference_time= time.time() - start


In [28]:
y_pred_train_xgb= xgb.predict_proba(X_train)[:, 1]
xgb_roc_train= roc_auc_score(y_train, y_pred_train_xgb)
print('Training AUC Score is', roc_auc_score(y_train, y_pred_train_xgb))

y_pred_xgb = xgb.predict_proba(X_valid)[:, 1]
xgb_roc_valid= roc_auc_score(y_valid, y_pred_xgb)
print('Validation AUC Score is', roc_auc_score(y_valid, y_pred_xgb))

Training AUC Score is 0.7643278346616884
Validation AUC Score is 0.7556429496553344


### **Catboost**

In [29]:
cbc= CatBoostClassifier(iterations= 1500, learning_rate= 0.01, depth= 8, random_state= 200)

#Training time
start = time.time()
cbc.fit(X_train, y_train)
cbc_training_time= time.time() - start

#Inference time
start = time.time()
y_pred= cbc.predict_proba(X_valid.loc[[861]])[:, 1]

cbc_inference_time= time.time() - start


0:	learn: 0.6826862	total: 245ms	remaining: 6m 6s
1:	learn: 0.6723548	total: 419ms	remaining: 5m 13s
2:	learn: 0.6616854	total: 623ms	remaining: 5m 10s
3:	learn: 0.6517522	total: 815ms	remaining: 5m 4s
4:	learn: 0.6418356	total: 1.01s	remaining: 5m 2s
5:	learn: 0.6323219	total: 1.18s	remaining: 4m 52s
6:	learn: 0.6230904	total: 1.37s	remaining: 4m 51s
7:	learn: 0.6141820	total: 1.51s	remaining: 4m 41s
8:	learn: 0.6051144	total: 1.71s	remaining: 4m 43s
9:	learn: 0.5968815	total: 1.9s	remaining: 4m 43s
10:	learn: 0.5889696	total: 2.06s	remaining: 4m 38s
11:	learn: 0.5809567	total: 2.22s	remaining: 4m 34s
12:	learn: 0.5729855	total: 2.37s	remaining: 4m 30s
13:	learn: 0.5649513	total: 2.54s	remaining: 4m 29s
14:	learn: 0.5576995	total: 2.75s	remaining: 4m 32s
15:	learn: 0.5507980	total: 2.92s	remaining: 4m 30s
16:	learn: 0.5438394	total: 3.08s	remaining: 4m 28s
17:	learn: 0.5367909	total: 3.27s	remaining: 4m 29s
18:	learn: 0.5297684	total: 3.45s	remaining: 4m 28s
19:	learn: 0.5231857	total

In [30]:
y_pred_train_cbc= cbc.predict_proba(X_train)[:, 1]
cbc_roc_train= roc_auc_score(y_train, y_pred_train_cbc)
print('Training AUC Score is', roc_auc_score(y_train, y_pred_train_cbc))

y_pred_cbc = cbc.predict_proba(X_valid)[:, 1]
cbc_roc_valid= roc_auc_score(y_valid, y_pred_cbc)
print('Validation AUC Score is', roc_auc_score(y_valid, y_pred_cbc))

Training AUC Score is 0.795210204517923
Validation AUC Score is 0.7612654067168042


In [34]:
test_preds = cbc.predict_proba(processed_test_data)[:, 1]

In [35]:
file = {'SK_ID_CURR': test.iloc[:, 0], 'TARGET': test_preds}

submission = pd.DataFrame(file, columns = ['SK_ID_CURR', 'TARGET'])

submission.to_csv('home_pipeline_Submission3_cbc.csv', index = False)

In [39]:
results = pd.DataFrame({'Models':['LR', 'DTC', 'RFC',  'GBC', 'LGBM', 'XGB', 'Catboost'], 
                        'Training time':[lr_training_time, dtc_training_time, rfc_training_time, gbc_training_time, lgbm_training_time, xgb_training_time, cbc_training_time], 
                          'Inference time':[lr_inference_time, dtc_inference_time, rfc_inference_time, gbc_inference_time, lgbm_inference_time, xgb_inference_time, cbc_inference_time],
                         'auc_roc_train':[lr_roc_train, dtc_roc_train, rfc_roc_train, gbc_roc_train, lgbm_roc_train, xgb_roc_train, cbc_roc_train],#
                        'auc_roc_test':[lr_roc_valid, dtc_roc_valid, rfc_roc_valid, gbc_roc_valid, lgbm_roc_valid, xgb_roc_valid, cbc_roc_valid]})
results.sort_values(by= 'auc_roc_test', ascending= False)

Unnamed: 0,Models,Training time,Inference time,auc_roc_train,auc_roc_test
4,LGBM,21.497956,0.003895,0.805948,0.761377
6,Catboost,277.727717,0.008424,0.79521,0.761265
5,XGB,86.121,0.002927,0.764328,0.755643
3,GBC,279.665533,0.003955,0.766373,0.75531
0,LR,9.291129,0.009142,0.730198,0.725991
2,RFC,126.620911,0.014931,1.0,0.711371
1,DTC,26.140144,0.003394,1.0,0.536441


Although LGBM had a higher validation score, catboost algorithm achieved a higher score on the private leaderboard(0.72958)