# Challenge : Predicting Loan pre-delinquency using Random Forest classifier


## Introduction
in this challenge i will be working with Carbon's loan disbursement dataset to predict whether or not a loan will be defaulted on or paid back.The dataset contains details about clients, their location ,  income, employment status and other features used to determine whether a loan should be given or not. The goal of this challenge is  to build both a random forest classifier and a neural network model. Once the models have been trained and tested,i will then evaluate and compare them and explain which is best.
This notebook is organised as follows :
1. Data Assessment
2. Data preparation/Cleaning
3. Building the models
4. Model Evaluation and Comparision






In [0]:
#import some packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

% matplotlib inline

## 1 Assessment and Data cleaning
in this session i will be assessing the dataset and inspecting, if there are other data quality issues that needs cleaning. i will just perform operations to understand my dataset better and clean the data 

In [0]:
#load the dataset into a pandas dataframe
#set  na_values='Null' in read_csv to detect the nulls.
df = pd.read_csv('paylaterloan.csv',na_values='Null')
#set the max number of columns
pd.options.display.max_columns = 33

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
#preview the dataset
df.head()

Unnamed: 0,loanID,clientId,clientIncome,dueDate,incomeVerified,clientAge,clientGender,clientMaritalStatus,clientLoanPurpose,clientResidentialStauts,clientState,clientTimeAtEmployer,clientNumberPhoneContacts,clientAvgCallsPerDay,loanType,loanNumber,applicationDate,approvalDate,declinedDate,disbursementDate,payout_status,paidAt,loanAmount,interestRate,loanTerm,max_amount_taken,max_tenor_taken,repaidDate,Firstpaymentdate,settleDays,firstPaymentRatio,firstPaymentDefault,loanDefault
0,302039538823,755398623,52500.0,2018-09-10 06:35:11 UTC,False,29,FEMALE,Single,business,Rented,KANO,7,257.0,153.0,paylater,6,2018-07-12,2018-07-12,,2018-07-12,SUCCESS,2018-09-04 11:54:00 UTC,16000,20.0,60,1,1,2018-08-01 11:30:31 UTC,2018-08-13 12:00:00 UTC,-12,0.0,0,0
1,302251936174,915689736,52500.0,2018-10-21 07:13:29 UTC,False,25,MALE,Single,business,Rented,LAGOS,21,3964.0,269.426415,paylater,9,2018-08-22,2018-08-22,,2018-08-22,SUCCESS,2018-09-06 04:44:04 UTC,14500,15.0,60,0,1,2018-09-06 05:34:55 UTC,2018-09-21 12:00:00 UTC,-15,0.0,0,0
2,302270229003,292629156,35000.0,2018-10-23 11:00:00 UTC,False,32,MALE,Single,education,Rented,ANAMBRA,29,1140.0,77.147826,paylater,2,2018-08-25,2018-08-25,,2018-08-25,SUCCESS,,19500,15.0,60,0,1,2018-11-27 01:33:56 UTC,2018-09-24 12:00:00 UTC,64,0.0,1,1
3,301929229685,671710636,35000.0,2018-08-18 04:21:05 UTC,False,28,FEMALE,Married,business,Own Residence,OSUN,36+,2764.0,31.678112,paylater,4,2018-06-19,2018-06-19,,2018-06-19,SUCCESS,2018-07-10 11:23:31 UTC,19500,15.0,60,1,1,2018-07-09 06:48:44 UTC,2018-07-19 12:00:00 UTC,-10,0.0,0,0
4,301841795739,367769827,35000.0,2018-08-01 07:31:40 UTC,False,34,MALE,Married,medical,Rented,ONDO,36+,504.0,3.0,paylater,5,2018-06-02,2018-06-02,,2018-06-02,SUCCESS,2018-08-09 06:05:37 UTC,17500,12.5,60,1,1,2018-07-14 02:28:10 UTC,2018-07-02 12:00:00 UTC,12,0.0,1,0


In [0]:
#check the data types and check for null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159596 entries, 0 to 159595
Data columns (total 33 columns):
loanID                       159596 non-null object
clientId                     159596 non-null int64
clientIncome                 159596 non-null float64
dueDate                      159596 non-null object
incomeVerified               159493 non-null object
clientAge                    159596 non-null int64
clientGender                 159596 non-null object
clientMaritalStatus          159590 non-null object
clientLoanPurpose            159596 non-null object
clientResidentialStauts      158460 non-null object
clientState                  159595 non-null object
clientTimeAtEmployer         155402 non-null object
clientNumberPhoneContacts    156888 non-null float64
clientAvgCallsPerDay         156909 non-null float64
loanType                     159596 non-null object
loanNumber                   159596 non-null int64
applicationDate              159596 non-null object
appro

In [0]:
# check unique values in clientAge column
df.clientAge.unique()

array([ 29,  25,  32,  28,  34,  35,  40,  36,  30,  48,  31,  33,  46,
        39,  47,  23,  42,  26,  24,  37,  27,  38,  43,  50,  22,  21,
        44,  45,  52,  49,  54,  41,  53,  55,  59,  51,  56,  58,  57,
        60,  20,  19, 138,  18])

In [0]:
#look up the values in incomeVerified column
df.incomeVerified.value_counts()

False    133286
True      26207
Name: incomeVerified, dtype: int64

In [0]:
#no duplicates
sum(df.duplicated())

0

In [0]:
#check gender count of clients
df.clientGender.value_counts()

MALE      109666
FEMALE     49930
Name: clientGender, dtype: int64

In [0]:
#check values in column
df.clientLoanPurpose.unique()

array(['business', 'education', 'medical', 'other', 'house'], dtype=object)

In [0]:
#check marital status column
df.clientMaritalStatus.unique()

array(['Single', 'Married', 'Separated', 'Widowed', 'Divorced', nan],
      dtype=object)

In [0]:
df.clientResidentialStauts.unique()

array(['Rented', 'Own Residence', 'Family Owned', 'Employer Provided',
       'Temp. Residence', nan], dtype=object)

In [0]:
df.clientState.unique()

array(['KANO', 'LAGOS', 'ANAMBRA', 'OSUN', 'ONDO', 'OYO', 'RIVERS',
       'AKWA IBOM', 'KWARA', 'ENUGU', 'ABUJA', 'OGUN', 'EDO', 'NIGER',
       'BAYELSA', 'EKITI', 'DELTA', 'BENUE', 'CROSS RIVER', 'KADUNA',
       'ABIA', 'PLATEAU', 'EBONYI', 'KOGI', 'IMO', 'ADAMAWA', 'JIGAWA',
       'NASARAWA', 'KEBBI', 'KATSINA', 'BAUCHI', 'SOKOTO', 'TARABA',
       'ZAMFARA', 'GOMBE', 'BORNO', 'YOBE', 'OJO', nan, 'LAGOS '],
      dtype=object)

### Assessment


*   All the date columns are of the wrong datatype.
*   clientTimeAtEmployer, is of type String instead of int and, contain Null entries.
* clientNumberPhoneContacts and
clientAvgCallsPerDay contain nulls
* clientAge contains an outlier, a client aged 138 years old.
* clientId  should not be a String data type.
* incomeVerified column have Null entries.
* clientResidentialStauts column have Null entries.
* declinedDate column populated with Null entries.
* clientResidentialStauts column wrongly splet.

### Cleaning
* Change date columns to datatime datatype.
* Set clientTimeAtEmployer, clientNumberPhoneContacts and   clientAvgCallsPerDay to Integers and float.
* Replace client age of 138 years with average age of clients.
* Drop declinedDate column.
* Set client id to String datatype.
* Rename clientResidentialStauts column
* Replace Null values in numeric columns  with average

In [0]:
#create copy of dataset
df_cleaned = df.copy()

### Set date columns to datetime

In [0]:
#create a list of date columns
date_columns = ['applicationDate', 'approvalDate','declinedDate',
       'disbursementDate']

def toDateTime(cols):
  for c in cols:
   df_cleaned[c] =  pd.to_datetime(df_cleaned[c])

  #call the function
toDateTime(date_columns)

In [0]:
#a list of date columns with time-stamp
date_ = ['repaidDate','dueDate','paidAt','Firstpaymentdate']
#for loop
for cols in date_:
  df_cleaned[cols] =  pd.to_datetime(df_cleaned[cols],format='%Y-%m-%d %H:%M:%S %Z', 
                                       errors='coerce')

### Set columns to datatype float and int

In [0]:
#clientNumberPhoneContacts contains null entries
#replace null with average
df_cleaned.clientNumberPhoneContacts.fillna(df_cleaned.clientNumberPhoneContacts.mean(),inplace=True)

In [0]:
df_cleaned.clientNumberPhoneContacts.unique()

array([  257.,  3964.,  1140., ..., 12887.,  7570.,  4281.])

In [0]:
df_cleaned.clientNumberPhoneContacts.dtype

dtype('float64')

In [0]:
#clientNumberPhoneContact to int
df_cleaned.clientNumberPhoneContacts = df_cleaned.clientNumberPhoneContacts.astype(int)

In [0]:
#check datatype
df_cleaned.clientNumberPhoneContacts.dtype

dtype('int64')

In [0]:
#clientAvgCallsPerDay contains null entries
#replace null with average
df_cleaned.clientAvgCallsPerDay.fillna(df_cleaned.clientAvgCallsPerDay.mean(),inplace=True)


In [0]:
# clientAvgCallsPerDay to type float
df_cleaned.clientAvgCallsPerDay = df_cleaned.clientAvgCallsPerDay.astype(float)
#round of to 2 significant figures
df_cleaned.clientAvgCallsPerDay = round(df_cleaned.clientAvgCallsPerDay,2)

In [0]:
#check datatype
df_cleaned.clientAvgCallsPerDay.dtype

dtype('float64')

In [0]:
#clientTimeAtEmployer contains null entries
#replace null with '0' 
#time at employer is in months , so i am of them thinking those with negatives have not worked up to a month
df_cleaned.clientTimeAtEmployer.fillna(0,inplace=True)

In [0]:
 df_cleaned.clientTimeAtEmployer.unique()

array(['7', '21', '29', '36+', '0', '1', '27', '28', '11', '17', '18',
       '5', '34', '26', '2', '9', '3', '16', '4', '22', '6', 0, '24',
       '14', '19', '15', '20', '12', '13', '31', '32', '25', '33', '10',
       '30', '8', '23', '35', '-6', '-7', '-5'], dtype=object)

In [0]:
#time at employer is in months , so i am of them thinking those with negatives have not worked up to a month

#replace values with '-' with '0'
df_cleaned.clientTimeAtEmployer.replace(to_replace=['-6', '-7'],value='0', inplace=True)

In [0]:
#remove '+' in front of 36 then convert clientTimeAtEmployer to int
df_cleaned.clientTimeAtEmployer = df_cleaned.clientTimeAtEmployer.str.rstrip('+')


In [0]:
#convert to float first
df_cleaned.clientTimeAtEmployer = df_cleaned.clientTimeAtEmployer.astype(float)

In [0]:
#fill nan in type float with average
df_cleaned.clientTimeAtEmployer.fillna(df_cleaned.clientTimeAtEmployer.mean(),inplace=True)

In [0]:
#convert to integer
df_cleaned.clientTimeAtEmployer = df_cleaned.clientTimeAtEmployer.astype(int)

In [0]:
#check datatype
df_cleaned.clientTimeAtEmployer.dtype

dtype('int64')

### Set clientID to string

In [0]:
##clientID to string
df_cleaned.clientId = df_cleaned.clientId.astype(str)

In [0]:
##check
df_cleaned.clientId.dtype

dtype('O')

### Replace client age 138 with average age of clients

In [0]:
df_cleaned.clientAge.replace(138,value=33.5,inplace=True)

In [0]:
#check
df_cleaned.query('clientAge == 138')

### Rename clientResidentialStauts column

In [0]:
 df_cleaned.rename(columns={"clientResidentialStauts":"clientResidentialStatus"},inplace=True)

In [0]:
df_cleaned.head()

Unnamed: 0,loanID,clientId,clientIncome,dueDate,incomeVerified,clientAge,clientGender,clientMaritalStatus,clientLoanPurpose,clientResidentialStatus,clientState,clientTimeAtEmployer,clientNumberPhoneContacts,clientAvgCallsPerDay,loanType,loanNumber,applicationDate,approvalDate,declinedDate,disbursementDate,payout_status,paidAt,loanAmount,interestRate,loanTerm,max_amount_taken,max_tenor_taken,repaidDate,Firstpaymentdate,settleDays,firstPaymentRatio,firstPaymentDefault,loanDefault
0,302039538823,755398623,52500.0,2018-09-10 06:35:11,False,29.0,FEMALE,Single,business,Rented,KANO,7,257,153.0,paylater,6,2018-07-12,2018-07-12,NaT,2018-07-12,SUCCESS,2018-09-04 11:54:00,16000,20.0,60,1,1,2018-08-01 11:30:31,2018-08-13 12:00:00,-12,0.0,0,0
1,302251936174,915689736,52500.0,2018-10-21 07:13:29,False,25.0,MALE,Single,business,Rented,LAGOS,21,3964,269.43,paylater,9,2018-08-22,2018-08-22,NaT,2018-08-22,SUCCESS,2018-09-06 04:44:04,14500,15.0,60,0,1,2018-09-06 05:34:55,2018-09-21 12:00:00,-15,0.0,0,0
2,302270229003,292629156,35000.0,2018-10-23 11:00:00,False,32.0,MALE,Single,education,Rented,ANAMBRA,29,1140,77.15,paylater,2,2018-08-25,2018-08-25,NaT,2018-08-25,SUCCESS,NaT,19500,15.0,60,0,1,2018-11-27 01:33:56,2018-09-24 12:00:00,64,0.0,1,1
3,301929229685,671710636,35000.0,2018-08-18 04:21:05,False,28.0,FEMALE,Married,business,Own Residence,OSUN,36,2764,31.68,paylater,4,2018-06-19,2018-06-19,NaT,2018-06-19,SUCCESS,2018-07-10 11:23:31,19500,15.0,60,1,1,2018-07-09 06:48:44,2018-07-19 12:00:00,-10,0.0,0,0
4,301841795739,367769827,35000.0,2018-08-01 07:31:40,False,34.0,MALE,Married,medical,Rented,ONDO,36,504,3.0,paylater,5,2018-06-02,2018-06-02,NaT,2018-06-02,SUCCESS,2018-08-09 06:05:37,17500,12.5,60,1,1,2018-07-14 02:28:10,2018-07-02 12:00:00,12,0.0,1,0


### Drop declined date rows
I will be dropping rows which do not have nulls in them. as this is contradictory to what the feature represents, since the loan was approved it should not then turn up as declined later.

In [0]:
df_cleaned[df_cleaned.declinedDate.isna() == False]

Unnamed: 0,loanID,clientId,clientIncome,dueDate,incomeVerified,clientAge,clientGender,clientMaritalStatus,clientLoanPurpose,clientResidentialStatus,clientState,clientTimeAtEmployer,clientNumberPhoneContacts,clientAvgCallsPerDay,loanType,loanNumber,applicationDate,approvalDate,declinedDate,disbursementDate,payout_status,paidAt,loanAmount,interestRate,loanTerm,max_amount_taken,max_tenor_taken,repaidDate,Firstpaymentdate,settleDays,firstPaymentRatio,firstPaymentDefault,loanDefault
5936,303127712,744689741,17500.0,2018-10-24 01:58:37,False,36.0,FEMALE,Married,business,Rented,LAGOS,36,3356,11.43,paylater,7,2018-04-27,2018-04-27,2018-05-07,2018-04-27,SUCCESS,NaT,73000,7.5,180,1,1,NaT,2018-05-28 12:00:00,269,1.0,1,1
7063,303054285,744689741,17500.0,2018-10-09 04:43:09,False,36.0,FEMALE,Married,business,Rented,LAGOS,36,3084,11.61,paylater,7,2018-04-12,2018-04-12,2018-04-18,2018-04-12,SUCCESS,NaT,73000,7.5,180,1,1,NaT,2018-05-12 12:00:00,285,1.0,1,1
17321,303078941,659790230,35000.0,2018-10-15 04:52:50,False,29.0,MALE,Single,education,Family Owned,LAGOS,5,346,57.32,paylater,10,2018-04-17,2018-04-17,2018-04-19,2018-04-17,SUCCESS,NaT,83000,7.5,180,0,1,NaT,2018-05-17 12:00:00,280,1.0,1,1
23736,302263901520,210673692,25000.0,2018-11-22 06:43:26,False,45.0,FEMALE,Married,house,Family Owned,KWARA,36,1253,16.01,paylater,4,2018-08-24,2018-08-24,2018-09-05,2018-08-24,SUCCESS,NaT,20000,12.5,90,0,1,NaT,2018-09-23 12:00:00,151,1.0,1,1
28414,301940309311,564984031,35000.0,2018-12-19 08:37:35,False,36.0,FEMALE,Married,education,Rented,LAGOS,36,729,0.38,paylater,4,2018-06-21,2018-06-21,2018-11-01,2018-06-22,SUCCESS,NaT,79000,7.5,180,1,1,2018-12-05 09:10:47,2018-07-23 12:00:00,135,0.0,1,1
39185,303060958,838864909,20000.0,2018-10-10 10:36:33,False,47.0,FEMALE,Married,business,Rented,OYO,36,1780,6.57,paylater,5,2018-04-13,2018-04-13,2018-04-18,2018-04-13,SUCCESS,NaT,60000,7.5,180,1,1,NaT,2018-05-13 12:00:00,284,1.0,1,1
52882,301793798267,698195772,25000.0,2018-11-20 02:41:52,True,32.0,FEMALE,Married,business,Rented,OYO,36,1923,63.45,paylater,4,2018-05-24,2018-05-24,2018-11-01,2018-05-24,SUCCESS,NaT,58000,5.0,180,1,1,2018-10-24 12:00:00,2018-06-25 12:00:00,121,0.0,1,1


In [0]:
#drop rows without declineddate null
df_cleaned = df_cleaned[df_cleaned.declinedDate.isna()]

In [0]:
# 7 rows where dropped
df_cleaned.shape

(159589, 33)

## #Investigate the loan default distribution of classes

In [0]:
#check count of  classes in loan default
df_cleaned.loanDefault.value_counts()

0    115304
1     44285
Name: loanDefault, dtype: int64

In [0]:
#proportion of defaults
#proportion of people who pay loans
round(df_cleaned.loanDefault.value_counts()[0]/len(df),2)

0.72

In [0]:
#proportion of defaulters 
round(df.loanDefault.value_counts()[1]/len(df),2)

0.28

28% are defaulters and 72% paid back their loans.

## 3. Data preparation
in this section i will perform several operations to prepare my data for training. i will handle missing values, one-hot-encode categorical data, balance the classes, select relevant features i deem fit and split my data into a training and testing set.

### Missing Values
some of the features contained Nulls, i have taken care of the numerical features with nulls above. i would fill the categorical features that have nulls with The mode (Most frequent value.) then change all to type category.

Categorical columns with nulls :
clientMaritalStatus , clientState, clientResidentialStatus,
IncomeVerified.


**ClientMaritalStatus**

In [0]:
#clientMaritalStatus
#find the mode
df_cleaned.clientMaritalStatus.value_counts()

Married      85486
Single       71360
Separated     1795
Widowed        939
Divorced         3
Name: clientMaritalStatus, dtype: int64

In [0]:
#replace null with Mode which is Married
df_cleaned.clientMaritalStatus.fillna('Married',inplace=True)

In [0]:
#check
sum(df_cleaned.clientMaritalStatus.isna())

0

In [0]:
#set data type to category
df_cleaned.clientMaritalStatus = df_cleaned.clientMaritalStatus.astype('category')

In [0]:
#check data type
df_cleaned.clientMaritalStatus.dtype

CategoricalDtype(categories=['Divorced', 'Married', 'Separated', 'Single', 'Widowed'], ordered=False)

**ClientState**

In [0]:
#clientState
#find the mode
df_cleaned.clientState.value_counts()

LAGOS          60662
OGUN           14798
ABUJA          13999
OYO            12729
RIVERS          7829
DELTA           5081
KWARA           3968
OSUN            3816
ONDO            3571
KADUNA          3496
EDO             2663
NIGER           2462
BENUE           2043
EKITI           2023
AKWA IBOM       1960
KOGI            1896
PLATEAU         1715
CROSS RIVER     1608
NASARAWA        1589
ENUGU           1566
ANAMBRA         1496
ABIA            1426
IMO             1287
KANO            1233
BAYELSA         1179
ADAMAWA          569
BAUCHI           418
SOKOTO           398
EBONYI           392
TARABA           336
KEBBI            279
GOMBE            265
KATSINA          248
ZAMFARA          221
BORNO            210
YOBE              83
JIGAWA            72
LAGOS              1
OJO                1
Name: clientState, dtype: int64

In [0]:
#replace null with Mode which is Lagos
df_cleaned.clientState.fillna('LAGOS',inplace=True)

In [0]:
#rename OJO to OYO State
df_cleaned.clientState.replace("OJO",value="OYO",inplace=True)

In [0]:
#set data type to category
df_cleaned.clientMaritalStatus = df_cleaned.clientState.astype('category')

**clientResidentialStatus**

In [0]:
#find the mode
df_cleaned.clientResidentialStatus.value_counts()

Rented               100529
Own Residence         26406
Family Owned          25666
Employer Provided      5589
Temp. Residence         263
Name: clientResidentialStatus, dtype: int64

In [0]:
#replace null with Mode which is Rented
df_cleaned.clientResidentialStatus.fillna('Rented',inplace=True)

In [0]:
#set data type to category
df_cleaned.clientResidentialStatus = df_cleaned.clientResidentialStatus.astype('category')

**IncomeVerified**

In [0]:
#find the mode
df_cleaned.incomeVerified.value_counts()


False    133280
True      26206
Name: incomeVerified, dtype: int64

In [0]:
#replace null with Mode which is False
df_cleaned.incomeVerified.fillna('False',inplace=True)

In [0]:
#set data type to category
df_cleaned.incomeVerified = df_cleaned.incomeVerified.astype('category')

**Set remaining categorical data to type category**

In [0]:
#set data type to category
df_cleaned.clientGender = df_cleaned.clientGender.astype('category')
df_cleaned.clientLoanPurpose = df_cleaned.clientLoanPurpose.astype('category')

In [0]:
#save data
df_cleaned.to_csv('cleanData',index=False)

### One-hot-encoding
Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. so i will be performing One-hot-encoding on certain categorical features. before this i will select my X features and create my ouput Y(Label).

In [0]:
#select Features
X = df_cleaned[['clientIncome', 'incomeVerified', 'clientAge',
       'clientGender', 'clientMaritalStatus', 'clientLoanPurpose',
       'clientResidentialStatus', 'clientState', 'clientTimeAtEmployer',
       'clientNumberPhoneContacts', 'clientAvgCallsPerDay','loanNumber','loanAmount',
       'interestRate', 'loanTerm', 'max_amount_taken', 'max_tenor_taken','settleDays', 'firstPaymentRatio','firstPaymentDefault']]
Y = df_cleaned['loanDefault']

In [0]:
#20 features selected
X.shape

(159589, 20)

In [0]:
len(Y)

159589

In [0]:
#one hot encode categorical features
X = pd.get_dummies(X,columns=['clientMaritalStatus','incomeVerified','clientResidentialStatus','clientGender','clientState','clientLoanPurpose'])

In [0]:
X.head()

Unnamed: 0,clientIncome,clientAge,clientTimeAtEmployer,clientNumberPhoneContacts,clientAvgCallsPerDay,loanNumber,loanAmount,interestRate,loanTerm,max_amount_taken,max_tenor_taken,settleDays,firstPaymentRatio,firstPaymentDefault,clientMaritalStatus_ABIA,clientMaritalStatus_ABUJA,...,clientState_NIGER,clientState_OGUN,clientState_ONDO,clientState_OSUN,clientState_OYO,clientState_PLATEAU,clientState_RIVERS,clientState_SOKOTO,clientState_TARABA,clientState_YOBE,clientState_ZAMFARA,clientLoanPurpose_business,clientLoanPurpose_education,clientLoanPurpose_house,clientLoanPurpose_medical,clientLoanPurpose_other
0,52500.0,29.0,7,257,153.0,6,16000,20.0,60,1,1,-12,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,52500.0,25.0,21,3964,269.43,9,14500,15.0,60,0,1,-15,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,35000.0,32.0,29,1140,77.15,2,19500,15.0,60,0,1,64,0.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,35000.0,28.0,36,2764,31.68,4,19500,15.0,60,1,1,-10,0.0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0
4,35000.0,34.0,36,504,3.0,5,17500,12.5,60,1,1,12,0.0,1,0,0,...,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0


In [0]:
X.dtypes

clientIncome                       float64
clientAge                          float64
clientTimeAtEmployer                 int64
clientNumberPhoneContacts            int64
clientAvgCallsPerDay               float64
loanNumber                           int64
loanAmount                           int64
interestRate                       float64
loanTerm                             int64
max_amount_taken                     int64
max_tenor_taken                      int64
settleDays                           int64
firstPaymentRatio                  float64
firstPaymentDefault                  int64
clientMaritalStatus_ABIA             uint8
clientMaritalStatus_ABUJA            uint8
clientMaritalStatus_ADAMAWA          uint8
clientMaritalStatus_AKWA IBOM        uint8
clientMaritalStatus_ANAMBRA          uint8
clientMaritalStatus_BAUCHI           uint8
clientMaritalStatus_BAYELSA          uint8
clientMaritalStatus_BENUE            uint8
clientMaritalStatus_BORNO            uint8
clientMarit

### Train and Test split
i will Split the data into training set (70%), and test set (30%). Training set will be used to fit the model, and test set will be to evaluate the best model.

In [0]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train,Y_train,X_labels,Y_labels = train_test_split(X, Y, test_size = 0.3, random_state = 42)

### Balance the Classes
Classification problems in most real world applications have imbalanced data sets. In other words, the positive examples (minority class) are a lot less than negative examples (majority class). in our dataset we have 72% (no default)negative values and 28% positive(loan default).class imbalance influences a learning algorithm during training by making the decision rule biased towards the majority class, that optimizes the model to make predictions based on the majority class in the dataset. so i am going  to balance the data set to achieve a model that is able to generalize and make good predictions on the minority class.


## 4. Buliding The Models.
The next step is to build the models. i will build a random forest and neural network, that both perform a binary classification on each loan.
the section is divided into the two different Classifiers, i will be building. Also i will perform hyper parameter tuning on both models to improve their performance. The following Metrics, will be used to evaluate our final models.

1.  **Accuracy** :
It’s the ratio of the correctly labeled subjects to the whole pool of subjects.
Accuracy is the most intuitive one.
2. **Precision** :
Precision is the ratio of the correctly +ve labeled by our program to all +ve labeled.
3. **Recall ** :
Recall is the ratio of the correctly +ve labeled by our program to all who are diabetic in reality.
4. ** F1-score**
F1 Score is the weighted average of precision and recall.
5. ** Auc ** :If you randomly chose one positive and one negative observation, AUC represents the likelihood that your classifier will assign a higher predicted probability to the positive observation.

.**General Flow**:
I will first build a base model, then do hyper parameter tuning, select the model with the best parameters. then use class balancing techiques  on the best model to build a model that predicts the minority class with higher accuracy(recall) i.e improving the Recall of the model. after training each model i scored each model using Kfold Cv  and Auc metric then evaluate model performance on test set. Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

## Random Forest Classifier
Most important hyperparameters of Random Forest:
* n_estimators = n of trees
* max_features = max number of features considered for splitting a node
* max_depth = max number of levels in each decision tree
* min_samples_split = min number of data points placed in a node before the node is split
* min_samples_leaf = min number of data points allowed in a leaf node
* bootstrap = method for sampling data points (with or without replacement)





**Build base model**

In [0]:
#import evaluation metrics
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
#import classifier
from sklearn.ensemble import RandomForestClassifier

In [0]:
# Instantiate model
base_rf = RandomForestClassifier(random_state=42)

In [0]:
# Train the model on training data
base_rf.fit(X_train,X_labels)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [0]:
#score the model based on the AUC
#first with Cv
from sklearn.model_selection import cross_val_score

In [0]:
#model performance after training
base_score = cross_val_score(base_rf,X_train, X_labels,
                         scoring="roc_auc", cv=10)

In [0]:
#AUC 
base_score.mean()

0.874788073428107

In [0]:
#predict
base_predict = base_rf.predict(Y_train)

**Evaluate the performance of base model on test data**

In [0]:
#accuracy
accuracy_score(base_predict,Y_labels)

0.8672431438895503

In [0]:
#recall
recall_score(Y_labels,base_predict)

0.5995316159250585

In [0]:
#Precision
precision_score(Y_labels,base_predict)

0.8826604382159938

In [0]:
#f1 score
f1_score(Y_labels,base_predict)

0.7140543458700738

### Hyperparameter tuning
in machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned. i will be performing both Random Search and Grid Search techniques.

### Random Search
Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. i am going to evaluate a wide range of values for each hyperparameter to find the best. Using Scikit-Learn’s RandomizedSearchCV method, we can define a grid of hyperparameter ranges, and randomly sample from the grid.

In [0]:
#random search
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 25, stop = 100, num = 3)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 10, num = 20)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [0]:
## Use the random grid to search for best hyperparameters
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = base_rf, param_distributions = random_grid,scoring='roc_auc', n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

In [0]:
# Fit the random search model
rf_random.fit(X_train,X_labels)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed: 12.3min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 23.0min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [25, 62, 100], 'max_features': ['auto', 'sqrt'], 'max_depth': [5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 10, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring='roc_auc', verbose=2)

In [0]:
rf_random.best_score_

0.9034379397792716

In [0]:
rf_random.best_params_

{'bootstrap': True,
 'max_depth': 10,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 100}

In [0]:
#set the best model to best_grid
rf_random = rf_random.best_estimator_

In [0]:
#score the model
random_score = cross_val_score(rf_random,X_train, X_labels,
                         scoring="roc_auc", cv=10)

In [0]:
#AUC 
random_score.mean()

0.9035897050309252

**Evaluate the performance of the random search model on the test set**

In [0]:
#predict
random_predict = rf_random.predict(Y_train)

In [0]:
#accuracy
accuracy_score(random_predict,Y_labels)

0.8731750109655994

In [0]:
#recall
recall_score(Y_labels,random_predict)

0.5690111052353253

In [0]:
#Precision
precision_score(Y_labels,random_predict)

0.9535384225851373

In [0]:
#f1 score
f1_score(Y_labels,random_predict)

0.7127176381529144

### Grid Search
Grid search is a technique which tends to find the right set of hyperparameters for the particular model.Random search allows us to narrow down the range for each hyperparameter. we can explicitly specify every combination of settings to try, using GridSearchCV.

In [0]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = [
{'n_estimators': [10,25], 
 'max_features': [5, 10], 
 'max_depth': [10, 50, None], 
 'bootstrap': [True, False],
 'min_samples_leaf': [1, 3, 5],
    'min_samples_split': [5,8, 10],}
]

In [0]:
# Instantiate the grid search model
rf_grid_search = GridSearchCV(estimator = base_rf, param_grid = param_grid, 
                          cv = 3,scoring='roc_auc' ,n_jobs = -1, verbose = 2)

In [0]:
# Fit the grid search model
rf_grid_search.fit(X_train,X_labels)

In [0]:
#the best params
rf_grid_search.best_params_

{'bootstrap': False, 'max_depth': 10, 'max_features': 10, 'n_estimators': 25}

In [0]:
#set the best model to best_grid
best_grid = rf_grid_search.best_estimator_

In [0]:
#make predictions with the best model
predictions =  best_grid.predict(Y_train)

**Evaluate** the Gridsearch model

In [0]:
#accuracy
accuracy_score(predictions,Y_labels)

0.8731123503978946

In [0]:
#F1
f1_score(Y_labels,predictions)

0.7134028400245318

In [0]:
#recall
recall_score(Y_labels,predictions)

0.5712019339729546

In [0]:
#precision
precision_score(Y_labels,predictions)

0.9498743718592965

**Manual tuning**
from the parameters obtained from the grid search and random search i was able to find a better performing model using manual tuning. i did others which will not be represented in this notebook as some took too long to run.

In [0]:
#manual tuning model
best_rf = RandomForestClassifier(random_state=42,bootstrap=False,max_depth=50,
                                  max_features=20,n_estimators=100)

In [0]:
best_rf.fit(X_train,X_labels)

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=50, max_features=20, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [0]:
#score after traing
best_score = cross_val_score(best_rf,X_train, X_labels,
                         scoring="roc_auc", cv=3)

In [0]:
#auc score
best_score.mean()

**Evaluate the model**

In [0]:
#prediction
Best_predict = best_rf.predict(Y_train)

In [0]:
#accuracy
accuracy_score(Best_predict,Y_labels)

0.8701673037157717

In [0]:
#F1
f1_score(Y_labels,Best_predict)

0.7188093730208993

In [0]:
#precision
precision_score(Y_labels,Best_predict)

0.8958168902920284

In [0]:
#recall
recall_score(Y_labels,Best_predict)

0.600211528291909

### Balance the classes and train on the best performing model

Oversampling and undersampling both have major drawbacks like prone to overfitting in oversampling minority class and loss of information by reducing the majority class.
the technique i will be using to balance the data set is:

* **Synthetic Minority Oversampling Technique (SMOTE):** It over-samples the minority class but using synthesized examples. It operates on feature space not the data space.
i will be using a library - imbalanced learn to perform this operations

In [0]:
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import RobustScaler

In [0]:
best_rf = RandomForestClassifier(random_state=42,bootstrap=False,max_depth=50,
                                  max_features=20,n_estimators=100)

In [0]:
#perform SMOTE)
#first scale the data
scalar = RobustScaler()
X_train = scalar.fit_transform(X_train)
Y_train = scalar.fit_transform(Y_train)

In [0]:
smote = SMOTE(random_state=42,ratio='minority')

In [0]:
#res means resampled
X_train_res, X_labels_res = smote.fit_sample(X_train, X_labels)

In [0]:
#train on resampled data
best_rf.fit(X_train_res, X_labels_res)

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=50, max_features=20, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [0]:
#score model
score = cross_val_score(best_rf,X_train_res, X_labels_res,
                         scoring="roc_auc", cv=3)

In [0]:
#AUC
score.mean()

0.9723146572915938

**Evaluate  model**

In [0]:
#predict
res_predict = best_rf.predict(Y_train)

In [0]:
#recall
recall_score(Y_labels,res_predict)

0.6447835612298859

In [0]:
#precision
precision_score(Y_labels,res_predict)

0.8267945364719558

In [0]:
#F1
f1_score(Y_labels,res_predict)

0.7245331069609506

In [0]:
#accuracy
accuracy_score(res_predict,Y_labels)

0.8644443051987384

## 5.  Model Evaluation and Comparision

Comparing the performance of models at training and testing.
The business problem being to reduce the risk of Carbon losing money due to clients defaulting on loans, My focus was to build a model that was geared towards having a High recall(predicting positvie values as positive) and High AUC, accuracy is no longer a good measure of performance for different models because if we simply predict all examples to belong to the negative class , we would achieve a High accuray always. After building the base model , i performed some Hyperparameter tuning to find a model with a better performance, i was able to settle on a model with recall of  0.6. i then balanced the data set using SMOTE technique to oversample the minority class. the final model had the highest** Precision : 0.83,** **Accuracy : 0.86**
**Recall : 0.64**, and an **AUC : of  0.9**. i Know further tuning could be performed on the model but that was not possible for me since, some of the hyperparamters i tried using took too long to Train aslo the random and gridsearch ran far too long. i'll see how well my NN classifier performs in the other notebook.