### For this assignment you will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. You will be creating a model that will help predict this.

### We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full.

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower.
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import PolynomialFeatures
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42)

In [47]:
df = pd.read_csv('loan_data.csv')
df.head()

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0


In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy        9578 non-null int64
purpose              9578 non-null object
int.rate             9578 non-null float64
installment          9578 non-null float64
log.annual.inc       9578 non-null float64
dti                  9578 non-null float64
fico                 9578 non-null int64
days.with.cr.line    9578 non-null float64
revol.bal            9578 non-null int64
revol.util           9578 non-null float64
inq.last.6mths       9578 non-null int64
delinq.2yrs          9578 non-null int64
pub.rec              9578 non-null int64
not.fully.paid       9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB


In [49]:
y=df.pop('not.fully.paid')

In [50]:
y.value_counts()

0    8045
1    1533
Name: not.fully.paid, dtype: int64

In [51]:
df.isnull().sum()

credit.policy        0
purpose              0
int.rate             0
installment          0
log.annual.inc       0
dti                  0
fico                 0
days.with.cr.line    0
revol.bal            0
revol.util           0
inq.last.6mths       0
delinq.2yrs          0
pub.rec              0
dtype: int64

In [52]:
import pandas_profiling

In [53]:
pandas_profiling.ProfileReport(df)

0,1
Number of variables,13
Number of observations,9578
Total Missing (%),0.0%
Total size in memory,972.8 KiB
Average record size in memory,104.0 B

0,1
Numeric,11
Categorical,1
Boolean,1
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.80497

0,1
1,7710
0,1868

Value,Count,Frequency (%),Unnamed: 3
1,7710,80.5%,
0,1868,19.5%,

0,1
Distinct count,2687
Unique (%),28.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4560.8
Minimum,178.96
Maximum,17640
Zeros (%),0.0%

0,1
Minimum,178.96
5-th percentile,1320.0
Q1,2820.0
Median,4140.0
Q3,5730.0
95-th percentile,9330.0
Maximum,17640.0
Range,17461.0
Interquartile range,2910.0

0,1
Standard deviation,2496.9
Coef of variation,0.54748
Kurtosis,1.9379
Mean,4560.8
MAD,1897.7
Skewness,1.1557
Sum,43683000
Variance,6234700
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
3660.0,50,0.5%,
3630.0,48,0.5%,
3990.0,46,0.5%,
4410.0,44,0.5%,
3600.0,41,0.4%,
2550.0,38,0.4%,
4080.0,38,0.4%,
3690.0,37,0.4%,
1800.0,37,0.4%,
4020.0,35,0.4%,

Value,Count,Frequency (%),Unnamed: 3
178.95833330000002,1,0.0%,
180.04166669999998,3,0.0%,
181.0,1,0.0%,
183.0416667,1,0.0%,
209.0416667,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
16260.0,1,0.0%,
16350.0,1,0.0%,
16652.0,1,0.0%,
17616.0,1,0.0%,
17639.95833,1,0.0%,

0,1
Distinct count,11
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.16371
Minimum,0
Maximum,13
Zeros (%),88.3%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,1
Maximum,13
Range,13
Interquartile range,0

0,1
Standard deviation,0.54621
Coef of variation,3.3365
Kurtosis,71.433
Mean,0.16371
MAD,0.28913
Skewness,6.0618
Sum,1568
Variance,0.29835
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0,8458,88.3%,
1,832,8.7%,
2,192,2.0%,
3,65,0.7%,
4,19,0.2%,
5,6,0.1%,
6,2,0.0%,
7,1,0.0%,
13,1,0.0%,
11,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,8458,88.3%,
1,832,8.7%,
2,192,2.0%,
3,65,0.7%,
4,19,0.2%,

Value,Count,Frequency (%),Unnamed: 3
6,2,0.0%,
7,1,0.0%,
8,1,0.0%,
11,1,0.0%,
13,1,0.0%,

0,1
Distinct count,2529
Unique (%),26.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,12.607
Minimum,0
Maximum,29.96
Zeros (%),0.9%

0,1
Minimum,0.0
5-th percentile,1.27
Q1,7.2125
Median,12.665
Q3,17.95
95-th percentile,23.65
Maximum,29.96
Range,29.96
Interquartile range,10.738

0,1
Standard deviation,6.884
Coef of variation,0.54606
Kurtosis,-0.90036
Mean,12.607
MAD,5.796
Skewness,0.023941
Sum,120750
Variance,47.389
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,89,0.9%,
10.0,19,0.2%,
0.6,16,0.2%,
13.16,13,0.1%,
19.2,13,0.1%,
15.1,13,0.1%,
12.0,13,0.1%,
6.0,13,0.1%,
13.28,12,0.1%,
10.8,12,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0.0,89,0.9%,
0.01,1,0.0%,
0.02,1,0.0%,
0.03,1,0.0%,
0.04,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
29.72,1,0.0%,
29.74,1,0.0%,
29.9,1,0.0%,
29.95,1,0.0%,
29.96,1,0.0%,

0,1
Distinct count,44
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,710.85
Minimum,612
Maximum,827
Zeros (%),0.0%

0,1
Minimum,612
5-th percentile,657
Q1,682
Median,707
Q3,737
95-th percentile,782
Maximum,827
Range,215
Interquartile range,55

0,1
Standard deviation,37.971
Coef of variation,0.053416
Kurtosis,-0.42231
Mean,710.85
MAD,31.264
Skewness,0.47126
Sum,6808486
Variance,1441.8
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
687,548,5.7%,
682,536,5.6%,
692,498,5.2%,
697,476,5.0%,
702,472,4.9%,
707,444,4.6%,
667,438,4.6%,
677,427,4.5%,
717,424,4.4%,
662,414,4.3%,

Value,Count,Frequency (%),Unnamed: 3
612,2,0.0%,
617,1,0.0%,
622,1,0.0%,
627,2,0.0%,
632,6,0.1%,

Value,Count,Frequency (%),Unnamed: 3
807,45,0.5%,
812,33,0.3%,
817,6,0.1%,
822,5,0.1%,
827,1,0.0%,

0,1
Distinct count,28
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.5775
Minimum,0
Maximum,33
Zeros (%),38.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,2
95-th percentile,5
Maximum,33
Range,33
Interquartile range,2

0,1
Standard deviation,2.2002
Coef of variation,1.3948
Kurtosis,26.288
Mean,1.5775
MAD,1.4949
Skewness,3.5842
Sum,15109
Variance,4.8411
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0,3637,38.0%,
1,2462,25.7%,
2,1384,14.4%,
3,864,9.0%,
4,475,5.0%,
5,278,2.9%,
6,165,1.7%,
7,100,1.0%,
8,72,0.8%,
9,47,0.5%,

Value,Count,Frequency (%),Unnamed: 3
0,3637,38.0%,
1,2462,25.7%,
2,1384,14.4%,
3,864,9.0%,
4,475,5.0%,

Value,Count,Frequency (%),Unnamed: 3
27,1,0.0%,
28,1,0.0%,
31,1,0.0%,
32,1,0.0%,
33,1,0.0%,

0,1
Distinct count,4788
Unique (%),50.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,319.09
Minimum,15.67
Maximum,940.14
Zeros (%),0.0%

0,1
Minimum,15.67
5-th percentile,65.559
Q1,163.77
Median,268.95
Q3,432.76
95-th percentile,756.27
Maximum,940.14
Range,924.47
Interquartile range,268.99

0,1
Standard deviation,207.07
Coef of variation,0.64894
Kurtosis,0.13791
Mean,319.09
MAD,165.63
Skewness,0.91252
Sum,3056200
Variance,42879
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
317.72,41,0.4%,
316.11,34,0.4%,
319.47,29,0.3%,
381.26,27,0.3%,
662.68,27,0.3%,
156.1,24,0.3%,
320.95,24,0.3%,
669.33,23,0.2%,
334.67,23,0.2%,
188.02,23,0.2%,

Value,Count,Frequency (%),Unnamed: 3
15.67,1,0.0%,
15.69,1,0.0%,
15.75,1,0.0%,
15.76,1,0.0%,
15.91,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
916.95,2,0.0%,
918.02,2,0.0%,
922.42,1,0.0%,
926.83,2,0.0%,
940.14,1,0.0%,

0,1
Distinct count,249
Unique (%),2.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.12264
Minimum,0.06
Maximum,0.2164
Zeros (%),0.0%

0,1
Minimum,0.06
5-th percentile,0.0774
Q1,0.1039
Median,0.1221
Q3,0.1407
95-th percentile,0.167
Maximum,0.2164
Range,0.1564
Interquartile range,0.0368

0,1
Standard deviation,0.026847
Coef of variation,0.21891
Kurtosis,-0.22432
Mean,0.12264
MAD,0.021441
Skewness,0.16442
Sum,1174.6
Variance,0.00072076
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0.1253,354,3.7%,
0.0894,299,3.1%,
0.1183,243,2.5%,
0.1218,215,2.2%,
0.0963,210,2.2%,
0.1114,206,2.2%,
0.08,198,2.1%,
0.1287,197,2.1%,
0.1148,193,2.0%,
0.0932,187,2.0%,

Value,Count,Frequency (%),Unnamed: 3
0.06,8,0.1%,
0.0639,4,0.0%,
0.0676,9,0.1%,
0.0705,23,0.2%,
0.0712,9,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0.2052,4,0.0%,
0.2086,6,0.1%,
0.209,2,0.0%,
0.2121,7,0.1%,
0.2164,2,0.0%,

0,1
Distinct count,1987
Unique (%),20.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10.932
Minimum,7.5475
Maximum,14.528
Zeros (%),0.0%

0,1
Minimum,7.5475
5-th percentile,9.9179
Q1,10.558
Median,10.929
Q3,11.291
95-th percentile,11.918
Maximum,14.528
Range,6.9809
Interquartile range,0.73288

0,1
Standard deviation,0.61481
Coef of variation,0.056239
Kurtosis,1.609
Mean,10.932
MAD,0.46903
Skewness,0.028668
Sum,104710
Variance,0.37799
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
11.00209984,308,3.2%,
10.81977828,248,2.6%,
10.59663473,224,2.3%,
10.30895266,224,2.3%,
10.71441777,221,2.3%,
11.22524339,196,2.0%,
11.15625052,165,1.7%,
10.77895629,149,1.6%,
10.91508846,147,1.5%,
11.08214255,146,1.5%,

Value,Count,Frequency (%),Unnamed: 3
7.547501682999999,1,0.0%,
7.60090246,1,0.0%,
8.101677747,1,0.0%,
8.160518247,1,0.0%,
8.188689124,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
13.71015004,2,0.0%,
13.99783211,1,0.0%,
14.12446477,1,0.0%,
14.18015367,1,0.0%,
14.52835448,1,0.0%,

0,1
Distinct count,6
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.062122
Minimum,0
Maximum,5
Zeros (%),94.2%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,1
Maximum,5
Range,5
Interquartile range,0

0,1
Standard deviation,0.26213
Coef of variation,4.2196
Kurtosis,38.781
Mean,0.062122
MAD,0.11699
Skewness,5.1264
Sum,595
Variance,0.06871
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0,9019,94.2%,
1,533,5.6%,
2,19,0.2%,
3,5,0.1%,
5,1,0.0%,
4,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,9019,94.2%,
1,533,5.6%,
2,19,0.2%,
3,5,0.1%,
4,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,533,5.6%,
2,19,0.2%,
3,5,0.1%,
4,1,0.0%,
5,1,0.0%,

0,1
Distinct count,7
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
debt_consolidation,3957
all_other,2331
credit_card,1262
Other values (4),2028

Value,Count,Frequency (%),Unnamed: 3
debt_consolidation,3957,41.3%,
all_other,2331,24.3%,
credit_card,1262,13.2%,
home_improvement,629,6.6%,
small_business,619,6.5%,
major_purchase,437,4.6%,
educational,343,3.6%,

0,1
Distinct count,7869
Unique (%),82.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,16914
Minimum,0
Maximum,1207359
Zeros (%),3.4%

0,1
Minimum,0.0
5-th percentile,127.7
Q1,3187.0
Median,8596.0
Q3,18250.0
95-th percentile,57654.0
Maximum,1207359.0
Range,1207359.0
Interquartile range,15062.0

0,1
Standard deviation,33756
Coef of variation,1.9958
Kurtosis,259.66
Mean,16914
MAD,15560
Skewness,11.161
Sum,162001946
Variance,1139500000
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0,321,3.4%,
255,10,0.1%,
298,10,0.1%,
682,9,0.1%,
346,8,0.1%,
182,6,0.1%,
1085,6,0.1%,
2229,6,0.1%,
8035,5,0.1%,
6,5,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,321,3.4%,
1,5,0.1%,
2,2,0.0%,
3,1,0.0%,
4,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
407794,1,0.0%,
508961,1,0.0%,
602519,1,0.0%,
952013,1,0.0%,
1207359,1,0.0%,

0,1
Distinct count,1035
Unique (%),10.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,46.799
Minimum,0
Maximum,119
Zeros (%),3.1%

0,1
Minimum,0.0
5-th percentile,1.1
Q1,22.6
Median,46.3
Q3,70.9
95-th percentile,94.0
Maximum,119.0
Range,119.0
Interquartile range,48.3

0,1
Standard deviation,29.014
Coef of variation,0.61998
Kurtosis,-1.1165
Mean,46.799
MAD,24.835
Skewness,0.059985
Sum,448240
Variance,841.84
Memory size,74.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,297,3.1%,
0.5,26,0.3%,
0.3,22,0.2%,
73.7,22,0.2%,
47.8,22,0.2%,
3.3,21,0.2%,
0.1,21,0.2%,
0.2,20,0.2%,
0.7,20,0.2%,
1.0,20,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,297,3.1%,
0.04,1,0.0%,
0.1,21,0.2%,
0.2,20,0.2%,
0.3,22,0.2%,

Value,Count,Frequency (%),Unnamed: 3
106.2,1,0.0%,
106.4,1,0.0%,
106.5,1,0.0%,
108.8,1,0.0%,
119.0,1,0.0%,

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec
0,1,debt_consolidation,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0


In [60]:
df['purpose'] = df['purpose'].astype('category')

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 13 columns):
credit.policy        9578 non-null int64
purpose              9578 non-null category
int.rate             9578 non-null float64
installment          9578 non-null float64
log.annual.inc       9578 non-null float64
dti                  9578 non-null float64
fico                 9578 non-null int64
days.with.cr.line    9578 non-null float64
revol.bal            9578 non-null int64
revol.util           9578 non-null float64
inq.last.6mths       9578 non-null int64
delinq.2yrs          9578 non-null int64
pub.rec              9578 non-null int64
dtypes: category(1), float64(6), int64(6)
memory usage: 907.7 KB


In [62]:
encodedDF = pd.get_dummies(df[['purpose']])

In [63]:
encodedDF.head()


Unnamed: 0,purpose_all_other,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business
0,0,0,1,0,0,0,0
1,0,1,0,0,0,0,0
2,0,0,1,0,0,0,0
3,0,0,1,0,0,0,0
4,0,1,0,0,0,0,0


In [64]:
df = pd.concat([encodedDF,df.drop(['credit.policy','purpose'],axis=1)],axis=1)

In [65]:
df.head(10)

Unnamed: 0,purpose_all_other,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec
0,0,0,1,0,0,0,0,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0
1,0,1,0,0,0,0,0,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0
2,0,0,1,0,0,0,0,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0
3,0,0,1,0,0,0,0,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0
4,0,1,0,0,0,0,0,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0
5,0,1,0,0,0,0,0,0.0788,125.13,11.904968,16.98,727,6120.041667,50807,51.0,0,0,0
6,0,0,1,0,0,0,0,0.1496,194.02,10.714418,4.0,667,3180.041667,3839,76.8,0,0,1
7,1,0,0,0,0,0,0,0.1114,131.22,11.0021,11.08,722,5116.0,24220,68.6,0,0,0
8,0,0,0,0,1,0,0,0.1134,87.19,11.407565,17.25,682,3989.0,69909,51.1,1,0,0
9,0,0,1,0,0,0,0,0.1221,84.12,10.203592,10.0,707,2730.041667,5630,23.0,1,0,0


In [66]:
train, test, y_train, y_test = train_test_split(df,y,test_size = 0.2,random_state=101)

In [67]:
lr = LogisticRegression()
lr.fit(train, y_train)
y_pred = lr.predict(test)
print('Accuracy score baseline:', accuracy_score(y_test,y_pred))

Accuracy score baseline: 0.8465553235908142


In [71]:
def fit_predict(train, test, y_train, y_test, scaler, max_depth,
               criterion = 'entropy', max_features = 1, min_samples_split = 4):
    train_scaled = scaler.fit_transform(train)
    test_scaled = scaler.transform(test)
    dt = DecisionTreeClassifier(criterion, max_depth=max_depth,
                               random_state=42, max_features = max_features,
                               min_samples_split = min_samples_split)
    dt.fit(train_scaled, y_train)
    y_pred = dt.predict(test_scaled)
    print(accuracy_score(y_test, y_pred))

In [72]:
dt = DecisionTreeClassifier()
dt.fit(train, y_train)
y_pred = dt.predict(test)
print(accuracy_score(y_test,y_pred))

0.7437369519832986


### max depth tunning

In [75]:
for i in range(1,20):
    print('Accuracy score using max_depth =', i, end= ': ')
    fit_predict(train, test, y_train, y_test, StandardScaler(), i)

Accuracy score using max_depth = 1: 0.8470772442588727
Accuracy score using max_depth = 2: 0.8470772442588727
Accuracy score using max_depth = 3: 0.8465553235908142
Accuracy score using max_depth = 4: 0.8470772442588727
Accuracy score using max_depth = 5: 0.8465553235908142
Accuracy score using max_depth = 6: 0.8465553235908142
Accuracy score using max_depth = 7: 0.843945720250522
Accuracy score using max_depth = 8: 0.8397703549060542
Accuracy score using max_depth = 9: 0.8423799582463466
Accuracy score using max_depth = 10: 0.8366388308977035
Accuracy score using max_depth = 11: 0.8324634655532359
Accuracy score using max_depth = 12: 0.8303757828810021
Accuracy score using max_depth = 13: 0.8282881002087683
Accuracy score using max_depth = 14: 0.8204592901878914
Accuracy score using max_depth = 15: 0.819937369519833
Accuracy score using max_depth = 16: 0.8094989561586639
Accuracy score using max_depth = 17: 0.8021920668058455
Accuracy score using max_depth = 18: 0.7990605427974948
Acc

### Max feature tunning

In [76]:
for i in np.arange(0.1, 1.0, 0.1):
    print('Accuracy score using max_features =', i, end= ': ')
    fit_predict(train, test, y_train, y_test, StandardScaler(), max_depth = 4, max_features=i)

Accuracy score using max_features = 0.1: 0.8470772442588727
Accuracy score using max_features = 0.2: 0.8470772442588727
Accuracy score using max_features = 0.30000000000000004: 0.8470772442588727
Accuracy score using max_features = 0.4: 0.8444676409185804
Accuracy score using max_features = 0.5: 0.8434237995824635
Accuracy score using max_features = 0.6: 0.8455114822546973
Accuracy score using max_features = 0.7000000000000001: 0.8444676409185804
Accuracy score using max_features = 0.8: 0.842901878914405
Accuracy score using max_features = 0.9: 0.843945720250522


### Min Sample split tunning

In [78]:
for i in range(2,10):
    print('Accuracy score using min sample split=', i, end= ': ')
    fit_predict(train, test, y_train, y_test, StandardScaler(), 4, max_features= 0.1, min_samples_split=i)

Accuracy score using min sample split= 2: 0.8470772442588727
Accuracy score using min sample split= 3: 0.8470772442588727
Accuracy score using min sample split= 4: 0.8470772442588727
Accuracy score using min sample split= 5: 0.8470772442588727
Accuracy score using min sample split= 6: 0.8470772442588727
Accuracy score using min sample split= 7: 0.8470772442588727
Accuracy score using min sample split= 8: 0.8470772442588727
Accuracy score using min sample split= 9: 0.8470772442588727


### Criterion Tunning

In [79]:
for i in ['gini', 'entropy']:
    print('Accuracy score using criterion tunning=', i, end= ': ')
    fit_predict(train, test, y_train, y_test, StandardScaler(), 4, 
                max_features= 0.1, min_samples_split=2,criterion = i)

Accuracy score using criterion tunning= gini: 0.8465553235908142
Accuracy score using criterion tunning= entropy: 0.8470772442588727


In [80]:
def create_poly(train, test, degree):
    poly = PolynomialFeatures(degree=degree)
    train_poly = poly.fit_transform(train)
    test_poly = poly.fit_transform(test)
    return train_poly, test_poly

In [81]:
for degree in [1,2,3,4]:
    train_poly, test_poly = create_poly(train, test, degree)
    print('Polynomial degree',degree)
    fit_predict(train_poly, test_poly, y_train, y_test, StandardScaler(), 4,
               max_features = 0.1, min_samples_split = 2, criterion = 'entropy')
    print(10*'-')
    
train_poly, test_poly = create_poly(train, test, 2)

Polynomial degree 1
0.8470772442588727
----------
Polynomial degree 2
0.8434237995824635
----------
Polynomial degree 3
0.8455114822546973
----------
Polynomial degree 4
0.8470772442588727
----------


### Random Forest

In [83]:
from sklearn.ensemble import RandomForestClassifier

In [84]:
rf = RandomForestClassifier(criterion='gini')

In [85]:
rf.fit(train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [86]:
pred_rf = rf.predict(test)

In [87]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,pred_rf))

0.8382045929018789


In [88]:
from sklearn.model_selection import GridSearchCV

In [89]:
params = {'n_estimators':[200,500,700],'max_depth':[10,15,18,20],
         'min_samples_leaf':[3,5,7]}

In [90]:
gs = GridSearchCV(rf,params,verbose=3)

In [91]:
gs.fit(train,y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV] max_depth=10, min_samples_leaf=3, n_estimators=200 ..............
[CV]  max_depth=10, min_samples_leaf=3, n_estimators=200, score=0.8367906066536204, total=   1.9s
[CV] max_depth=10, min_samples_leaf=3, n_estimators=200 ..............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.1s remaining:    0.0s


[CV]  max_depth=10, min_samples_leaf=3, n_estimators=200, score=0.8386844166014096, total=   2.0s
[CV] max_depth=10, min_samples_leaf=3, n_estimators=200 ..............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.4s remaining:    0.0s


[CV]  max_depth=10, min_samples_leaf=3, n_estimators=200, score=0.8378378378378378, total=   2.1s
[CV] max_depth=10, min_samples_leaf=3, n_estimators=500 ..............
[CV]  max_depth=10, min_samples_leaf=3, n_estimators=500, score=0.8383561643835616, total=   5.2s
[CV] max_depth=10, min_samples_leaf=3, n_estimators=500 ..............
[CV]  max_depth=10, min_samples_leaf=3, n_estimators=500, score=0.8386844166014096, total=   5.2s
[CV] max_depth=10, min_samples_leaf=3, n_estimators=500 ..............
[CV]  max_depth=10, min_samples_leaf=3, n_estimators=500, score=0.8382295338817078, total=   4.7s
[CV] max_depth=10, min_samples_leaf=3, n_estimators=700 ..............
[CV]  max_depth=10, min_samples_leaf=3, n_estimators=700, score=0.837573385518591, total=   7.0s
[CV] max_depth=10, min_samples_leaf=3, n_estimators=700 ..............
[CV]  max_depth=10, min_samples_leaf=3, n_estimators=700, score=0.8386844166014096, total=   6.9s
[CV] max_depth=10, min_samples_leaf=3, n_estimators=700 ..

[CV]  max_depth=15, min_samples_leaf=7, n_estimators=700, score=0.8383561643835616, total=   8.0s
[CV] max_depth=15, min_samples_leaf=7, n_estimators=700 ..............
[CV]  max_depth=15, min_samples_leaf=7, n_estimators=700, score=0.8382928739232577, total=   7.8s
[CV] max_depth=15, min_samples_leaf=7, n_estimators=700 ..............
[CV]  max_depth=15, min_samples_leaf=7, n_estimators=700, score=0.8382295338817078, total=   8.3s
[CV] max_depth=18, min_samples_leaf=3, n_estimators=200 ..............
[CV]  max_depth=18, min_samples_leaf=3, n_estimators=200, score=0.8379647749510764, total=   2.6s
[CV] max_depth=18, min_samples_leaf=3, n_estimators=200 ..............
[CV]  max_depth=18, min_samples_leaf=3, n_estimators=200, score=0.8386844166014096, total=   2.5s
[CV] max_depth=18, min_samples_leaf=3, n_estimators=200 ..............
[CV]  max_depth=18, min_samples_leaf=3, n_estimators=200, score=0.836662749706228, total=   2.6s
[CV] max_depth=18, min_samples_leaf=3, n_estimators=500 ..

[CV]  max_depth=20, min_samples_leaf=7, n_estimators=200, score=0.8390759592795615, total=   2.2s
[CV] max_depth=20, min_samples_leaf=7, n_estimators=200 ..............
[CV]  max_depth=20, min_samples_leaf=7, n_estimators=200, score=0.8386212299255777, total=   2.2s
[CV] max_depth=20, min_samples_leaf=7, n_estimators=500 ..............
[CV]  max_depth=20, min_samples_leaf=7, n_estimators=500, score=0.8383561643835616, total=   5.5s
[CV] max_depth=20, min_samples_leaf=7, n_estimators=500 ..............
[CV]  max_depth=20, min_samples_leaf=7, n_estimators=500, score=0.8382928739232577, total=   5.8s
[CV] max_depth=20, min_samples_leaf=7, n_estimators=500 ..............
[CV]  max_depth=20, min_samples_leaf=7, n_estimators=500, score=0.8378378378378378, total=   5.9s
[CV] max_depth=20, min_samples_leaf=7, n_estimators=700 ..............
[CV]  max_depth=20, min_samples_leaf=7, n_estimators=700, score=0.837573385518591, total=   8.1s
[CV] max_depth=20, min_samples_leaf=7, n_estimators=700 ..

[Parallel(n_jobs=1)]: Done 108 out of 108 | elapsed: 10.9min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [200, 500, 700], 'max_depth': [10, 15, 18, 20], 'min_samples_leaf': [3, 5, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

In [92]:
gs.best_params_

{'max_depth': 20, 'min_samples_leaf': 3, 'n_estimators': 700}

In [93]:
gs.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=20, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=700, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [95]:
rf1 = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                           max_depth=20, max_features='auto', max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=3, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=700, n_jobs=1,
                           oob_score=True, random_state=101, verbose=0,
                           warm_start=False)

In [96]:
rf1.fit(train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=20, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=700, n_jobs=1,
            oob_score=True, random_state=101, verbose=0, warm_start=False)

In [97]:
rf1.oob_score_

0.8371182458888019

In [98]:
pred_rf1 = rf1.predict(test)

In [99]:
print(accuracy_score(y_test,pred_rf1))

0.8470772442588727


In [100]:
rf1.feature_importances_

array([0.01108376, 0.00579265, 0.01263026, 0.00330451, 0.00522013,
       0.00209029, 0.01252235, 0.11033317, 0.11604686, 0.10804861,
       0.10353267, 0.0852303 , 0.11193036, 0.11239541, 0.11199535,
       0.07135686, 0.01029957, 0.00618689])

In [101]:
sorted(list(zip(rf1.feature_importances_,train.columns)),reverse=True)

[(0.11604686235100856, 'installment'),
 (0.11239541419408874, 'revol.bal'),
 (0.11199535271947572, 'revol.util'),
 (0.11193036163345264, 'days.with.cr.line'),
 (0.11033316904054125, 'int.rate'),
 (0.10804860511115291, 'log.annual.inc'),
 (0.10353266660616904, 'dti'),
 (0.08523030305497716, 'fico'),
 (0.07135685827701271, 'inq.last.6mths'),
 (0.012630257954099463, 'purpose_debt_consolidation'),
 (0.012522352377548973, 'purpose_small_business'),
 (0.011083759436649096, 'purpose_all_other'),
 (0.010299566417078918, 'delinq.2yrs'),
 (0.006186889018194423, 'pub.rec'),
 (0.00579264721918362, 'purpose_credit_card'),
 (0.005220129735632066, 'purpose_home_improvement'),
 (0.003304513982240676, 'purpose_educational'),
 (0.0020902908714939047, 'purpose_major_purchase')]

In [102]:
df.corr()

Unnamed: 0,purpose_all_other,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec
purpose_all_other,1.0,-0.220935,-0.475848,-0.1093,-0.150359,-0.124004,-0.149076,-0.124,-0.203103,-0.080077,-0.125825,0.067184,-0.056574,-0.067728,-0.138535,0.017795,0.016658,-0.030451
purpose_credit_card,-0.220935,1.0,-0.32685,-0.075076,-0.103279,-0.085176,-0.102397,-0.042109,0.000774,0.072942,0.084476,-0.012512,0.04622,0.072316,0.091321,-0.03364,-0.008817,0.014842
purpose_debt_consolidation,-0.475848,-0.32685,1.0,-0.161698,-0.222441,-0.183451,-0.220542,0.123607,0.161658,-0.026214,0.179149,-0.154132,-0.009318,0.005785,0.211869,-0.04424,-0.000697,0.026845
purpose_educational,-0.1093,-0.075076,-0.161698,1.0,-0.051094,-0.042138,-0.050658,-0.019618,-0.09451,-0.119799,-0.035325,-0.013012,-0.042621,-0.034743,-0.053128,0.024243,-0.002214,-0.013521
purpose_home_improvement,-0.150359,-0.103279,-0.222441,-0.051094,1.0,-0.057967,-0.069687,-0.050697,0.023024,0.116375,-0.092788,0.097474,0.068087,0.003258,-0.114449,0.043827,-0.013098,0.004704
purpose_major_purchase,-0.124004,-0.085176,-0.183451,-0.042138,-0.057967,1.0,-0.057472,-0.068978,-0.079836,-0.03102,-0.077719,0.067129,-0.020561,-0.062395,-0.108079,-0.001445,0.004085,-0.011734
purpose_small_business,-0.149076,-0.102397,-0.220542,-0.050658,-0.069687,-0.057472,1.0,0.151247,0.145654,0.09154,-0.069245,0.063292,0.034883,0.083069,-0.060962,0.042567,-0.004148,-0.005595
int.rate,-0.124,-0.042109,0.123607,-0.019618,-0.050697,-0.068978,0.151247,1.0,0.27614,0.056383,0.220006,-0.714821,-0.124022,0.092527,0.464837,0.20278,0.156079,0.098162
installment,-0.203103,0.000774,0.161658,-0.09451,0.023024,-0.079836,0.145654,0.27614,1.0,0.448102,0.050202,0.086039,0.183297,0.233625,0.081356,-0.010419,-0.004368,-0.03276
log.annual.inc,-0.080077,0.072942,-0.026214,-0.119799,0.116375,-0.03102,0.09154,0.056383,0.448102,1.0,-0.054065,0.114576,0.336896,0.37214,0.054881,0.029171,0.029203,0.016506


### LOGISTIC REGRESSION ACCURACY BASELINE -> 0.8465553235908142
### DECISION TREE ACCURACY BASELINE -> 0.7437369519832986
### DECISION TREE ACCURACY AFTER TUNNING -> 0.8470772442588727
### RANDOM FOREST ACCURACY BASELINE -> 0.8382045929018789
### RANDOM FOREST GRIDSEARRCH CV -> 0.8470772442588727