# Project description

Credit has played a key role in the economy for centuries and some form of credit has existed since the beginning of commerce. We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/). Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace [here](https://www.lendingclub.com/company/about-us?).

Each borrower completes a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data and their own data science process to assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back. You can read more about the interest rate that Lending Club assigns [here](https://www.lendingclub.com/loans/personal-loans/rates-fees). Lending Club also tries to verify all the information the borrower provides but it can't verify all of the information (usually for regulation reasons).

A higher interest rate means that the borrower is a risk and more unlikely to pay back the loan. While a lower interest rate means that the borrower has a good credit history and is more likely to pay back the loan. The interest rates range from 5.32% all the way to 30.99% and each borrower is given a [grade](https://www.lendingclub.com/investing/investor-education/interest-rates-and-fees) according to the interest rate they were assigned. If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.

Investors are primarily interested in receiving a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. Once they're ready to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the [origination](https://help.lendingclub.com/hc/en-us/articles/214463677) fee that Lending Club charges.

The borrower will make monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off before they see a return in money. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition to the requested amount. Many loans aren't completely paid off on time and some borrowers default on the loan.

While Lending Club has to be extremely savvy and rigorous with their credit modelling, investors on Lending Club need to be equally as savvy about determining which loans are more likely to be paid off. At first, you may wonder why investors put money into anything but low interest loans. The incentive investors have to back higher interest loans is, well, the higher interest! If investors believe the borrower can pay back the loan, even if he or she has a weak financial history, then investors can make more money through the larger additional amount the borrower has to pay.

Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of low, medium, and interest loans. In this project, we'll focus on the mindset of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time. To do that, we'll need to first understand the features in the dataset and then experiment with building machine learning models that reliably predict if a loan will be paid off or not.


Lending Club releases data for all of the approved and declined loan applications periodically on their [Website](https://www.lendingclub.com/investing/peer-to-peer).

In this project, we'll focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. You'll find the dataset in 'data/loans_2007.csv'.

You'll also find a [data dictionary](data/LCDataDictionary.xlsx) (in XLS format) which contains information on the different column names.


# Problem Statement

We would like to build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not.

# Instructions

1. Read and explore the dataset.
2. Perform data cleaning tasks that are useful to model our problem.
3. Define what features we want to use and which column represents the target column we want to predict. 
4. Perform necessary data preparation to start training machine learning models.
5. 
    a. Make predictions about whether or not a loan will be paid off on time.

    b. Our objective is to fund enough loans that are paid off on time to offset our losses from loans that aren't paid
    off. An error metric will help us determine if our algorithm will make us money or lose us money. Select an error metric that will help us figure out when our model is performing well, and when it's performing poorly.
7. Evaluate your model.

In [85]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE
warnings.filterwarnings('ignore')

In [7]:
data = pd.read_csv('D:/Data Camp/Mini-project II/mini-project2/mini-project2/data/loans_2007.csv')
data.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [8]:
data.columns

Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code',
       'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util',
       'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv',
       'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'policy_code', 'application_type',
       'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt',
       'pub_rec_bankruptcies', 'tax_liens'],
      dtype='object')

In [9]:
data.loan_status.value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

In [10]:
data = data[(data['loan_status'] == 'Fully Paid') | (data['loan_status'] == 'Charged Off')]

In [11]:
data.loan_status.value_counts()

Fully Paid     33136
Charged Off     5634
Name: loan_status, dtype: int64

In [12]:
data.loan_status[(data['loan_status']  == 'Fully Paid')]=0
data.loan_status[(data['loan_status']  == 'Charged Off')]=1

In [13]:
data.loan_status.astype(float)

0        0.0
1        1.0
2        0.0
3        0.0
5        0.0
        ... 
39781    0.0
39782    0.0
39783    0.0
39784    0.0
39785    0.0
Name: loan_status, Length: 38770, dtype: float64

In [14]:
len(data.columns)

52

In [15]:

#creating a lists to store columns we will retain as categorical or numerical

categorical_features=[]
numerical_features=[]

In [16]:
data.columns[:10]

Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade'],
      dtype='object')

In [17]:

categorical_features.append('term')
numerical_features.append('loan_amnt')
numerical_features.append('installment')
numerical_features.append('int_rate')

In [18]:
data.columns[10:20]

Index(['emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'verification_status', 'issue_d', 'loan_status', 'pymnt_plan',
       'purpose', 'title'],
      dtype='object')

In [19]:
categorical_features.append('home_ownership')
categorical_features.append('purpose')
categorical_features.append('verification_status')
numerical_features.append('emp_length')
numerical_features.append('annual_inc')

In [20]:
data.columns[20:30]

Index(['zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util'],
      dtype='object')

In [21]:
numerical_features.append('pub_rec')
numerical_features.append('earliest_cr_line')
numerical_features.append('delinq_2yrs')
numerical_features.append('revol_util')
numerical_features.append('revol_bal')
numerical_features.append('open_acc')
numerical_features.append('inq_last_6mths')
numerical_features.append('dti')

In [22]:
data.columns[30:40]

Index(['total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv',
       'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries'],
      dtype='object')

In [24]:
numerical_features.append('total_acc')

In [25]:
data.columns[41:]

Index(['last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'policy_code', 'application_type',
       'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt',
       'pub_rec_bankruptcies', 'tax_liens'],
      dtype='object')

In [26]:

numerical_features.append('tax_liens')
numerical_features.append('last_credit_pull_d')
numerical_features.append('pub_rec_bankruptcies')

In [27]:
print('We have', len(categorical_features),'categorical_features and',len(numerical_features),'numerical_features')

We have 4 categorical_features and 17 numerical_features


In [28]:
categorical_features

['term', 'home_ownership', 'purpose', 'verification_status']

In [29]:
numerical_features

['loan_amnt',
 'installment',
 'int_rate',
 'emp_length',
 'annual_inc',
 'pub_rec',
 'earliest_cr_line',
 'delinq_2yrs',
 'revol_util',
 'revol_bal',
 'open_acc',
 'inq_last_6mths',
 'dti',
 'total_acc',
 'tax_liens',
 'last_credit_pull_d',
 'pub_rec_bankruptcies']

In [30]:
#Cleaning 

data.isna().sum()

id                               0
member_id                        0
loan_amnt                        0
funded_amnt                      0
funded_amnt_inv                  0
term                             0
int_rate                         0
installment                      0
grade                            0
sub_grade                        0
emp_title                     2396
emp_length                    1036
home_ownership                   0
annual_inc                       0
verification_status              0
issue_d                          0
loan_status                      0
pymnt_plan                       0
purpose                          0
title                           11
zip_code                         0
addr_state                       0
dti                              0
delinq_2yrs                      0
earliest_cr_line                 0
inq_last_6mths                   0
open_acc                         0
pub_rec                          0
revol_bal           

In [31]:
data.dropna(subset=['tax_liens','revol_util','last_credit_pull_d'],inplace=True)

In [32]:
data.loan_status[data.emp_length.isna()==True].value_counts()

0    806
1    227
Name: loan_status, dtype: int64

In [33]:
data.loan_status[data.pub_rec_bankruptcies.isna()==True].value_counts()

0    540
1    117
Name: loan_status, dtype: int64

In [34]:
data.dropna(subset=['pub_rec_bankruptcies','emp_length'],inplace=True)

In [35]:
data.reset_index(inplace=True)

In [37]:
#dataframe for categorical data
data_categorical=pd.DataFrame()
for col in categorical_features:
    data_categorical=pd.concat([ data_categorical, data[col]], axis= 1)

data_categorical.head()

Unnamed: 0,term,home_ownership,purpose,verification_status
0,36 months,RENT,credit_card,Verified
1,60 months,RENT,car,Source Verified
2,36 months,RENT,small_business,Not Verified
3,36 months,RENT,other,Source Verified
4,36 months,RENT,wedding,Source Verified


In [38]:
#dataframe for numerical data
data_numerical=pd.DataFrame()
for col in numerical_features:
    data_numerical=pd.concat([ data_numerical, data[col]], axis= 1)

data_numerical.head()

Unnamed: 0,loan_amnt,installment,int_rate,emp_length,annual_inc,pub_rec,earliest_cr_line,delinq_2yrs,revol_util,revol_bal,open_acc,inq_last_6mths,dti,total_acc,tax_liens,last_credit_pull_d,pub_rec_bankruptcies
0,5000.0,162.87,10.65%,10+ years,24000.0,0.0,Jan-1985,0.0,83.7%,13648.0,3.0,1.0,27.65,9.0,0.0,Jun-2016,0.0
1,2500.0,59.83,15.27%,< 1 year,30000.0,0.0,Apr-1999,0.0,9.4%,1687.0,3.0,5.0,1.0,4.0,0.0,Sep-2013,0.0
2,2400.0,84.33,15.96%,10+ years,12252.0,0.0,Nov-2001,0.0,98.5%,2956.0,2.0,2.0,8.72,10.0,0.0,Jun-2016,0.0
3,10000.0,339.31,13.49%,10+ years,49200.0,0.0,Feb-1996,0.0,21%,5598.0,10.0,1.0,20.0,37.0,0.0,Apr-2016,0.0
4,5000.0,156.46,7.90%,3 years,36000.0,0.0,Nov-2004,0.0,28.3%,7963.0,9.0,3.0,11.2,12.0,0.0,Jan-2016,0.0


In [39]:
data_numerical.isna().sum()

loan_amnt               0
installment             0
int_rate                0
emp_length              0
annual_inc              0
pub_rec                 0
earliest_cr_line        0
delinq_2yrs             0
revol_util              0
revol_bal               0
open_acc                0
inq_last_6mths          0
dti                     0
total_acc               0
tax_liens               0
last_credit_pull_d      0
pub_rec_bankruptcies    0
dtype: int64

In [40]:
data_numerical['revol_util']=data_numerical['revol_util'].str.strip('%').astype(float)

In [41]:
data_numerical['int_rate']=data_numerical['int_rate'].str.strip('%').astype(float)

In [42]:
for col in ['last_credit_pull_d', 'earliest_cr_line']:
    data_numerical[col] = pd.DatetimeIndex(data_numerical[col]).astype(np.int64)*1e-9

In [43]:
data_numerical.emp_length.unique()

array(['10+ years', '< 1 year', '3 years', '8 years', '9 years',
       '4 years', '5 years', '1 year', '6 years', '2 years', '7 years'],
      dtype=object)

In [44]:
mapping_dict = {
    'emp_length': {
        '10+ years': 10,
        '9 years': 9,
        '8 years': 8,
        '7 years': 7,
        '6 years': 6,
        '5 years': 5,
        '4 years': 4,
        '3 years': 3,
        '2 years': 2,
        '1 year': 1,
        '< 1 year': 0,
    }
}

data_numerical.replace(mapping_dict, inplace=True)

In [45]:
data_numerical.head()

Unnamed: 0,loan_amnt,installment,int_rate,emp_length,annual_inc,pub_rec,earliest_cr_line,delinq_2yrs,revol_util,revol_bal,open_acc,inq_last_6mths,dti,total_acc,tax_liens,last_credit_pull_d,pub_rec_bankruptcies
0,5000.0,162.87,10.65,10,24000.0,0.0,473385600.0,0.0,83.7,13648.0,3.0,1.0,27.65,9.0,0.0,1464739000.0,0.0
1,2500.0,59.83,15.27,0,30000.0,0.0,922924800.0,0.0,9.4,1687.0,3.0,5.0,1.0,4.0,0.0,1377994000.0,0.0
2,2400.0,84.33,15.96,10,12252.0,0.0,1004573000.0,0.0,98.5,2956.0,2.0,2.0,8.72,10.0,0.0,1464739000.0,0.0
3,10000.0,339.31,13.49,10,49200.0,0.0,823132800.0,0.0,21.0,5598.0,10.0,1.0,20.0,37.0,0.0,1459469000.0,0.0
4,5000.0,156.46,7.9,3,36000.0,0.0,1099267000.0,0.0,28.3,7963.0,9.0,3.0,11.2,12.0,0.0,1451606000.0,0.0


In [46]:
for col in data_numerical.columns:
    data_numerical[col].astype(float)

In [47]:
data_numerical.describe()

Unnamed: 0,loan_amnt,installment,int_rate,emp_length,annual_inc,pub_rec,earliest_cr_line,delinq_2yrs,revol_util,revol_bal,open_acc,inq_last_6mths,dti,total_acc,tax_liens,last_credit_pull_d,pub_rec_bankruptcies
count,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0,36989.0
mean,11172.230258,325.073393,11.999874,4.967477,69456.09,0.054178,859815800.0,0.146152,48.999896,13381.643462,9.305848,0.869502,13.33784,22.159345,0.0,1411559000.0,0.04185
std,7383.9922,208.844669,3.707859,3.552827,63964.66,0.235504,209338300.0,0.491792,28.310762,15844.117774,4.376308,1.067945,6.65307,11.402571,0.0,54088940.0,0.200924
min,500.0,16.08,5.42,0.0,4000.0,0.0,-757382400.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0,1185926000.0,0.0
25%,5500.0,167.73,8.94,2.0,41200.0,0.0,757382400.0,0.0,25.6,3730.0,6.0,0.0,8.22,14.0,0.0,1370045000.0,0.0
50%,10000.0,280.61,11.86,4.0,60000.0,0.0,899251200.0,0.0,49.5,8885.0,9.0,1.0,13.43,20.0,0.0,1425168000.0,0.0
75%,15000.0,429.99,14.54,9.0,83000.0,0.0,1001894000.0,0.0,72.5,17043.0,12.0,1.0,18.6,29.0,0.0,1464739000.0,0.0
max,35000.0,1305.19,24.59,10.0,6000000.0,4.0,1225498000.0,11.0,99.9,149588.0,44.0,8.0,29.99,90.0,0.0,1464739000.0,2.0


In [48]:
data_categorical.isna().sum()

term                   0
home_ownership         0
purpose                0
verification_status    0
dtype: int64

In [49]:
data_categorical.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36989 entries, 0 to 36988
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   term                 36989 non-null  object
 1   home_ownership       36989 non-null  object
 2   purpose              36989 non-null  object
 3   verification_status  36989 non-null  object
dtypes: object(4)
memory usage: 1.4+ MB


In [50]:
data_numerical.drop(columns='tax_liens',axis=1, inplace=True)

In [51]:
data_categorical.head()

Unnamed: 0,term,home_ownership,purpose,verification_status
0,36 months,RENT,credit_card,Verified
1,60 months,RENT,car,Source Verified
2,36 months,RENT,small_business,Not Verified
3,36 months,RENT,other,Source Verified
4,36 months,RENT,wedding,Source Verified


In [52]:
for col in data_categorical.columns:
    dummies = pd.get_dummies(data_categorical[col])
    dummies.drop(dummies.columns[-1],axis=1,inplace=True)
    data_categorical = pd.concat([data_categorical, dummies], axis=1)
    data_categorical.drop(col, axis=1, inplace=True)

In [54]:
data_categorical.head()

Unnamed: 0,36 months,MORTGAGE,OTHER,OWN,car,credit_card,debt_consolidation,educational,home_improvement,house,major_purchase,medical,moving,other,renewable_energy,small_business,vacation,Not Verified,Source Verified
0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [55]:

for col in data_categorical.columns:
    data_categorical[col].astype(int)

In [61]:
#Scaling

scaler = StandardScaler()
scaled_num = pd.DataFrame(scaler.fit_transform(data_numerical),columns=data_numerical.columns)

#combining numercial and categorical in one data frame containing all feature that will be used during Modelling stage

X=pd.concat([scaled_num,data_categorical],axis=1)

y=data.loan_status
print(X.shape, y.shape)
X.head()

(36989, 35) (36989,)


Unnamed: 0,loan_amnt,installment,int_rate,emp_length,annual_inc,pub_rec,earliest_cr_line,delinq_2yrs,revol_util,revol_bal,...,house,major_purchase,medical,moving,other,renewable_energy,small_business,vacation,Not Verified,Source Verified
0,-0.835905,-0.776681,-0.364063,1.416503,-0.710653,-0.230056,-1.845986,-0.297185,1.225703,0.016811,...,0,0,0,0,0,0,0,0,0,0
1,-1.17448,-1.270068,0.881956,-1.398195,-0.61685,-0.230056,0.301473,-0.297185,-1.398777,-0.738116,...,0,0,0,0,0,0,0,0,0,1
2,-1.188022,-1.152755,1.06805,1.416503,-0.89432,-0.230056,0.691507,-0.297185,1.748479,-0.658022,...,0,0,0,0,0,0,1,0,1,0
3,-0.158755,0.068169,0.401888,1.416503,-0.31668,-0.230056,-0.175235,-0.297185,-0.989033,-0.491271,...,0,0,0,0,1,0,0,0,0,1
4,-0.835905,-0.807374,-1.10574,-0.553785,-0.523047,-0.230056,1.143865,-0.297185,-0.731177,-0.342002,...,0,0,0,0,0,0,0,0,0,1


In [71]:
pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.8.1-py3-none-any.whl (189 kB)
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.8.1 imblearn-0.0
Note: you may need to restart the kernel to use updated packages.


In [86]:
#Function to identify FPR and TPR of our model

def perf(y_test, y_pred):
    from sklearn.metrics import confusion_matrix
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    fpr = round(fp  / (fp + tn),2)*100
    tpr= round(tp / (tp+fn),2)*100
    
    return(print(tpr,'% of the positives are appropriately identified, and ',fpr,'% of the negatives are appropriately identified.'))

In [76]:

print(y_train.value_counts())#Previous original class distribution
y_train_for_smote=[i for i in y_train]
X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train, y_train_for_smote) 
print(pd.Series(y_train_resampled).value_counts()) #Preview synthetic sample class distribution
y_train_resampled=list(y_train_resampled)
y_train=list(y_train)

0    25392
1     4199
Name: loan_status, dtype: int64
0    25392
1    25392
dtype: int64


In [77]:
X_train.head()

Unnamed: 0,loan_amnt,installment,int_rate,emp_length,annual_inc,pub_rec,earliest_cr_line,delinq_2yrs,revol_util,revol_bal,...,house,major_purchase,medical,moving,other,renewable_energy,small_business,vacation,Not Verified,Source Verified
30450,-0.944249,-0.930912,-1.210924,-0.835255,-0.460512,-0.230056,0.967627,-0.297185,0.462731,-0.730921,...,0,1,0,0,0,0,0,0,1,0
5820,-0.90362,-0.882311,-1.10574,-0.835255,-0.226004,-0.230056,0.540859,-0.297185,-1.374051,-0.758187,...,0,0,0,0,0,0,1,0,0,1
18938,0.518395,0.748589,-0.639158,0.853564,-0.132264,-0.230056,-1.055186,-0.297185,-0.639338,2.847932,...,0,0,0,0,1,0,0,0,0,1
16295,-0.90362,-0.833805,0.183431,-0.553785,0.712026,-0.230056,-1.959488,11.903252,-0.628741,-0.478458,...,0,1,0,0,0,0,0,0,0,0
28941,-0.429615,-0.257005,0.399191,-0.272316,-0.147835,-0.230056,1.369218,-0.297185,-0.960775,-0.709714,...,0,0,0,0,0,0,0,0,0,0


In [78]:
model=LogisticRegression()
model.fit(X_train_resampled,y_train_resampled)

LogisticRegression()

In [94]:
training_preds = model.predict(X_train_resampled)
X_test2=np.array(X_test)
y_pred = model.predict(X_test2)
training_accuracy = accuracy_score(y_train_resampled, training_preds)
val_accuracy = accuracy_score(y_test, y_pred)

print("Training Accuracy: {:.4}%".format(training_accuracy * 100))
print("Validation accuracy: {:.4}%".format(val_accuracy * 100))
perf(y_test,y_pred)

Training Accuracy: 76.18%
Validation accuracy: 74.91%
40.0 % of the positives are appropriately identified, and  19.0 % of the negatives are appropriately identified.


### Conclusion


The logistic regression model gives us the best results by identifying 40% of the loans that won't be repaid but the downside is that the lender will only fund 34% of the loans that will be paid off.

Interest rate, date of last credit pull and length of the loan are the most important features in determining if a borrower will default