## Data Dictionary

The dataset consists of the following fields:
* Loan ID: A unique Identifier for the loan information.
* Customer ID: A unique identifier for the customer. Customers may have more than one loan.
* Loan Status: A categorical variable indicating if the loan was paid back or defaulted.
* Current Loan Amount: This is the loan amount that was either completely paid off, or the amount that was defaulted.
* Term: A categorical variable indicating if it is a short term or long term loan.
* Credit Score: A value between 0 and 800 indicating the riskiness of the borrowers credit history.
* Years in current job: A categorical variable indicating how many years the customer has been in their current job.
* Home Ownership: Categorical variable indicating home ownership. Values are "Rent", "Home Mortgage", and "Own". If the value is OWN, then the customer is a home owner with no mortgage
* Annual Income: The customer's annual income
* Purpose: A description of the purpose of the loan.
* Monthly Debt: The customer's monthly payment for their existing loans
* Years of Credit History: The years since the first entry in the customer’s credit history • Months since last delinquent: Months since the last loan delinquent payment
* Number of Open Accounts: The total number of open credit cards
* Number of Credit Problems: The number of credit problems in the customer records.
* Current Credit Balance: The current total debt for the customer
* Maximum Open Credit: The maximum credit limit for all credit sources.
* Bankruptcies: The number of bankruptcies
* Tax Liens: The number of tax liens.



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import numpy as np
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 100)

#Regression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge,Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import ExtraTreeRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error
from sklearn.preprocessing import normalize,scale

#Classification
from sklearn.naive_bayes import MultinomialNB,GaussianNB,BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix,precision_score,recall_score,f1_score


In [2]:
def regression_(x,y):
    
    lr=LinearRegression()
    r=Ridge()
    l=Lasso()
    e=ElasticNet()
    kn=KNeighborsRegressor()
    et=ExtraTreeRegressor()
    gb=GradientBoostingRegressor()
    dt=DecisionTreeRegressor()
    xgb=XGBRegressor()
       
    algos=[lr,r,l,e,kn,et,gb,dt,xgb]
    algos_names=['LinearRegressor','Ridge','Lasso','ElasticNet','KNeighbors','ExtraTree','GradientBoosting','DecisionTree','XGB']
    
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.2,random_state=42)
    
    r_score=[]
    mse=[]
    mae=[]
    
    result=pd.DataFrame(columns=['R_square','MSE','MAE'],index=algos_names)
    
    for algo in algos:
        pred=algo.fit(x,y).predict(x)
        r_score.append(r2_score(y,pred))
        mse.append(mean_squared_error(y,pred)**.5)
        mae.append(mean_absolute_error(y,pred))
    
    
    result.R_square=r_score
    result.MSE=mse
    result.MAE=mae
    
    return result.sort_values('R_square',ascending=False)

In [3]:
def classification_(train,y):
    
    
    g=GaussianNB()
    b=BernoulliNB()
    k=KNeighborsClassifier()
    svc=SVC()
    d=DecisionTreeClassifier()
    log=LogisticRegression()
    gbc=GradientBoostingClassifier()
    mn=MultinomialNB()
    rf=RandomForestClassifier()
    ab=AdaBoostClassifier()

    x=train
    y=y
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.2,random_state=42)
    algos=[g,b,k,svc,d,log,gbc,mn,rf,ab]
    algos_name=['Gaussian','Bernoulli','KNeigbors','SVC','DecisionTree','LogisticRegr','GradientBoosting','Multinominal','RandomForest','AdaBoost']
    
    accuracy = []
    precision = []
    recall = []
    f1 = []
   
    result=pd.DataFrame(columns=['AccuracyScore','PrecisionScore','RecallScore','f1_Score'],index=algos_name)
    
    for i in algos:
        
        predict=i.fit(x_train,y_train).predict(x_test)
        
        accuracy.append(accuracy_score(y_test,predict))
        precision.append(precision_score(y_test,predict))
        recall.append(recall_score(y_test,predict))
        f1.append(f1_score(y_test,predict))
        

    
    
    result.AccuracyScore=accuracy
    result.PrecisionScore=precision
    result.RecallScore=recall
    result.f1_Score=f1
    
    
    return result.sort_values('f1_Score',ascending=False)

In [4]:
df = pd.read_csv("LoansTrainingSet.csv")
df.drop_duplicates(inplace=True)
df

Unnamed: 0,Loan ID,Customer ID,Loan Status,Current Loan Amount,Term,Credit Score,Years in current job,Home Ownership,Annual Income,Purpose,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts,Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens
0,000025bb-5694-4cff-b17d-192b1a98ba44,5ebc8bb1-5eb9-4404-b11b-a6eebc401a19,Fully Paid,11520,Short Term,741.0,10+ years,Home Mortgage,33694.0,Debt Consolidation,$584.03,12.3,41.0,10,0,6760,16056,0.0,0.0
1,00002c49-3a29-4bd4-8f67-c8f8fbc1048c,927b388d-2e01-423f-a8dc-f7e42d668f46,Fully Paid,3441,Short Term,734.0,4 years,Home Mortgage,42269.0,other,"$1,106.04",26.3,,17,0,6262,19149,0.0,0.0
2,00002d89-27f3-409b-aa76-90834f359a65,defce609-c631-447d-aad6-1270615e89c4,Fully Paid,21029,Short Term,747.0,10+ years,Home Mortgage,90126.0,Debt Consolidation,"$1,321.85",28.8,,5,0,20967,28335,0.0,0.0
3,00005222-b4d8-45a4-ad8c-186057e24233,070bcecb-aae7-4485-a26a-e0403e7bb6c5,Fully Paid,18743,Short Term,747.0,10+ years,Own Home,38072.0,Debt Consolidation,$751.92,26.2,,9,0,22529,43915,0.0,0.0
4,0000757f-a121-41ed-b17b-162e76647c1f,dde79588-12f0-4811-bab0-e2b07f633fcd,Fully Paid,11731,Short Term,746.0,4 years,Rent,50025.0,Debt Consolidation,$355.18,11.5,,12,0,17391,37081,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
256979,fffef5b7-be99-4666-ac70-2a397d2ee435,7211a8e3-cba4-4132-b939-222eed8a662c,Fully Paid,3911,Short Term,,2 years,Rent,,Debt Consolidation,"$1,706.58",19.9,,16,0,43992,44080,0.0,0.0
256980,ffffca93-aa8c-4123-b8ff-7852f6df889a,616fef0c-8f09-4327-9b5c-48fcfaa52934,Fully Paid,5078,Short Term,737.0,10+ years,Own Home,77186.0,Debt Consolidation,"$1,376.47",19.1,47.0,9,0,1717,9758,0.0,0.0
256981,ffffcb2e-e48e-4d2c-a0d6-ed6bce5bfdbe,971a6682-183b-4a52-8bce-1d3429ade295,Charged Off,12116,Short Term,7460.0,9 years,Home Mortgage,52504.0,Debt Consolidation,$297.96,15.1,82.0,8,0,3315,20090,0.0,0.0
256982,ffffcb2e-e48e-4d2c-a0d6-ed6bce5bfdbe,971a6682-183b-4a52-8bce-1d3429ade295,Charged Off,12116,Short Term,746.0,9 years,Home Mortgage,52504.0,Debt Consolidation,$297.96,15.1,82.0,8,0,3315,20090,0.0,0.0


In [5]:
df.isnull().sum()

Loan ID                              0
Customer ID                          0
Loan Status                          0
Current Loan Amount                  0
Term                                 0
Credit Score                     59346
Years in current job             10444
Home Ownership                       0
Annual Income                    59346
Purpose                              0
Monthly Debt                         0
Years of Credit History              0
Months since last delinquent    131427
Number of Open Accounts              0
Number of Credit Problems            0
Current Credit Balance               0
Maximum Open Credit                  0
Bankruptcies                       492
Tax Liens                           23
dtype: int64

In [6]:
df["Loan Status"].value_counts()

Fully Paid     176191
Charged Off     64183
Name: Loan Status, dtype: int64

In [7]:
df["Loan Status"] = df["Loan Status"].replace(["Fully Paid", "Charged Off"], [1,0]).astype(int)

In [8]:
df["Home Ownership"].value_counts()

Home Mortgage    117231
Rent             101209
Own Home          21389
HaveMortgage        545
Name: Home Ownership, dtype: int64

In [9]:
df["Home Ownership"] = df["Home Ownership"].replace(["Home Mortgage", "Rent", "Own Home", "HaveMortgage"], [0,1,2,3]).astype(int)

In [10]:
df["Term"].value_counts()

Short Term    182177
Long Term      58197
Name: Term, dtype: int64

In [11]:
df["Term"] = df["Term"].replace(["Short Term", "Long Term"], [1,0]).astype(int)

In [12]:
df["Purpose"].value_counts()

Debt Consolidation      190656
Home Improvements        14106
other                    13221
Other                     9115
Business Loan             4275
Buy a Car                 3150
Medical Bills             2687
Take a Trip               1467
Buy House                 1445
Educational Expenses       252
Name: Purpose, dtype: int64

In [13]:
df["Purpose"] = df["Purpose"].replace("Other","other")

In [14]:
df["Maximum Open Credit"].value_counts()

0        1503
0         223
15662      18
9749       18
14770      18
         ... 
55588       1
34676       1
92025       1
82288       1
62371       1
Name: Maximum Open Credit, Length: 87188, dtype: int64

In [15]:
df["Monthly Debt"].value_counts()

$0.00         241
$847.85        12
$838.10        12
$636.87        11
$837.00        11
             ... 
$3,097.91       1
$2,068.23       1
$347.17         1
$1,405.80       1
$2,525.82       1
Name: Monthly Debt, Length: 129115, dtype: int64

In [16]:
df["Monthly Debt"] = df["Monthly Debt"].str.replace("$", "")
df["Monthly Debt"] = df["Monthly Debt"].str.replace(",", "")

In [17]:
df['Monthly Debt'] = df['Monthly Debt'].astype(float)

In [18]:
df["Tax Liens"].value_counts() 

0.0     236037
1.0       3055
2.0        811
3.0        223
4.0        116
5.0         54
6.0         28
9.0          8
8.0          8
7.0          6
10.0         3
11.0         2
Name: Tax Liens, dtype: int64

In [19]:
df["Tax Liens"] = df["Tax Liens"].fillna(0)

In [20]:
df["Bankruptcies"].value_counts()

0.0    214858
1.0     23906
2.0       902
3.0       168
4.0        29
5.0        15
6.0         3
7.0         1
Name: Bankruptcies, dtype: int64

In [21]:
df["Bankruptcies"] = df["Bankruptcies"].fillna(0)

In [22]:
df["Years in current job"].value_counts()

10+ years    73965
2 years      21972
< 1 year     19684
3 years      19337
5 years      16759
1 year       15670
4 years      15138
6 years      13654
7 years      13073
8 years      11417
9 years       9261
Name: Years in current job, dtype: int64

In [23]:
df["Years in current job"] = df["Years in current job"].replace("< 1 year",0)
df["Years in current job"] = df["Years in current job"].replace("1 year",1)
df["Years in current job"] = df["Years in current job"].replace("2 years",2)
df["Years in current job"] = df["Years in current job"].replace("3 years",3)
df["Years in current job"] = df["Years in current job"].replace("4 years",4)
df["Years in current job"] = df["Years in current job"].replace("5 years",5)
df["Years in current job"] = df["Years in current job"].replace("6 years",6)
df["Years in current job"] = df["Years in current job"].replace("7 years",7)
df["Years in current job"] = df["Years in current job"].replace("8 years",8)
df["Years in current job"] = df["Years in current job"].replace("9 years",9)
df["Years in current job"] = df["Years in current job"].replace("10+ years",10)

In [24]:
df["Years in current job"].fillna(df["Years in current job"].median(),inplace= True)

In [25]:
df["Credit Score"].value_counts()

747.0     5512
746.0     5326
740.0     5278
741.0     5245
742.0     4995
          ... 
6070.0       3
5910.0       3
5900.0       2
5930.0       2
5860.0       1
Name: Credit Score, Length: 334, dtype: int64

In [26]:
df['Credit Score']=df['Credit Score'].apply(lambda x:x/10 if x>800 else x)

In [27]:
df["Credit Score"].fillna(df["Credit Score"].median(),inplace= True)

In [28]:
df = df.drop(["Loan ID", "Customer ID", "Months since last delinquent", 'Maximum Open Credit',"Purpose"],axis=1)

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 240374 entries, 0 to 256983
Data columns (total 14 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Loan Status                240374 non-null  int32  
 1   Current Loan Amount        240374 non-null  int64  
 2   Term                       240374 non-null  int32  
 3   Credit Score               240374 non-null  float64
 4   Years in current job       240374 non-null  float64
 5   Home Ownership             240374 non-null  int32  
 6   Annual Income              181028 non-null  float64
 7   Monthly Debt               240374 non-null  float64
 8   Years of Credit History    240374 non-null  float64
 9   Number of Open Accounts    240374 non-null  int64  
 10  Number of Credit Problems  240374 non-null  int64  
 11  Current Credit Balance     240374 non-null  int64  
 12  Bankruptcies               240374 non-null  float64
 13  Tax Liens                  24

## Regression

In [30]:
missing = df[df["Annual Income"].isnull()]
filled = df[df["Annual Income"].notnull()]

In [31]:
abs(filled.corr()["Annual Income"]).sort_values(ascending = False)

Annual Income                1.000000
Monthly Debt                 0.475043
Current Credit Balance       0.307302
Years of Credit History      0.155994
Home Ownership               0.152087
Number of Open Accounts      0.147549
Years in current job         0.070728
Loan Status                  0.070346
Term                         0.067576
Bankruptcies                 0.047207
Tax Liens                    0.038451
Current Loan Amount          0.022396
Number of Credit Problems    0.016035
Credit Score                 0.014700
Name: Annual Income, dtype: float64

In [32]:
y=filled["Annual Income"]
x=filled[["Monthly Debt","Current Credit Balance"]]

In [33]:
missing.shape , filled.shape

((59346, 14), (181028, 14))

In [34]:
type(x), type(y)

(pandas.core.frame.DataFrame, pandas.core.series.Series)

In [35]:
regression_(x,y)

Unnamed: 0,R_square,MSE,MAE
ExtraTree,0.999637,1076.120654,25.704262
DecisionTree,0.999637,1076.120654,25.704262
XGB,0.500406,39912.622542,22472.758178
KNeighbors,0.39506,43919.560161,20921.391949
GradientBoosting,0.277965,47982.274714,23136.728557
LinearRegressor,0.234012,49421.156321,23603.115213
Ridge,0.234012,49421.156321,23603.115213
Lasso,0.234012,49421.156321,23603.115176
ElasticNet,0.234012,49421.156321,23603.113643


In [36]:
del missing["Annual Income"]

In [37]:
etr = ExtraTreeRegressor()
pred = etr.fit(x,y).predict(missing[["Monthly Debt","Current Credit Balance"]])

In [38]:
#missing["Annual Income"] = pred
missing.loc[:,'Annual Income']=pred

In [39]:
#df = missing.merge(filled)
df=missing.append(filled)

In [40]:
df.isnull().sum()

Loan Status                  0
Current Loan Amount          0
Term                         0
Credit Score                 0
Years in current job         0
Home Ownership               0
Monthly Debt                 0
Years of Credit History      0
Number of Open Accounts      0
Number of Credit Problems    0
Current Credit Balance       0
Bankruptcies                 0
Tax Liens                    0
Annual Income                0
dtype: int64

## Classification

In [41]:
y = df['Loan Status']
x = df.drop("Loan Status",axis=1)
x = pd.get_dummies(x,drop_first=True)

In [42]:
classification_(x,y)

Unnamed: 0,AccuracyScore,PrecisionScore,RecallScore,f1_Score
RandomForest,0.81793,0.833025,0.941683,0.884028
GradientBoosting,0.753697,0.770135,0.949022,0.850271
LogisticRegr,0.736973,0.736977,0.999944,0.848554
SVC,0.736911,0.736911,1.0,0.848531
Bernoulli,0.737015,0.737333,0.999012,0.848454
AdaBoost,0.752075,0.772247,0.941118,0.848361
KNeigbors,0.718794,0.788324,0.8454,0.815865
DecisionTree,0.731773,0.834045,0.793999,0.813529
Gaussian,0.414415,0.990427,0.207356,0.342918
Multinominal,0.410588,1.0,0.200158,0.333553


## Deep Learning

In [44]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.2,random_state=42)

In [45]:
import tensorflow as tf
model = tf.keras.Sequential([
                            tf.keras.layers.Dense(512, activation="relu"),
                            tf.keras.layers.Dense(512, activation="relu"),
                            tf.keras.layers.Dropout(0.25),
                            tf.keras.layers.Flatten(),
                            tf.keras.layers.Dense(512, activation="relu"),
                            tf.keras.layers.Dropout(0.50),
                            tf.keras.layers.Dense(512, activation="relu"),
                            tf.keras.layers.Dense(1, activation="softmax")])
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])
history = model.fit(x_train,y_train, batch_size=128, epochs=40, verbose=1, validation_data=(x_test, y_test))

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [46]:
_, accuracy = model.evaluate(x_test, y_test)
print("Accuracy: %.2f" % (accuracy*100))

Accuracy: 73.69
