<center><h1>E-Signing of Loan Based on Financial History</h1></center>

<div>
<img src="img/logo.jpg" width="900"/>
</div>

<h4>ESignature:</h4>
<br>
<p style="text-indent:5em">Electronic signatures aren’t exactly a novelty. They have been around since the American Civil War, during which contracts were signed through Morse. In a modern setting, an e-Sign refers to a unique, digitised, encrypted personal identifier. This is, in essence, different from the ‘wet’ signatures created by hand. The e-Sign is meant to complete transactions, loops, and agreements electronically.</p>

<p style="text-indent:5em">In India, the e-Sign has been granted legal status by amendments to various laws, namely the Information Technology Act, Indian Evidence Act and the Negotiable Instruments Act. Early adopters in the financial sector have started using e-Sign to get customers to sign loan and card applications, and loan agreements.</p>



### Importing the Library

In [None]:
#importing library for reading, writing and perform basic operations
import pandas as pd
import numpy as np

#Importing library for Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Importing the library for evaluating the model
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

### Reading the Data

In [None]:
df=pd.read_csv("financial_data.csv")
df.head()

### Exploring the Data

In [None]:
#Finding the count of each class of dependent variable
print(sum(df["e_signed"]==1))
print(sum(df["e_signed"]==0))

In [None]:
#Checking for percentage of missing data in each column
percent_missing = df.isnull().sum() * 100 / len(df)                
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})          

missing_value_df

<p style="text-indent:5em">We see that the data is not highly imbalanced since the ratio between the count of two classes of dependent variable is not so high.

In [None]:
#Dropping the unnecessary columns
dataset2 = df.drop(columns = ['entry_id', 'pay_schedule', 'e_signed'])

fig = plt.figure(figsize=(15, 12))

plt.suptitle('Histograms of Numerical Columns', fontsize=20)
for i in range(dataset2.shape[1]):
    plt.subplot(6, 3, i + 1)
    f = plt.gca()
    f.set_title(dataset2.columns.values[i])

    vals = np.size(dataset2.iloc[:, i].unique())
    if vals >= 100:
        vals = 100

    plt.hist(dataset2.iloc[:, i], bins=vals, color='#3F5D7D')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])

In [None]:
#Correlation of independent variable with dependent variable

dataset2.corrwith(df.e_signed).plot.bar(figsize=(20,10),title="correlation wigh E signed",
                                             fontsize=20,rot=45,grid= True,color=['pink','green',
                                                                                  'blue','cyan','magenta'])

### Building the ANN


<div>
<img src="img/ann.png" width="500"/>
</div>

In [None]:
#Initiatizig into new variable and creating the dummy variable for categorical data
data=df
dummy=pd.get_dummies(data["pay_schedule"])
dummy=dummy.drop(labels=["bi-weekly"],axis=1)

In [None]:

data=data.drop(["pay_schedule"],axis=1)

In [None]:

data=pd.concat([data,dummy],axis=1)

In [None]:

data.shape

In [None]:
# Seperating the data into dependent and independent varioable. Response contains the esigned column which needs to be computed
#from the dataset
response = data["e_signed"]
dataset = data.drop(columns = ["e_signed", "entry_id"])

In [None]:
#Transforming the data using StandardScaler
from sklearn.preprocessing import StandardScaler
sc_X= StandardScaler()

In [None]:
# Splitting into Train and Test Set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset,
                                                    response,
                                                    test_size = 0.2,
                                                    random_state = 0)

In [None]:
#Fitting and transforming our data
X_train=sc_X.fit_transform(X_train)
X_test=sc_X.transform(X_test)

In [None]:
X_train.shape

#### Building ANN Using Keras

<img src="img/keras.png" width="500"/>
</div>

In [None]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense


classifier = Sequential()

In [None]:
#Adding the layers in ANN
classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu', input_dim = 21))


classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu'))


classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))


classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)

In [None]:
#Predict functioon the test data using the model
pred=classifier.predict(X_test)

In [None]:

y_pred = (pred > 0.5)

In [None]:
accuracy_score(y_pred,y_test)

## Other Machine Learning Model

#### Feature Engineering


    
<div>
    <br>
<img src="img/ml.jpg" width="500"/>
</div>

##### AGE

We will group our age data as follows:
<div>
    <br>
<img src="img/age_group.png" width="500"/>
</div>

In [None]:
#Initializing the Empty List
AGE=[]

for i in range(len(df)):
    if(df["age"][i]<=45):
        AGE.append("YOUTH")
        
    else:
        AGE.append("SENIOR")

#### Months Employed and Personal Account

We will calculate months employed as follows:
<br>
<p style="text-indent:5em">month employed=month employed+ years employed*12</p>
    <p style="text-indent:5em">account_m=account_m+account_y*12</p>


In [None]:
EMPLOYED=[]
for i in range(len(df)):
    x=df["months_employed"][i]+df["years_employed"][i]*12
    EMPLOYED.append(x)

In [None]:
PA=[]
for i in range(len(df)):
    x=df["personal_account_m"][i]+df["personal_account_y"][i]*12
    PA.append(x)

In [None]:
avg_risk=[]
for i in range(len(df)):
    x=(df["risk_score_2"][i]+df["risk_score_3"][i]+df["risk_score_4"][i]+df["risk_score_5"][i])/4
    avg_risk.append(x)

In [None]:
ext_quality=[]
for i in range(len(df)):
    x=(df["ext_quality_score_2"][i]+df["ext_quality_score"][i])/2
    ext_quality.append(x)

#### Merging all the features into Single Dataframe

<p style="text-indent:5em">


In [None]:
#Converting all lists into DataFrame
AGE=pd.DataFrame(AGE)
EMPLOYED=pd.DataFrame(EMPLOYED)
PA=pd.DataFrame(PA)
avg_risk=pd.DataFrame(avg_risk)
ext_quality=pd.DataFrame(ext_quality)

#Concatng all the features
featured=pd.concat([AGE,EMPLOYED,PA,avg_risk,ext_quality],axis=1)

In [None]:
featured.columns=["AGE","EMPLOYED","PA","RISK","QUALITY"]

In [None]:
featured.head()

In [None]:
#Creating Dummy Variable
dummy1=pd.get_dummies(featured["AGE"])
dummy1.head()

In [None]:
#Dropping the AGE Variable since dummy variable is created
featured=featured.drop(["AGE"],axis=1)

#Concating the data and dummy variable
featured=pd.concat([featured,dummy1],axis=1)
featured.head()

#### Classifying the Data into dependent and independent Variable

In [None]:
#Dependent Variable
dep="e_signed"

#Selelcting all the column as independent variable
ind=df.columns.tolist()

In [None]:
#Removing the dependent variable from independent
ind.remove(dep)
ind.remove("entry_id")

In [None]:
#Selecting the Data
X=df[ind]
Y=df[dep]

In [None]:
#Concating the featured data and unfeatured data
X=pd.concat([X,featured],axis=1)
X.head()

In [None]:
#Dropping the unnecessay data from our dataset
X=X.drop(labels=["age","months_employed","years_employed","personal_account_m","personal_account_y","risk_score_2","risk_score_3"
                ,"risk_score_4","risk_score_5","ext_quality_score_2","ext_quality_score"],axis=1)

In [None]:
X.head()

In [None]:
#Creating Dummy Variable
dummy2=pd.get_dummies(X["pay_schedule"])

#Removing the trap
dummy2=dummy2.drop(labels=["bi-weekly"],axis=1)
dummy2.head()

In [None]:
#Dropping the pay_schedule since dummy variable is created
X=X.drop(labels=["pay_schedule"],axis=1)

#Concating the data and dummy variable
X=pd.concat([X,dummy2],axis=1)
X.head()

#### Data Transformation and Model Building

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
scale=StandardScaler()
#Transforming the data
X=scale.fit_transform(X)

##### Splitting the Data

In [None]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(X,Y,test_size=0.25,random_state=0)

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [None]:
#Initializing the object
model=RandomForestClassifier()

<div>
    <br>
<img src="img/random.png" width="600"/>
</div>

In [None]:
#Fitting the model
model.fit(xtrain,ytrain)

In [None]:
#Predictions on test data
pred=model.predict(xtest)

In [None]:
accuracy_score(ytest,pred)

### Gradient Boosting


<div>
    <br>
<img src="img/gradient.png" width="600"/>
</div>



<a href="https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d">Read More</a>

In [None]:
lr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1]

for learning_rate in lr_list:
    gb_clf = GradientBoostingClassifier(n_estimators=20, learning_rate=learning_rate, max_features=2, max_depth=2, random_state=0)
    gb_clf.fit(xtrain, ytrain)

In [None]:
print("Learning rate: ", learning_rate)
print("Accuracy score (training): {0:.3f}".format(gb_clf.score(xtrain, ytrain)))
print("Accuracy score (validation): {0:.3f}".format(gb_clf.score(xtest, ytest)))

### XgBoost


<div>
    <br>
<img src="img/xg_boost.jpeg" width="600"/>
</div>

In [None]:
#Implementing xgboost
from xgboost import XGBClassifier

In [None]:
#Initializing the model
xgb_clf = XGBClassifier()
xgb_clf.fit(xtrain, ytrain)

In [None]:
score = xgb_clf.score(xtest, ytest)
print(score)

#### Support Vector Machine


<br>
<img src="img/sv.jpg" height=100px >
  
  
<br>
<p>For more reading click <a href="https://monkeylearn.com/blog/introduction-to-support-vector-machines-svm/">here</a>.<br><br>
    
<br>

In [None]:
from sklearn.svm import SVC
classifier = SVC(random_state = 0, kernel = 'rbf')
classifier.fit(xtrain, ytrain)

In [None]:
# Predicting Test Set
y_pred = classifier.predict(xtest)

In [None]:
#Finding the Accuracy Score
acc = accuracy_score(ytest, y_pred)
acc


<br>
<h4>Models implemented:</h4>

<ul>
    <li>Artificial Neural Network</li>
    <li>Random Forest Classifier</li>
    <li>Gradient Boosting</li>
    <li>Support Vector Machine</li>
    <li>Xg Boost</li>
</ul>

<center><h2><u>Conclusion</u><h2></center
    
 <ul>
    <li>XgBoost Algorithm performs the best and give the accuracy of 62 %</li>
    <li>We see that the ANN with no feature engineering performs far better than SVM, Random Forest with feature engineering</li>
    <li>Though we didnt get very high accuracy but this can help the banks in knowing whether the customer is risky or not.</li>
</ul>