# Building an ML model to predict Loan

This is the third tutorial practice on AceleradDev DS week 1. It's a more advanced topic, which aims to provide us an experince in putting a model in production using the API method.

Here I will register and follow the [Tutorial to deploy Machine Learning models in Production as APIs (using Flask)](https://www.analyticsvidhya.com/blog/2017/09/machine-learning-models-as-apis-using-flask/) steps looking for a better compreehension of machine learning development in Python.

## Import libraries

In [62]:
import os
import json
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
import dill as pickle

import warnings

warnings.filterwarnings("ignore")

## Load and describe data

In [86]:
# importing data
data = pd.read_csv('../data/train.csv')

In [5]:
list(data.columns)

['Loan_ID',
 'Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History',
 'Property_Area',
 'Loan_Status']

In [6]:
data.shape

(614, 13)

In [7]:
for _ in data.columns:
    print("The number of null values in: {} == {}".format(_, data[_].isnull().sum()))

The number of null values in: Loan_ID == 0
The number of null values in: Gender == 13
The number of null values in: Married == 3
The number of null values in: Dependents == 15
The number of null values in: Education == 0
The number of null values in: Self_Employed == 32
The number of null values in: ApplicantIncome == 0
The number of null values in: CoapplicantIncome == 0
The number of null values in: LoanAmount == 22
The number of null values in: Loan_Amount_Term == 14
The number of null values in: Credit_History == 50
The number of null values in: Property_Area == 0
The number of null values in: Loan_Status == 0


Check values in variables with missing data. 

In [8]:
missing_pred = ['Dependents', 'Self_Employed', 'Loan_Amount_Term', 'Gender', 'Married']

for values in missing_pred:
    print("List of unique labels for {}: {}".format(values, set(data[values])))

List of unique labels for Dependents: {nan, '0', '1', '3+', '2'}
List of unique labels for Self_Employed: {nan, 'Yes', 'No'}
List of unique labels for Loan_Amount_Term: {nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 12.0, 36.0, 300.0, 180.0, 60.0, 84.0, 480.0, 360.0, 240.0, 120.0}
List of unique labels for Gender: {nan, 'Female', 'Male'}
List of unique labels for Married: {nan, 'Yes', 'No'}


They will be trated following these rules:

- Dependents: Assumption that there are no dependents
- Self_Employed: Assumption that the applicant is not self-employed
- Loan_Amount_Term: Assumption that the loan amount term is median value
- Credit_History: Assumption that the person has a credit history
- Married: If nothing specified, applicant is not married
- Gender: Assuming the gender is Male for the missing values

Till here we have just tried to understand data and decide what to do in terms of manipulation. Let's proceed.

## Data preparation

We'll start spliting the dataset in train and test.

In [9]:
# creating train and test data
pred_var= ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
           'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term',
           'Credit_History', 'Property_Area']

X_train, X_test, y_train, y_test = train_test_split(
    data[pred_var], data['Loan_Status'],
    test_size=0.25, random_state=42
)

And than we'll compile a list of pre-processing steps that we do on to create a custom estimator.

In [10]:
X_train['Dependents'] = X_train['Dependents'].fillna('0')
X_train['Self_Employed'] = X_train['Self_Employed'].fillna('No')
X_train['Loan_Amount_Term'] = X_train['Loan_Amount_Term'].fillna(X_train['Loan_Amount_Term'].mean())
X_train['Credit_History'] = X_train['Credit_History'].fillna(1)
X_train['Married'] = X_train['Married'].fillna('No')
X_train['Gender'] = X_train['Gender'].fillna('Male')
X_train['LoanAmount'] = X_train['LoanAmount'].fillna(X_train['LoanAmount'].mean())

We have a lot of string labels that we encounter in Gender, Married, Education, Self_Employed & Property_Area columns.

In [11]:
label_columns = ['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area', 'Dependents']

for _ in label_columns:
    print("List of unique labels {}:{}".format(_, set(X_train[_])))

List of unique labels Gender:{'Female', 'Male'}
List of unique labels Married:{'Yes', 'No'}
List of unique labels Education:{'Graduate', 'Not Graduate'}
List of unique labels Self_Employed:{'Yes', 'No'}
List of unique labels Property_Area:{'Rural', 'Urban', 'Semiurban'}
List of unique labels Dependents:{'1', '3+', '0', '2'}


They will be converted to numerical

In [12]:
gender_values = {'Female' : 0, 'Male' : 1} 
married_values = {'No' : 0, 'Yes' : 1}
education_values = {'Graduate' : 0, 'Not Graduate' : 1}
employed_values = {'No' : 0, 'Yes' : 1}
property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}

X_train.replace({'Gender': gender_values,
                 'Married': married_values,
                 'Education': education_values,
                 'Self_Employed': employed_values,
                 'Property_Area': property_values,
                 'Dependents': dependent_values},
                inplace=True)

In [13]:
X_train.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
92,1,1,2,1,0,3273,1820.0,81.0,360.0,1.0,1
304,1,0,0,0,0,4000,2500.0,140.0,360.0,1.0,0
68,1,1,3,1,1,7100,0.0,125.0,60.0,1.0,1
15,1,0,0,0,0,4950,0.0,125.0,360.0,1.0,1
211,1,1,3,0,0,3430,1250.0,128.0,360.0,0.0,2


In [14]:
X_train.dtypes

Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int64
dtype: object

In [15]:
for _ in X_train.columns:
    print("The number of null values in:{} == {}".format(_, X_train[_].isnull().sum()))

The number of null values in:Gender == 0
The number of null values in:Married == 0
The number of null values in:Dependents == 0
The number of null values in:Education == 0
The number of null values in:Self_Employed == 0
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 0
The number of null values in:Loan_Amount_Term == 0
The number of null values in:Credit_History == 0
The number of null values in:Property_Area == 0


Now there's no more null values and all variables are defined as numeric. So we'll convert the pandas dataframes to numpy arrays:

In [16]:
X_train = X_train.as_matrix()

In [17]:
X_train.shape

(460, 11)

We'll create a custom pre-processing estimator that would help us in writing better pipelines and in future deployments:

In [18]:
# Custom Pre-Processing estimator for our use-case 
class PreProcessing(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    # Selected variables
    def transform(self, df):
        pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome',\
                    'CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']
        
        # Replace missing data
        df = df[pred_var]
        df['Dependents'] = df['Dependents'].fillna(0)
        df['Self_Employed'] = df['Self_Employed'].fillna('No')
        df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(self.term_mean_)
        df['Credit_History'] = df['Credit_History'].fillna(1)
        df['Married'] = df['Married'].fillna('No')
        df['Gender'] = df['Gender'].fillna('Male')
        df['LoanAmount'] = df['LoanAmount'].fillna(self.amt_mean_)
        
        # Set factors
        gender_values = {'Female' : 0, 'Male' : 1} 
        married_values = {'No' : 0, 'Yes' : 1}
        education_values = {'Graduate' : 0, 'Not Graduate' : 1}
        employed_values = {'No' : 0, 'Yes' : 1}
        property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
        dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}
        df.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \
                    'Self_Employed': employed_values, 'Property_Area': property_values, \
                    'Dependents': dependent_values}, inplace=True)
        
        return df.as_matrix()

    def fit(self, df, y=None, **fit_params):
        self.term_mean_ = df['Loan_Amount_Term'].mean()
        self.amt_mean_ = df['LoanAmount'].mean()
        
        return self

To make sure that this works, let's do a test run for it:

In [87]:
X_train, X_test, y_train, y_test = train_test_split(
    data[pred_var], data['Loan_Status'],
    test_size=0.25, random_state=42
)

In [20]:
X_train.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
92,Male,Yes,2,Not Graduate,No,3273,1820.0,81.0,360.0,1.0,Urban
304,Male,No,0,Graduate,No,4000,2500.0,140.0,360.0,1.0,Rural
68,Male,Yes,3+,Not Graduate,Yes,7100,0.0,125.0,60.0,1.0,Urban
15,Male,No,0,Graduate,No,4950,0.0,125.0,360.0,1.0,Urban
211,Male,Yes,3+,Graduate,No,3430,1250.0,128.0,360.0,0.0,Semiurban


In [21]:
for _ in X_train.columns:
    print("The number of null values in:{} == {}".format(_, X_train[_].isnull().sum()))

The number of null values in:Gender == 11
The number of null values in:Married == 1
The number of null values in:Dependents == 11
The number of null values in:Education == 0
The number of null values in:Self_Employed == 20
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 16
The number of null values in:Loan_Amount_Term == 11
The number of null values in:Credit_History == 36
The number of null values in:Property_Area == 0


We restart the datasets. Now, let's see if our pipeline works fine.

In [88]:
preprocess = PreProcessing()
preprocess

PreProcessing()

In [90]:
preprocess.fit(X_train)

PreProcessing()

In [91]:
X_train_transformed = preprocess.transform(X_train)
X_train_transformed.shape

(460, 11)

In [47]:
X_train_transformed

array([[  1.,   1.,   2., ..., 360.,   1.,   1.],
       [  1.,   0.,   0., ..., 360.,   1.,   0.],
       [  1.,   1.,   3., ...,  60.,   1.,   1.],
       ...,
       [  0.,   0.,   0., ..., 360.,   1.,   1.],
       [  0.,   0.,   0., ..., 240.,   1.,   2.],
       [  1.,   1.,   0., ..., 360.,   1.,   1.]])

It's working! Lets transform the test dataset either:

In [26]:
X_test_transformed = preprocess.transform(X_test)

In [27]:
X_test_transformed.shape

(154, 11)

In [28]:
X_test_transformed

array([[  1.,   1.,   0., ..., 360.,   1.,   2.],
       [  1.,   1.,   0., ..., 360.,   1.,   2.],
       [  1.,   1.,   2., ..., 360.,   1.,   0.],
       ...,
       [  0.,   1.,   0., ..., 360.,   1.,   2.],
       [  1.,   1.,   2., ..., 360.,   0.,   2.],
       [  1.,   0.,   0., ..., 480.,   1.,   1.]])

At this time the predictos $x$ are ready to go. The final step on data processing is to set the response variable as a numpy matrix.

In [48]:
y_train.head()

92     Y
304    Y
68     Y
15     Y
211    N
Name: Loan_Status, dtype: object

In [49]:
y_train = y_train.replace({'Y':1, 'N':0}).as_matrix()
y_test = y_test.replace({'Y':1, 'N':0}).as_matrix()

In [50]:
y_train[:5]

array([1, 1, 1, 1, 0])

## Build model

Now it's time to build our model.

We'll set the grid search for parameter optimization.

In [51]:
param_grid = {"randomforestclassifier__n_estimators" : [10, 20, 30],
             "randomforestclassifier__max_depth" : [None, 6, 8, 10],
             "randomforestclassifier__max_leaf_nodes": [None, 5, 10, 20], 
             "randomforestclassifier__min_impurity_split": [0.1, 0.2, 0.3]}

After that we set the pipeline for pre-processing data.

In [52]:
pipe = make_pipeline(PreProcessing(), RandomForestClassifier())
pipe

Pipeline(memory=None,
         steps=[('preprocessing', PreProcessing()),
                ('randomforestclassifier',
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators='warn', n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

And run the cross validation.

In [53]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3)
grid

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('preprocessing', PreProcessing()),
                                       ('randomforestclassifier',
                                        RandomForestClassifier(bootstrap=True,
                                                               class_weight=None,
                                                               criterion='gini',
                                                               max_depth=None,
                                                               max_features='auto',
                                                               max_leaf_nodes=None,
                                                               min_impurity_decrease=0.0,
                                                               min_impurity_split=None,
                                                               min_samples_leaf=1,
            

In [55]:
grid.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('preprocessing', PreProcessing()),
                                       ('randomforestclassifier',
                                        RandomForestClassifier(bootstrap=True,
                                                               class_weight=None,
                                                               criterion='gini',
                                                               max_depth=None,
                                                               max_features='auto',
                                                               max_leaf_nodes=None,
                                                               min_impurity_decrease=0.0,
                                                               min_impurity_split=None,
                                                               min_samples_leaf=1,
            

In [56]:
print("Best parameters: {}".format(grid.best_params_))

Best parameters: {'randomforestclassifier__max_depth': None, 'randomforestclassifier__max_leaf_nodes': 5, 'randomforestclassifier__min_impurity_split': 0.1, 'randomforestclassifier__n_estimators': 30}


In [57]:
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))

Test set score: 0.77


## Make predictions for separate test data

In [58]:
test_df = pd.read_csv('../data/test.csv', encoding="utf-8-sig")
test_df = test_df.head()
test_df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


In [59]:
grid.predict(test_df)

array([1, 1, 1, 1, 1])

## Saving the machine learning model

In [64]:
filename = 'model_v1.pk'
with open('../models/'+filename, 'wb') as file:
    pickle.dump(grid, file)

And run the saved model on `test_df` to verify its consistency.

In [66]:
with open('../models/'+filename ,'rb') as f:
    loaded_model = pickle.load(f)
    
loaded_model.predict(test_df)

array([1, 1, 1, 1, 1])

Uaha! It's running like a sharm!

## Create the API with Flask

In [68]:
# Filename: server.py
import os
import pandas as pd
from sklearn.externals import joblib
from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def apicall():
    # API Call
    
    # Pandas dataframe (sent as a payload) from API Call
    try:
        test_json = request.get_json()
        test = pd.read_json(test_json, orient='records')

        #To resolve the issue of TypeError: Cannot compare types 'ndarray(dtype=int64)' and 'str'
        test['Dependents'] = [str(x) for x in list(test['Dependents'])]

        #Getting the Loan_IDs separated out
        loan_ids = test['Loan_ID']

    except Exception as e:
        raise e

    clf = 'model_v1.pk'

    if test.empty:
        return(bad_request())
    else:
        #Load the saved model
        print("Loading the model...")
        loaded_model = None
        with open('./models/'+clf,'rb') as f:
            loaded_model = pickle.load(f)

        print("The model has been loaded...doing predictions now...")
        predictions = loaded_model.predict(test)

        # Add the predictions as Series to a new pandas dataframe OR
        # Depending on the use-case, the entire test data appended with the new files
        prediction_series = list(pd.Series(predictions))

        final_predictions = pd.DataFrame(list(zip(loan_ids, prediction_series)))

        # We can be as creative in sending the responses.
        # But we need to send the response codes as well.
        responses = jsonify(predictions=final_predictions.to_json(orient="records"))
        responses.status_code = 200

        return (responses)

In [78]:
import json
import requests

# Setting the headers to send and accept json responses
header = {'Content-Type': 'application/json',
          'Accept': 'application/json'}

# Reading test batch
df = pd.read_csv('../data/test.csv', encoding="utf-8-sig")
df = df.head()

#Converting Pandas Dataframe to json
data = df.to_json(orient='records')
data

'[{"Loan_ID":"LP001015","Gender":"Male","Married":"Yes","Dependents":"0","Education":"Graduate","Self_Employed":"No","ApplicantIncome":5720,"CoapplicantIncome":0,"LoanAmount":110.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001022","Gender":"Male","Married":"Yes","Dependents":"1","Education":"Graduate","Self_Employed":"No","ApplicantIncome":3076,"CoapplicantIncome":1500,"LoanAmount":126.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001031","Gender":"Male","Married":"Yes","Dependents":"2","Education":"Graduate","Self_Employed":"No","ApplicantIncome":5000,"CoapplicantIncome":1800,"LoanAmount":208.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001035","Gender":"Male","Married":"Yes","Dependents":"2","Education":"Graduate","Self_Employed":"No","ApplicantIncome":2340,"CoapplicantIncome":2546,"LoanAmount":100.0,"Loan_Amount_Term":360.0,"Credit_History":null,"Property_Are

In [82]:
# POST <url>/predict
resp = requests.post("http://0.0.0.0:8000/predict",
                     data = json.dumps(data),
                     headers = header)

ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3063f76438>: Failed to establish a new connection: [Errno 111] Connection refused',))

In [83]:
resp.status_code

NameError: name 'resp' is not defined

In [None]:
resp.json()