# Loan Status Prediction

We are going to build a model to whether a loan application is approved. 

In [97]:
df = spark.read.csv("gs://newbms1-dataproc/loans.csv", header=True, inferSchema=True)
df.count()

614

In [98]:
df.show(5)

+--------+------+-------+----------+------------+-------------+---------------+-----------------+----------+----------------+--------------+-------------+-----------+
| Loan_ID|Gender|Married|Dependents|   Education|Self_Employed|ApplicantIncome|CoapplicantIncome|LoanAmount|Loan_Amount_Term|Credit_History|Property_Area|Loan_Status|
+--------+------+-------+----------+------------+-------------+---------------+-----------------+----------+----------------+--------------+-------------+-----------+
|LP001002|  Male|     No|         0|    Graduate|           No|           5849|              0.0|      null|             360|             1|        Urban|          Y|
|LP001003|  Male|    Yes|         1|    Graduate|           No|           4583|           1508.0|       128|             360|             1|        Rural|          N|
|LP001005|  Male|    Yes|         0|    Graduate|          Yes|           3000|              0.0|        66|             360|             1|        Urban|          Y

In [99]:
df.printSchema()

root
 |-- Loan_ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Married: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- Education: string (nullable = true)
 |-- Self_Employed: string (nullable = true)
 |-- ApplicantIncome: integer (nullable = true)
 |-- CoapplicantIncome: double (nullable = true)
 |-- LoanAmount: integer (nullable = true)
 |-- Loan_Amount_Term: integer (nullable = true)
 |-- Credit_History: integer (nullable = true)
 |-- Property_Area: string (nullable = true)
 |-- Loan_Status: string (nullable = true)



## Data Preprocessing

In [100]:
# Drop any unnecessary columns 
df = df.drop('CoapplicantIncome')
df.show(3)
# Drop records with null values 
df2 = df.dropna()
df2.count()

+--------+------+-------+----------+---------+-------------+---------------+----------+----------------+--------------+-------------+-----------+
| Loan_ID|Gender|Married|Dependents|Education|Self_Employed|ApplicantIncome|LoanAmount|Loan_Amount_Term|Credit_History|Property_Area|Loan_Status|
+--------+------+-------+----------+---------+-------------+---------------+----------+----------------+--------------+-------------+-----------+
|LP001002|  Male|     No|         0| Graduate|           No|           5849|      null|             360|             1|        Urban|          Y|
|LP001003|  Male|    Yes|         1| Graduate|           No|           4583|       128|             360|             1|        Rural|          N|
|LP001005|  Male|    Yes|         0| Graduate|          Yes|           3000|        66|             360|             1|        Urban|          Y|
+--------+------+-------+----------+---------+-------------+---------------+----------+----------------+--------------+-----

480

In [101]:
df2.show(3)

+--------+------+-------+----------+------------+-------------+---------------+----------+----------------+--------------+-------------+-----------+
| Loan_ID|Gender|Married|Dependents|   Education|Self_Employed|ApplicantIncome|LoanAmount|Loan_Amount_Term|Credit_History|Property_Area|Loan_Status|
+--------+------+-------+----------+------------+-------------+---------------+----------+----------------+--------------+-------------+-----------+
|LP001003|  Male|    Yes|         1|    Graduate|           No|           4583|       128|             360|             1|        Rural|          N|
|LP001005|  Male|    Yes|         0|    Graduate|          Yes|           3000|        66|             360|             1|        Urban|          Y|
|LP001006|  Male|    Yes|         0|Not Graduate|           No|           2583|       120|             360|             1|        Urban|          Y|
+--------+------+-------+----------+------------+-------------+---------------+----------+----------------

In [102]:
# Convert monthly income to annual income
df2 = df2.withColumn('annualIncome',(df2.ApplicantIncome*12)).drop('ApplicantIncome')
df2.show(3)

+--------+------+-------+----------+------------+-------------+----------+----------------+--------------+-------------+-----------+------------+
| Loan_ID|Gender|Married|Dependents|   Education|Self_Employed|LoanAmount|Loan_Amount_Term|Credit_History|Property_Area|Loan_Status|annualIncome|
+--------+------+-------+----------+------------+-------------+----------+----------------+--------------+-------------+-----------+------------+
|LP001003|  Male|    Yes|         1|    Graduate|           No|       128|             360|             1|        Rural|          N|       54996|
|LP001005|  Male|    Yes|         0|    Graduate|          Yes|        66|             360|             1|        Urban|          Y|       36000|
|LP001006|  Male|    Yes|         0|Not Graduate|           No|       120|             360|             1|        Urban|          Y|       30996|
+--------+------+-------+----------+------------+-------------+----------+----------------+--------------+-------------+----

In [103]:
# Let's convert categorical values to numerical values
from pyspark.ml.feature import StringIndexer

# Create an array of the columns
categoricalColumns = ['Gender', 'Married', 'Education', 'Property_Area', 'Self_Employed', 'Loan_Status']

# loop through the columns using a for loop
for categoricalCol in categoricalColumns:
    indexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + '_idx')
    indexer_model = indexer.fit(df2)
    df2 = indexer_model.transform(df2)
    df2 = df2.drop(categoricalCol)

df2.show(20)

+--------+----------+----------+----------------+--------------+------------+----------+-----------+-------------+-----------------+-----------------+---------------+
| Loan_ID|Dependents|LoanAmount|Loan_Amount_Term|Credit_History|annualIncome|Gender_idx|Married_idx|Education_idx|Property_Area_idx|Self_Employed_idx|Loan_Status_idx|
+--------+----------+----------+----------------+--------------+------------+----------+-----------+-------------+-----------------+-----------------+---------------+
|LP001003|         1|       128|             360|             1|       54996|       0.0|        0.0|          0.0|              2.0|              0.0|            1.0|
|LP001005|         0|        66|             360|             1|       36000|       0.0|        0.0|          0.0|              1.0|              1.0|            0.0|
|LP001006|         0|       120|             360|             1|       30996|       0.0|        0.0|          1.0|              1.0|              0.0|            0.0

In [104]:
from pyspark.sql.functions import when
df2 = df2.withColumn("Dependents", when(df2.Dependents == "3+", 3)
                                 .when(df2.Dependents == "2", 2)
                                 .when(df2.Dependents == "1", 1)
                                 .when(df2.Dependents == "0", 0)                    
                                 .otherwise(df2.Dependents))

In [105]:
df2.toPandas().to_csv("gs://newbms1-dataproc/loans2-dt/loans.csv", index=False)

## Modeling

In [106]:
# Read the pre-processed data into a dataframe
df = spark.read.csv("gs://newbms1-dataproc/loans2-dt/loans.csv", header=True, inferSchema=True);
df.show(20)

+--------+----------+----------+----------------+--------------+------------+----------+-----------+-------------+-----------------+-----------------+---------------+
| Loan_ID|Dependents|LoanAmount|Loan_Amount_Term|Credit_History|annualIncome|Gender_idx|Married_idx|Education_idx|Property_Area_idx|Self_Employed_idx|Loan_Status_idx|
+--------+----------+----------+----------------+--------------+------------+----------+-----------+-------------+-----------------+-----------------+---------------+
|LP001003|         1|       128|             360|             1|       54996|       0.0|        0.0|          0.0|              2.0|              0.0|            1.0|
|LP001005|         0|        66|             360|             1|       36000|       0.0|        0.0|          0.0|              1.0|              1.0|            0.0|
|LP001006|         0|       120|             360|             1|       30996|       0.0|        0.0|          1.0|              1.0|              0.0|            0.0

In [107]:
# To remove a numerical ordering, let's do one-hot encoding (OHE). 

from pyspark.ml.feature import OneHotEncoderEstimator
# Next, create an encoder object with the input and output columns specified 
# The output columns are vectors
encoder = OneHotEncoderEstimator(inputCols=['Gender_idx','Married_idx','Education_idx','Self_Employed_idx','Property_Area_idx'], 
                                 outputCols=["Gender_idx_vector","Married_idx_vector", "Education_idx_vector","Self_Employed_idx_vector","Property_Area_idx_vector"])

In [108]:
# Encoder first identifies the categories in the data
encoder_model = encoder.fit(df)

In [109]:
# Encoder then converts the categorical data into encoded vectors, in a new column
df = encoder_model.transform(df)
df.show(10)

+--------+----------+----------+----------------+--------------+------------+----------+-----------+-------------+-----------------+-----------------+---------------+------------------------+--------------------+------------------------+-----------------+------------------+
| Loan_ID|Dependents|LoanAmount|Loan_Amount_Term|Credit_History|annualIncome|Gender_idx|Married_idx|Education_idx|Property_Area_idx|Self_Employed_idx|Loan_Status_idx|Self_Employed_idx_vector|Education_idx_vector|Property_Area_idx_vector|Gender_idx_vector|Married_idx_vector|
+--------+----------+----------+----------------+--------------+------------+----------+-----------+-------------+-----------------+-----------------+---------------+------------------------+--------------------+------------------------+-----------------+------------------+
|LP001003|         1|       128|             360|             1|       54996|       0.0|        0.0|          0.0|              2.0|              0.0|            1.0|         

In [110]:
# Label the target column as "label" for the algorithm to find it
df = df.withColumnRenamed("Loan_Status_idx","label")

In [111]:
# Import the VectorAssembler class. This will let us convert our feature-columns into a feature vector, 
# which is the format required for the data so that PySpark Machine Learning algorithm can process it.
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["Dependents", "LoanAmount", "Loan_Amount_Term", "Credit_History", "annualIncome", "Gender_idx_vector",
                                       "Married_idx_vector", "Education_idx_vector","Self_Employed_idx_vector","Property_Area_idx_vector"],outputCol='features')
df = assembler.transform(df)
df.show(5)

+--------+----------+----------+----------------+--------------+------------+----------+-----------+-------------+-----------------+-----------------+-----+------------------------+--------------------+------------------------+-----------------+------------------+--------------------+
| Loan_ID|Dependents|LoanAmount|Loan_Amount_Term|Credit_History|annualIncome|Gender_idx|Married_idx|Education_idx|Property_Area_idx|Self_Employed_idx|label|Self_Employed_idx_vector|Education_idx_vector|Property_Area_idx_vector|Gender_idx_vector|Married_idx_vector|            features|
+--------+----------+----------+----------------+--------------+------------+----------+-----------+-------------+-----------------+-----------------+-----+------------------------+--------------------+------------------------+-----------------+------------------+--------------------+
|LP001003|         1|       128|             360|             1|       54996|       0.0|        0.0|          0.0|              2.0|          

In [112]:
df.select('features','label').show(3,truncate=False)

+-----------------------------------------------------+-----+
|features                                             |label|
+-----------------------------------------------------+-----+
|[1.0,128.0,360.0,1.0,54996.0,1.0,1.0,1.0,1.0,0.0,0.0]|1.0  |
|[0.0,66.0,360.0,1.0,36000.0,1.0,1.0,1.0,0.0,0.0,1.0] |0.0  |
|[0.0,120.0,360.0,1.0,30996.0,1.0,1.0,0.0,1.0,0.0,1.0]|0.0  |
+-----------------------------------------------------+-----+
only showing top 3 rows



In [113]:
# Split the data into training and test datasets
train, test = df.randomSplit([0.8,0.2], seed=1)

[train.count(), test.count()]

[387, 93]

In [114]:
from pyspark.ml.classification import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree_model = tree.fit(train)
prediction = tree_model.transform(test)

In [115]:
prediction.select('label','prediction','probability','features').show(10, truncate=False)

+-----+----------+----------------------------------------+-----------------------------------------------------+
|label|prediction|probability                             |features                                             |
+-----+----------+----------------------------------------+-----------------------------------------------------+
|0.0  |0.0       |[0.7,0.3]                               |[2.0,267.0,360.0,1.0,65004.0,1.0,1.0,1.0,0.0,0.0,1.0]|
|0.0  |0.0       |[0.7931034482758621,0.20689655172413793]|[2.0,200.0,360.0,1.0,36876.0,1.0,1.0,1.0,1.0,0.0,1.0]|
|1.0  |1.0       |[0.0,1.0]                               |[0.0,116.0,360.0,0.0,31200.0,1.0,1.0,0.0,1.0,1.0,0.0]|
|0.0  |0.0       |[0.8805970149253731,0.11940298507462686]|[0.0,122.0,360.0,1.0,33588.0,1.0,1.0,1.0,1.0,1.0,0.0]|
|1.0  |0.0       |[0.7931034482758621,0.20689655172413793]|[1.0,106.0,360.0,1.0,56304.0,1.0,0.0,1.0,0.0,0.0,0.0]|
|0.0  |0.0       |[0.8805970149253731,0.11940298507462686]|[0.0,134.0,360.0,1.0,47292.0,

In [116]:
# Count the number of true positive, false positive,  false negative, and true negative cases
prediction.groupBy("label", "prediction").count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  1.0|       1.0|   15|
|  0.0|       1.0|    2|
|  1.0|       0.0|   19|
|  0.0|       0.0|   57|
+-----+----------+-----+



In [117]:
# Calculating accuracy using a PySpark method
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(prediction)
print( "Accuracy = %g" % (accuracy))
print( "Test Error = %g" % (1.0 - accuracy))

Accuracy = 0.774194
Test Error = 0.225806


In [118]:
# Precision
results = prediction.select(['prediction', 'label'])
# convert the results dataframe into a "resilient distributed dataset (RDD)", a special type of data structure in Spark
predictionAndLabels=results.rdd
# Calculate metrics
from pyspark.mllib.evaluation import MulticlassMetrics
metrics = MulticlassMetrics(predictionAndLabels)
precision = metrics.precision(label=1);
print("Precision: %g" % precision);

Precision: 0.882353


In [119]:
# Recall
recall = metrics.recall(label=1);
print("Recall: %g" % recall);

Recall: 0.441176


- In this case, Precision is a better metric because if a loan application ,which is not supposed to be, is approved, it causes delinquency.
- Recall doesn't consider False Positive so it's not proper measurement. 
- Based on Precision, the model seems to be accurate. 