<a href="http://www.calstatela.edu/centers/hipic"><img align="left" src="https://avatars2.githubusercontent.com/u/4156894?v=3&s=100"><image/>
</a>
<img align="right" alt="California State University, Los Angeles" src="http://www.calstatela.edu/sites/default/files/groups/California%20State%20University%2C%20Los%20Angeles/master_logo_full_color_horizontal_centered.svg" style="width: 360px;"/>

# CIS5560 Term Project Tutorial

------
#### Authors: [Nikita Marathe](https://www.linkedin.com/in/nikita-dhanraj-marathe-544323138); [Nikeeta Akbari](https://www.linkedin.com/in/nikeeta-akbari/); [Rohit Tiwari](https://www.linkedin.com/in/rohit-tiwari-626012102/); [Zeeshan Khan](https://www.linkedin.com/in/zeeshan-k-0a6330b4/)

#### Instructor: [Jongwook Woo](https://www.linkedin.com/in/jongwook-woo-7081a85)

#### Date: 05/18/2018

## Home Mortgage Prediction Model using Desicion Tree Classifier

In this exercise, you will implement Two-Class Logistic Regression model to predict the status of the Home Mortgage Application.

###Pre-requisites:

1. A Spark cluster, with default configuration as part of Databricks community edition.
2. Dataset for Home Mortgage Disclosure Act. Available to download here: https://www.consumerfinance.gov/data-research/hmda/explore
3. Databricks community edition account. Signup for free here : https://databricks.com/try-databricks

###Creating a Cluster

Sign into your databricks account and go to Clusters option on the left and click on create cluster. Give the cluster name and click create cluster.

###Overview

You should follow the steps below to build, train and test the model from the source data:

1. Import the HMDA_Data.csv as table in databricks.
2. Change the datatype for all the columns as required.
3. Preprocess the data by removing missing values.
4. Prepare the data with the features (input columns, output column as label)
5. Split the data using data.randomSplit(): Training and Testing;Rename label to trueLabel in test.
6. Transform the columns to a vector using VectorAssembler
7. Set features and label from the vector
8. Build a Decision Tree Classifier Model with the label and features
9. Build a Pipeline with 2 stages, VectorAssembler and Decision Tree Classifier.
10. Train the model
11. Prepare the testing Data Frame with features and label from the vector.
12. Predict and test the testing Data Frame using the model trained at the step 10.
14. Assess the performance of the model using AUC, precision and recall.

###Upload CSV as a Table

The data for this project is provided as a CSV file containing details of Home Mortgage Application status. The data includes specific characteristics (or features) for each application, as well as a label column indicating the status of the application.

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [8]:
# Import Spark SQL and Spark ML libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator


### Read your csv file from its table at Databricks

In [10]:
# Load the source data
csv = sqlContext.sql("SELECT * FROM final_csv")

### Select Feature Columns and label Column
Using the SQL commands, 15 feature columns are selected which are highly important for our predicitve model, and one label column called action_taken_name is selected to perform predicition.

In [12]:
# Select features and label
data1 = csv.select("tract_to_msamd_income","population", "minority_population", "loan_amount_000s", "applicant_income_000s","purchaser_type_name","preapproval_name","owner_occupancy_name","loan_type_name","lien_status_name","co_applicant_sex_name","co_applicant_race_name_1","co_applicant_ethnicity_name","agency_name","agency_abbr",col("action_taken_name").alias("label"))

### Perform Data Transformation
After selecting feature columns, it is important to trnasform our dataset by cleaning it. For this purpose, we have used dropna() function as this function drops any row which has null values.

In [14]:
# Drop rows from the table even if one value is null
data2 = data1.dropna()

### Splitting of Data
We have used randomSplit() funciton to split our data. We have used 70% of data for traning and 30% for testing. General less ratio of data is used for testing and more ratio is used training.

In [16]:
# Split the data
splits = data2.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")

### Define the Pipeline
Now define a pipeline that two stages. First stage is to create a vector of feature columns and trains a classification model using the Logistic Regression Algorithm. The input parameters for the Logistic Regression Algorithm are label column (action_taken_name) and featureCol (vector created in stage 1). In the second stage pipeline is created using the assember and the logistic regression alogrithm.

In [18]:
vectorAssembler = VectorAssembler(inputCols = ["tract_to_msamd_income","population", "minority_population", "loan_amount_000s", "applicant_income_000s","purchaser_type_name","preapproval_name","owner_occupancy_name","loan_type_name","lien_status_name","co_applicant_sex_name","co_applicant_race_name_1","co_applicant_ethnicity_name","agency_name","agency_abbr"], outputCol="features")
#Model1 - Decision Tree 
decision = DecisionTreeClassifier(labelCol="label", featuresCol= "features")
pipeline = Pipeline(stages=[vectorAssembler, decision])

### Train the Model

In [20]:
# define list of models made from Train Validation Split and Cross Validation
model = pipeline.fit(train)

### Test the Model
Now you're ready to apply the model to the test data.

In [22]:
prediction = model.transform(test)
predicted = prediction.select("prediction", "trueLabel")

### Review the Area Under ROC
Another way to assess the performance of a classification model is to measure the area under a ROC curve for the model. the spark.ml library includes a **BinaryClassificationEvaluator** class that you can use to compute this.

In [24]:
evaluator = BinaryClassificationEvaluator(labelCol="trueLabel", rawPredictionCol="prediction", metricName="areaUnderROC")
auc = evaluator.evaluate(prediction)
print "Average Accuracy =", auc

### Review the Recall And Precision
Another way to assess the performance of a classification model is to measure the precision and recall.

In [26]:
tp = float(predicted.filter("prediction == 1.0 AND truelabel == 1").count())
fp = float(predicted.filter("prediction == 1.0 AND truelabel == 0").count())
tn = float(predicted.filter("prediction == 0.0 AND truelabel == 0").count())
fn = float(predicted.filter("prediction == 0.0 AND truelabel == 1").count())
metrics = spark.createDataFrame([("Precision", tp / (tp + fp)), ("Recall", tp / (tp + fn))],["metric", "value"])
metrics.show()

### Result shows:
* AUC: 0.898
* Precision: 0.982
* Recall: 0.812

References:
1. [Importing Tables in Databricks](https://docs.databricks.com/user-guide/tables.html)
1. [Markdown Cells in Jupyter](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html)
1. [Markdown Cheatshee](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)
1. [Markdown Guide](https://help.ghost.org/hc/en-us/articles/224410728-Markdown-Guide)