# Decision Tree Learning: Prosper Loan Dataset

A decision tree a learned set of rules that allows us to make decisions on data.

We are going to look at the prosper loan dataset.  This dataset shows a history of loans made by Prosper.

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

## Step 1: Load the Data

In [None]:
## small file, start with this
datafile = "/data/prosper-loan/prosper-loan-data-sample.csv"

## this is a large file
# datafile = "/data/prosper-loan/prosper-loan-data.csv.gz"

In [None]:
%%time
data = spark.read. \
          option("header", "true"). \
          option("inferSchema", "true").  \
          csv(datafile)

In [None]:
print("read {:,} records".format(data.count()))
# schema
data.printSchema()

In [None]:
## print with pandas
data.limit(10).toPandas()

In [None]:
## TODO : select a few columns 
## start with: 'LoanStatus', 'ProsperScore',  'EmploymentStatus', 'CreditScore', 'StatedMonthlyIncome'
## we can add more later

select_columns = ['LoanStatus', 'ProsperScore', '???', '???', '???', '???']

## Note : vector columns can only have Numbers, don't include Categorical columns here
## And dfefinitely not 'LoanStatus'  (if you are curiuos include and see what happens!)
vector_columns = ['ProsperScore', 'EmpIndex', 'CreditScore', 'StatedMonthlyIncome']



In [None]:
## TODO : Extract only the columns we are interested in
## Hint : 'select_columns'
prosper = data.select(???)  

print (prosper.count())
prosper.limit(10).toPandas()

## Step 2 : Clean Data

In [None]:
## TODO :  Drop any NA, null values.  
## Hint : Using `.na.drop()`
prosper_clean = prosper.na.???()

print("Original record count {:,}, cleaned records count {:,},  dropped {:,}"\
      .format(prosper.count(), prosper_clean.count(), (prosper.count() - prosper_clean.count())))
prosper_clean.show()


## Look at some summary data

**=> Q : What does that say about the cardinality of these categorical columns?**


In [None]:
## TODO : use 'describe()' and then 'toPandas()'
prosper_clean.???().???()

In [None]:
## TODO : Look at some summaries
## We are going to group counts by 'LoanStatus',  'EmploymentStatus'

prosper_clean.groupBy('LoanStatus').count().show()
prosper_clean.groupBy('EmploymentStatus').count().show()

## Step 3: Converting Categorical columns 

Convert categorical columns to numeric.   
Here let's convert **EmploymentStatus** column

In [None]:
from pyspark.ml.feature import StringIndexer

## TODO : Create a StringIndexer with inputCol='EmploymentStatus'  and outputCol = 'EmpIndex'
strIndexer_employment = StringIndexer(inputCol="???", outputCol="???")
prosper_indexed = strIndexer_employment.fit(prosper_clean).transform(prosper_clean)

prosper_indexed.limit(10).toPandas()

## Step 4: Build feature vectors using VectorAssembler.

In [None]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=vector_columns, outputCol="features")
feature_vector = assembler.transform(prosper_indexed)
feature_vector = feature_vector.withColumn("label", feature_vector["LoanStatus"])

feature_vector.limit(10).toPandas()

## Step 5: Split Data into training and test.

We will split our the data up into training and test.  (You know the drill by now).

In [None]:
## TODO :  Split the data into 70% training and 30% test sets 
## Hint : 0.7   , 0.3
(training, test) =  feature_vector.randomSplit([???,  ???])
print("training set = " , training.count())
print("testing set = " , test.count())

## Step 6: Decision Tree

### 6.1 Create Tree

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier


##  TODO: Create a DecisionTree model with 5000 Maxbins
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxBins=???)

### 6.2 Train Tree

In [None]:
%%time

## TODO : train with training data set
## Hint : training

print ("training starting...")
tree_model = dt.fit(???)
print ("training done")

### 6.3 Print Tree

In [None]:
## TODO : Observe the output
## how many nodes does the tree have?
print(tree_model)
print()
print(tree_model.toDebugString)

### 6.4 Predict

In [None]:
## TODO : create predictions using test dataset
## Hint : test
predictions = tree_model.transform(???)

predictions2= predictions.drop('rawPrediction', 'probability')
predictions2.limit(10).toPandas()


## Step 7: Evaluate the model.

Let us check to see how the model did, using accuracy as a measure.

### 7.1 Model Accuracy

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("accuracy ",  accuracy)


### 7.2 Confusion Matrix

In [None]:
predictions.groupBy('LoanStatus').pivot('prediction', [0,1]).count().na.fill(0).orderBy('LoanStatus').show()

## Step 8: Improve Accuracy

### Add more data
In Step-1 change the 'datafile' to the full dataset.  
And see how the accuracy above changes

### Add more features
Look at the schema of the full dataset.  Are there any columns you want to add