# Decision Tree Learning: Prosper Loan Dataset

A decision tree a learned set of rules that allows us to make decisions on data.

We are going to look at the prosper loan dataset.  This dataset shows a history of loans made by Prosper.

In [7]:
%matplotlib inline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

Spark UI running on http://YOURIPADDRESS:4041


## Step 1: Load the Data

In [4]:
dataset = spark.read.csv("/data/prosper-loan/prosper-loan-data.csv.gz", 
                         header=True, inferSchema=True)


In [None]:
dataset.show(20)

In [2]:
# Define our columns for convenience.

columns = ['Term', 'BorrowerRate', 'ProsperRating (numeric)', 'ProsperScore', 'EmploymentStatusDuration', 'IsBorrowerHomeowner',
           'CreditScore', 'CurrentCreditLines', 'OpenCreditLines',
           'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment',
           'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent',
           'DelinquenciesLast7Years', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months',
           'RevolvingCreditBalance', 'BankcardUtilization', 'AvailableBankcardCredit', 'TotalTrades',
           'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months', 'DebtToIncomeRatio',
           'IncomeVerifiable', 'StatedMonthlyIncome', 'TotalProsperLoans', 'TotalProsperPaymentsBilled',
           'OnTimeProsperPayments', 'ProsperPaymentsLessThanOneMonthLate', 'ProsperPaymentsOneMonthPlusLate',
           'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'LoanOriginalAmount',
           'MonthlyLoanPayment', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount',
           'Investors', 'YearsWithCredit']

categorical_columns = ["BorrowerState", "EmploymentStatus", "ListingCategory"]
categorical_indexers = ["BorrowerState_index", "EmploymentStatus_index", "ListingCategory_index"]




In [5]:
dataset.select(columns).show(10)

+----+------------+-----------------------+------------+------------------------+-------------------+-----------+------------------+---------------+--------------------------+---------------------+---------------------------+--------------------+--------------+--------------------+----------------+-----------------------+------------------------+-------------------------+----------------------+-------------------+-----------------------+-----------+----------------------------------+-----------------------+-----------------+----------------+-------------------+-----------------+--------------------------+---------------------+-----------------------------------+-------------------------------+------------------------+---------------------------+------------------+------------------+---------------+--------------------------+---------------------------+---------+---------------+
|Term|BorrowerRate|ProsperRating (numeric)|ProsperScore|EmploymentStatusDuration|IsBorrowerHomeowner|CreditSc

## Step 2: Examine the contents of the categorical columns.

Let's look at the contents of our categorical columns.

**=> TODO: Group by categorical columns LoanStatus BorrowerState, EmploymentStatus, ListingCategory
**and see the breakdowns by count** 

In [None]:
dataset.groupBy(???).count().show()
dataset.groupBy(???).count().show()
dataset.groupBy(???).count().show()
dataset.groupBy(???).count().show(60)

**=> What does that say about the cardinality of these categorical columns? ***



## Step 3: Converting Categorical columns 

We need to convert categorical columns to numerics.  Remember, remember, Spark ML can *only* handle numeric columns.  There's a tool called StringIndexer that will help us here.

Because there are a lot of indexers, we build a pipeline to help us out here. 

**=> TODO: enter in the list with all of our indexers into the pipeline

In [8]:
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index", handleInvalid="keep").fit(dataset) for column in categorical_columns ]

print(len(indexers))

pipeline = Pipeline(stages=???)
df_r = pipeline.fit(dataset).transform(dataset)


3


## Step 4: Drop all NAs.

We're just going to drop NAs.

**=> TODO: Drop all NAs

In [None]:
na_dropped = df_r.select(columns + categorical_indexers + ['LoanStatus']).???() #TODO: Drop NAs


## Step 5: Build feature vectors using VectorAssembler.

In [None]:
assembler = VectorAssembler(inputCols=columns + categorical_indexers, outputCol="features")
fv = assembler.transform(na_dropped)

## Step 6: Build Indexers

We are going to load the label indexer, which will make us a label column for loan status.
We will also add feature indexer which will identify which features are categorical. (we should have 3.)

In [None]:
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="LoanStatus", outputCol="indexedLabel").fit(fv)


In [None]:
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(fv)


## Step 7: Split Data into training and test.

We will split our the data up into training and test.  (You know the drill by now).

**=> TODO: Split dataset into 70% training, 30% validation


In [None]:

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) =  ???


## Step 8: Run the pipeline that will fit our decision tree

We have a 3 stage pipeline here: 

 1. LabelIndexer
 2. FeatureIndexer
 3. DecisionTreeClassifier
 
Running the pipeline will do all three.  Note that our other indexer pipeline already ran above.

**=> TODO: Add labelIndexer, featureIndexer, and dt to our pipeline



In [None]:

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxBins=100)

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[???, ???, ???])



In [None]:
# Train model.  This also runs the indexers.

model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)



## Step 9: Evaluate the model.

Let us check to see how the model did, using accuracy as a measure.

In [None]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))


In [None]:

treeModel = model.stages[2]
# summary only
print(treeModel)