# Pipeline: Prosper Loan Dataset

Pipelines are very useful for combining many steps together.

We are going to look at the prosper loan dataset.  This dataset shows a history of loans made by Prosper.

In [1]:
# initialize Spark Session
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
spark

Initializing Spark...
Spark found in :  /Users/sujee/spark
Spark config:
	 spark.app.name=TestApp
	spark.master=local[*]
	executor.memory=2g
	spark.sql.warehouse.dir=/var/folders/lp/qm_skljd2hl4xtps5vw0tdgm0000gn/T/tmp2vwjn9y8
	some_property=some_value
Spark UI running on port 4040


## Step 1: Load the Data

In [2]:
dataset = spark.read.csv("/data/prosper-loan/prosper-loan-data.csv.gz", 
                         header=True, inferSchema=True)


In [3]:
import pandas as pd 

dataset.limit(5).toPandas()

Unnamed: 0,Term,LoanStatus,BorrowerRate,ProsperRating (numeric),ProsperScore,ListingCategory,BorrowerState,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,...,ProsperPaymentsOneMonthPlusLate,ProsperPrincipalBorrowed,ProsperPrincipalOutstanding,LoanOriginalAmount,MonthlyLoanPayment,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors,YearsWithCredit
0,36,1,0.158,4.0,6.0,Unknown,CO,Self-employed,2.0,True,...,0.0,0.0,0.0,9425,330.43,0,0,0.0,258,13
1,36,1,0.1325,4.0,6.0,Unknown,Unknown,Full-time,19.0,False,...,0.0,0.0,0.0,1000,33.81,0,0,0.0,53,14
2,36,0,0.1435,5.0,4.0,Debt,AL,Employed,1.0,False,...,0.0,0.0,0.0,4000,137.39,0,0,0.0,1,18
3,36,0,0.3177,1.0,5.0,Household,FL,Other,121.0,True,...,0.0,0.0,0.0,4000,173.71,0,0,0.0,10,15
4,36,1,0.2075,4.0,6.0,Unknown,MI,Full-time,36.0,False,...,0.0,0.0,0.0,3000,112.64,0,0,0.0,53,11


In [4]:
# Define our columns for convenience.

columns = ['Term', 'BorrowerRate', 'ProsperRating (numeric)', 'ProsperScore', 'EmploymentStatusDuration', 'IsBorrowerHomeowner',
           'CreditScore', 'CurrentCreditLines', 'OpenCreditLines',
           'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment',
           'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent',
           'DelinquenciesLast7Years', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months',
           'RevolvingCreditBalance', 'BankcardUtilization', 'AvailableBankcardCredit', 'TotalTrades',
           'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months', 'DebtToIncomeRatio',
           'IncomeVerifiable', 'StatedMonthlyIncome', 'TotalProsperLoans', 'TotalProsperPaymentsBilled',
           'OnTimeProsperPayments', 'ProsperPaymentsLessThanOneMonthLate', 'ProsperPaymentsOneMonthPlusLate',
           'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'LoanOriginalAmount',
           'MonthlyLoanPayment', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount',
           'Investors', 'YearsWithCredit']

categorical_columns = ["BorrowerState", "EmploymentStatus", "ListingCategory"]
categorical_indexers = ["BorrowerState_index", "EmploymentStatus_index", "ListingCategory_index"]




In [5]:
dataset.select(columns).limit(5).toPandas()

Unnamed: 0,Term,BorrowerRate,ProsperRating (numeric),ProsperScore,EmploymentStatusDuration,IsBorrowerHomeowner,CreditScore,CurrentCreditLines,OpenCreditLines,TotalCreditLinespast7years,...,ProsperPaymentsOneMonthPlusLate,ProsperPrincipalBorrowed,ProsperPrincipalOutstanding,LoanOriginalAmount,MonthlyLoanPayment,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors,YearsWithCredit
0,36,0.158,4.0,6.0,2.0,True,640.0,5.0,4.0,12.0,...,0.0,0.0,0.0,9425,330.43,0,0,0.0,258,13
1,36,0.1325,4.0,6.0,19.0,False,640.0,2.0,2.0,10.0,...,0.0,0.0,0.0,1000,33.81,0,0,0.0,53,14
2,36,0.1435,5.0,4.0,1.0,False,680.0,9.0,7.0,29.0,...,0.0,0.0,0.0,4000,137.39,0,0,0.0,1,18
3,36,0.3177,1.0,5.0,121.0,True,700.0,10.0,9.0,18.0,...,0.0,0.0,0.0,4000,173.71,0,0,0.0,10,15
4,36,0.2075,4.0,6.0,36.0,False,620.0,4.0,4.0,13.0,...,0.0,0.0,0.0,3000,112.64,0,0,0.0,53,11


## Step 2: drop all NAs

Go ahead and drop all NAs.

**=> TODO: Drop all NAs

In [6]:
dataset = dataset.na.drop()

## Step 2: Examine the contents of the categorical columns.

Let's look at the contents of our categorical columns.

**=> TODO: Group by categorical columns LoanStatus BorrowerState, EmploymentStatus, ListingCategory
**and see the breakdowns by count** 

In [7]:
dataset.groupBy('LoanStatus').count().show()
dataset.groupBy('BorrowerState').count().show()
dataset.groupBy('EmploymentStatus').count().show()
dataset.groupBy('ListingCategory').count().show(60)

+----------+-----+
|LoanStatus|count|
+----------+-----+
|         1|33530|
|         0|16194|
+----------+-----+

+-------------+-----+
|BorrowerState|count|
+-------------+-----+
|           SC|  424|
|           AZ|  882|
|           LA|  346|
|           MN| 1186|
|           NJ| 1128|
|           DC|  186|
|           OR|  938|
|           VA| 1434|
|           RI|  161|
|           KY|  395|
|           WY|   62|
|           NH|  221|
|           MI| 1665|
|           NV|  371|
|           WI|  785|
|           ID|  314|
|           CA| 6800|
|           NE|  257|
|           CT|  597|
|           MT|  165|
+-------------+-----+
only showing top 20 rows

+----------------+-----+
|EmploymentStatus|count|
+----------------+-----+
|        Employed|18393|
|       Part-time| 1060|
|   Self-employed| 3045|
|    Not employed|  583|
|           Other|  924|
|       Full-time|25016|
|         Retired|  703|
+----------------+-----+

+---------------+-----+
|ListingCategory|count|
+------

**=> What does that say about the cardinality of these categorical columns? ***



## Step 3: Converting Categorical columns 

We need to convert categorical columns to numerics.  Remember, remember, Spark ML can *only* handle numeric columns.  There's a tool called StringIndexer that will help us here.

Because there are a lot of indexers, we build a pipeline to help us out here. 

**=> TODO: enter in the list with all of our indexers into the pipeline**

In [8]:
from pyspark.ml.feature import StringIndexer

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index", handleInvalid="keep").fit(dataset) for column in categorical_columns ]


## Step 5: Build feature vectors using VectorAssembler.

In [9]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=columns + categorical_indexers, outputCol="features")

## Step 6: Build Indexers

We are going to load the label indexer, which will make us a label column for loan status.
We will also add feature indexer which will identify which features are categorical. (we should have 3.)

In [10]:
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="LoanStatus", outputCol="indexedLabel")


In [11]:
from pyspark.ml.feature import VectorIndexer

# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4)


In [12]:
# Scaler
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="indexedFeatures", outputCol="scaledFeatures")

## Step 7: Split Data into training and test.

We will split our the data up into training and test.  (You know the drill by now).

**=> TODO: Split dataset into 70% training, 30% validation


In [13]:

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) =  dataset.randomSplit([.70, .30])


## Step 8: Run the pipeline that will fit our decision tree

We have a 8 stage pipeline here: 

 1. CategoryIndexers (3 total)
 4. VectorAssembler 
 5. LabelIndexer
 6. FeatureIndexer
 7. Scaler
 8 RandomForestClassifier
 
Running the pipeline will do all three.  Note that our other indexer pipeline already ran above.

**=> TODO: Add indexers list plus labelIndexer, featureIndexer, and rf to our pipeline
 HINT: You should have 8 seperate items


In [14]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

# Train a DecisionTree model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="scaledFeatures", maxBins=10000, numTrees=20)

stages = indexers + [assembler, labelIndexer, featureIndexer, scaler, rf] # TODO enter the six stages to the pipeline

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=stages)



In [15]:
# Train model.  This also runs the indexers.

model = pipeline.fit(trainingData)



For Test, we don't want to re-run the training of the random forest, but we do want to run the rest of the pipeline.


In [16]:

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "scaledFeatures").show(5)




+----------+------------+--------------------+
|prediction|indexedLabel|      scaledFeatures|
+----------+------------+--------------------+
|       0.0|         1.0|(45,[0,1,2,3,4,6,...|
|       0.0|         1.0|(45,[0,1,2,3,4,6,...|
|       0.0|         1.0|[0.0,1.2533138996...|
|       0.0|         1.0|[0.0,1.3455228602...|
|       0.0|         1.0|[0.0,1.3455228602...|
+----------+------------+--------------------+
only showing top 5 rows



## Step 9: Evaluate the model.

Let us check to see how the model did, using accuracy as a measure.

In [17]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))


Test Error = 0.311943 


In [18]:

treeModel = model.stages[2]
# summary only
print(treeModel)

StringIndexer_4673adf6b3c794d4ade8
