# Decision Tree Learning: Prosper Loan Dataset

Let's look at a clustering example in Spark MLLib.

Here, we are going to load the mtcars dataset. This has some stats on different models of cars.  Here, we will load the CSV file as a spark dataframe, and view it.

In [139]:
%matplotlib inline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


In [3]:
dataset = spark.read.csv("../../datasets/prosper-loan/prosper-loan-data.csv.gz", header=True, inferSchema=True)


In [106]:
columns = ["Term","BorrowerAPR","BorrowerRate","LenderYield",
           "ListingCategory (numeric)","EmploymentStatusDuration",
           "IsBorrowerHomeowner","CurrentlyInGroup","CreditScoreRangeLower","CreditScoreRangeUpper",
           "CurrentCreditLines","OpenCreditLines","TotalCreditLinespast7years",
           "OpenRevolvingAccounts","OpenRevolvingMonthlyPayment","InquiriesLast6Months","TotalInquiries",
           "CurrentDelinquencies","AmountDelinquent","DelinquenciesLast7Years","PublicRecordsLast10Years",
           "PublicRecordsLast12Months","RevolvingCreditBalance","BankcardUtilization","AvailableBankcardCredit",
           "TotalTrades","TradesNeverDelinquent (percentage)","TradesOpenedLast6Months","DebtToIncomeRatio",
           "IncomeVerifiable","StatedMonthlyIncome", "LoanCurrentDaysDelinquent",
           "LoanMonthsSinceOrigination","LoanOriginalAmount",
           "MonthlyLoanPayment","LP_CustomerPayments","LP_CustomerPrincipalPayments",
           "LP_InterestandFees","LP_ServiceFees","LP_CollectionFees","LP_GrossPrincipalLoss",
           "LP_NetPrincipalLoss","LP_NonPrincipalRecoverypayments","PercentFunded","Recommendations",
           "InvestmentFromFriendsCount","InvestmentFromFriendsAmount","Investors"
]

categorical_columns = ["CreditGrade", "BorrowerState", "Occupation", "EmploymentStatus", "IncomeRange", 
                       "LoanOriginationQuarter"]

categorical_indexers = ["CreditGrade_index", "BorrowerState_index", "Occupation_index", "EmploymentStatus_index", "IncomeRange_index", 
                       "LoanOriginationQuarter_index"]

boolean_columns = ["IsBorrowerHomeowner","CurrentlyInGroup", 'IncomeVerifiable']

null_columns = ["EstimatedEffectiveYield","EstimatedLoss","EstimatedReturn","ProsperRating (numeric)",
                "ProsperRating (Alpha)","ProsperScore", "TotalProsperLoans","TotalProsperPaymentsBilled",
                "OnTimeProsperPayments","ProsperPaymentsLessThanOneMonthLate","ProsperPaymentsOneMonthPlusLate",
                "ProsperPrincipalBorrowed","ProsperPrincipalOutstanding", "ScorexChangeAtTimeOfListing",
                "LoanFirstDefaultedCycleNumber"
               ]


In [107]:
dataset.select(null_columns).show(1)

+-----------------------+-------------+---------------+-----------------------+---------------------+------------+-----------------+--------------------------+---------------------+-----------------------------------+-------------------------------+------------------------+---------------------------+---------------------------+-----------------------------+
|EstimatedEffectiveYield|EstimatedLoss|EstimatedReturn|ProsperRating (numeric)|ProsperRating (Alpha)|ProsperScore|TotalProsperLoans|TotalProsperPaymentsBilled|OnTimeProsperPayments|ProsperPaymentsLessThanOneMonthLate|ProsperPaymentsOneMonthPlusLate|ProsperPrincipalBorrowed|ProsperPrincipalOutstanding|ScorexChangeAtTimeOfListing|LoanFirstDefaultedCycleNumber|
+-----------------------+-------------+---------------+-----------------------+---------------------+------------+-----------------+--------------------------+---------------------+-----------------------------------+-------------------------------+------------------------+----

In [108]:
dataset.select(categorical_columns).show(1)

+-----------+-------------+----------+----------------+--------------+----------------------+
|CreditGrade|BorrowerState|Occupation|EmploymentStatus|   IncomeRange|LoanOriginationQuarter|
+-----------+-------------+----------+----------------+--------------+----------------------+
|          C|           CO|     Other|   Self-employed|$25,000-49,999|               Q3 2007|
+-----------+-------------+----------+----------------+--------------+----------------------+
only showing top 1 row



In [118]:
dataset.select(columns).show(10)

+----+-----------+------------+-----------+-------------------------+------------------------+-------------------+----------------+---------------------+---------------------+------------------+---------------+--------------------------+---------------------+---------------------------+--------------------+--------------+--------------------+----------------+-----------------------+------------------------+-------------------------+----------------------+-------------------+-----------------------+-----------+----------------------------------+-----------------------+-----------------+----------------+-------------------+-------------------------+--------------------------+------------------+------------------+-------------------+----------------------------+------------------+--------------+-----------------+---------------------+-------------------+-------------------------------+-------------+---------------+--------------------------+---------------------------+---------+
|Term|Bor

In [114]:
dataset.show()

+--------------------+-------------+--------------------+-----------+----+--------------------+-------------------+-----------+------------+-----------+-----------------------+-------------+---------------+-----------------------+---------------------+------------+-------------------------+-------------+-------------------+----------------+------------------------+-------------------+----------------+--------------------+--------------------+---------------------+---------------------+-----------------------+------------------+---------------+--------------------------+---------------------+---------------------------+--------------------+--------------+--------------------+----------------+-----------------------+------------------------+-------------------------+----------------------+-------------------+-----------------------+-----------+----------------------------------+-----------------------+-----------------+--------------+----------------+-------------------+-----------------

In [115]:
dataset.groupBy('LoanStatus').count().show()

+--------------------+-----+
|          LoanStatus|count|
+--------------------+-----+
|           Defaulted| 5018|
|          Chargedoff|11992|
|FinalPaymentInPro...|  205|
|           Completed|38074|
|Past Due (61-90 d...|  313|
|Past Due (>120 days)|   16|
|           Cancelled|    5|
|Past Due (1-15 days)|  806|
|Past Due (91-120 ...|  304|
|             Current|56576|
|Past Due (31-60 d...|  363|
|Past Due (16-30 d...|  265|
+--------------------+-----+



In [103]:
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index", handleInvalid="keep").fit(dataset) for column in categorical_columns ]


pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(dataset).transform(dataset)

df_r.show()

+--------------------+-------------+--------------------+-----------+----+--------------------+-------------------+-----------+------------+-----------+-----------------------+-------------+---------------+-----------------------+---------------------+------------+-------------------------+-------------+-------------------+----------------+------------------------+-------------------+----------------+--------------------+--------------------+---------------------+---------------------+-----------------------+------------------+---------------+--------------------------+---------------------+---------------------------+--------------------+--------------+--------------------+----------------+-----------------------+------------------------+-------------------------+----------------------+-------------------+-----------------------+-----------+----------------------------------+-----------------------+-----------------+--------------+----------------+-------------------+-----------------

In [136]:
na_dropped = df_r.select(columns + categorical_indexers + ['LoanStatus']).na.drop()


In [148]:
assembler = VectorAssembler(inputCols=columns + categorical_indexers, outputCol="features")
fv = assembler.transform(na_dropped)

In [142]:
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="LoanStatus", outputCol="indexedLabel").fit(fv)


In [141]:
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(fv)


In [149]:

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = fv.randomSplit([0.7, 0.3])


In [150]:

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]
# summary only
print(treeModel)

IllegalArgumentException: 'requirement failed: DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature 50 has 68 values. Considering remove this and other categorical features with a large number of values, or add more training examples.'