# PIpeline: Prosper Loan Dataset

Pipelines are very useful for combining many steps together.

We are going to look at the prosper loan dataset.  This dataset shows a history of loans made by Prosper.

In [None]:
%matplotlib inline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

## Step 1: Load the Data

In [None]:
dataset = spark.read.csv("/data/prosper-loan/prosper-loan-data.csv.gz", 
                         header=True, inferSchema=True)


In [None]:
dataset.show(20)

In [None]:
# Define our columns for convenience.

columns = ['Term', 'BorrowerRate', 'ProsperRating (numeric)', 'ProsperScore', 'EmploymentStatusDuration', 'IsBorrowerHomeowner',
           'CreditScore', 'CurrentCreditLines', 'OpenCreditLines',
           'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment',
           'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent',
           'DelinquenciesLast7Years', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months',
           'RevolvingCreditBalance', 'BankcardUtilization', 'AvailableBankcardCredit', 'TotalTrades',
           'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months', 'DebtToIncomeRatio',
           'IncomeVerifiable', 'StatedMonthlyIncome', 'TotalProsperLoans', 'TotalProsperPaymentsBilled',
           'OnTimeProsperPayments', 'ProsperPaymentsLessThanOneMonthLate', 'ProsperPaymentsOneMonthPlusLate',
           'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'LoanOriginalAmount',
           'MonthlyLoanPayment', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount',
           'Investors', 'YearsWithCredit']

categorical_columns = ["BorrowerState", "EmploymentStatus", "ListingCategory"]
categorical_indexers = ["BorrowerState_index", "EmploymentStatus_index", "ListingCategory_index"]




In [None]:
dataset.select(columns).show(10)

## Step 2: drop all NAs

Go ahead and drop all NAs.

**=> TODO: Drop all NAs

In [None]:
dataset = # TODO Drop NAs

## Step 2: Examine the contents of the categorical columns.

Let's look at the contents of our categorical columns.

**=> TODO: Group by categorical columns LoanStatus BorrowerState, EmploymentStatus, ListingCategory
**and see the breakdowns by count** 

In [None]:
dataset.groupBy(???).count().show()
dataset.groupBy(???).count().show()
dataset.groupBy(???).count().show()
dataset.groupBy(???).count().show(60)

**=> What does that say about the cardinality of these categorical columns? ***



## Step 3: Converting Categorical columns 

We need to convert categorical columns to numerics.  Remember, remember, Spark ML can *only* handle numeric columns.  There's a tool called StringIndexer that will help us here.

Because there are a lot of indexers, we build a pipeline to help us out here. 

**=> TODO: enter in the list with all of our indexers into the pipeline**
HINT: take each column, and then output the column + "_index"

In [None]:
print(categorical_columns)

indexers = [StringIndexer(inputCol=???, outputCol=???, handleInvalid="keep").\
            fit(dataset) for column in categorical_columns ]


## Step 5: Build feature vectors using VectorAssembler.

**=> TODO: enter input cols as columns + categorical_indexers, outputCol = features**



In [None]:
assembler = VectorAssembler(inputCols=???, outputCol=???) #TODO: create vector assembler

## Step 6: Build Indexers

We are going to load the label indexer, which will make us a label column for loan status.
We will also add feature indexer which will identify which features are categorical. (we should have 3.)

**=>TODO: Which column is your output label?**
**=>TODO: Enter input column for label**

In [None]:
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="???", outputCol="indexedLabel")


In [None]:
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4)


Let's scale the data.  We will use standardscaler for this. This will normalize all data.

**=> TODO: instantiate standardscaler with inputcol=indexedFeatures, outputCol=scaledFeatures**

In [None]:
# Scaler

scaler = StandardScaler(inputCol="???", outputCol="???")

## Step 7: Split Data into training and test.

We will split our the data up into training and test.  (You know the drill by now).

**=> TODO: Split dataset into 70% training, 30% validation**


In [None]:

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = ???


## Step 8: Run the pipeline that will fit our decision tree

We have a 8 stage pipeline here: 

 1. CategoryIndexer1
 2. CategoryIndexer2
 3. CategoryIndexer3
 4. VectorAssembler 
 5. LabelIndexer
 6. FeatureIndexer
 7. Scaler
 8. RandomForestClassifier
 
Running the pipeline will do all eight.  Note that our other indexer pipeline already ran above.

**=> TODO: Add indexers list plus assembler, labelIndexer, featureIndexer, scaler, and rf to our pipeline**
 HINT: You should have 8 separate items


In [None]:

# Train a DecisionTree model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="scaledFeatures", maxBins=10000, numTrees=20)

stages = indexers + [???] # TODO enter the six stages to the pipeline

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=stages)



In [None]:
# Train model.  This also runs the indexers.

model = pipeline.fit(trainingData)



Let's make predictions

**=> TODO: make predicitons on our test data

In [None]:

# Make predictions.
predictions = model.transform(???) #Make predictions on test data.

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "scaledFeatures").show(5)




## Step 9: Evaluate the model.

Let us check to see how the model did, using accuracy as a measure.

In [None]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))


In [None]:

treeModel = model.stages[2]
# summary only
print(treeModel)