# Lending Club Classifier and .deploy() Demo

To set-up running Spark 2.0 (required for this demo) from a Jupyter notebook, follow these [instructions](https://github.com/combust-ml/mleap/wiki/Setting-up-a-Spark-2.0-notebook-with-MLeap-and-Toree).

This demo will show you how to:
1. Load the research dataset from s3
2. Construct a feature transformer pipeline using commonly available transformers in Spark
3. Train and deploy our classifiers to a public model server hosted on the combust.ml cloud using .deploy()

NOTE: To run the actual deploy step you have to either:
1. Get a key from combust.ml - it's easy, just email us!
2. Fire up the combust cloud server on your local machine - also easy, send us an email and we'll send you a docker image.

## Background on the Dataset

The dataset used for the demo was pulled together from the publicly available [Lending Club Statistics datasets](https://www.lendingclub.com/info/download-data.action). The original data provided by Lending Club (Issued and Rejected loans) is not standardized, so for this demo we've gone ahead and pulled together only the common set of fields for you.

### Step 0: Load libraries and data

For now, we've made it so that you have to download the [data](https://s3-us-west-2.amazonaws.com/mleap-demo/datasources/airbnb.avro.zip) from s3. We suggest that you place it in your /tmp directory.

Once [TOREE-345](https://issues.apache.org/jira/browse/TOREE-345) is fixed, we won't have to deal with complicated notebook setup and will just be able to include:

```scala
%AddDeps ml.combust.mleap mleap-spark_2.11 0.3.0 --transitive
```

For now, make sure to clone [mleap](https://github.com/combust-ml/mleap) and run:

```bash
sbt "+ publishLocal"
sbt publishM2
```

In [12]:
%AddDeps ml.combust.mleap mleap-spark_2.11 0.3.0-SNAPSHOT --transitive --repository file:///Users/mikhail/.m2/repository
%AddDeps com.databricks spark-avro_2.11 3.0.1

import org.apache.spark.ml.mleap.feature.OneHotEncoder
import org.apache.spark.ml.feature.{StandardScaler, StringIndexer, VectorAssembler}
import org.apache.spark.ml.classification.{RandomForestClassifier, LogisticRegression}
import org.apache.spark.ml.{Pipeline, PipelineStage}

Marking ml.combust.mleap:mleap-spark_2.11:0.3.0-SNAPSHOT for download
Preparing to fetch from:
-> file:/var/folders/b2/swftcs0s4bvfpqfvn3rm0pzw0000gn/T/toree_add_deps2959412719271534823/
-> file:/Users/mikhail/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /var/folders/b2/swftcs0s4bvfpqfvn3rm0pzw0000gn/T/toree_add_deps2959412719271534823/https/repo1.maven.org/maven2/com/lihaoyi/sourcecode_2.11/0.1.1/sourcecode_2.11-0.1.1.jar
-> New file at /var/folders/b2/swftcs0s4bvfpqfvn3rm0pzw0000gn/T/toree_add_deps2959412719271534823/https/repo1.maven.org/maven2/io/spray/spray-json_2.11/1.3.2/spray-json_2.11-1.3.2.jar
-> New file at /var/folders/b2/swftcs0s4bvfpqfvn3rm0pzw0000gn/T/toree_add_deps2959412719271534823/https/repo1.maven.org/maven2/com/lihaoyi/fastparse-utils_2.11/0.3.7/fastparse-utils_2.11-0.3.7.jar
-> New file at /Users/mikhail/.m2/repository/ml/combust/mleap/mleap-runtime_2.11/0.3.0-SNAPSHOT/mleap-runtime_2.11-0.3.0-SNAPSHOT.jar
-> New file at /Users/mikhail/.m2/repos

IMPORTANT!!! You may have to run this next block of code a few times to get it to work - this is due to another bug in Toree. For me, running it twice works.

In [14]:
// Step 1. Load our Lending Club dataset

val inputFile = "file:////tmp/lending_club.avro"
val outputFileRf = "/tmp/transformer.rf.ml"
val outputFileLr = "/tmp/transformer.lr.ml"

var dataset = spark.sqlContext.read.format("com.databricks.spark.avro").
  load(inputFile)

println(dataset.count())

755473


In [20]:
dataset.select("loan_amount", "fico_score_group_fnl", "dti", "emp_length", "state", "approved", "loan_title").show(5)

+-----------+--------------------+------+----------+-----+--------+------------------+
|loan_amount|fico_score_group_fnl|   dti|emp_length|state|approved|        loan_title|
+-----------+--------------------+------+----------+-----+--------+------------------+
|     1000.0|           650 - 700|   0.1|   4 years|   NM|     0.0|Wedding/Engagement|
|     1000.0|           700 - 800|   0.1|  < 1 year|   MA|     0.0|Debt Consolidation|
|    11000.0|           700 - 800|   0.1|    1 year|   MD|     0.0|Debt Consolidation|
|     6000.0|           650 - 700|0.3864|  < 1 year|   MA|     0.0|             Other|
|     1500.0|           500 - 550|0.0943|  < 1 year|   MD|     0.0|             Other|
+-----------+--------------------+------+----------+-----+--------+------------------+
only showing top 5 rows



### Cap DTI

This will be available as a custom transformer.

In [None]:
dataset.registerTempTable("df")

val datasetFnl = spark.sqlContext.sql(f"""
    select
        loan_amount,
        fico_score_group_fnl,
        case when dti >= 10.0
            then 10.0
            else dti
        end as dti,
        emp_length,
        state,
        loan_title,
        approved
    from df
""")

### Let's take a look at some summary statistics

In [3]:
// Most popular cities (original dataset)

spark.sqlContext.sql(f"""
    select 
        state,
        count(*) as n,
        cast(avg(loan_amount) as decimal(12,2)) as loan_amount,
        cast(avg(dti) as decimal(12,2)) as dti,
        cast(avg(approved) as decimal(12,2)) as approved
    from df
    group by state
    order by count(*) desc
""").show(15)

+-----+-----+-----------+-----+--------+
|state|    n|loan_amount|  dti|approved|
+-----+-----+-----------+-----+--------+
|   CA|99793|   13687.90|14.38|    0.00|
|   TX|62049|   13165.46| 8.94|    0.00|
|   NY|60715|   13244.03|10.81|    0.00|
|   FL|60051|   12488.73| 8.85|    0.00|
|   PA|33167|   12776.74| 7.87|    0.00|
|   IL|31487|   13224.26| 8.45|    0.00|
|   GA|29000|   12362.53|11.25|    0.00|
|   OH|28511|   12159.61| 7.90|    0.00|
|   NJ|27665|   13935.15|10.38|    0.00|
|   VA|23556|   12950.66|12.43|    0.00|
|   MI|20696|   12641.24| 8.77|    0.00|
|   NC|20389|   12588.16| 4.59|    0.00|
|   MA|18808|   12456.19| 9.57|    0.00|
|   MD|17859|   12100.79| 7.52|    0.00|
|   AZ|16281|   12820.07| 9.81|    0.00|
+-----+-----+-----------+-----+--------+
only showing top 15 rows



In [1]:
// Most popular cities (original dataset)

spark.sqlContext.sql(f"""
    select 
        loan_title,
        count(*) as n,
        cast(avg(loan_amount) as decimal(12,2)) as loan_amount,
        cast(avg(dti) as decimal(12,2)) as dti,
        cast(avg(approved) as decimal(12,2)) as approved
    from df
    group by loan_title
    order by count(*) desc
""").show(15)

+--------------------+------+-----------+-----+--------+
|          loan_title|     n|loan_amount|  dti|approved|
+--------------------+------+-----------+-----+--------+
|  Debt Consolidation|293152|   14973.08| 2.99|    0.00|
|               Other|179838|    9404.06|23.50|    0.00|
|Home/Home Improve...| 59073|   15292.14| 2.92|    0.00|
|  Payoff Credit Card| 57827|   16015.58| 4.29|    0.00|
|    Car Payment/Loan| 46369|   10368.69| 3.49|    0.00|
|       Business Loan| 41286|   18215.65|13.60|    0.00|
|      Health/Medical| 20129|    7452.36| 4.80|    0.00|
|              Moving| 18460|    6638.31|14.80|    0.00|
|  Wedding/Engagement| 12867|   10197.83| 4.57|    0.00|
|            Vacation|  9729|    5627.50| 6.62|    0.00|
|             College|  7631|    7974.38|43.78|    0.00|
|    Renewable Energy|  3165|    9794.15|11.15|    0.00|
|        Payoff Bills|  2302|   10826.49| 1.56|    0.00|
|       Personal Loan|  2256|    9496.62| 0.43|    0.00|
|          Motorcycle|   520|  

### Step 2: Define continous and categorical features and filter nulls

In [None]:
// Step 2. Create our feature pipeline and train it on the entire dataset
val continuousFeatures = Array("loan_amount",
  "dti")

val categoricalFeatures = Array("loan_title",
  "emp_length",
  "state",
  "fico_score_group_fnl")

val allFeatures = continuousFeatures.union(categoricalFeatures)

In [None]:
// Filter all null values
val allCols = allFeatures.union(Seq("approved")).map(datasetImputed.col)
val nullFilter = allCols.map(_.isNotNull).reduce(_ && _)
val datasetImputedFiltered = datasetImputed.select(allCols: _*).filter(nullFilter).persist()

println(datasetImputedFiltered.count())

### Step 3: Split data into training and validation

In [None]:
val Array(trainingDataset, validationDataset) = datasetImputedFiltered.randomSplit(Array(0.7, 0.3))

### Step 4: Continous Feature Pipeline

In [None]:
val continuousFeatureAssembler = new VectorAssembler(uid = "continuous_feature_assembler").
    setInputCols(continuousFeatures).
    setOutputCol("unscaled_continuous_features")

val continuousFeatureScaler = new StandardScaler(uid = "continuous_feature_scaler").
    setInputCol("unscaled_continuous_features").
    setOutputCol("scaled_continuous_features")

val ContinuousFeaturePolynomialExpansion = new PolynomialExpansion(uid = "polynomial_expansion_loan_amount").
    setInputCols("loan_amount").
    setOutputCol("loan_amount_polynomial_expansion")

### Step 5: Categorical Feature Pipeline

In [None]:
val categoricalFeatureIndexers = categoricalFeatures.map {
    feature => new StringIndexer(uid = s"string_indexer_$feature").
      setInputCol(feature).
      setOutputCol(s"${feature}_index")
}

val categoricalFeatureOneHotEncoders = categoricalFeatureIndexers.map {
    indexer => new OneHotEncoder(uid = s"oh_encoder_${indexer.getOutputCol}").
      setInputCol(indexer.getOutputCol).
      setOutputCol(s"${indexer.getOutputCol}_oh")
}

### Step 6: Assemble our features and feature pipeline

In [None]:
val featureColsRf = categoricalFeatureIndexers.map(_.getOutputCol).union(Seq("scaled_continuous_features", "loan_amount_polynomial_expansion"))
val featureColsLr = categoricalFeatureOneHotEncoders.map(_.getOutputCol).union(Seq("scaled_continuous_features"))

// assemble all processes categorical and continuous features into a single feature vector
val featureAssemblerLr = new VectorAssembler(uid = "feature_assembler_lr").
    setInputCols(featureColsLr).
    setOutputCol("features_lr")
    
val featureAssemblerRf = new VectorAssembler(uid = "feature_assembler_rf").
    setInputCols(featureColsRf).
    setOutputCol("features_rf")

val estimators: Array[PipelineStage] = Array(continuousFeatureAssembler, continuousFeatureScaler).
    union(categoricalFeatureIndexers).
    union(categoricalFeatureOneHotEncoders).
    union(Seq(featureAssemblerLr, featureAssemblerRf))

val featurePipeline = new Pipeline(uid = "feature_pipeline").
    setStages(estimators)
val sparkFeaturePipelineModel = featurePipeline.fit(datasetImputedFiltered)

println("Finished constructing the pipeline")

### Step 7: Train Random Forest Classifier

In [None]:
// Step 3.1 Create our random forest model
val randomForest = new RandomForestClassifier(uid = "random_forest_classifier").
    setFeaturesCol("features_rf").
    setLabelCol("approved").
    setPredictionCol("approved_prediction")

val sparkPipelineEstimatorRf = new Pipeline().setStages(Array(sparkFeaturePipelineModel, randomForest))
val sparkPipelineRf = sparkPipelineEstimatorRf.fit(datasetImputedFiltered)

println("Complete: Training Random Forest")

### Step 8: Train Logistic Regression Model

In [None]:
val logisticRegression = new LogisticRegression(uid = "logistic_regression").
    setFeaturesCol("features_lr").
    setLabelCol("approved").
    setPredictionCol("approved_prediction")

val sparkPipelineEstimatorLr = new Pipeline().setStages(Array(sparkFeaturePipelineModel, logisticRegression))
val sparkPipelineLr = sparkPipelineEstimatorLr.fit(datasetImputedFiltered)

println("Complete: Training Logistic Regression")

### Step 9: Load the libaries for .deploy()

In [None]:
import ml.combust.bundle.BundleRegistry
import ml.combust.mleap.spark.SparkSupport._
import ml.combust.client.spark.SparkSupport._
import ml.combust.client.model.MultiModelClient
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import scala.concurrent.duration._
import scala.concurrent.Await

implicit val hr = BundleRegistry("spark")

### Step 10: Set up the ActorSystem

In [None]:
implicit val system = ActorSystem("combust-client")
implicit val materializer = ActorMaterializer()
implicit val ec = system.dispatcher

### Step 11: Define model servers to send your models to

In [None]:
val client = MultiModelClient.builder[String]().
  withServer("model-server-1", "http://combust-model-server-4671c502.124d9c7a.svc.dockerapp.io:8889").
  withServer("model-server-2", "http://combust-model-server-9ac83b2d.801b9d51.svc.dockerapp.io:8890").
  build()

### Step 12: Deploy your LR and RF pipelines to the model servers

In [None]:
Await.result(client.deploy(sparkPipelineLr, "airbnb_lr").sequence, 10.seconds)

In [None]:
Await.result(client.deploy(sparkPipelineRf, "airbnb_rf").sequence, 10.seconds)

### Step 13 (Optional): Serialize your models to bundle.ml

In [None]:
sparkPipelineLr.serializeToBundle(new java.io.File("/tmp/model.lr"))
sparkPipelineRf.serializeToBundle(new java.io.File("/tmp/model.rf"))

### Step 14 (Optional): Undeploy your models from the model server

In [None]:
Await.result(client.undeploy("airbnb_rf").sequence, 10.seconds)

In [None]:
Await.result(client.undeploy("airbnb_lr").sequence, 10.seconds)