# Lending Club Classifier and .deploy() Demo

To set-up running Spark 2.0 (required for this demo) from a Jupyter notebook, follow these [instructions](https://github.com/combust-ml/mleap/wiki/Setting-up-a-Spark-2.0-notebook-with-MLeap-and-Toree).

This demo will show you how to:
1. Load the research dataset from s3
2. Construct a feature transformer pipeline using commonly available transformers in Spark
3. Train and deploy our classifiers to a public model server hosted on the combust.ml cloud using .deploy()

NOTE: To run the actual deploy step you have to either:
1. Get a key from combust.ml - it's easy, just email us!
2. Fire up the combust cloud server on your local machine - also easy, send us an email and we'll send you a docker image.

## Background on the Dataset

The dataset used for the demo was pulled together from the publicly available [Lending Club Statistics datasets](https://www.lendingclub.com/info/download-data.action). The original data provided by Lending Club (Issued and Rejected loans) is not standardized, so for this demo we've gone ahead and pulled together only the common set of fields for you.

### Step 0: Load libraries and data

For now, we've made it so that you have to download the [data](https://s3-us-west-2.amazonaws.com/mleap-demo/datasources/airbnb.avro.zip) from s3. We suggest that you place it in your /tmp directory.

Once [TOREE-345](https://issues.apache.org/jira/browse/TOREE-345) is fixed, we won't have to deal with complicated notebook setup and will just be able to include:

```scala
%AddDeps ml.combust.mleap mleap-spark_2.11 0.3.0 --transitive
```

For now, make sure to clone [mleap](https://github.com/combust-ml/mleap) and run:

```bash
sbt "+ publishLocal"
sbt publishM2
```

In [12]:
%AddDeps ml.combust.mleap mleap-spark_2.11 0.3.0-SNAPSHOT --transitive --repository file:///Users/mikhail/.m2/repository
%AddDeps com.databricks spark-avro_2.11 3.0.1

import org.apache.spark.ml.mleap.feature.OneHotEncoder
import org.apache.spark.ml.feature.{StandardScaler, StringIndexer, VectorAssembler}
import org.apache.spark.ml.classification.{RandomForestClassifier, LogisticRegression}
import org.apache.spark.ml.{Pipeline, PipelineStage}

Marking ml.combust.mleap:mleap-spark_2.11:0.3.0-SNAPSHOT for download
Preparing to fetch from:
-> file:/var/folders/b2/swftcs0s4bvfpqfvn3rm0pzw0000gn/T/toree_add_deps2959412719271534823/
-> file:/Users/mikhail/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /var/folders/b2/swftcs0s4bvfpqfvn3rm0pzw0000gn/T/toree_add_deps2959412719271534823/https/repo1.maven.org/maven2/com/lihaoyi/sourcecode_2.11/0.1.1/sourcecode_2.11-0.1.1.jar
-> New file at /var/folders/b2/swftcs0s4bvfpqfvn3rm0pzw0000gn/T/toree_add_deps2959412719271534823/https/repo1.maven.org/maven2/io/spray/spray-json_2.11/1.3.2/spray-json_2.11-1.3.2.jar
-> New file at /var/folders/b2/swftcs0s4bvfpqfvn3rm0pzw0000gn/T/toree_add_deps2959412719271534823/https/repo1.maven.org/maven2/com/lihaoyi/fastparse-utils_2.11/0.3.7/fastparse-utils_2.11-0.3.7.jar
-> New file at /Users/mikhail/.m2/repository/ml/combust/mleap/mleap-runtime_2.11/0.3.0-SNAPSHOT/mleap-runtime_2.11-0.3.0-SNAPSHOT.jar
-> New file at /Users/mikhail/.m2/repos

IMPORTANT!!! You may have to run this next block of code a few times to get it to work - this is due to another bug in Toree. For me, running it twice works.

In [14]:
// Step 1. Load our Lending Club dataset

val inputFile = "file:////tmp/lending_club.avro"
val outputFileRf = "/tmp/transformer.rf.ml"
val outputFileLr = "/tmp/transformer.lr.ml"

var dataset = spark.sqlContext.read.format("com.databricks.spark.avro").
  load(inputFile)

println(dataset.count())

755473


In [19]:
dataset.select("loan_amount", "fico_score_group_fnl", "dti", "loan_amount", "emp_length", "state", "approved", "loan_title").show(5)

+-----------+--------------------+------+-----------+----------+-----+--------+------------------+
|loan_amount|fico_score_group_fnl|   dti|loan_amount|emp_length|state|approved|        loan_title|
+-----------+--------------------+------+-----------+----------+-----+--------+------------------+
|     1000.0|           650 - 700|   0.1|     1000.0|   4 years|   NM|     0.0|Wedding/Engagement|
|     1000.0|           700 - 800|   0.1|     1000.0|  < 1 year|   MA|     0.0|Debt Consolidation|
|    11000.0|           700 - 800|   0.1|    11000.0|    1 year|   MD|     0.0|Debt Consolidation|
|     6000.0|           650 - 700|0.3864|     6000.0|  < 1 year|   MA|     0.0|             Other|
|     1500.0|           500 - 550|0.0943|     1500.0|  < 1 year|   MD|     0.0|             Other|
+-----------+--------------------+------+-----------+----------+-----+--------+------------------+
only showing top 5 rows

