# MLeap.deploy() Demo

To set-up running Spark 2.0 (required for this demo) from a Jupyter notebook, follow these [instructions](https://github.com/combust-ml/mleap/wiki/Setting-up-a-Spark-2.0-notebook-with-MLeap-and-Toree).

This demo will show you how to:
1. Load the research dataset from s3
2. Construct a feature transformer pipeline using commonly available transformers in Spark
3. Deploy your model to a public model server hosted on the combust.ml cloud using .deploy()

NOTE: To run the actual deploy step you have to either:
1. Get a key from combust.ml - it's easy, just email us!
2. Fire up the combust cloud server on your local machine - also easy, send us an email and we'll send you a docker image.

## Background on the Dataset

The dataset used for the demo was pulled together from individual cities' data found [here](http://insideairbnb.com/get-the-data.html). We've also gone ahead and pulled the individual datasets and relevant features into this [research dataset](https://s3-us-west-2.amazonaws.com/mleap-demo/datasources/airbnb.avro.zip) stored as avro.

### Step 0: Load libraries and data

For now, we've made it so that you have to download the [data]() from s3. We suggest that you unzip it in your /tmp directory.

In [1]:
// %AddDeps ml.combust.mleap mleap-spark_2.11 0.3.0 --transitive

import org.apache.spark.ml.mleap.feature.OneHotEncoder
import org.apache.spark.ml.feature.{StandardScaler, StringIndexer, VectorAssembler}
import org.apache.spark.ml.regression.{RandomForestRegressor, LinearRegression}
import org.apache.spark.ml.{Pipeline, PipelineStage}

IMPORTANT!!! You may have to run this next block of code a few times to get it to work - this is due to another bug in Toree. For me, running it twice works.

In [2]:
// Step 1. Load our Airbnb dataset

val inputFile = "file:////tmp/airbnb.avro"
val outputFileRf = "/tmp/transformer.rf.ml"
val outputFileLr = "/tmp/transformer.lr.ml"

var dataset = spark.sqlContext.read.format("com.databricks.spark.avro").
  load(inputFile)

var datasetFiltered = dataset.filter("price >= 50 AND price <= 750 and bathrooms > 0.0")
println(dataset.count())
println(datasetFiltered.count())

389255
321588


### Step 1: Standardize the data for our demo 


In [3]:
datasetFiltered.registerTempTable("df")

val datasetImputed = spark.sqlContext.sql(f"""
    select
        id,
        city,
        case when state in('NY', 'CA', 'London', 'Berlin', 'TX' ,'IL', 'OR', 'DC', 'WA')
            then state
            else 'Other'
        end as state,
        space,
        price,
        bathrooms,
        bedrooms,
        room_type,
        host_is_superhost,
        cancellation_policy,
        case when security_deposit is null
            then 0.0
            else security_deposit
        end as security_deposit,
        price_per_bedroom,
        case when number_of_reviews is null
            then 0.0
            else number_of_reviews
        end as number_of_reviews,
        case when extra_people is null
            then 0.0
            else extra_people
        end as extra_people,
        instant_bookable,
        case when cleaning_fee is null
            then 0.0
            else cleaning_fee
        end as cleaning_fee,
        case when review_scores_rating is null
            then 80.0
            else review_scores_rating
        end as review_scores_rating,
        case when square_feet is not null and square_feet > 100
            then square_feet
            when (square_feet is null or square_feet <=100) and (bedrooms is null or bedrooms = 0)
            then 350.0
            else 380 * bedrooms
        end as square_feet
    from df
    where bedrooms is not null
""")


datasetImputed.select("square_feet", "price", "bedrooms", "bathrooms", "cleaning_fee").describe().show()

+-------+------------------+------------------+------------------+------------------+-----------------+
|summary|       square_feet|             price|          bedrooms|         bathrooms|     cleaning_fee|
+-------+------------------+------------------+------------------+------------------+-----------------+
|  count|            321588|            321588|            321588|            321588|           321588|
|   mean| 546.7441757777032|131.54961006007687|1.3352426085550455| 1.199068373198005|37.64188340360959|
| stddev|363.39839582374066| 90.10912788720098|0.8466586601060778|0.4830590051262673|42.64237791484579|
|    min|             104.0|              50.0|               0.0|               0.5|              0.0|
|    max|           32292.0|             750.0|              10.0|               8.0|            700.0|
+-------+------------------+------------------+------------------+------------------+-----------------+



### Step 1.1: Take a look at some summary statistics of the data

In [4]:
// Most popular cities (original dataset)

spark.sqlContext.sql(f"""
    select 
        state,
        count(*) as n,
        cast(avg(price) as decimal(12,2)) as avg_price,
        max(price) as max_price
    from df
    group by state
    order by count(*) desc
""").show()

+-------------+-----+---------+---------+
|        state|    n|avg_price|max_price|
+-------------+-----+---------+---------+
|           NY|48362|   146.75|    750.0|
|           CA|44716|   158.76|    750.0|
|Île-de-France|40732|   107.74|    750.0|
|       London|17532|   117.71|    750.0|
|          NSW|14416|   167.96|    750.0|
|       Berlin|13098|    81.01|    650.0|
|Noord-Holland| 8890|   128.56|    750.0|
|          VIC| 8636|   144.49|    750.0|
|North Holland| 7636|   134.60|    700.0|
|           IL| 7544|   141.85|    750.0|
|           ON| 7186|   129.05|    750.0|
|           TX| 6702|   196.59|    750.0|
|           WA| 5858|   132.48|    750.0|
|    Catalonia| 5748|   106.39|    720.0|
|           BC| 5522|   133.14|    750.0|
|           DC| 5476|   136.56|    720.0|
|       Québec| 5116|   104.98|    700.0|
|    Catalunya| 4570|    99.36|    675.0|
|       Veneto| 4486|   131.71|    700.0|
|           OR| 4330|   114.02|    700.0|
+-------------+-----+---------+---

In [5]:
// Most expensive popular cities (original dataset)
dataset.registerTempTable("df")

spark.sqlContext.sql(f"""
    select 
        city,
        count(*) as n,
        cast(avg(price) as decimal(12,2)) as avg_price,
        max(price) as max_price
    from df
    group by city
    order by avg(price) desc
""").filter("n>25").show()

+--------------------+---+---------+---------+
|                city|  n|avg_price|max_price|
+--------------------+---+---------+---------+
|          Palm Beach| 68|   491.28|   1500.0|
|              Malibu|337|   377.53|   4500.0|
|   Pacific Palisades| 36|   326.00|    850.0|
|         Watsonville| 80|   319.70|    782.0|
|       Darling Point| 65|   309.03|   2001.0|
|       Bilgola Beach| 32|   300.44|    890.0|
|        Avalon Beach| 88|   278.93|   1000.0|
|              Avalon| 82|   270.15|    850.0|
|             Del Mar| 40|   266.20|    900.0|
|            Tamarama|153|   258.26|   1000.0|
|       Playa Del Rey| 34|   255.76|    599.0|
|            La Jolla|124|   254.70|   2400.0|
| Rancho Palos Verdes| 85|   253.44|   1250.0|
|     Manhattan Beach|249|   252.19|   1000.0|
|La CañAda Flintridge| 32|   250.88|    900.0|
| Sydney Olympic Park| 40|   250.55|    520.0|
|              Mosman|239|   246.82|   3701.0|
|            Capitola| 72|   246.50|    650.0|
|          Bi

### Step 2: Define continous and categorical features and filter nulls

In [25]:
// Step 2. Create our feature pipeline and train it on the entire dataset
val continuousFeatures = Array("bathrooms",
  "bedrooms",
  "security_deposit",
  "cleaning_fee",
  "extra_people",
  "number_of_reviews",
  "square_feet",
  "review_scores_rating")

val categoricalFeatures = Array("room_type",
  "host_is_superhost",
  "cancellation_policy",
  "instant_bookable",
  "state")

val allFeatures = continuousFeatures.union(categoricalFeatures)

In [7]:
// Filter all null values
val allCols = allFeatures.union(Seq("price")).map(datasetImputed.col)
val nullFilter = allCols.map(_.isNotNull).reduce(_ && _)
val datasetImputedFiltered = datasetImputed.select(allCols: _*).filter(nullFilter).persist()

println(datasetImputedFiltered.count())

321588


### Step 3: Split data into training and validation 

In [8]:
val Array(trainingDataset, validationDataset) = datasetImputedFiltered.randomSplit(Array(0.7, 0.3))

### Step 4: Continous Feature Pipeline

In [9]:
val continuousFeatureAssembler = new VectorAssembler(uid = "continuous_feature_assembler").
    setInputCols(continuousFeatures).
    setOutputCol("unscaled_continuous_features")

val continuousFeatureScaler = new StandardScaler(uid = "continuous_feature_scaler").
    setInputCol("unscaled_continuous_features").
    setOutputCol("scaled_continuous_features")

### Step 5: Categorical Feature Pipeline

In [10]:
val categoricalFeatureIndexers = categoricalFeatures.map {
    feature => new StringIndexer(uid = s"string_indexer_$feature").
      setInputCol(feature).
      setOutputCol(s"${feature}_index")
}
val categoricalFeatureOneHotEncoders = categoricalFeatureIndexers.map {
    indexer => new OneHotEncoder(uid = s"oh_encoder_${indexer.getOutputCol}").
      setInputCol(indexer.getOutputCol).
      setOutputCol(s"${indexer.getOutputCol}_oh")
}

### Step 6: Assemble our features and feature pipeline

Note that we have slightly different feature pipelines for LR and RF. This is done purely for demonstration purposes, whereas your actual models should scale continuous features for the RF model as well.

In [11]:
val featureColsRf = categoricalFeatureIndexers.map(_.getOutputCol).union(Seq("scaled_continuous_features"))
val featureColsLr = categoricalFeatureOneHotEncoders.map(_.getOutputCol).union(Seq("scaled_continuous_features"))

// assemble all processes categorical and continuous features into a single feature vector
val featureAssemblerLr = new VectorAssembler(uid = "feature_assembler_lr").
    setInputCols(featureColsLr).
    setOutputCol("features_lr")
val featureAssemblerRf = new VectorAssembler(uid = "feature_assembler_rf").
    setInputCols(featureColsRf).
    setOutputCol("features_rf")

val estimators: Array[PipelineStage] = Array(continuousFeatureAssembler, continuousFeatureScaler).
    union(categoricalFeatureIndexers).
    union(categoricalFeatureOneHotEncoders).
    union(Seq(featureAssemblerLr, featureAssemblerRf))

val featurePipeline = new Pipeline(uid = "feature_pipeline").
    setStages(estimators)

val sparkFeaturePipelineModel = featurePipeline.fit(datasetImputedFiltered)

println("Finished constructing the pipeline")

Finished constructing the pipeline


### Step 7: Train Random Forest Model

In [12]:
// Step 3.1 Create our random forest model
val randomForest = new RandomForestRegressor(uid = "random_forest_regression").
    setFeaturesCol("features_rf").
    setLabelCol("price").
    setPredictionCol("price_prediction")

val sparkPipelineEstimatorRf = new Pipeline().setStages(Array(sparkFeaturePipelineModel, randomForest))
val sparkPipelineRf = sparkPipelineEstimatorRf.fit(datasetImputedFiltered)

println("Complete: Training Random Forest")

Complete: Training Random Forest


### Step 8: Train Linear Regression Model

In [13]:
// Step 3.2 Create our linear regression model
val linearRegression = new LinearRegression(uid = "linear_regression").
    setFeaturesCol("features_lr").
    setLabelCol("price").
    setPredictionCol("price_prediction")

val sparkPipelineEstimatorLr = new Pipeline().setStages(Array(sparkFeaturePipelineModel, linearRegression))
val sparkPipelineLr = sparkPipelineEstimatorLr.fit(datasetImputedFiltered)

println("Complete: Training Linear Regression")

Complete: Training Linear Regression


### Step 9 (Optional): Serialize your models to bundle.ml

In [15]:
sparkPipelineLr.serializeToBundle(new java.io.File("/tmp/model.lr"))
sparkPipelineRf.serializeToBundle(new java.io.File("/tmp/model.rf"))

### Step 10 (Optional): Deserialize your models from bundle.ml

In [4]:
import ml.combust.mleap.runtime.MleapContext.defaultContext
import ml.combust.mleap.runtime.MleapSupport._
import java.io.File

In [11]:
val (bundleLr, mleapTransformerLr) = new File("/tmp/model.lr").deserializeBundle()
val (bundleRf, mleapTransformerRf) = new File("/tmp/model.rf").deserializeBundle()

### Step 11 (Optional): Manually Create the LeapFrame

In [4]:
import ml.combust.mleap.runtime._
import ml.combust.mleap.runtime.types._
import org.apache.spark.ml.linalg.Vectors

In [5]:
val schema = StructType(Seq(StructField("features", TensorType.doubleVector()))).get
val dataset = LocalDataset(Seq(Row(Vectors.dense(Array(20.0, 10.0, 5.0)))))
val frame = LeapFrame(schema, dataset)

### Step 12 (Optional): Deserialize and Score a LeapFrame from Json

In [None]:
import ml.combust.mleap.runtime.serialization.FrameReader

In [1]:
val s = scala.io.Source.fromURL("https://s3-us-west-2.amazonaws.com/mleap-demo/frame.json").mkString

println(s)

{
  "schema": {
    "fields": [{
      "name": "state",
      "type": "string"
    }, {
      "name": "bathrooms",
      "type": "double"
    }, {
      "name": "square_feet",
      "type": "double"
    }, {
      "name": "bedrooms",
      "type": "double"
    }, {
      "name": "security_deposit",
      "type": "double"
    }, {
      "name": "cleaning_fee",
      "type": "double"
    }, {
      "name": "extra_people",
      "type": "double"
    }, {
      "name": "number_of_reviews",
      "type": "double"
    }, {
      "name": "review_scores_rating",
      "type": "double"
    }, {
      "name": "room_type",
      "type": "string"
    }, {
      "name": "host_is_superhost",
      "type": "string"
    }, {
      "name": "cancellation_policy",
      "type": "string"
    }, {
      "name": "instant_bookable",
      "type": "string"
    }]
  },
  "rows": [["NY", 2.0, 1250.0, 3.0, 50.0, 30.0, 2.0, 56.0, 90.0, "Entire home/apt", "1.0", "strict", "1.0"]]
}



In [9]:
val bytes = s.getBytes("UTF-8")
val frame = FrameReader("ml.combust.mleap.json").fromBytes(bytes)

In [20]:
val frameLr = mleapTransformerLr.transform(frame).get.select("price_prediction").get

val frameRf = mleapTransformerRf.transform(frame).get.select("price_prediction").get

println("Price LR: " + frameLr.dataset(0).getDouble(0))
println("Price RF: " + frameRf.dataset(0).getDouble(0))

Price LR: 232.62463916675608
Price RF: 219.09900439790786


### Step 13 (Optional): Score a Spark Data Frame

In [6]:
import ml.combust.mleap.spark.SparkSupport._


In [3]:
val inputFile = "file:////tmp/airbnb.avro"

var dataset = spark.sqlContext.read.format("com.databricks.spark.avro").
  load(inputFile)

var datasetFiltered = dataset.filter("price >= 50 AND price <= 750 and bathrooms > 0.0")

In [11]:
datasetFiltered.registerTempTable("df")

val datasetImputed = spark.sqlContext.sql(f"""
    select
        id,
        city,
        case when state in('NY', 'CA', 'London', 'Berlin', 'TX' ,'IL', 'OR', 'DC', 'WA')
            then state
            else 'Other'
        end as state,
        space,
        price,
        bathrooms,
        bedrooms,
        room_type,
        host_is_superhost,
        cancellation_policy,
        case when security_deposit is null
            then 0.0
            else security_deposit
        end as security_deposit,
        price_per_bedroom,
        case when number_of_reviews is null
            then 0.0
            else number_of_reviews
        end as number_of_reviews,
        case when extra_people is null
            then 0.0
            else extra_people
        end as extra_people,
        instant_bookable,
        case when cleaning_fee is null
            then 0.0
            else cleaning_fee
        end as cleaning_fee,
        case when review_scores_rating is null
            then 80.0
            else review_scores_rating
        end as review_scores_rating,
        case when square_feet is not null and square_feet > 100
            then square_feet
            when (square_feet is null or square_feet <=100) and (bedrooms is null or bedrooms = 0)
            then 350.0
            else 380 * bedrooms
        end as square_feet
    from df
    where bedrooms is not null
""")


datasetImputed.select("square_feet", "price", "bedrooms", "bathrooms", "cleaning_fee").describe().show()

+-------+------------------+------------------+------------------+------------------+-----------------+
|summary|       square_feet|             price|          bedrooms|         bathrooms|     cleaning_fee|
+-------+------------------+------------------+------------------+------------------+-----------------+
|  count|            321588|            321588|            321588|            321588|           321588|
|   mean| 546.7441757777032|131.54961006007687|1.3352426085550455| 1.199068373198005|37.64188340360959|
| stddev|363.39839582374066| 90.10912788720098|0.8466586601060778|0.4830590051262673|42.64237791484579|
|    min|             104.0|              50.0|               0.0|               0.5|              0.0|
|    max|           32292.0|             750.0|              10.0|               8.0|            700.0|
+-------+------------------+------------------+------------------+------------------+-----------------+



In [26]:
val allCols = allFeatures.union(Seq("price")).map(datasetImputed.col)
val nullFilter = allCols.map(_.isNotNull).reduce(_ && _)
val datasetImputedFiltered = datasetImputed.select(allCols: _*).filter(nullFilter).persist()


In [27]:
val sparkDataframe = mleapTransformerLr.sparkTransform(datasetImputedFiltered)

In [16]:
sparkDataframe.columns

Array(id, city, state, space, price, bathrooms, bedrooms, room_type, host_is_superhost, cancellation_policy, security_deposit, price_per_bedroom, number_of_reviews, extra_people, instant_bookable, cleaning_fee, review_scores_rating, square_feet, unscaled_continuous_features, scaled_continuous_features, room_type_index, host_is_superhost_index, cancellation_policy_index, instant_bookable_index, state_index, room_type_index_oh, host_is_superhost_index_oh, cancellation_policy_index_oh, instant_bookable_index_oh, state_index_oh, features_lr, features_rf, price_prediction)

In [29]:
sparkDataframe.select("bedrooms", "bathrooms", "price", "price_prediction").show()

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13, localhost): scala.MatchError: 8.0 (of class java.math.BigDecimal)
	at ml.combust.mleap.core.feature.VectorAssemblerModel$$anonfun$apply$2.apply(VectorAssemblerModel.scala:32)
	at ml.combust.mleap.core.feature.VectorAssemblerModel$$anonfun$apply$2.apply(VectorAssemblerModel.scala:32)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at ml.combust.mleap.core.feature.VectorAssemblerModel.apply(VectorAssemblerModel.scala:32)
	at ml.combust.mleap.runtime.transformer.feature.VectorAssembler$$anonfun$1.apply(VectorAssembler.scala:15)
	at ml.combust.mleap.runtime.transformer.feature.VectorAssembler$$anonfun$1.apply(VectorAssembler.scala:15)
	at org.apache.spark.sql.mleap.UserDefinedFunctionConverters$$anonfun$conve

### Step 10: Load the libaries for .deploy()

In [14]:
import ml.combust.bundle.BundleRegistry
import ml.combust.mleap.spark.SparkSupport._
import ml.combust.client.spark.SparkSupport._
import ml.combust.client.model.MultiModelClient
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import scala.concurrent.duration._
import scala.concurrent.Await

implicit val hr = BundleRegistry("spark")

Name: Compile Error
Message: <console>:25: error: object client is not a member of package ml.combust
       import ml.combust.client.spark.SparkSupport._
                         ^
StackTrace: 

### Step 11: Set up the ActorSystem

In [None]:
implicit val system = ActorSystem("combust-client")
implicit val materializer = ActorMaterializer()
implicit val ec = system.dispatcher

### Step 12: Define model servers to send your models to

In the future release of this notebook, we'll add direction on how to set-up your public cloud account.

For now, send us an email at mikhail@combust.ml to get an access key or a docker image of the model server.

In [None]:
val client = MultiModelClient.builder[String]().
  withServer("model-server-1", "http://combust-model-server-4671c502.124d9c7a.svc.dockerapp.io:8889").
  withServer("model-server-2", "http://combust-model-server-9ac83b2d.801b9d51.svc.dockerapp.io:8890").
  build()

### Step 13: Deploy your LR and RF pipelines to the model servers

In [None]:
Await.result(client.deploy(sparkPipelineLr, "airbnb_lr").sequence, 10.seconds)

In [None]:
Await.result(client.deploy(sparkPipelineRf, "airbnb_rf").sequence, 10.seconds)

### Step 14 (Optional): Undeploy your models from the model server

In [None]:
Await.result(client.undeploy("airbnb_rf").sequence, 10.seconds)

In [None]:
Await.result(client.undeploy("airbnb_lr").sequence, 10.seconds)