# Description
----
Getting started with mleap

https://mleap-docs.combust.ml/getting-started/

## Typical MLeap Workflow

A typical MLeap workflow consists of 3 parts:

1. Training: Write your ML Pipelines the same way you do today
2. Serialization: Serialize all of the data processing (ml pipeline) and the algorithms to Bundle.ML
3. Execution: Use MLeap runtime to execute your serialized pipeline without dependencies on Spark or Scikit (you'll still need TensorFlow binaries)


## Serialization

Once you have your pipeline trained, MLeap provides functionality to serialize the entire ML/Data Pipeline and your trained algorithm (linear models, tree-based models, neural networks) to Bundle.ML. Serialization generates something called a `bundle` which is a physical representation of your pipeline and algorithm that you can deploy, share, view all of the pieces of the pipeline.



## Execution

The goal of MLeap was initially to enable scoring of Spark's ML pipelines without the dependency on Spark. That functionality is powered by MLeap Runtime, which loads your serialized bundle and executes it on incoming dataframes (LeapFrames).

Did we mention that MLeap Runtime is extremely fast? We have recorded benchmarks of micro-second execution on LeapFrames and sub-5ms response times when part of a RESTful API service.

**Note: As of right now, MLeap runtime is only provided as a Java/Scala library, but we do plan to add python bindings in the future.**



# Load libraries and data

In [6]:
# imports for adhoc notebooks
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

# magics
%load_ext blackcellmagic
# start cell with `%% black` to format using `black`

%load_ext autoreload
# start cell with `%autoreload` to reload module
# https://ipython.org/ipython-doc/stable/config/extensions/autoreload.html

In [2]:
from mleap import pyspark

from pyspark.ml.linalg import Vectors
from mleap.pyspark.spark_support import SimpleSparkSerializer
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.regression import LinearRegression
from pyspark.ml.classification import LogisticRegression

from pyspark.sql import SparkSession

import src.spark.utils as uts

## Setup spark

In [40]:
%autoreload
mleap_jars = uts.get_mleap_jars()
# avro_jar = '/Users/bartev/dev/github-bv/san-tan/lrn-spark/references/spark-avro_2.11-4.0.0.jar'
spark = (
    SparkSession
    .builder
    .appName('mleap-ex')
    .config('spark.jars', mleap_jars)
#     .config('spark.jars', f"{mleap_jars},{avro_jar}")
    .getOrCreate()
    )


In [41]:
spark

spark.stop()

# Simple MLeap example
----
https://mleap-docs.combust.ml/py-spark/

## Create simple spark pipeline

### Imports MLeap serialization functionality for PySpark

In [69]:
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer

### Import standard PySpark Transformers and packages

In [70]:
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row

### Create a test data frame

In [74]:
sc = spark.sparkContext

In [84]:
l = [('Alice', 1), ('Bob', 2)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df2.collect()

[Row(name='Alice', age=1), Row(name='Bob', age=2)]

In [83]:
df2.show()

+-----+---+
| name|age|
+-----+---+
|Alice|  1|
|  Bob|  2|
+-----+---+



### Build a very simple pipeline using two transformers

In [85]:
string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')

feature_assembler = VectorAssembler(inputCols=[string_indexer.getOutputCol()],
                                    outputCol='features')

feature_pipeline = [string_indexer, feature_assembler]

featurePipeline = Pipeline(stages=feature_pipeline)

fittedPipeline = featurePipeline.fit(df2)

In [104]:
type(fittedPipeline)

pyspark.ml.pipeline.PipelineModel

In [103]:
fittedPipeline.serializeToBundle??

### Serialize to zip file
----
In order to serialize to a zip file, make sure the URI begins with jar:file and ends with a .zip.

For example `jar:file:/tmp/mleap-bundle.zip.`



In [92]:
import os
current_path = os.getcwd()

In [93]:
current_path

'/Users/bartev/dev/github-bv/san-tan/lrn-spark/notebooks'

In [100]:
%autoreload
uts.create_mleap_fname('foo.zip', '.')

'jar:file:/Users/bartev/dev/github-bv/san-tan/lrn-spark/notebooks/foo.zip'

In [101]:
fname = uts.create_mleap_fname('pyspark.example.zip', '.')
print(fname)
fittedPipeline.serializeToBundle(fname,
                                fittedPipeline.transform(df2))

jar:file:/Users/bartev/dev/github-bv/san-tan/lrn-spark/notebooks/pyspark.example.zip


Py4JJavaError: An error occurred while calling o911.serializeToBundle.
: java.nio.file.FileAlreadyExistsException: root/VectorAssembler_ce2795015298.node/model.json
	at com.sun.nio.zipfs.ZipFileSystem.newOutputStream(ZipFileSystem.java:516)
	at com.sun.nio.zipfs.ZipPath.newOutputStream(ZipPath.java:790)
	at com.sun.nio.zipfs.ZipFileSystemProvider.newOutputStream(ZipFileSystemProvider.java:285)
	at java.nio.file.Files.newOutputStream(Files.java:216)
	at java.nio.file.Files.write(Files.java:3292)
	at ml.combust.bundle.serializer.JsonFormatModelSerializer.write(ModelSerializer.scala:48)
	at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$2$$anonfun$apply$1.apply$mcV$sp(ModelSerializer.scala:89)
	at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$2$$anonfun$apply$1.apply(ModelSerializer.scala:89)
	at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$2$$anonfun$apply$1.apply(ModelSerializer.scala:89)
	at scala.util.Try$.apply(Try.scala:192)
	at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$2.apply(ModelSerializer.scala:89)
	at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$2.apply(ModelSerializer.scala:89)
	at scala.util.Success.flatMap(Try.scala:231)
	at ml.combust.bundle.serializer.ModelSerializer.write(ModelSerializer.scala:88)
	at ml.combust.bundle.serializer.NodeSerializer$$anonfun$write$1.apply(NodeSerializer.scala:85)
	at ml.combust.bundle.serializer.NodeSerializer$$anonfun$write$1.apply(NodeSerializer.scala:81)
	at scala.util.Try$.apply(Try.scala:192)
	at ml.combust.bundle.serializer.NodeSerializer.write(NodeSerializer.scala:81)
	at ml.combust.bundle.serializer.GraphSerializer$$anonfun$writeNode$1.apply(GraphSerializer.scala:34)
	at ml.combust.bundle.serializer.GraphSerializer$$anonfun$writeNode$1.apply(GraphSerializer.scala:30)
	at scala.util.Try$.apply(Try.scala:192)
	at ml.combust.bundle.serializer.GraphSerializer.writeNode(GraphSerializer.scala:30)
	at ml.combust.bundle.serializer.GraphSerializer$$anonfun$write$2.apply(GraphSerializer.scala:21)
	at ml.combust.bundle.serializer.GraphSerializer$$anonfun$write$2.apply(GraphSerializer.scala:21)
	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
	at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
	at ml.combust.bundle.serializer.GraphSerializer.write(GraphSerializer.scala:20)
	at org.apache.spark.ml.bundle.ops.PipelineOp$$anon$1.store(PipelineOp.scala:21)
	at org.apache.spark.ml.bundle.ops.PipelineOp$$anon$1.store(PipelineOp.scala:14)
	at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$1.apply(ModelSerializer.scala:87)
	at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$1.apply(ModelSerializer.scala:83)
	at scala.util.Try$.apply(Try.scala:192)
	at ml.combust.bundle.serializer.ModelSerializer.write(ModelSerializer.scala:83)
	at ml.combust.bundle.serializer.NodeSerializer$$anonfun$write$1.apply(NodeSerializer.scala:85)
	at ml.combust.bundle.serializer.NodeSerializer$$anonfun$write$1.apply(NodeSerializer.scala:81)
	at scala.util.Try$.apply(Try.scala:192)
	at ml.combust.bundle.serializer.NodeSerializer.write(NodeSerializer.scala:81)
	at ml.combust.bundle.serializer.BundleSerializer$$anonfun$write$1.apply(BundleSerializer.scala:34)
	at ml.combust.bundle.serializer.BundleSerializer$$anonfun$write$1.apply(BundleSerializer.scala:29)
	at scala.util.Try$.apply(Try.scala:192)
	at ml.combust.bundle.serializer.BundleSerializer.write(BundleSerializer.scala:29)
	at ml.combust.bundle.BundleWriter.save(BundleWriter.scala:31)
	at ml.combust.mleap.spark.SimpleSparkSerializer$$anonfun$serializeToBundleWithFormat$2.apply(SimpleSparkSerializer.scala:26)
	at ml.combust.mleap.spark.SimpleSparkSerializer$$anonfun$serializeToBundleWithFormat$2.apply(SimpleSparkSerializer.scala:25)
	at resource.AbstractManagedResource$$anonfun$5.apply(AbstractManagedResource.scala:88)
	at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
	at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
	at scala.util.control.Exception$Catch.apply(Exception.scala:103)
	at scala.util.control.Exception$Catch.either(Exception.scala:125)
	at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88)
	at resource.ManagedResourceOperations$class.apply(ManagedResourceOperations.scala:26)
	at resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50)
	at resource.DeferredExtractableManagedResource$$anonfun$tried$1.apply(AbstractManagedResource.scala:33)
	at scala.util.Try$.apply(Try.scala:192)
	at resource.DeferredExtractableManagedResource.tried(AbstractManagedResource.scala:33)
	at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:27)
	at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


# Databricks AirBnB example

df = spark.read.format("com.databricks.spark.avro").load("file:////tmp/airbnb.avro")

In [105]:
df = (
    spark.read.format("csv")
    .option("header", "true")
    .option('inferSchema','true')
    .load("/Users/bartev/dev/gitpie/mleap-demo/data/airbnb.csv")
)

In [106]:
df.columns

['id',
 'name',
 'price',
 'bedrooms',
 'bathrooms',
 'room_type',
 'square_feet',
 'host_is_superhost',
 'state',
 'cancellation_policy',
 'security_deposit',
 'cleaning_fee',
 'extra_people',
 'number_of_reviews',
 'price_per_bedroom',
 'review_scores_rating',
 'instant_bookable']

In [107]:
df.show()

+-------+--------------------+-----+--------+---------+---------------+-----------+-----------------+---------+-------------------+----------------+------------+------------+-----------------+-----------------+--------------------+----------------+
|     id|                name|price|bedrooms|bathrooms|      room_type|square_feet|host_is_superhost|    state|cancellation_policy|security_deposit|cleaning_fee|extra_people|number_of_reviews|price_per_bedroom|review_scores_rating|instant_bookable|
+-------+--------------------+-----+--------+---------+---------------+-----------+-----------------+---------+-------------------+----------------+------------+------------+-----------------+-----------------+--------------------+----------------+
|1949687|Delectable Victor...| 80.0|     1.0|      1.0|Entire home/apt|       null|              0.0|   London|           moderate|           100.0|        20.0|        10.0|                8|             80.0|                94.0|             0.0|
|386

In [108]:
df.dtypes

[('id', 'int'),
 ('name', 'string'),
 ('price', 'double'),
 ('bedrooms', 'double'),
 ('bathrooms', 'double'),
 ('room_type', 'string'),
 ('square_feet', 'double'),
 ('host_is_superhost', 'double'),
 ('state', 'string'),
 ('cancellation_policy', 'string'),
 ('security_deposit', 'double'),
 ('cleaning_fee', 'double'),
 ('extra_people', 'double'),
 ('number_of_reviews', 'int'),
 ('price_per_bedroom', 'double'),
 ('review_scores_rating', 'double'),
 ('instant_bookable', 'double')]

In [109]:
datasetFiltered = (df.filter("price >= 50")
    .filter("price <= 750")
    .filter('bathrooms > 0.0'))

In [110]:
print(df.count())
print(datasetFiltered.count())

389255
321588


In [111]:
datasetFiltered.columns

['id',
 'name',
 'price',
 'bedrooms',
 'bathrooms',
 'room_type',
 'square_feet',
 'host_is_superhost',
 'state',
 'cancellation_policy',
 'security_deposit',
 'cleaning_fee',
 'extra_people',
 'number_of_reviews',
 'price_per_bedroom',
 'review_scores_rating',
 'instant_bookable']

In [112]:
datasetFiltered.registerTempTable('df')

q = """
select
    id,
    case when state in ('NY', 'CA', 'London', 'Berlin',
        'TX', 'IL', 'OR', 'DC', 'WA')
        then state
        else 'Other'
        end as state,
    price,
    bathrooms,
    bedrooms,
    room_type,
    host_is_superhost,
    cancellation_policy,
    case when security_deposit is null
        then 0.0
        else security_deposit
        end as security_deposit,
    price_per_bedroom,
    case when number_of_reviews is null
        then 0.0
        else number_of_reviews
        end as number_of_reviews,
    case when extra_people is null
        then 0.0
        else extra_people
        end as extra_people,
    instant_bookable,
    case when cleaning_fee is null
        then 0.0
        else cleaning_fee
        end as cleaning_fee,
    case when review_scores_rating is null
        then 0.0
        else review_scores_rating
        end as review_scores_rating,
    case when square_feet is not null and square_feet > 100
        then square_feet
        when (square_feet is null or square_feet <=100)
            and (bedrooms is null or bedrooms = 0)
        then 350.0
        else 380 * bedrooms        
        end as square_feet,
    case when bathrooms >= 2
        then 1.0
        else 0.0
        end as n_bathrooms_more_than_two
from df
where bedrooms is not null
        """
datasetImputed = spark.sql(q)

In [113]:
(
    datasetImputed.select(
        "square_feet", "price", "bedrooms", "bathrooms", "cleaning_fee"
    )
    .describe()
    .show()
)

+-------+------------------+------------------+------------------+-------------------+------------------+
|summary|       square_feet|             price|          bedrooms|          bathrooms|      cleaning_fee|
+-------+------------------+------------------+------------------+-------------------+------------------+
|  count|            321588|            321588|            321588|             321588|            321588|
|   mean| 546.7441757777032|131.54961006007687|1.3352426085550455|  1.199068373198005| 37.64188340360959|
| stddev|363.39839582373594| 90.10912788720096|0.8466586601060732|0.48305900512627586|42.642377914845966|
|    min|             104.0|              50.0|               0.0|                0.5|               0.0|
|    max|           32292.0|             750.0|              10.0|                8.0|             700.0|
+-------+------------------+------------------+------------------+-------------------+------------------+



## Look at some summary statistics of the data

In [114]:
# most popular states

spark.sql("""
select
    state,
    count(*) as n,
    cast(avg(price) as decimal(12,2)) as avg_price,
    max(price) as max_price
from df
group by state
order by n desc
""").show()

+-------------+-----+---------+---------+
|        state|    n|avg_price|max_price|
+-------------+-----+---------+---------+
|           NY|48362|   146.75|    750.0|
|           CA|44716|   158.76|    750.0|
|Île-de-France|40732|   107.74|    750.0|
|       London|17542|   117.72|    750.0|
|          NSW|14416|   167.96|    750.0|
|       Berlin|13098|    81.01|    650.0|
|Noord-Holland| 8890|   128.56|    750.0|
|          VIC| 8636|   144.49|    750.0|
|North Holland| 7636|   134.60|    700.0|
|           IL| 7544|   141.85|    750.0|
|           ON| 7186|   129.05|    750.0|
|           TX| 6702|   196.59|    750.0|
|           WA| 5858|   132.48|    750.0|
|    Catalonia| 5748|   106.39|    720.0|
|           BC| 5522|   133.14|    750.0|
|           DC| 5476|   136.56|    720.0|
|       Québec| 5116|   104.98|    700.0|
|    Catalunya| 4570|    99.36|    675.0|
|       Veneto| 4486|   131.71|    700.0|
|           OR| 4330|   114.02|    700.0|
+-------------+-----+---------+---

In [116]:
datasetImputed.limit(10).toPandas()

Unnamed: 0,id,state,price,bathrooms,bedrooms,room_type,host_is_superhost,cancellation_policy,security_deposit,price_per_bedroom,number_of_reviews,extra_people,instant_bookable,cleaning_fee,review_scores_rating,square_feet,n_bathrooms_more_than_two
0,1949687,London,80.0,1.0,1.0,Entire home/apt,0.0,moderate,100.0,80.0,8.0,10.0,0.0,20.0,94.0,380.0,0.0
1,144337,London,200.0,1.5,1.0,Private room,0.0,strict,300.0,200.0,24.0,20.0,0.0,0.0,84.0,250.0,0.0
2,1372647,London,75.0,1.0,1.0,Private room,0.0,flexible,0.0,75.0,3.0,10.0,0.0,0.0,100.0,380.0,0.0
3,2440394,London,70.0,1.0,1.0,Private room,1.0,moderate,100.0,70.0,3.0,0.0,0.0,30.0,100.0,380.0,0.0
4,1949687,London,80.0,1.0,1.0,Entire home/apt,0.0,moderate,100.0,80.0,8.0,10.0,0.0,20.0,94.0,380.0,0.0
5,144337,London,200.0,1.5,1.0,Private room,0.0,strict,300.0,200.0,24.0,20.0,0.0,0.0,84.0,250.0,0.0
6,1372647,London,75.0,1.0,1.0,Private room,0.0,flexible,0.0,75.0,3.0,10.0,0.0,0.0,100.0,380.0,0.0
7,2440394,London,70.0,1.0,1.0,Private room,0.0,moderate,100.0,70.0,3.0,0.0,0.0,30.0,100.0,380.0,0.0
8,4754275,Other,110.0,1.0,1.0,Entire home/apt,0.0,moderate,0.0,110.0,3.0,0.0,0.0,50.0,87.0,380.0,0.0
9,8154300,Other,91.0,1.0,2.0,Entire home/apt,0.0,moderate,170.0,45.5,2.0,0.0,0.0,0.0,100.0,760.0,0.0


## Define continous and categorical features

In [144]:
continuous_features = [
    "bathrooms",
    "bedrooms",
    "security_deposit",
    "cleaning_fee",
    "extra_people",
    "number_of_reviews",
    "square_feet",
    "review_scores_rating",
]
categorical_features = [
    "state",
    "room_type",
    "host_is_superhost",
    "cancellation_policy",
    "instant_bookable",
]

all_features = continuous_features + categorical_features

In [145]:
dataset_imputed = datasetImputed.persist()

## Split data into train and validation

In [146]:
[training_dataset, validation_dataset] = dataset_imputed.randomSplit([0.7, 0.3])

## Continuous feature pipeline

In [147]:
continuous_feature_assembler = VectorAssembler(
    inputCols=continuous_features, outputCol="unscaled_continuous_features"
)

continuous_feature_scaler = StandardScaler(
    inputCol="unscaled_continuous_features",
    outputCol="scaled_continuous_features",
    withStd=True,
    withMean=True,
)

## Categorical features pipeline

In [148]:
import src.models.train_model as tm

In [149]:
%autoreload
categorical_feature_indexers = tm.make_string_indexer_list(categorical_features)

# ohe_input_cols = tm.get_output_col_names(categorical_feature_indexers)

categorical_feature_ohe = tm.make_one_hot_encoder(categorical_features)

In [150]:
categorical_feature_ohe.getInputCols()

['state_Index',
 'room_type_Index',
 'host_is_superhost_Index',
 'cancellation_policy_Index',
 'instant_bookable_Index']

In [151]:
categorical_feature_ohe.getOutputCols()

['state_OHE',
 'room_type_OHE',
 'host_is_superhost_OHE',
 'cancellation_policy_OHE',
 'instant_bookable_OHE']

## Assemble features and feature pipeline

In [152]:
estimatorLr = (
    [continuous_feature_assembler, continuous_feature_scaler]
    + categorical_feature_indexers
    + [categorical_feature_ohe]
)

featurePipeline = Pipeline(stages=estimatorLr)

sparkFeaturePipelineModel = featurePipeline.fit(dataset_imputed)

print("Finished constructing the pipeline")

Finished constructing the pipeline


## Train a Linear Regression Model

In [158]:
linearRegression = LinearRegression(featuresCol='scaled_continuous_features',
                                   labelCol='price',
                                   predictionCol='price_prediction',
                                   maxIter=10,
                                   regParam=0.3,
                                   elasticNetParam=0.8)

pipeline_lr = [sparkFeaturePipelineModel, linearRegression]

sparkPipelineEstimatorLr = Pipeline(stages=pipeline_lr)

sparkPipelineLr = sparkPipelineEstimatorLr.fit(dataset_imputed)

print('Complet: Training Linear Regression')

Complet: Training Linear Regression


## Train a Logistic Regression Model

In [163]:
logisticRegression = LogisticRegression(featuresCol='scaled_continuous_features',
                                       labelCol='n_bathrooms_more_than_two',
                                       predictionCol='n_bathrooms_more_than_two_prediction',
                                       maxIter=10)

pipeline_log_r = [sparkFeaturePipelineModel, logisticRegression]

sparkPipelineEstimatorLogr = Pipeline(stages=pipeline_log_r)

sparkPipelineLogr = sparkPipelineEstimatorLogr.fit(dataset_imputed)

print('Complete: Training Logistic Regression')

Complete: Training Logistic Regression


## Serialize the model ot Bundle.ML

In [165]:
fname_lr = uts.create_mleap_fname('pyspark.lr.zip', '../models/')
fname_logr = uts.create_mleap_fname('pyspark.logr.zip', '../models/')

In [168]:
for mod, fname in [(sparkPipelineLr, fname_lr),
                   (sparkPipelineLogr, fname_logr)]:
    mod.serializeToBundle(fname, mod.transform(dataset_imputed))

In [170]:
ls ../models/

[1m[36mlr-pipeline-model[m[m/ pyspark.logr.zip   pyspark.lr.zip


## (Optional) Deserialize from Bundle.ML

In [171]:
sparkPipelineLR_des = PipelineModel.deserializeFromBundle(fname_lr)

In [172]:
type(sparkPipelineLR_des)

pyspark.ml.pipeline.PipelineModel