# Sparkling Water Pipeline Productionalization

## Background

Sparkling Water provides access to H2O algorithms and publishes an API to integrate them as part of regular Spark pipelines. This feature allows for seamless training and deployment of H2O algorithms in the Spark environment. Furthermore, thanks to MOJO (java binary) representations of trained H2O models, production pipelines do not require access to H2O runtime. This enables a wide variety of deployment scenarios. Similarly, Sparkling Water can be used for deploying MOJOs from Driverless AI models.

Moreover, by supporting Python and Scala environments, we enable a simple transfer of modeling results between data scientists ("Python land") and production ("JVM land").


## Goal

The goals of this hands-on are two-fold:
  - Show integration of H2O models into Spark pipelines using PySpark and PySparkling,
  - Demonstrate deployment of the trained pipeline in the context of JVM and Spark streaming.
  
Our modeling goal is to predict sentiment of Amazon food reviews. For this purpose, we use a pre-processed dataset from [SNAP repository](https://snap.stanford.edu/data/web-FineFoods.html). The dataset contains multiple columns but for simplicity, we will use only `date`, `summary` and overall `score`. The score helps us to approximate sentiment.

![Scenario](./img/scenario.png)

## Environment preparation

First, let's verify that `SparkSession` is available in the notebook environment. We do not need to explicitly create a `SparkSession` as it is automatically created for us
during startup of the Jupyter notebook. This works because Jupyter is configured with a Spark kernel.


In [1]:
spark

### Prepare `H2OContext`

We will start `H2OContext` in the so-called _internal backend_ mode. The means H2O is sharing the JVM with Spark (see details in [Sparkling Water documentation](https://github.com/h2oai/sparkling-water/blob/rel-2.2/doc/tutorials/backends.rst)).

The following call initializes H2O on each Spark executor in the Spark cluster.

In [2]:
from pysparkling import *
hc = H2OContext.getOrCreate(spark)

Connecting to H2O server at http://172.17.0.2:54321 ... successful.


0,1
H2O cluster uptime:,11 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.5
H2O cluster version age:,"21 days, 18 hours and 10 minutes"
H2O cluster name:,sparkling-water-h2o_local-1562781759793
H2O cluster total nodes:,1
H2O cluster free memory:,6.991 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4



Sparkling Water Context:
 * Sparkling Water Version: 2.4.13
 * H2O name: sparkling-water-h2o_local-1562781759793
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (driver,f1fa966296b2,54321)
  ------------------------

  Open H2O Flow in browser: http://172.17.0.2:54321 (CMD + click in Mac OSX)

    


> Note: the reported IP is the private IP of the docker container where the demo is running. To open H2O Flow in your own browser, copy your browser URL and replace the port with 54321.
>
> For example, my Jupyter notebook's URL is `http://52.202.98.125:8888`. After opening a new browser tab or window, copy the address and replace port `8888` with `54321`:
>`http://52.202.98.125:54321`.

## Data preparation

We are going to use H2O to load the data since it does a pretty good job of guessing all the nuances of input formats. We will then pass the data to Spark.

_(Alternatively, one could load the data directly into a Spark dataframe, bypassing H2O completely. In this case, leveraging H2O for data input is considerably easier.)_

In [3]:
import h2o
reviews_h2o = h2o.upload_file("../../data/amazon_reviews/AmazonReviews_Train.csv", "reviews.hex")

Parse progress: |█████████████████████████████████████████████████████████| 100%


### Explore data table in H2O Flow

At this point, we can access H2O Flow and explore data and its properties directly there.

### Convert H2O frame to Spark frame so we can pass it as the input to the pipeline

After data exploration, we can start with data munging. Since we are going to use Spark for these steps, we will pass the dataframe from H2O to Spark.

In [4]:
reviews_spark = hc.as_spark_frame(reviews_h2o)

#### Trick #1: Save the original Spark schema

At this point, we will save the input data schema to be used later in the deployed Spark streaming application.

In [5]:
reviews_spark.printSchema()

with open('schema.json','w') as f:
    f.write(str(reviews_spark.schema.json()))

root
 |-- Id: integer (nullable = false)
 |-- ProductId: string (nullable = false)
 |-- UserId: string (nullable = false)
 |-- ProfileName: string (nullable = false)
 |-- HelpfulnessNumerator: short (nullable = false)
 |-- HelpfulnessDenominator: short (nullable = false)
 |-- Score: byte (nullable = false)
 |-- Time: integer (nullable = false)
 |-- Summary: string (nullable = false)
 |-- Text: string (nullable = false)



## Now let's define all the stages for the pipeline

The Spark pipelines are composed of various transformers. In our example, we combine a few Spark transformers to clean up textual data and transform it into numerical format. The pipeline is finalized by training an H2O XGBoost binomial model.

> Note: The pipeline stages are not executed right away, they are executed during each fit and transform call.

### Define transformer to drop unnecessary columns
The Spark `SQLTransformer` allows for using SQL to munge data.

As part of this transformer, we convert timestamp to a human readable date string.

For this example, we are selecting just the `Score`, `Time` and `Summary` columns. The goal of this analysis is to predict sentiment, i.e., whether the review is positive or negative. The review can be influenced by several aspects. The `Summary` is of course the mostly important information, but `Time` can influence the model as well. For example, people may tend to give higher reviews on Friday evenings because there's a weekend in front of them. :)

In [6]:
from pyspark.ml.feature import SQLTransformer
colSelect = SQLTransformer(
    statement="SELECT Score, from_unixtime(Time) as Time, Summary FROM __THIS__")

#### Trick #2: Explore intermediate results
To explore intermediate results, we can invoke the defined transformer directly. Note that this will cause Spark to execute the transformer as well as all unevaluated upstream code. 

In [7]:
selected = colSelect.transform(reviews_spark)
selected.show()

+-----+-------------------+--------------------+
|Score|               Time|             Summary|
+-----+-------------------+--------------------+
|    5|2011-04-27 00:00:00|Good Quality Dog ...|
|    1|2012-09-07 00:00:00|   Not as Advertised|
|    4|2008-08-18 00:00:00|"Delight" says it...|
|    2|2011-06-13 00:00:00|      Cough Medicine|
|    5|2012-10-21 00:00:00|         Great taffy|
|    4|2012-07-12 00:00:00|          Nice Taffy|
|    5|2012-06-20 00:00:00|Great!  Just as g...|
|    5|2012-05-03 00:00:00|Wonderful, tasty ...|
|    5|2011-11-23 00:00:00|          Yay Barley|
|    5|2012-10-26 00:00:00|    Healthy Dog Food|
|    5|2005-02-08 00:00:00|The Best Hot Sauc...|
|    5|2010-08-27 00:00:00|My cats LOVE this...|
|    1|2012-06-13 00:00:00|My Cats Are Not F...|
|    4|2010-11-05 00:00:00|   fresh and greasy!|
|    5|2010-03-12 00:00:00|Strawberry Twizzl...|
|    5|2009-12-29 00:00:00|Lots of twizzlers...|
|    2|2012-09-20 00:00:00|          poor taste|
|    5|2012-08-16 00

### Define transformer to create multiple time features based on the `Time` column

The `Time` column is stored internally as a timestamp. To be useful in modeling, we need to extract the time information in a format that is understandable by the predictive algorithms we employ. We can use SparkSQL data methods such as `month`, `dayofmonth`, etc. to engineer multiple new features from the timestamp information. 

In [8]:
refineTime = SQLTransformer(
    statement="""
    SELECT  Score,
            Summary, 
            dayofmonth(Time) as Day, 
            month(Time) as Month, 
            year(Time) as Year, 
            weekofyear(Time) as WeekNum, 
            date_format(Time, 'EEE') as Weekday, 
            hour(Time) as HourOfDay, 
            IF(date_format(Time, 'EEE')='Sat' OR date_format(Time, 'EEE')='Sun', 1, 0) as Weekend, 
            CASE 
                WHEN month(TIME)=12 OR month(Time)<=2 THEN 'Winter' 
                WHEN month(TIME)>=3 OR month(Time)<=5 THEN 'Spring' 
                WHEN month(TIME)>=6 AND month(Time)<=9 THEN 'Summer' 
                ELSE 'Fall' END as Season 
    FROM __THIS__""")

Now inspect the updated data

In [9]:
refined = refineTime.transform(selected)
refined.show()

+-----+--------------------+---+-----+----+-------+-------+---------+-------+------+
|Score|             Summary|Day|Month|Year|WeekNum|Weekday|HourOfDay|Weekend|Season|
+-----+--------------------+---+-----+----+-------+-------+---------+-------+------+
|    5|Good Quality Dog ...| 27|    4|2011|     17|    Wed|        0|      0|Spring|
|    1|   Not as Advertised|  7|    9|2012|     36|    Fri|        0|      0|Spring|
|    4|"Delight" says it...| 18|    8|2008|     34|    Mon|        0|      0|Spring|
|    2|      Cough Medicine| 13|    6|2011|     24|    Mon|        0|      0|Spring|
|    5|         Great taffy| 21|   10|2012|     42|    Sun|        0|      1|Spring|
|    4|          Nice Taffy| 12|    7|2012|     28|    Thu|        0|      0|Spring|
|    5|Great!  Just as g...| 20|    6|2012|     25|    Wed|        0|      0|Spring|
|    5|Wonderful, tasty ...|  3|    5|2012|     18|    Thu|        0|      0|Spring|
|    5|          Yay Barley| 23|   11|2011|     47|    Wed|      

### Remove neutral reviews and classify the Scores

We are not interested in the neutral reviews (reviews with the `Score=3`) as they would not add much information to the model. This is a fairly standard approach in NPS (net promoter score) type analyses, and common in particular in sentiment analysis. 

In [10]:
from pyspark.sql.types import FloatType
from pyspark.sql.functions import col, udf
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover, IDF, CountVectorizer

filterScore = SQLTransformer(
    statement="""
    SELECT  IF(Score<3,'NEGATIVE', 'POSITIVE') as Sentiment, Summary, Day, Month, Year,
            WeekNum, Weekday, HourOfDay, Weekend, Season 
    FROM __THIS__ WHERE Score !=3 """)

 Inspect the data

In [11]:
filtered = filterScore.transform(refined)
filtered.show()

+---------+--------------------+---+-----+----+-------+-------+---------+-------+------+
|Sentiment|             Summary|Day|Month|Year|WeekNum|Weekday|HourOfDay|Weekend|Season|
+---------+--------------------+---+-----+----+-------+-------+---------+-------+------+
| POSITIVE|Good Quality Dog ...| 27|    4|2011|     17|    Wed|        0|      0|Spring|
| NEGATIVE|   Not as Advertised|  7|    9|2012|     36|    Fri|        0|      0|Spring|
| POSITIVE|"Delight" says it...| 18|    8|2008|     34|    Mon|        0|      0|Spring|
| NEGATIVE|      Cough Medicine| 13|    6|2011|     24|    Mon|        0|      0|Spring|
| POSITIVE|         Great taffy| 21|   10|2012|     42|    Sun|        0|      1|Spring|
| POSITIVE|          Nice Taffy| 12|    7|2012|     28|    Thu|        0|      0|Spring|
| POSITIVE|Great!  Just as g...| 20|    6|2012|     25|    Wed|        0|      0|Spring|
| POSITIVE|Wonderful, tasty ...|  3|    5|2012|     18|    Thu|        0|      0|Spring|
| POSITIVE|          

### Tokenize the message

Here we use Spark's [RegexTokenizer](https://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer) to tokenize the messages.

In [12]:
regexTokenizer = RegexTokenizer(inputCol="Summary",
                                outputCol="tokenized_summary",
                                pattern="[, ]",
                                toLowercase=True)

Inspect the data

In [13]:
tokenized = regexTokenizer.transform(filtered)
tokenized.show()

+---------+--------------------+---+-----+----+-------+-------+---------+-------+------+--------------------+
|Sentiment|             Summary|Day|Month|Year|WeekNum|Weekday|HourOfDay|Weekend|Season|   tokenized_summary|
+---------+--------------------+---+-----+----+-------+-------+---------+-------+------+--------------------+
| POSITIVE|Good Quality Dog ...| 27|    4|2011|     17|    Wed|        0|      0|Spring|[good, quality, d...|
| NEGATIVE|   Not as Advertised|  7|    9|2012|     36|    Fri|        0|      0|Spring|[not, as, adverti...|
| POSITIVE|"Delight" says it...| 18|    8|2008|     34|    Mon|        0|      0|Spring|["delight", says,...|
| NEGATIVE|      Cough Medicine| 13|    6|2011|     24|    Mon|        0|      0|Spring|   [cough, medicine]|
| POSITIVE|         Great taffy| 21|   10|2012|     42|    Sun|        0|      1|Spring|      [great, taffy]|
| POSITIVE|          Nice Taffy| 12|    7|2012|     28|    Thu|        0|      0|Spring|       [nice, taffy]|
| POSITIVE

### Remove unnecessary words

Some words do not bring much information for the resulting model. For this, we use Spark's [StopWordsRemover](https://spark.apache.org/docs/2.1.0/ml-features.html#stopwordsremover) to clean the data.

In [14]:
stopWordsRemover = StopWordsRemover(inputCol=regexTokenizer.getOutputCol(),
                                    outputCol="CleanedSummary",
                                    caseSensitive=False)

Inspect the data

In [15]:
stopWordsRemoved = stopWordsRemover.transform(tokenized)
stopWordsRemoved.select(["Sentiment", "Summary", "CleanedSummary"]).show()

+---------+--------------------+--------------------+
|Sentiment|             Summary|      CleanedSummary|
+---------+--------------------+--------------------+
| POSITIVE|Good Quality Dog ...|[good, quality, d...|
| NEGATIVE|   Not as Advertised|        [advertised]|
| POSITIVE|"Delight" says it...|   ["delight", says]|
| NEGATIVE|      Cough Medicine|   [cough, medicine]|
| POSITIVE|         Great taffy|      [great, taffy]|
| POSITIVE|          Nice Taffy|       [nice, taffy]|
| POSITIVE|Great!  Just as g...|[great!, good, ex...|
| POSITIVE|Wonderful, tasty ...|[wonderful, tasty...|
| POSITIVE|          Yay Barley|       [yay, barley]|
| POSITIVE|    Healthy Dog Food|[healthy, dog, food]|
| POSITIVE|The Best Hot Sauc...|[best, hot, sauce...|
| POSITIVE|My cats LOVE this...|[cats, love, "die...|
| NEGATIVE|My Cats Are Not F...|[cats, fans, new,...|
| POSITIVE|   fresh and greasy!|    [fresh, greasy!]|
| POSITIVE|Strawberry Twizzl...|[strawberry, twiz...|
| POSITIVE|Lots of twizzlers

### Hash the words

NLP (natural language processing) for predictive modeling is based on the idea that text can be represented as numeric values. These values are then fed into any algorithm the user chooses. One choice of numeric representation uses [CountVectorizer](https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer).

`CountVectorizer` is very similar to the [HashingTF](https://spark.apache.org/docs/2.1.0/ml-features.html#tf-idf) function, except that it preserves the mapping from the index back to the word using an internal vocabulary.

For example, if the word `Dog` is stored in the hash at the index `100`, we can get the word back as `countVectorizerModel.vocabulary[100]`.

#### Trick #3: Set minDF parameter to limit number of words

The `minDF` parameter ensures that only words which occur more the `minDF` times in our case are included. This both speeds the process of modeling and ensures that outliers (infrequent words) do not affect our model that much.

In [16]:
countVectorizer = CountVectorizer(inputCol=stopWordsRemover.getOutputCol(),
                                  outputCol="frequencies",
                                  minDF=100)

#### Trick #4: Manually train the count vectorizer so we can see how it behaves before we execute the pipeline


In [17]:
countVecModel = countVectorizer.fit(stopWordsRemoved)

See the vocabulary:

In [18]:
print("Vocabulary size is " + str(len(countVecModel.vocabulary)))
print(countVecModel.vocabulary[:10])

Vocabulary size is 1528
[u'great', u'good', u'best', u'love', u'coffee', u'tea', u'product', u'taste', u'delicious', u'excellent']


Inspect the data

In [19]:
vectorized = countVecModel.transform(stopWordsRemoved)
vectorized.select(["Sentiment", "CleanedSummary", "frequencies"]).show()

+---------+--------------------+--------------------+
|Sentiment|      CleanedSummary|         frequencies|
+---------+--------------------+--------------------+
| POSITIVE|[good, quality, d...|(1528,[1,10,12,35...|
| NEGATIVE|        [advertised]|  (1528,[620],[1.0])|
| POSITIVE|   ["delight", says]|  (1528,[402],[1.0])|
| NEGATIVE|   [cough, medicine]|        (1528,[],[])|
| POSITIVE|      [great, taffy]|(1528,[0,1428],[1...|
| POSITIVE|       [nice, taffy]|(1528,[29,1428],[...|
| POSITIVE|[great!, good, ex...|(1528,[1,59,126],...|
| POSITIVE|[wonderful, tasty...|(1528,[15,37,1428...|
| POSITIVE|       [yay, barley]|        (1528,[],[])|
| POSITIVE|[healthy, dog, food]|(1528,[10,12,21],...|
| POSITIVE|[best, hot, sauce...|(1528,[2,44,86,45...|
| POSITIVE|[cats, love, "die...|(1528,[3,12,23,41...|
| NEGATIVE|[cats, fans, new,...|(1528,[12,41,79],...|
| POSITIVE|    [fresh, greasy!]|   (1528,[83],[1.0])|
| POSITIVE|[strawberry, twiz...|(1528,[13,19,667]...|
| POSITIVE|[lots, twizzlers,

### Create an Inverse Document Frequency (IDF) model

Here we use Spark's [tf-idf](https://spark.apache.org/docs/2.1.0/ml-features.html#tf-idf) method to model the importance of a term in a document to the given set of data. Please see the [Spark documentation](https://spark.apache.org/docs/2.1.0/ml-features.html#tf-idf) for more information on TF-IDF.

In [20]:
idf = IDF(inputCol=countVectorizer.getOutputCol(),
          outputCol="tf_idf_frequencies",
          minDocFreq=1)

Manually train the IDF model to see the results before we execute the pipeline,

In [21]:
idfModel = idf.fit(vectorized)

Inspect the data

In [22]:
afterIdf = idfModel.transform(vectorized)
afterIdf.select(["Sentiment", "CleanedSummary", "frequencies", "tf_idf_frequencies"]).show()

+---------+--------------------+--------------------+--------------------+
|Sentiment|      CleanedSummary|         frequencies|  tf_idf_frequencies|
+---------+--------------------+--------------------+--------------------+
| POSITIVE|[good, quality, d...|(1528,[1,10,12,35...|(1528,[1,10,12,35...|
| NEGATIVE|        [advertised]|  (1528,[620],[1.0])|(1528,[620],[7.32...|
| POSITIVE|   ["delight", says]|  (1528,[402],[1.0])|(1528,[402],[6.86...|
| NEGATIVE|   [cough, medicine]|        (1528,[],[])|        (1528,[],[])|
| POSITIVE|      [great, taffy]|(1528,[0,1428],[1...|(1528,[0,1428],[2...|
| POSITIVE|       [nice, taffy]|(1528,[29,1428],[...|(1528,[29,1428],[...|
| POSITIVE|[great!, good, ex...|(1528,[1,59,126],...|(1528,[1,59,126],...|
| POSITIVE|[wonderful, tasty...|(1528,[15,37,1428...|(1528,[15,37,1428...|
| POSITIVE|       [yay, barley]|        (1528,[],[])|        (1528,[],[])|
| POSITIVE|[healthy, dog, food]|(1528,[10,12,21],...|(1528,[10,12,21],...|
| POSITIVE|[best, hot, sa

### Remove Summary Column

Recall from above that predictive algorithms do not understand string values very well. This is why we transformed the text data of the `Summary` column using TF-IDF. We will keep the numeric representations of `Summary` and drop the original text so that we do not confuse the model.

In [23]:
removeSummary = SQLTransformer(
    statement="""
    SELECT Sentiment, Day, Month, Year, WeekNum, Weekday, HourOfDay, Weekend, Season, tf_idf_frequencies
    FROM __THIS__ """)

Inspect the data

In [24]:
removedSummary = removeSummary.transform(afterIdf)
removedSummary.show()

+---------+---+-----+----+-------+-------+---------+-------+------+--------------------+
|Sentiment|Day|Month|Year|WeekNum|Weekday|HourOfDay|Weekend|Season|  tf_idf_frequencies|
+---------+---+-----+----+-------+-------+---------+-------+------+--------------------+
| POSITIVE| 27|    4|2011|     17|    Wed|        0|      0|Spring|(1528,[1,10,12,35...|
| NEGATIVE|  7|    9|2012|     36|    Fri|        0|      0|Spring|(1528,[620],[7.32...|
| POSITIVE| 18|    8|2008|     34|    Mon|        0|      0|Spring|(1528,[402],[6.86...|
| NEGATIVE| 13|    6|2011|     24|    Mon|        0|      0|Spring|        (1528,[],[])|
| POSITIVE| 21|   10|2012|     42|    Sun|        0|      1|Spring|(1528,[0,1428],[2...|
| POSITIVE| 12|    7|2012|     28|    Thu|        0|      0|Spring|(1528,[29,1428],[...|
| POSITIVE| 20|    6|2012|     25|    Wed|        0|      0|Spring|(1528,[1,59,126],...|
| POSITIVE|  3|    5|2012|     18|    Thu|        0|      0|Spring|(1528,[15,37,1428...|
| POSITIVE| 23|   11|

### Create an XGBoost model using H2O

Up to this point, all of our data wrangling and feature engineering efforts have used Spark methods exclusively. Now we turn to H2O to train an H2O XGBoost model on the `Sentiment` column (using default settings). Note that there are many more steps involved with tuning an XGBoost model which we omit here. The full documentation for XGBoost is available at [H2O Documentation](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html)

In [25]:
from pysparkling.ml import ColumnPruner, H2OXGBoost

xgboost = H2OXGBoost(splitRatio=0.8,
             featuresCols=[idf.getOutputCol()],
             labelCol="Sentiment")

###  Create the pipeline by defining all the stages

Now we have all the pieces ready and can define the final pipeline.

In [26]:
pipeline = Pipeline(stages=[colSelect,
                            refineTime,
                            filterScore,
                            regexTokenizer,
                            stopWordsRemover,
                            countVectorizer,
                            idf,
                            removeSummary,
                            xgboost])

## Train the pipeline model

The `fit` call calls each trasformer and estimator in the pipeline and creates so called the `PipelineModel`. The model is trained from the cleaned data from previous transformers and the final model is ready to accept the raw data to make predictions

In [27]:
model = pipeline.fit(reviews_spark)

### Try predictions

First, let's load the data that we can use for prediction

In [28]:
reviews_h2o_pred = h2o.upload_file("../../data/amazon_reviews/AmazonReviews_Predictions.csv", "reviews_preds.hex")

Parse progress: |█████████████████████████████████████████████████████████| 100%


Then pass the data to Spark so that we can run the Spark pipeline on it,

In [29]:
reviews_spark_pred = hc.as_spark_frame(reviews_h2o_pred)

Now run the predictions:

In [30]:
model.transform(reviews_spark_pred).show()

+---------+---+-----+----+-------+-------+---------+-------+------+--------------------+--------------------+
|Sentiment|Day|Month|Year|WeekNum|Weekday|HourOfDay|Weekend|Season|  tf_idf_frequencies|   prediction_output|
+---------+---+-----+----+-------+-------+---------+-------+------+--------------------+--------------------+
| POSITIVE|  8|    6|2012|     23|    Fri|        0|      0|Spring|(1528,[48,647,131...|[0.21051210165023...|
| POSITIVE| 15|   12|2011|     50|    Thu|        0|      0|Winter|        (1528,[],[])|[0.21051210165023...|
| POSITIVE| 14|    9|2011|     37|    Wed|        0|      0|Spring|(1528,[264,306],[...|[0.21051210165023...|
| POSITIVE| 20|   10|2011|     42|    Thu|        0|      0|Spring|(1528,[26,452],[4...|[0.21051210165023...|
| POSITIVE|  9|    9|2012|     36|    Sun|        0|      1|Spring|(1528,[36,1409],[...|[0.21051210165023...|
| POSITIVE|  8|    2|2012|      6|    Wed|        0|      0|Winter|        (1528,[],[])|[0.21051210165023...|
| NEGATIVE

## Save the pipeline model

Later we can use the pipeline model in Scala to demonstrate the deployment of the pipeline in the JVM world.

In [31]:
model.write().overwrite().save("reviews_pipeline.model")

#### Trick #5: Check variable importances

We can inspect the model in H2O Flow and see the variable importances. However, we do not have information about the words, just the indices. We can ask the `CountVectorizer` what word is on the specific index to see what words affect our model the most.

In [32]:
model.stages[5].vocabulary[0]

u'great'

## Let's Deploy the Application

Up to this point, we have defined the PySpark pipeline. We will now demonstrate its deployment using the PySpark Streaming application in python, where the pipeline defined above will receive raw streaming data and run predictions on them in real time.

The steps will be:

 1. Load the schema from the schema file.
 1. Load the pipeline from the pipeline file.
 1. Create an input data stream and pass it the schema. The input data stream will point to a directory where a new csv files will be coming from different streaming sources.
 1. Create and output the data stream. For the purposes of this tutorial, we store the data into memory and also into a SparkSQL table.
 1. We can inspect the predictions in "real time" by regularly displaying the content of the desired table.

In [33]:
# Check again we have spark available
spark

In [34]:
# 1. Load exported schema of input data
from pyspark.sql.types import StructType
import json

schema = StructType.fromJson(json.load(open("schema.json", 'r')))
print(schema)

StructType(List(StructField(Id,IntegerType,false),StructField(ProductId,StringType,false),StructField(UserId,StringType,false),StructField(ProfileName,StringType,false),StructField(HelpfulnessNumerator,ShortType,false),StructField(HelpfulnessDenominator,ShortType,false),StructField(Score,ByteType,false),StructField(Time,IntegerType,false),StructField(Summary,StringType,false),StructField(Text,StringType,false)))


In [35]:
# 2. Load the exported pipeline model
from pyspark.ml import PipelineModel
pipeline_model = PipelineModel.load("reviews_pipeline.model/")

In [36]:
# Start Streaming
from subprocess import Popen
Popen(["./start_streaming.sh"])

<subprocess.Popen at 0x7f052b98e650>

In [37]:
!ls output

0.csv  1.csv  2.csv


In [38]:
# 3. Prepare the input data stream
input_data_stream = spark.readStream.schema(schema).csv("output")

In [39]:
# 4. Prepare the output data stream
output_data_stream = pipeline_model.transform(input_data_stream)

# Start processing the input data
output_data_stream.writeStream.format("memory").queryName("predictions").start()

<pyspark.sql.streaming.StreamingQuery at 0x7f052b956150>

In [40]:
# List the output
import time
while(True):
    spark.sql("select * from predictions").show()
    time.sleep(5)

+---------+---+-----+----+-------+-------+---------+-------+------+--------------------+--------------------+
|Sentiment|Day|Month|Year|WeekNum|Weekday|HourOfDay|Weekend|Season|  tf_idf_frequencies|   prediction_output|
+---------+---+-----+----+-------+-------+---------+-------+------+--------------------+--------------------+
| POSITIVE|  8|    2|2005|      6|    Tue|        0|      0|Winter|(1528,[2,44,86,45...|[0.21051210165023...|
| POSITIVE| 18|    8|2008|     34|    Mon|        0|      0|Spring|(1528,[402],[6.86...|[0.21051210165023...|
| POSITIVE| 12|    7|2012|     28|    Thu|        0|      0|Spring|(1528,[29,1431],[...|[0.21051210165023...|
| NEGATIVE| 13|    6|2012|     24|    Wed|        0|      0|Spring|(1528,[12,41,79],...|[0.21051210165023...|
| POSITIVE| 27|    8|2010|     34|    Fri|        0|      0|Spring|(1528,[3,12,23,41...|[0.21051210165023...|
| POSITIVE| 20|    6|2012|     25|    Wed|        0|      0|Spring|(1528,[1,59,126],...|[0.21051210165023...|
| POSITIVE


+---------+---+-----+----+-------+-------+---------+-------+------+--------------------+--------------------+
|Sentiment|Day|Month|Year|WeekNum|Weekday|HourOfDay|Weekend|Season|  tf_idf_frequencies|   prediction_output|
+---------+---+-----+----+-------+-------+---------+-------+------+--------------------+--------------------+
| POSITIVE|  8|    2|2005|      6|    Tue|        0|      0|Winter|(1528,[2,44,86,45...|[0.21051210165023...|
| POSITIVE| 18|    8|2008|     34|    Mon|        0|      0|Spring|(1528,[402],[6.86...|[0.21051210165023...|
| POSITIVE| 12|    7|2012|     28|    Thu|        0|      0|Spring|(1528,[29,1431],[...|[0.21051210165023...|
| NEGATIVE| 13|    6|2012|     24|    Wed|        0|      0|Spring|(1528,[12,41,79],...|[0.21051210165023...|
| POSITIVE| 27|    8|2010|     34|    Fri|        0|      0|Spring|(1528,[3,12,23,41...|[0.21051210165023...|
| POSITIVE| 20|    6|2012|     25|    Wed|        0|      0|Spring|(1528,[1,59,126],...|[0.21051210165023...|
| POSITIV


+---------+---+-----+----+-------+-------+---------+-------+------+--------------------+--------------------+
|Sentiment|Day|Month|Year|WeekNum|Weekday|HourOfDay|Weekend|Season|  tf_idf_frequencies|   prediction_output|
+---------+---+-----+----+-------+-------+---------+-------+------+--------------------+--------------------+
| POSITIVE|  8|    2|2005|      6|    Tue|        0|      0|Winter|(1528,[2,44,86,45...|[0.21051210165023...|
| POSITIVE| 18|    8|2008|     34|    Mon|        0|      0|Spring|(1528,[402],[6.86...|[0.21051210165023...|
| POSITIVE| 12|    7|2012|     28|    Thu|        0|      0|Spring|(1528,[29,1431],[...|[0.21051210165023...|
| NEGATIVE| 13|    6|2012|     24|    Wed|        0|      0|Spring|(1528,[12,41,79],...|[0.21051210165023...|
| POSITIVE| 27|    8|2010|     34|    Fri|        0|      0|Spring|(1528,[3,12,23,41...|[0.21051210165023...|
| POSITIVE| 20|    6|2012|     25|    Wed|        0|      0|Spring|(1528,[1,59,126],...|[0.21051210165023...|
| POSITIV

KeyboardInterrupt: 

### Let's see it in practice!