-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

## SparkML on Streaming Data

Let's take in the model we saved earlier, and apply it to some streaming data!

In [4]:
%run "./Includes/Classroom_Setup"

In [5]:
from pyspark.ml.pipeline import PipelineModel

fileName = userhome + "/tmp/DT_Pipeline"
pipelineModel = PipelineModel.load(fileName)

We can simulate streaming data.

NOTE: You must specify a schema when creating a streaming source DataFrame.

In [7]:
from pyspark.sql.types import *

schema = StructType([
  StructField("rating",DoubleType()), 
  StructField("review",StringType())])

streamingData = (spark
                 .readStream
                 .schema(schema)
                 .option("maxFilesPerTrigger", 1)
                 .parquet("/mnt/training/movie-reviews/imdb/imdb_ratings_50k.parquet"))

Why is this stream taking so long? What configuration should we set?

In [9]:
stream = (pipelineModel
          .transform(streamingData)
          .groupBy("label", "prediction")
          .count()
          .sort("label", "prediction"))

display(stream)

label,prediction,count
0.0,0.0,12876
0.0,1.0,12122
1.0,0.0,3047
1.0,1.0,21949


In [10]:
spark.conf.get("spark.sql.shuffle.partitions")

In [11]:
spark.conf.set("spark.sql.shuffle.partitions", "8")

Let's try this again

In [13]:
stream = (pipelineModel
          .transform(streamingData)
          .groupBy("label", "prediction")
          .count()
          .sort("label", "prediction"))

display(stream)

label,prediction,count
0.0,0.0,12876
0.0,1.0,12122
1.0,0.0,3047
1.0,1.0,21949


Let's save our results to a file.

In [15]:
import re

streamingView = str(re.sub('\W', '', username))
checkpointFile = userhome + "/tmp/checkPoint"
dbutils.fs.rm(checkpointFile, True) # Clear out the checkpointing directory

(stream
 .writeStream
 .format("memory")
 .option("checkpointLocation", checkpointFile)
 .outputMode("complete")
 .queryName(streamingView)
 .start())

In [16]:
display(sql("select * from " + streamingView))

label,prediction,count
0.0,0.0,9615
0.0,1.0,9055
1.0,0.0,2304
1.0,1.0,16521


-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>