# Structured Streaming

For this lab, we will need a data streaming source

refer to [sdg].p44

and the root: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html


We can create one by using Kafka server that simulates a live data stream.

Instructions on setting the Kafka server are in `prepare_kafka_server.md` in the root directory of this repo.


## Basic Concepts
<img src="https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png">


## The plan
You will read data from Kafka data source using the streaming API.



In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.types import *
import os,time


In [2]:

# Schema for retail data 
SCHEMA = "InvoiceNo INT ,StockCode INT,Description STRING ,Quantity INT,InvoiceDate DATE,UnitPrice FLOAT,CustomerID FLOAT, country STRING"

# The config packages will try to download the needed packages from maven.org --> you need internet connection
# It must match the specific Spark version you run!
spark = SparkSession.builder.appName('streaming')\
    .config("spark.kryoserializer.buffer.max", "512m")\
    .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0')\
    .getOrCreate()




:: loading settings :: url = jar:file:/usr/local/spark-3.2.0-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-1a574f2b-8893-40fa-8b22-435f9aa036fd;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.0 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.0 in central
	found org.apache.kafka#kafka-clients;2.8.0 in central
	found org.lz4#lz4-java;1.7.1 in central
	found org.xerial.snappy#snappy-java;1.1.8.4 in central
	found org.slf4j#slf4j-api;1.7.30 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.1 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found org.apache.hadoop#hadoop-client-api;3.3.1 in central
	found org.apache.htrace#htrace-core4;4.1.0-incubating in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central

In [4]:
kafka_server = "kafka:9092"  # internal name in the Docker network
topic = "retail"             # the topic name where the data is stored

## Read the data stream into a regular DataFrame.

The dataframe will get bigger and bigger -- so BE CAREFUL!

In [5]:
static_df = spark.read\
                  .format("kafka")\
                  .option("kafka.bootstrap.servers", kafka_server)\
                  .option("subscribe", topic)\
                  .option("startingOffsets", "earliest")\
                  .option("failOnDataLoss",False)\
                  .load()
retail_data = static_df.select(f.from_csv(f.decode("value", "US-ASCII"), schema=SCHEMA).alias("value")).select("value.*")

In [6]:
%%time 
# on my pc, there is a fixed 3 sec time for each of count() and show()  ?!
# this is probably a spark config: https://stackoverflow.com/questions/59916338/why-is-there-a-delay-in-the-launch-of-spark-executors
print("%d records in frame" % retail_data.count())
retail_data.show(5)

                                                                                

542214 records in frame
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|InvoiceDate|UnitPrice|CustomerID|       country|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|     null|     null|         Description|    null|       null|     null|      null|       Country|
|   536365|     null|WHITE HANGING HEA...|       6| 2010-12-01|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6| 2010-12-01|     3.39|   17850.0|United Kingdom|
|   536365|     null|CREAM CUPID HEART...|       8| 2010-12-01|     2.75|   17850.0|United Kingdom|
|   536365|     null|KNITTED UNION FLA...|       6| 2010-12-01|     3.39|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
only showing top 5 rows

CPU times: user 4.59 ms, sys: 2.32 ms, total: 6.9 m

## Read the data stream using the streaming API

It does not make sense to read infinite data (or at least unbounded) into a dataframe. 

Let's try to read in streaming mode (a.k.a micro batch)

In [7]:
OFFSETS_PER_TRIGGER = 5000
streaming_df = spark.readStream\
                  .format("kafka")\
                  .option("kafka.bootstrap.servers", kafka_server)\
                  .option("subscribe", topic)\
                  .option("startingOffsets", "earliest")\
                  .option("failOnDataLoss",False)\
                  .option("maxOffsetsPerTrigger", OFFSETS_PER_TRIGGER )\
                  .load()\
                  .select(f.from_csv(f.decode("value", "US-ASCII"), schema=SCHEMA).alias("value")).select("value.*")

In [8]:
# Let's see the structure of the DF
streaming_df

DataFrame[InvoiceNo: int, StockCode: int, Description: string, Quantity: int, InvoiceDate: date, UnitPrice: float, CustomerID: float, country: string]

In [9]:
country_counts = streaming_df.groupBy('country').count()
count_countries_query =country_counts.writeStream\
.queryName('num_countries')\
.format("console")\
.outputMode("complete")\
.start()

# https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html



23/02/21 11:23:37 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-37b1bef1-b86f-4954-818c-f11f3fd33f95. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/02/21 11:23:37 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


In [10]:
# wait some time like we have something better to do.
# During this time the Spark will run the query on each incoming microbatch
time.sleep(20)
count_countries_query.stop()
# If you don't stop the query, it will run forever, waiting for more data to arrive from the input


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+--------------+-----+
|       country|count|
+--------------+-----+
|       Germany|   30|
|        France|   20|
|          EIRE|   24|
|        Norway|   73|
|       Country|    2|
|     Australia|   14|
|United Kingdom| 4835|
|   Netherlands|    2|
+--------------+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+--------------+-----+
|       country|count|
+--------------+-----+
|       Germany|  181|
|        France|  106|
|       Belgium|   12|
|         Italy|   24|
|          EIRE|  109|
|     Lithuania|   34|
|        Norway|   73|
|         Spain|    5|
|   Switzerland|    6|
|         Japan|   16|
|       Country|    4|
|        Poland|    8|
|      Portugal|    7|
|     Australia|   14|
|United Kingdom| 9399|
|   Netherlands|    2|
+--------------+-----+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+--------------+-----+
|       country|count|
+--------------+-----+
|       Germany|  213|
|        France|  167|
|       Belgium|   12|
|         Italy|   25|
|          EIRE|  145|
|     Lithuania|   34|
|        Norway|   73|
|         Spain|    5|
|       Iceland|   31|
|   Switzerland|    6|
|         Japan|   16|
|       Country|    6|
|        Poland|    8|
|      Portugal|   14|
|     Australia|   14|
|United Kingdom|14229|
|   Netherlands|    2|
+--------------+-----+



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+--------------+-----+
|       country|count|
+--------------+-----+
|       Germany|  254|
|        France|  197|
|       Belgium|   12|
|         Italy|   25|
|          EIRE|  145|
|     Lithuania|   35|
|        Norway|  147|
|         Spain|    5|
|       Iceland|   31|
|   Switzerland|    6|
|         Japan|   17|
|       Country|    8|
|        Poland|    8|
|      Portugal|   80|
|     Australia|   22|
|United Kingdom|19006|
|   Netherlands|    2|
+--------------+-----+



23/02/21 11:23:58 ERROR WriteToDataSourceV2Exec: Data source write support org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@6ee6191f is aborting.
23/02/21 11:23:58 ERROR WriteToDataSourceV2Exec: Data source write support org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@6ee6191f aborted.
23/02/21 11:23:58 WARN Shell: Interrupted while joining on: Thread[Thread-8606,5,]
java.lang.InterruptedException
	at java.base/java.lang.Object.wait(Native Method)
	at java.base/java.lang.Thread.join(Thread.java:1300)
	at java.base/java.lang.Thread.join(Thread.java:1375)
	at org.apache.hadoop.util.Shell.joinThread(Shell.java:1043)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:1003)
	at org.apache.hadoop.util.Shell.run(Shell.java:901)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
	at org.apache.hadoop.fs.File

# Another example - reading text from a network connection

Copied verbatim (מִלָה בְּמִלָה) from https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#quick-example

*FIRST* run the data source in another window, and then run the cell below. (If you first run the cell, it will complain on "Connection refused" which means there is no input.

When you had enough, close the datasource, and the cell will finish automatically (because it will identiy the connection is terminated)

In [None]:
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split the lines into words
words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

 # Start running the query that prints the running counts to the console
query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

23/02/21 11:27:37 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.
23/02/21 11:27:37 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-bf22d1ca-7c4a-404e-8c59-2262a2065570. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/02/21 11:27:37 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
+----+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
|  is|    1|
|noam|    1|
|here|    1|
+----+-----+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
|some|    1|
| not|    1|
|  is|    2|
|noam|    1|
|here|    2|
+----+-----+



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+-----+-----+
| word|count|
+-----+-----+
| some|    1|
|  not|    2|
|nice,|    1|
|  it?|    1|
|   is|    4|
|spark|    1|
| noam|    1|
| here|    2|
|   or|    1|
+-----+-----+



23/02/21 11:28:38 WARN TextSocketMicroBatchStream: Stream closed by localhost:9999


# What we did not cover here

This was just a taste of the streaming API. 

New features are added from time to time, so checking the docs is always advised.

Some interesting topics to follow:
* selection, Projection
* Handling errors (duplication, recovery ...)
* Window operations
* Join operations

# Check yourself

* What will happen if you run 'country_counts.show()'? Why?
* change OFFSETS_PER_TRIGGER to 100. How does it affect the processing?