#### Structured Streaming

With Structured Streaming, you can take the same operations that you perform in batch mode using Sparkâ€™s structured APIs and run them in a streaming fashion. This can reduce latency and allow for incremental processing.


* Stream processing is the act of continuously incorporating new data to compute the result. It is unbounded. We will have multiple versions of the result.
* Batch processing only computes the result once. Fixed input dataset.

Stream and batch processing often need to work together.

Stream processing use cases: notifications, real-time reporting, incremental ETL, real-time decision making, online ML, update data to serve in real-time.

Advantages of stream processing: lower latency, efficient updates, automatic bookkeeping on new data.

Is it expensive? Not necessarly, databricks allow you to schedule a stream to process only the currently available data.

##### Micro-batch processing

The data from a source (example kafka) is coming in faster than it can be consumed. We solve this problem with the micro-batch model. -> We collect data for a set interval of time: the trigger interval. 
There are 2 models for strea processing systems: **Continuous or micro-batch processing**. 
In the continuous model each node in the system continuously listen to messages from other nodes and outputing new updates to the external nodes. 
Micro batch processing wait to accumulate more batches of data. Then process them in parallel.
When you have to decide between these 2 modes you have to consider your desired latency and the total cost of the operation (TCO).
Micro-batch systems can deliver latency from 1 ms to 1 second depending on the application. 
With micro-batches new rows are appended to unbounded table.

input sources: kafka, event hubs, files
for testing: sockets and generator
`spark.readStream <insert input configuration>`

Configure data stream writer: 
`spark.readStream <insert input configuration>
.filter(col("event_name") == "finalize")
.groupBy("traffic_source").count()
.writeStream
<insert sink configurations>`

Output skinks: kafka, event hubs, files, foreach
for debugging: console, memory

SINKS specify the destination of the result set of a stream. 

File -> Structured Streaming sink type that is idempotent and can provide end-to-end exactly-once semantics in a Structured Streaming job.

**Output modes**:
* append: add new records only.
* update: update changed records.
* complete: rewrite full output.

**Trigger types**:

* Default: process each micro-batch as soon as the previous one has been processed.
* Fixed interval: micro batch processing kicked off at the user-specified interval.
* One-time: process all of the available data as a single micro-batch and then automatically stop the query.
* Continuous processing: long running tasks that continously read, process, and write data as soon as events are available.

Fault tolerance: guaranteed by checkpointing and write-ahead logs, idempotent sinks, replayable data source

Some streaming query operations: stop stream, await termination, status, is active, recent progress, name, ID, runID

In [0]:
# Obtain an initial streaming DataFrame from a Parquet-format file source.

schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: STRING>, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>, traffic_source STRING, user_first_touch_timestamp BIGINT, user_id STRING"

df = (spark
      .readStream
      .schema(schema)
      .option("maxFilesPerTrigger", 1)
      .parquet("/mnt/training/ecommerce/events/events.parquet")
     )

df.isStreaming

Out[2]: True

In [0]:
# Apply some transformations, producing new streaming DataFrames.

from pyspark.sql.functions import col, approx_count_distinct, count

emailTrafficDF = (df
                  .filter(col("traffic_source") == "email")
                  .withColumn("mobile", col("device").isin(["iOS", "Android"]))
                  .select("user_id", "event_timestamp", "mobile")
                 )

emailTrafficDF.isStreaming

Out[3]: True

In [0]:
# Take the final streaming DataFrame (our result table) and write it to a file sink in "append" mode.

checkpointPath = userhome + "/email_traffic/checkpoint"
outputPath = userhome + "/email_traffic/output"

devicesQuery = (emailTrafficDF
                .writeStream
                .outputMode("append")
                .format("parquet")
                .queryName("email_traffic")
                .trigger(processingTime="1 second")
                .option("checkpointLocation", checkpointPath)
                .start(outputPath)
               )

##### Monitor streaming query 

-> Use the streaming query "handle" to monitor and control it.

In [0]:
devicesQuery.id

In [0]:
devicesQuery.status

In [0]:
devicesQuery.lastProgress

In [0]:
devicesQuery.awaitTermination(5)

In [0]:
devicesQuery.stop()