# Writing Continuous Applications with Structured Streaming Python APIs in Apache Spark

Tutorial for <img src="https://databricks.com/wp-content/uploads/2018/12/pydata-logo-4.png" alt="" width="6%"/> Miami

---
title: Continuous Streaming - Event-time Aggregation and Watermarking in Structured Streaming
authors:
- Michael Johns
- Modified by Jules Damji (for PyData Miami)

created_at: 2018-10-02
updated_at: 2019-1-5

tldr: Demonstrates event-time aggregation, watermarks, windows, late data handling
---

# Event-time Aggregation and Watermarking in Structured Streaming

Continuous applications often require near real-time decisions on real-time aggregated statistics—such as health of and readings from IoT devices or detecting anomalous behavior. In this notebook, we will briefly explore how easily streaming aggregations can be expressed in Structured Streaming, and how naturally late, and out-of-order data is handled.

### Stateful Incremental Execution 

Structured Streaming allows users to express the same streaming query as a batch query, and the Spark SQL engine incrementalizes the query and executes on streaming data. 
* Spark SQL engine internally maintains the intermediate aggregations as fault-tolerant state.
* At every trigger, the state is read and updated in the state store, and all updates are saved to the write ahead log.

<img src="https://demo.cloud.databricks.com/files/mjohns/streaming/watermarking/fault-tolerant-exactly-once-stateful-stream-processing-in-structured-streaming.png" width="40%"/>


<sub>Reference [Blog Event-time Aggregation and Watermarking](https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html)</sub>

<sub>Another [good reference](http://vishnuviswanath.com/spark_structured_streaming.html) to understand Sliding vs Tumbling Windows and Watermarking</sub>

## Setup

In [5]:
%run "./setup/setup_data"

### Input path for sensors.

In [7]:
sensor_path = "/mnt/jules-pydata/Streaming/continuous_streaming/streaming_sensor/"
dbutils.fs.head("{}streaming-sensor_file-1.json".format(sensor_path), 233)

In [8]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

jsonSchema = (
  StructType()
  .add("timestamp", TimestampType())
  .add("deviceId", LongType())
  .add("deviceType", StringType())
  .add("signalStrength", DoubleType())
)

In [9]:
# DataFrame w/ schema [eventTime: timestamp, deviceId: string, signal: bigint]
eventsDF = (
  spark
    .readStream
    .schema(jsonSchema)
      .option("maxFilesPerTrigger", 1) # slow it down to demo
    .json(sensor_path) # the source
)

## Standard Aggregations (not Windowed)

Let's do some normal standard aggregations such as average signal strengh and groupBy deviceid.

In [11]:
display(eventsDF.groupBy("deviceId").avg("signalStrength"))

## Aggregations on Windows over Event-Time

In many cases, rather than running aggregations over the whole stream, you want aggregations over data bucketed by time windows (say, every 5 minutes or every hour), e.g. see what is the average signal strength in last 5 minutes in case if the devices have started to behave anomalously (example below). 
* Move beyond just _processing-time_ windows (when data hits the system)
* Handle _event-time_ windows (when events actually happened, reflected in a field in the data itself) 

<img src="https://demo.cloud.databricks.com/files/mjohns/streaming/watermarking/mapping-of-event-time-to-5-min-tumbling-windows.png" width="40%"/>

__Notice each window is a group for which running counts are calculated.__

In [13]:
display(
  eventsDF 
    .groupBy(window("timestamp", "5 minute")) 
    .count()
)

You can also define overlapping windows by specifying both the window length and the sliding interval (example below).

<img src="https://demo.cloud.databricks.com/files/mjohns/streaming/watermarking/mapping-of-event-time-to-overlapping-windows-of-length-10-mins-and-sliding-interval-5-mins.png" width="40%"/>

__Notice this grouping strategy automatically handles _late_ and _out-of-order data_ — the late event would just update older window groups instead of the latest ones.__

In [15]:
display (
  eventsDF
    .groupBy(window("timestamp", "10 minutes", "5 minutes"))
    .count()
)

### A complex end-to-end query

Here is an end-to-end illustration of a query that is grouped by both the `deviceId` and the overlapping windows. The illustration below shows how the final result of a query changes after new data is processed with 5 minute triggers when you are grouping by both deviceId and sliding windows.

<img src="https://demo.cloud.databricks.com/files/mjohns/streaming/watermarking/late-data-handling-in-windowed-grouped-aggregation.png" width="40%"/>

__Notice how the late, out-of-order record [12:04, dev2] updated an old window’s count.__

In [18]:
# Showing the groupBy 'deviceId'
display (
  eventsDF 
    .groupBy(
      "deviceId",
      window("timestamp", "10 minutes", "5 minutes")) 
    .count()
    .withColumn(
      "window_size", 
      (col("window.end").cast("Long") - col("window.start").cast("Long")) / 60
    )
    .orderBy("window.end", ascending=False)
)

## Watermarking to Limit State while Handling Late Data
The arrival of late data can result in updates to older windows. This complicates the process of defining which old aggregates are not going to be updated and therefore can be dropped from the state store to limit the state size. In Apache Spark 2.1+, watermarking enables automatic dropping of old state data.

<img src="https://demo.cloud.databricks.com/files/mjohns/streaming/watermarking/watermarking-in-windowed-grouped-aggregation.png" width="40%"/>

__Notice in the example above, a "too late" event arrives between the processing-times 12:20 and 12:25. The watermark is used to differentiate between late and the “too-late” events and treat them accordingly.__

A much simpler examples that illusrates the concept behing watermark is [this blog](http://vishnuviswanath.com/spark_structured_streaming.html)

<img src="https://databricks.com/wp-content/uploads/2019/01/watermarking_concept.png" width="40%"/>

In [20]:
# Same example as above, just showing `group by` for device with watermarking to define boundaries for "too late" data.

display (
  eventsDF 
    .withWatermark("timestamp", "10 minutes") 
    .groupBy(
      "deviceId",
      window("timestamp", "10 minutes", "5 minutes")) 
    .count()
    .withColumn(
      "window_size", 
      (col("window.end").cast("Long") - col("window.start").cast("Long")) / 60
    )
    .orderBy("window.end", ascending=False)
)

_Structured Streaming’s windowing strategy handles key streaming aggregations: __windows over event-time and late and out-of-order data__. Using this windowing strategy allows Structured Streaming engine to implement watermarking, in which late data can be discarded. As a result of this design, we can manage the size of the state-store._