In [4]:
%run "./Includes/Classroom-Setup"

<h2>Streaming Aggregations</h2>

Continuous applications often require near real-time decisions on real-time, aggregated statistics.

Some examples include 
* Aggregating errors in data from IoT devices by type 
* Detecting anomalous behavior in a server's log file by aggregating by country. 
* Doing behavior analysis on instant messages via hash tags.

However, in the case of streams, you generally don't want to run aggregations over the entire dataset.

<h2>Windowing</h2>

If we were using a static DataFrame to produce an aggregate count, we could use `groupBy()` and `count()`. 

Instead we accumulate counts within a sliding window, answering questions like "How many records are we getting every second?"

The following illustration, from the <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html" target="_blank">Structured Streaming Programming Guide</a> guide, helps us understanding how it works:

<img src="http://spark.apache.org/docs/latest/img/structured-streaming-window.png">

<h2>Event Time vs Receipt Time</h2>

**Event Time** is the time at which the event occurred in the real world.

**Event Time** is **NOT** something maintained by the Structured Streaming framework. 

At best, Streams only knows about **Receipt Time** - the time a piece of data arrived in Spark.

<h2>Windowed Streaming Example</h2>

For this example, we will examine the files in `/mnt/training/sensor-data/accelerometer/time-series-stream.json/`.

Each line in the file contains a JSON record with two fields: `time` and `action`

New files are being written to this directory continuously (aka streaming).

Theoretically, there is no end to this process.

Let's start by looking at the head of one such file:

In [19]:
%fs head dbfs:/mnt/training/sensor-data/accelerometer/time-series-stream.json/file-0.json

Let's try to analyze these files interactively. 

First configure a schema.

The schema must be specified for file-based Structured Streams. 
Because of the simplicity of the schema, we can use the simpler, DDL-formatted, string representation of the schema.

In [21]:
inputPath = "dbfs:/mnt/training/sensor-data/accelerometer/time-series-stream.json/"

jsonSchema = "time timestamp, action string"

With the schema defined, we can create the initial DataFrame `inputDf` and then `countsDF` which represents our aggregation:

In [23]:
from pyspark.sql.functions import window, col

inputDF = (spark
  .readStream                                 # Returns an instance of DataStreamReader
  .schema(jsonSchema)                         # Set the schema of the JSON data
  .option("maxFilesPerTrigger", 1)            # Treat a sequence of files as a stream, one file at a time
  .json(inputPath)                            # Specifies the format, path and returns a DataFrame
)

countsDF = (inputDF
  .groupBy(col("action"),                     # Aggregate by action...
           window(col("time"), "1 hour"))     # ...then by a 1 hour window
  .count()                                    # For the aggregate, produce a count
  .select(col("window.start").alias("start"), # Elevate field to column
          col("count"),                       # Include count
          col("action"))                      # Include action
  .orderBy(col("start"))                      # Sort by the start time
)

To view the results of our query, pass the DataFrame `countsDF` to the `display()` function.

As we did in the previous lesson, we are going to specify the stream's name so that we can have better control over it.

In [25]:
myStreamName = "lesson03_ps"
display(countsDF,  streamName = myStreamName)

start,count,action
2016-07-26T02:00:00.000+0000,179,Open
2016-07-26T02:00:00.000+0000,11,Close
2016-07-26T03:00:00.000+0000,344,Close
2016-07-26T03:00:00.000+0000,1001,Open
2016-07-26T04:00:00.000+0000,815,Close
2016-07-26T04:00:00.000+0000,999,Open
2016-07-26T05:00:00.000+0000,328,Open
2016-07-26T05:00:00.000+0000,323,Close


### Performance Considerations

If you run that query, as is, it will take a surprisingly long time to start generating data. What's the cause of the delay? 

If you expand the **Spark Jobs** component, you'll see something like this:

It's our `groupBy()`. `groupBy()` causes a _shuffle_, and, by default, Spark SQL shuffles to 200 partitions. In addition, we're doing a _stateful_ aggregation: one that requires Structured Streaming to maintain and aggregate data over time.

When doing a stateful aggregation, Structured Streaming must maintain an in-memory _state map_ for each window within each partition. For fault tolerance reasons, the state map has to be saved after a partition is processed, and it needs to be saved somewhere fault-tolerant. To meet those requirements, the Streaming API saves the maps to a distributed store. On some clusters, that will be HDFS. Databricks uses the DBFS.

That means that every time it finishes processing a window, the Streaming API writes its internal map to disk. The write has some overhead, typically between 1 and 2 seconds.

In [27]:
untilStreamIsReady(myStreamName)

Before proceeding, we need to stop any streams

In [29]:
# for s in spark.streams.active: # Iterate over all active streams
#   s.stop()                     # Stop the stream

# As mentioned in lesson #2, we have provided additional methods for working with streams, and in  
# this case, for dealing with the rare exceptions that may arise as a result of terminating a stream.
# Listed above is the logical equivalent to this operation.
stopAllStreams()

One way to reduce this overhead is to reduce the number of partitions Spark shuffles to.

In most cases, you want a 1-to-1 mapping of partitions to cores for streaming applications.

Rerun the query below and notice the performance improvement.

Once the data is loaded, render a line graph with 
* **Keys** is set to `start`
* **Series groupings** is set to `action`
* **Values** is set to `count`

In [32]:
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

display(countsDF,  streamName = myStreamName)

start,count,action
2016-07-26T02:00:00.000+0000,179,Open
2016-07-26T02:00:00.000+0000,11,Close
2016-07-26T03:00:00.000+0000,344,Close
2016-07-26T03:00:00.000+0000,1001,Open
2016-07-26T04:00:00.000+0000,999,Open
2016-07-26T04:00:00.000+0000,815,Close
2016-07-26T05:00:00.000+0000,323,Close
2016-07-26T05:00:00.000+0000,328,Open


Wait until stream is done initializing...

In [34]:
untilStreamIsReady(myStreamName)

When you are done, stop all the streaming jobs.

In [36]:
stopAllStreams()

<h2>Problem with Generating Many Windows</h2>

We are generating a window for every 1 hour aggregate. 

_Every window_ has to be separately persisted and maintained.

Over time, this aggregated data will build up in the driver.

The end result being a massive slowdown if not an OOM Error.

<h2>Watermarking</h2>

A better solution to the problem is to define a cut-off.

A point after which Structured Streaming is allowed to throw saved windows away.

That's what _watermarking_ allows us to do.

### Refining our previous example

Below is our previous example with watermarking. 

We're telling Structured Streaming to keep no more than 2 hours of aggregated data.

In [41]:
watermarkedDF = (inputDF
  .withWatermark("time", "2 hours")             # Specify a 2-hour watermark
  .groupBy(col("action"),                       # Aggregate by action...
           window(col("time"), "1 hour"))       # ...then by a 1 hour window
  .count()                                      # For each aggregate, produce a count
  .select(col("window.start").alias("start"),   # Elevate field to column
          col("count"),                         # Include count
          col("action"))                        # Include action
  .orderBy(col("start"))                        # Sort by the start time
)
display(watermarkedDF, streamName = myStreamName) # Start the stream and display it

start,count,action
2016-07-26T02:00:00.000+0000,179,Open
2016-07-26T02:00:00.000+0000,11,Close
2016-07-26T03:00:00.000+0000,1001,Open
2016-07-26T03:00:00.000+0000,344,Close
2016-07-26T04:00:00.000+0000,815,Close
2016-07-26T04:00:00.000+0000,999,Open
2016-07-26T05:00:00.000+0000,328,Open
2016-07-26T05:00:00.000+0000,323,Close


In the example above,   
* Data received 2 hour _past_ the watermark will be dropped. 
* Data received within 2 hours of the watermark will never be dropped.

More specifically, any data less than 2 hours behind the latest data processed till then is guaranteed to be aggregated.

However, the guarantee is strict only in one direction. 

Data delayed by more than 2 hours is not guaranteed to be dropped; it may or may not get aggregated. 

The more delayed the data is, the less likely the engine is going to process it.

Wait until stream is done initializing...

In [44]:
untilStreamIsReady(myStreamName)

Stop all the streams

In [46]:
stopAllStreams()

In [48]:
%run "./Includes/Classroom-Cleanup"