# Importing Libraries


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, col, window

# Start Spark Session


In [2]:
spark = SparkSession.builder.getOrCreate()

## Other Configurations


In [3]:
spark.conf.set("spark.sql.repl.eagerEval.enabled", True)

# Streaming Aggregations

Continuous applications often require near real-time decisions on real-time, aggregated statistics.

Some examples include

- Aggregating errors in data from IoT devices by type
- Detecting anomalous behavior in a server's log file by aggregating by country.
- Doing behavior analysis on instant messages via hash tags.

However, in the case of streams, you generally don't want to run aggregations over the entire dataset.

> What problems might you encounter if you aggregate over a stream's entire dataset?

While streams have a definitive start, there conceptually is no end to the flow of data.

Because there is no "end" to a stream, the size of the dataset grows in perpetuity.

This means that your cluster will eventually run out of resources.

Instead of aggregating over the entire dataset, you can aggregate over data grouped by windows of time (say, every 5 minutes or every hour).

This is referred to as `windowing`

## Windowing

If we were using a static DataFrame to produce an aggregate count, we could use `groupBy()` and `count()`.

Instead we accumulate counts within a sliding window, answering questions like "How many records are we getting every second?"

- **Sliding windows** : The windows overlap and a single event may be aggregated into multiple windows.

- **Tumbling Windows**: The windows do not overlap and a single event will be aggregated into only one window.

The diagram below shows sliding windows.

The following illustration, from the <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html" target="_blank">Structured Streaming Programming Guide</a> guide, helps us understanding how it works:

<img src="http://spark.apache.org/docs/latest/img/structured-streaming-window.png">


## Event Time vs Receipt Time

- **Event Time** is the time at which the event occurred in the real world.

- **Event Time** is **NOT** something maintained by the Structured Streaming framework.

At best, Structured Streaming only knows about **Receipt Time** - the time a piece of data arrived in Spark.

### What are some examples of **Event Time**? **of Receipt Time**?

#### Examples of _Event Time_:

- The timestamp recorded in each record of a log file
- The instant at which an IoT device took a measurement
- The moment a REST API received a request

#### Examples of _Receipt Time_:

- A timestamp added to a DataFrame the moment it was processed by Spark
- The timestamp extracted from an hourly log file's file name
- The time at which an IoT hub received a report of a device's measurement
  - Presumably offset by some delay from when the measurement was taken

### What are some of the inherent problems with using **Receipt Time**?

The main problem with using **Receipt Time** is going to be with accuracy. For example:

- The time between when an IoT device takes a measurement vs when it is reported can be off by several minutes.
  - This could have significant ramifications to security and health devices, for example
- The timestamp embedded in an hourly log file can be off by up to one hour making correlations to other events extremely difficult
- The timestamp added by Spark as part of a DataFrame transformation can be off by hours to weeks to months depending on when the event occurred and when the job ran.

### When might it be OK to use **Receipt Time** instead of **Event Time**?

When accuracy is not a significant concern - that is **Receipt Time** is close enough to **Event Time**

One example would be for IoT events that can be delayed by minutes but the resolution of your query is by days or months (close enough)


# Windowed Streaming Example

Each line in the file contains a JSON record with two fields: `time` and `action`
New files are being written to this directory continuously (aka streaming).
Theoretically, there is no end to this process.
Let's start by looking at the head of one such file:


# Reading the FileStream


In [10]:
myschema = "time timestamp, action string"

inputDF = (
    spark.readStream.schema(myschema)
    .option("maxFilesPerTrigger", 10)
    .json("./datasets/time-series/")
)

countsDF = inputDF.groupBy(col("action"), window(col("time"), "1 hour")).count()

In [15]:
# streamingQuery = (
#     countsDF.writeStream.format("json")
#     .trigger(processingTime="1 seconds")
#     .option("checkpointLocation", "chkpt")
#     .outputMode("append")
#     .start("test")
# )
# streamingQuery.awaitTermination()


# query = countsDF.writeStream \
#     .format("console") \
#     .outputMode("update") \
#     .start()


streamingQuery = (
    countsDF.writeStream.queryName(  # Start with our "streaming" DataFrame  # Get the DataStreamWriter
        "stream_1p"
    )  # Name the query
    .trigger(processingTime="1 seconds")  # Configure for a 3-second micro-batch
    .format("json")  # Specify the sink type, a Parquet file
    .option(
        "checkpointLocation", "./streaming/checkpointdir/"
    )  # Specify the location of checkpoint files & W-A logs
    .outputMode("append")  # Write only new data to the "file"
    .start(
        "./streaming/outputdir/"
    )  # Start the job, writing to the specified directory
)

In [16]:
streamingQuery.awaitTermination(
    6
)  # Stream for another 5 seconds while the current thread blocks
streamingQuery.stop()  # Stop the stream

In [17]:
spark.read.json("./streaming/outputdir/").show(truncate=False)

+------+-----+--------------------------------------------------------------+
|action|count|window                                                        |
+------+-----+--------------------------------------------------------------+
|Scroll|1215 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
|Click |1216 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
|Close |1224 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
|Open  |1220 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
+------+-----+--------------------------------------------------------------+



## Performance Considerations

If you run that query, as is, it will take a surprisingly long time to start generating data. What's the cause of the delay?

If you expand the **Spark Jobs** component, you'll see something like this:

<img src="https://files.training.databricks.com/images/structured-streaming-shuffle-partitions-200.png"/>

It's our `groupBy()`. `groupBy()` causes a _shuffle_, and, by default, Spark SQL shuffles to 200 partitions. In addition, we're doing a _stateful_ aggregation: one that requires Structured Streaming to maintain and aggregate data over time.

When doing a stateful aggregation, Structured Streaming must maintain an in-memory _state map_ for each window within each partition. For fault tolerance reasons, the state map has to be saved after a partition is processed, and it needs to be saved somewhere fault-tolerant. To meet those requirements, the Streaming API saves the maps to a distributed store. On some clusters, that will be HDFS. Azure Databricks uses the DBFS.

That means that every time it finishes processing a window, the Streaming API writes its internal map to disk. The write has some overhead, typically between 1 and 2 seconds.


## What's the cause of the delay?

- `groupBy()` causes a **shuffle**
- By default, this produces **200 partitions**
- Plus a **stateful aggregation** to be maintained **over time**

This results in :

- Maintenance of an **in-memory state map** for **each window** within **each partition**
- Writing of the state map to a fault-tolerant store
  - On some clusters, that will be HDFS
  - Azure Databricks uses the DBFS
- Around 1 to 2 seconds overhead


## Shuffle Partition Best Practices

One way to reduce this overhead is to reduce the number of partitions Spark shuffles to.
In most cases, you want a 1-to-1 mapping of partitions to cores for streaming applications.

## Run query with proper setting for shuffle partitions

Rerun the query below and notice the performance improvement.
Once the data is loaded, render a line graph with

- **Keys** is set to `start`
- **Series groupings** is set to `action`
- **Values** is set to `count`


In [25]:
spark.conf.set("spark.sql.shuffle.partitions", spark.sparkContext.defaultParallelism)


streamingQuery = (
    countsDF.writeStream.queryName(  # Start with our "streaming" DataFrame  # Get the DataStreamWriter
        "stream_3p"
    )  # Name the query
    .trigger(processingTime="1 seconds")  # Configure for a 3-second micro-batch
    .format("json")  # Specify the sink type, a Parquet file
    .option(
        "checkpointLocation", "./streaming/checkpointdir2/"
    )  # Specify the location of checkpoint files & W-A logs
    .outputMode("append")  # Write only new data to the "file"
    .start(
        "./streaming/outputdir2/"
    )  # Start the job, writing to the specified directory
)

streamingQuery.awaitTermination(
    6
)  # Stream for another 5 seconds while the current thread blocks
streamingQuery.stop()  # Stop the stream

In [26]:
spark.read.json("./streaming/outputdir2/").show(truncate=False)

+------+-----+--------------------------------------------------------------+
|action|count|window                                                        |
+------+-----+--------------------------------------------------------------+
|Scroll|1215 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
|Click |1216 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
|Close |1224 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
|Open  |1220 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
+------+-----+--------------------------------------------------------------+



# Stop all streams


In [27]:
for s in spark.streams.active:  # Iterate over all active streams
    s.stop()  # Stop the stream

# Problem with Generating Many Windows

We are generating a window for every 1 hour aggregate.
_Every window_ has to be separately persisted and maintained.
Over time, this aggregated data will build up in the driver.
The end result being a massive slowdown if not an OOM Error.

## How do we fix that problem?

One simple solution is to increase the size of our window (say, to 2 hours).
That way, we're generating fewer windows.
But if the job runs for a long time, we're still building up an unbounded set of windows.
Eventually, we could hit resource limits.

# Watermarking

A better solution to the problem is to define a cut-off.
A point after which Structured Streaming will commit windowed data to sink, or throw it away if the sink is console or memory as `display()` mimics.
That's what _watermarking_ allows us to do.


In [29]:
watermarkedDF = (
    inputDF.withWatermark("time", "2 hours")  # Specify a 2-hour watermark
    .groupBy(
        col("action"), window(col("time"), "1 hour")  # Aggregate by action...
    )  # ...then by a 1 hour window
    .count()  # For each aggregate, produce a coun
)


spark.conf.set("spark.sql.shuffle.partitions", spark.sparkContext.defaultParallelism)


streamingQuery = (
    countsDF.writeStream.queryName(  # Start with our "streaming" DataFrame  # Get the DataStreamWriter
        "stream_4p"
    )  # Name the query
    .trigger(processingTime="1 seconds")  # Configure for a 3-second micro-batch
    .format("json")  # Specify the sink type, a Parquet file
    .option(
        "checkpointLocation", "./streaming/checkpointdir3/"
    )  # Specify the location of checkpoint files & W-A logs
    .outputMode("append")  # Write only new data to the "file"
    .start(
        "./streaming/outputdir3/"
    )  # Start the job, writing to the specified directory
)

streamingQuery.awaitTermination(
    6
)  # Stream for another 5 seconds while the current thread blocks
streamingQuery.stop()  # Stop the stream

spark.read.json("./streaming/outputdir3/").show(truncate=False)

+------+-----+--------------------------------------------------------------+
|action|count|window                                                        |
+------+-----+--------------------------------------------------------------+
|Scroll|1215 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
|Click |1216 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
|Close |1224 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
|Open  |1220 |{2016-07-26T08:30:00.000+05:30, 2016-07-26T07:30:00.000+05:30}|
+------+-----+--------------------------------------------------------------+

