# 08 Window functions

Window functions operate on a group of rows, referred to as a window, and *calculate a return value for each row based on the group of rows*. Window functions are useful for processing tasks such as calculating a moving average, computing a cumulative statistic, or accessing the value of rows given the relative position of the current row.


TODO:
* take some from https://www.databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html  ( from 2015 )
* https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-window-functions-7b4e39ad3c86
* Add examples
 
    

<hr>

# Event time processing
[SDG] chapter 22. [TODO:Maybe move this part to sdg folder?]

Event time is the time that is embedded in the data itself. It is most often, *though not required
to be*, the time that an event actually occurs. This is important to use because it provides a
more robust way of comparing events against one another. The challenge here is that event
data can be late or out of order. This means that the stream processing system must be able to
handle out-of-order or late data.


The [SDG] has a full chapter on this topic. Here we just touch the windowing functionality.

Contrary to the SQL window functions which aggregate same values [TODO: is it correct?], the event time window runs on a time span.

Note: It is also possible to build windows based on amount of rows (e.g. "take the average of the last 500 rows"). Look for "count-based windows"


# Windows on Event Time

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, col

datapath = "../data/sdg/"
spark = SparkSession.builder.getOrCreate()

spark.conf.set("spark.sql.shuffle.partitions", 5)
#static = spark.read.json("/data/activity-data")

static = spark.read.json(datapath + "/activity-data/part-00000*.json")

streaming = spark\
.readStream\
.schema(static.schema)\
.option("maxFilesPerTrigger", 10)\
.json(datapath + "/activity-data")
#streaming.printSchema()

The first step in event-time analysis is to convert the timestamp column into the proper Spark
SQL timestamp type. Our current column is unixtime nanoseconds (represented as a long),
therefore we’re going to have to do a little manipulation to get it into the proper format:

In [None]:
withEventTime = streaming.selectExpr("*",
"cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time")

## Tumbling Windows - non overlapping intervals

In [None]:
# Count how many events in every 10 minute interval
withEventTime.groupBy(window(col("event_time"), "10 minutes")).count()\
.writeStream\
.queryName("events_per_window")\
.format("memory")\
.outputMode("complete")\
.start()

The output data is in the memory sink **for debug only**, so we can use SQL to query it:

In [None]:
spark.sql("SELECT * FROM events_per_window").show(truncate=False)
# you can of course do 'SELECT count'   or 'SELECT window.start' etc.

Perform aggregation on multiple columns (event_time,User):

In [None]:
withEventTime.groupBy(window(col("event_time"), "10 minutes"), "User").count()\
.writeStream\
.queryName("events_per_window_user")\
.format("memory")\
.outputMode("complete")\
.start()

In [None]:
spark.sql("SELECT * FROM events_per_window_user").show(truncate=False)

## Sliding Windows
Let's count events during the last 60 minutes, moving by an 8 minutes window.

In [None]:
from pyspark.sql.functions import window, col
withEventTime.groupBy(window(col("event_time"), "60 minutes", "8 minutes"))\
.count()\
.writeStream\
.queryName("events_per_window_60_8")\
.format("memory")\
.outputMode("complete")\
.start()

In [None]:
spark.sql("SELECT * FROM events_per_window_60_8 ORDER BY window.start ASC").show(33,truncate=False)

### Handling Late Data with Watermarks
The preceding examples are great, but they have a flaw. We never specified how late we expect
to see data. This means that Spark is going to need to store that intermediate data forever because
we never specified a watermark, or a time at which we don’t expect to see any more data. This
applies to all stateful processing that operates on event time. We must specify this watermark in
order to age-out data in the stream (and, therefore, state) so that we don’t overwhelm the system over a long period of time.


A **watermark** is an amount of time following a given event or set of events after
which we do not expect to see any more data from that time.

We need to specify watermarks because if we did not, we’d need to keep all of our windows around forever, expecting them to be updated forever.

In [None]:
withEventTime\
.withWatermark("event_time", "30 minutes")\
.groupBy(window(col("event_time"), "10 minutes", "5 minutes"))\
.count()\
.writeStream\
.queryName("events_per_window_WM")\
.format("memory")\
.outputMode("complete")\
.start()

In [None]:
spark.sql("SELECT * FROM events_per_window_WM ORDER BY window.start ASC").show(33,truncate=False)

### Dropping Duplicates in a Stream

In this example, we consider a row as duplicate if it has the same User and event_time

In [None]:
from pyspark.sql.functions import expr
query=withEventTime\
.withWatermark("event_time", "5 seconds")\
.dropDuplicates(["User", "event_time"])\
.groupBy("User")\
.count()\
.writeStream\
.queryName("pydeduplicated")\
.format("memory")\
.outputMode("complete")\
.start()

In [None]:
spark.sql("SELECT * FROM pydeduplicated").show(truncate=False)

In [None]:
# what is the status of our queries?
query.status

In [None]:
query.recentProgress

<br><br>
The following topics also appear in the chapter, but not enough time to discuss them.

## Arbitrary Stateful Processing
### Time-Outs
### Output Modes