# ***Event Time and Window Operation***

It can happen sensors are far from the Data Center where they are processed. In this case, let's suppose we take streaming data from IoT devices, the data from a device 10km far from the DC will be processed after the data arrived from a device 1km far, even if the data from the first were generated before with respect to the latter. Let's see how we overcome this issue.

Input streaming records are usually characterized by a time information:
- It is the time when the data was generated
- It is usually called event-time

You want to use the time when the data was generated (i.e., the event-time) rather than the time Spark receives them.

Spark allows defining windows based on the time-event input column, and then apply aggregation functions over each window.

For each structured streaming query on which you want to apply a window computation you must:

- Specify the name of the time-event column in the input (streaming) DataFrame

- The characteristics of the (sliding) windows
    - windowDuration
    - slideDuration
    - Do not set it if you want non-overlapped windows
    
The **window(timeColumn, windowDuration, slideDuration=None)** function is used inside the standard groupBy() one to specify the characteristics of the windows.

**N.B.**: Windows can be used only with queries that are applying aggregation functions!


### **Example 1**

Input

- A stream of records retrieved from localhost:9999

- Each input record is a reading about the status of a station of a bike sharing system in a specific timestamp

- Each input reading has the format
                    [ stationId, #free_slots, #used_slots, timestamp]

- timestamp is the event-time column

Output

- For each stationId, print on the standard output the total number of received input readings with a number of free slots equal to 0 in each window

- The query is executed for each window

- Set windowDuration to 2 seconds and no slideDuration

- i.e., non-overlapped window


In [None]:
from pyspark.sql.types import *
from pyspark.sql.functions import split
from pyspark.sql.functions import window

# Create a "receiver" DataFrame that will connect to localhost:9999
recordsDF = spark.readStream\
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()

In [None]:
# The input records are characterized by one single column called value
# of type string
# Example of an input record: s1,0,3,2016-03-11 09:00:04
# Define four more columns by splitting the input column value
# New columns:
# - stationId
# - freeslots
# - usedslots
# - timestamp

readingsDF = recordsDF\
.withColumn("stationId", split(recordsDF.value, ',')[0].cast("string"))\
.withColumn("freeslots", split(recordsDF.value, ',')[1].cast("integer"))\
.withColumn("usedslots", split(recordsDF.value, ',')[2].cast("integer"))\
.withColumn("timestamp", split(recordsDF.value, ',')[3].cast("timestamp"))

In [None]:
# Filter data
# Use the standard filter transformation
fullReadingsDF = readingsDF.filter("freeslots=0")

# Count the number of readings with a number of free slots equal to 0
# for each stationId in each window.
# windowDuration = 2 seconds
# no overlapping windows
countsDF = fullReadingsDF\
.groupBy(window(fullReadingsDF.timestamp, "2 seconds"), "stationId")\
.agg({"*":"count"})\
.sort("window")

In [None]:
# The result of the structured streaming query will be stored/printed on
# the console "sink"
# complete output mode
# (append mode cannot be used for aggregation queries)
queryCountWindowStreamWriter = countsDF \
.writeStream \
.outputMode("complete") \
.format("console")\
.option("truncate", "false“)

# Start the execution of the query (it will be executed until it is explicitly stopped)
queryCountWindow = queryCountWindowStreamWriter.start()

## **Late Events**

Sparks handles data that have arrived later than expected based on its event-time. They are called late data.

Every time new data are processed the result is computed by combining old aggregate values and the new data by considering the event-time column instead of the time Spark receives the data.

The code is the same of “Event Time and Window Operations: Example 3”!

**Late data are automatically handled by Spark**

### **Example 2**

Input

- A stream of records retrieved from localhost:9999

- Each input record is a reading about the status of a station of a bike sharing system in a specific timestamp

- Each input reading has the format
                    [ stationId, #free_slots, #used_slots, timestamp]

- timestamp is the event-time column

Output

- For each window, print on the standard output the total number of received input readings with a number of free slots equal to 0

- The query is executed for each window

- Set windowDuration to 2 seconds and no slideDuration

- i.e., non-overlapped windows

In [None]:
...

# Filter data
# Use the standard filter transformation
fullReadingsDF = readingsDF.filter("freeslots=0")

# Count the number of readings with a number of free slots equal to 0
# for each stationId in each window.
# windowDuration = 2 seconds
# no overlapping windows
countsDF = fullReadingsDF\
.groupBy(window(fullReadingsDF.timestamp, "2 seconds"))\
.agg({"*":"count"})\
.sort("window")

...