# ***Watermarking***

Watermarking is a feature of Spark that allows the user to specify the threshold of late data, and allows the engine to accordingly clean up old state.

Results related to old event-times are not needed in many real streaming applications. They can be dropped to improve the efficiency of the application.

Specifically, to run windowed queries for days, it is necessary for the system to bound the amount of intermediate in-memory state it accumulates.

This means the system needs to know when an old aggregate can be dropped from the in-memory state because the application is not going to receive late data for that aggregate any more. To enable this, in Spark 2.1, watermarking has been introduced.

You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected to be in terms of event time.


# ***Join Operations***

Spark Structured Streaming manages also join operations

- Between two streaming DataFrames (usually put together the batches associated with the same time slot)

- Between a streaming DataFrame and a static DataFrame

The result of the streaming join is generated incrementally.

Hence, old data must be discarded. You must define watermark thresholds on both input streams such that the engine knows how delayed the input can be and drop old data.

The methods **join()** and **withWatermark()** are used to join streaming DataFrames. The join method is similar to the one available for static DataFrame.

In [None]:
from pyspark.sql.functions import expr
impressions = spark.readStream. ...
clicks = spark.readStream. ...


# Apply watermarks on event-time columns
impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
                                                     
clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")

# Join with event-time constraints
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
""") )