Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
[SPARK-18124] Observed delay based Event Time Watermarks #15702
This PR adds a new method
An example that emits windowed counts of records, waiting up to 5 minutes for late data to arrive.
df.withWatermark("eventTime", "5 minutes") .groupBy(window($"eventTime", "1 minute") as 'window) .count() .writeStream .format("console") .mode("append") // In append mode, we only output finalized aggregations. .start()
Calculating the watermark.
The current event time is computed by looking at the
Note that since we must coordinate this value across partitions occasionally, the actual watermark used is only guaranteed to be at least
This mechanism was chosen for the initial implementation over processing time for two reasons:
Other notable implementation details
Remaining in this PR
There are some natural additional features that we should consider for future work:
A very dumb question (I apologize), there is nothing stopping a user to actually use processing time as watermarks with this API either. One can easily do
My biggest confusion here, that I couldn't find documented was the Type of the watermark column. Does it need to be timestamp type or can it be LongType?
Not a dumb question! You can certainly use processing time if those are the semantics you require. I do think there is a little bit of work we need to do to ensure determinism for these functions. Specifically,
Good point on the documentation. The thing you are missing is that it must be used in a window function, which does require