Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18124] Observed delay based Event Time Watermarks #15702

Closed
wants to merge 12 commits into from

Conversation

Projects
None yet
10 participants
@marmbrus
Copy link
Contributor

commented Oct 31, 2016

This PR adds a new method withWatermark to the Dataset API, which can be used specify an event time watermark. An event time watermark allows the streaming engine to reason about the point in time after which we no longer expect to see late data. This PR also has augmented StreamExecution to use this watermark for several purposes:

  • To know when a given time window aggregation is finalized and thus results can be emitted when using output modes that do not allow updates (e.g. Append mode).
  • To minimize the amount of state that we need to keep for on-going aggregations, by evicting state for groups that are no longer expected to change. Although, we do still maintain all state if the query requires (i.e. if the event time is not present in the groupBy or when running in Complete mode).

An example that emits windowed counts of records, waiting up to 5 minutes for late data to arrive.

df.withWatermark("eventTime", "5 minutes")
  .groupBy(window($"eventTime", "1 minute") as 'window)
  .count()
  .writeStream
  .format("console")
  .mode("append") // In append mode, we only output finalized aggregations.
  .start()

Calculating the watermark.

The current event time is computed by looking at the MAX(eventTime) seen this epoch across all of the partitions in the query minus some user defined delayThreshold. An additional constraint is that the watermark must increase monotonically.

Note that since we must coordinate this value across partitions occasionally, the actual watermark used is only guaranteed to be at least delay behind the actual event time. In some cases we may still process records that arrive more than delay late.

This mechanism was chosen for the initial implementation over processing time for two reasons:

  • it is robust to downtime that could affect processing delay
  • it does not require syncing of time or timezones between the producer and the processing engine.

Other notable implementation details

  • A new trigger metric eventTimeWatermark outputs the current value of the watermark.
  • We mark the event time column in the Attribute metadata using the key spark.watermarkDelay. This allows downstream operations to know which column holds the event time. Operations like window propagate this metadata.
  • explain() marks the watermark with a suffix of -T${delayMs} to ease debugging of how this information is propagated.
  • Currently, we don't filter out late records, but instead rely on the state store to avoid emitting records that are both added and filtered in the same epoch.

Remaining in this PR

  • The test for recovery is currently failing as we don't record the watermark used in the offset log. We will need to do so to ensure determinism, but this is deferred until #15626 is merged.

Other follow-ups

There are some natural additional features that we should consider for future work:

  • Ability to write records that arrive too late to some external store in case any out-of-band remediation is required.
  • Update mode so you can get partial results before a group is evicted.
  • Other mechanisms for calculating the watermark. In particular a watermark based on quantiles would be more robust to outliers.
child: SparkPlan) extends SparkPlan {

// TODO: Use Spark SQL Metrics?
val maxEventTime = new MaxLong

This comment has been minimized.

Copy link
@marmbrus

marmbrus Oct 31, 2016

Author Contributor

@zsxwing am I doing this right?

@@ -252,6 +252,10 @@ public static long parseSecondNano(String secondNano) throws IllegalArgumentExce
public final int months;
public final long microseconds;

public final long milliseconds() {
return this.microseconds / MICROS_PER_MILLI;

This comment has been minimized.

Copy link
@rxin

rxin Nov 1, 2016

Contributor

2 space indent

@marmbrus marmbrus changed the title [SPARK-18124] Observed-delay based Even Time Watermarks [SPARK-18124] Observed-delay based Event Time Watermarks Nov 1, 2016

@@ -536,6 +535,37 @@ class Dataset[T] private[sql](
}

/**
* Defines an event time watermark for this [[Dataset]]. This watermark tracks a point in time

This comment has been minimized.

Copy link
@rxin

rxin Nov 1, 2016

Contributor

need a tag here for experimental

*/
@Experimental
@InterfaceStability.Evolving
def withWatermark(eventTime: String, delay: String): Dataset[T] = withTypedPlan {

This comment has been minimized.

Copy link
@rxin

rxin Nov 1, 2016

Contributor

you'd need one that takes in a column wouldn't you?

@SparkQA

This comment has been minimized.

Copy link

commented Nov 1, 2016

Test build #67839 has finished for PR 15702 at commit 5b92132.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@marmbrus

This comment has been minimized.

Copy link
Contributor Author

commented Nov 1, 2016

@ericl - flaky test... Should we turn it off for now?

retest this please

@ericl

This comment has been minimized.

Copy link
Contributor

commented Nov 1, 2016

I'm still trying to find a failure that includes https://github.com/apache/spark/pull/15701/files. Until then it's hard to debug.

Another option might be turning off or adding a retry around this particular test, I'll make another PR for that.

@marmbrus marmbrus changed the title [SPARK-18124] Observed-delay based Event Time Watermarks [SPARK-18124] Observed delay based Event Time Watermarks Nov 1, 2016

@SparkQA

This comment has been minimized.

Copy link

commented Nov 1, 2016

Test build #67866 has finished for PR 15702 at commit 14a728e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
override def toString: String = s"$name#${exprId.id}$typeSuffix"
/** Used to signal the column used to calculate an eventTime watermark (e.g. a#1-T{delayMs}) */
private def delaySuffix = if (metadata.contains(EventTimeWatermark.delayKey)) {
s"-T${metadata.getLong(EventTimeWatermark.delayKey)}"

This comment has been minimized.

Copy link
@brkyvz

brkyvz Nov 1, 2016

Contributor

is this in milliseconds or microseconds like timestamp type?

if (a semanticEquals eventTime) {
val updatedMetadata = new MetadataBuilder()
.withMetadata(a.metadata)
.putLong(EventTimeWatermark.delayKey, delay.milliseconds)

This comment has been minimized.

Copy link
@brkyvz

brkyvz Nov 1, 2016

Contributor

I'm a bit confused. Normally Spark SQL uses microsecond precision for TimestampType. When it converts it to LongType, it uses second precision. Here we're using milliseconds. Wouldn't that be super confusing to reason about?

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 1, 2016

Author Contributor

I switched it to using CalendarInterval to make it clearer what units were being used where. I chose milliseconds because it seemed like the right granularity. Microseconds are too short for the global coordination required and seconds lack granularity. It should be easy to change, and I'm open to that if there's consensus this is too confusing though.

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

Updating the key to include Ms

@brkyvz

This comment has been minimized.

Copy link
Contributor

commented Nov 1, 2016

A very dumb question (I apologize), there is nothing stopping a user to actually use processing time as watermarks with this API either. One can easily do df.withColumn("timestamp", current_timestamp()).withWatermark("timestamp"). I like that we're suggesting users to use eventTime for stability, but we're not actually constraining them, right?

My biggest confusion here, that I couldn't find documented was the Type of the watermark column. Does it need to be timestamp type or can it be LongType?

@marmbrus

This comment has been minimized.

Copy link
Contributor Author

commented Nov 1, 2016

Not a dumb question! You can certainly use processing time if those are the semantics you require. I do think there is a little bit of work we need to do to ensure determinism for these functions. Specifically, now() in SQL is supposed to return the moment query execution began. For streaming we'll need to record this and make sure that we give the same time even in the case of failures (based on the batch id with the current execution model).

Good point on the documentation. The thing you are missing is that it must be used in a window function, which does require TimestampType. I can see how to make this more clear.

@SparkQA

This comment has been minimized.

Copy link

commented Nov 2, 2016

Test build #67939 has finished for PR 15702 at commit 311e7c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
* using output modes that do not allow updates.
* - To minimize the amount of state that we need to keep for on-going aggregations.
*
* The current event time is computed by looking at the `MAX(eventTime)` seen in an epoch across

This comment has been minimized.

Copy link
@koeninger

koeninger Nov 2, 2016

Contributor
  • Should this be "The current watermark is computed..." ?
  • what is an epoch, it isn't mentioned in the docs or elsewhere in the PR

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

Changed to watermark. For epoch, I really just mean "during some period of time where we decide too coordinate across the partitions". This happens at batch boundaries now, but that is not part of the contract we are promising. I just removed that word to avoid confusion.

* Spark will use this watermark for several purposes:
* - To know when a given time window aggregation can be finalized and thus can be emitted when
* using output modes that do not allow updates.
* - To minimize the amount of state that we need to keep for on-going aggregations.

This comment has been minimized.

Copy link
@koeninger

koeninger Nov 2, 2016

Contributor

For append, this sounds like the intention is emit only once watermark has passed, and drop state.
But for other output modes, it's not clear from reading this what the effect of the watermark on emission and dropping state is.

* @param eventTime the name of the column that contains the event time of the row.
* @param delayThreshold the minimum delay to wait to data to arrive late, relative to the latest
* record that has been processed in the form of an interval
* (e.g. "1 minute" or "5 hours").

This comment has been minimized.

Copy link
@koeninger

koeninger Nov 2, 2016

Contributor

Should this make it clear what the minimum useful granularity is (ms)?

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

That seems like more of an implementation detail, rather than a contract of the API. The real contract is stated above as the actual watermark used is only guaranteed to be at least 'delayThreshold' behind the actual event time. There aren't really any bounds we can promise without knowing more about the query (even ms).

}

// Update and output modified rows from the StateStore.
case Some(Update) =>

This comment has been minimized.

Copy link
@koeninger

koeninger Nov 2, 2016

Contributor

I'm not clear on why the semantics of Update mean that watermarks shouldn't be used to remove state.

This comment has been minimized.

Copy link
@CodingCat

CodingCat Nov 2, 2016

Contributor

@koeninger, Update shall allow the late data to correct the previous results even they are late than the threshold, the similar implementation is in http://cdn.oreillystatic.com/en/assets/1/event/160/Triggers%20in%20Apache%20Beam%20_incubating_%20Presentation.pdf (search withLateFirings)...correct me if I was wrong

This comment has been minimized.

Copy link
@koeninger

koeninger Nov 2, 2016

Contributor

To put it the other way, do the docs in this PR tell you as a user that for any output method other than Append, you are potentially keeping unlimited aggregate state in memory, regardless of whether you set a watermark?

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

The only output modes that are supported publicly are Complete and Append (update is only available internally for tests). When we add support for Update (I'd like to do this soon), it should also evict tuples which can no longer be updated due to their group falling beneath the watermark. I thought that it was fairly clear that Complete would need to retain the complete set of aggregate state, but I'm happy to make this more explicit if others are confused by this.

This comment has been minimized.

Copy link
@koeninger

koeninger Nov 2, 2016

Contributor

Yes, I think it's a good idea to explicitly say for each output mode whether watermarks affect emit and evict. Just so I'm clear, the intention is

Append: affects emit, affects evict
Update: doesn't affect emit, affects evict
Complete: doesn't affect emit, no eviction

Is that right?

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

That is correct.

This comment has been minimized.

Copy link
@amitsela

amitsela Nov 13, 2016

Member

Generally, updates should be able to take into account late arrivals (in respect to EndOfWindow) and allow to act upon a user defined strategy, such as: update for each following element.

@koeninger

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2016

Given the concerns Ofir raised about a single far future event screwing up monotonic event time, do you want to document that problem even if there isn't an enforced filter for it?

@SparkQA

This comment has been minimized.

Copy link

commented Nov 2, 2016

Test build #67998 has finished for PR 15702 at commit 2685771.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@zsxwing
Copy link
Member

left a comment

Looks good overall. My comments can be addressed later.

import org.apache.spark.unsafe.types.CalendarInterval
import org.apache.spark.util.AccumulatorV2

class MaxLong(protected var currentValue: Long = 0)

This comment has been minimized.

Copy link
@zsxwing

zsxwing Nov 2, 2016

Member

nit: protected -> private

This comment has been minimized.

Copy link
@zsxwing

zsxwing Nov 2, 2016

Member

nit: Could you document that this one only support positive longs?


class MaxLong(protected var currentValue: Long = 0)
extends AccumulatorV2[Long, Long]
with Serializable {

This comment has been minimized.

Copy link
@zsxwing

zsxwing Nov 2, 2016

Member

nit: not needed. AccumulatorV2 is already Serializable.


case class ValueAdded(key: UnsafeRow, value: UnsafeRow) extends StoreUpdate

case class ValueUpdated(key: UnsafeRow, value: UnsafeRow) extends StoreUpdate

case class KeyRemoved(key: UnsafeRow) extends StoreUpdate
case class ValueRemoved(key: UnsafeRow, value: UnsafeRow) extends StoreUpdate

This comment has been minimized.

Copy link
@zsxwing

zsxwing Nov 2, 2016

Member

Any special reason to change this? It seems weird that adding an unused field value.

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

It is used. We need the value to emit the result upon eviction.

streamMetrics.reportTriggerDetail(EVENT_TIME_WATERMARK, newWatermark)
currentEventTimeWatermark = newWatermark
} else {
logTrace(s"Event time didn't move: $newWatermark < $currentEventTimeWatermark")

This comment has been minimized.

Copy link
@zsxwing

zsxwing Nov 2, 2016

Member

We need to call streamMetrics.reportTriggerDetail(EVENT_TIME_WATERMARK, newWatermark) here. Otherwise, the trigger details won't have EVENT_TIME_WATERMARK for this batch.

}.headOption.foreach { newWatermark =>
if (newWatermark > currentEventTimeWatermark) {
logInfo(s"Updating eventTime watermark to: $newWatermark ms")
streamMetrics.reportTriggerDetail(EVENT_TIME_WATERMARK, newWatermark)

This comment has been minimized.

Copy link
@zsxwing

zsxwing Nov 2, 2016

Member

Is it fine to just set EVENT_TIME_WATERMARK to 0 if the first batch doesn't have any data (E.g., the filter specified by the user drops all data)?

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

I think thats okay?

This comment has been minimized.

Copy link
@zsxwing

zsxwing Nov 2, 2016

Member

I suggest just fixing it since it's pretty easy. Just if (newWatermark == 0) "-" else newWatermark.toString

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

I see, that makes sense. I actually just moved it out so we only report if its non-zero.

child.execute().mapPartitions { iter =>
val getEventTime = UnsafeProjection.create(eventTime :: Nil, child.output)
iter.map { row =>
maxEventTime.add(getEventTime(row).getLong(0))

This comment has been minimized.

Copy link
@zsxwing

zsxwing Nov 2, 2016

Member

Just a small question: which place will check the eventTime type? I guess getLong just throws an exception if the format is wrong. Can we fail it before starting the spark job?

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

Added to checkAnalysis.

CheckAnswer((10, 3)),
AddData(inputData, 10), // 10 is later than 15 second watermark
CheckAnswer((10, 3)),
AddData(inputData, 25), // 10 is later than 15 second watermark

This comment has been minimized.

Copy link
@zsxwing

zsxwing Nov 2, 2016

Member

nit: the comment is wrong

@SparkQA

This comment has been minimized.

Copy link

commented Nov 2, 2016

Test build #3397 has finished for PR 15702 at commit 2685771.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Nov 2, 2016

Test build #3398 has finished for PR 15702 at commit 2685771.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
case etw: EventTimeWatermark =>
etw.eventTime.dataType match {
case s: StructType
if s.find(_.name == "start").map(_.dataType).contains(TimestampType) =>

This comment has been minimized.

Copy link
@zsxwing

zsxwing Nov 2, 2016

Member

nit: Option.contains is not in Scala 2.10.

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

really? lame...

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 2, 2016

Author Contributor

Oh... it should also check the end of the window, not the start...

@SparkQA

This comment has been minimized.

Copy link

commented Nov 2, 2016

Test build #68023 has finished for PR 15702 at commit 379255d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Nov 2, 2016

Test build #68025 has finished for PR 15702 at commit 7a9b6dd.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
@tdas
Copy link
Contributor

left a comment

Major feedback - Python API for withWatermark()?
Other than its looking good

@@ -155,6 +155,16 @@ trait CheckAnalysis extends PredicateHelper {
}

operator match {
case etw: EventTimeWatermark =>
etw.eventTime.dataType match {
case s: StructType

This comment has been minimized.

Copy link
@tdas

tdas Nov 3, 2016

Contributor

Which high level case is caught by this condition?

This comment has been minimized.

Copy link
@marmbrus

marmbrus Nov 9, 2016

Author Contributor

The result of a window operation.

}

override def add(v: Long): Unit = {
if (value < v) { currentValue = v }

This comment has been minimized.

Copy link
@tdas

tdas Nov 3, 2016

Contributor

nit: less confusing to read if if (currentValue < v) { currentValue = v }.
In fact why not used math.max?

}

override def merge(other: AccumulatorV2[Long, Long]): Unit = {
if (currentValue < other.value) {

This comment has been minimized.

Copy link
@tdas

tdas Nov 3, 2016

Contributor

nit: same as above, why not use math.max

@SparkQA

This comment has been minimized.

Copy link

commented Nov 3, 2016

Test build #68029 has finished for PR 15702 at commit 1d4784f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@tdas

This comment has been minimized.

Copy link
Contributor

commented Nov 10, 2016

LGTM, pending tests.

@SparkQA

This comment has been minimized.

Copy link

commented Nov 11, 2016

Test build #68496 has finished for PR 15702 at commit de601bb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@marmbrus

This comment has been minimized.

Copy link
Contributor Author

commented Nov 11, 2016

jenkins, test this please

@SparkQA

This comment has been minimized.

Copy link

commented Nov 11, 2016

Test build #68504 has finished for PR 15702 at commit de601bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Nov 14, 2016

Test build #68631 has finished for PR 15702 at commit 87d8618.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@marmbrus

This comment has been minimized.

Copy link
Contributor Author

commented Nov 14, 2016

jenkins test this please

@SparkQA

This comment has been minimized.

Copy link

commented Nov 15, 2016

Test build #68637 has finished for PR 15702 at commit 87d8618.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@tdas

This comment has been minimized.

Copy link
Contributor

commented Nov 15, 2016

I am merging this to master and 2.1

@asfgit asfgit closed this in c071878 Nov 15, 2016

asfgit pushed a commit that referenced this pull request Nov 15, 2016

[SPARK-18124] Observed delay based Event Time Watermarks
This PR adds a new method `withWatermark` to the `Dataset` API, which can be used specify an _event time watermark_.  An event time watermark allows the streaming engine to reason about the point in time after which we no longer expect to see late data.  This PR also has augmented `StreamExecution` to use this watermark for several purposes:
  - To know when a given time window aggregation is finalized and thus results can be emitted when using output modes that do not allow updates (e.g. `Append` mode).
  - To minimize the amount of state that we need to keep for on-going aggregations, by evicting state for groups that are no longer expected to change.  Although, we do still maintain all state if the query requires (i.e. if the event time is not present in the `groupBy` or when running in `Complete` mode).

An example that emits windowed counts of records, waiting up to 5 minutes for late data to arrive.
```scala
df.withWatermark("eventTime", "5 minutes")
  .groupBy(window($"eventTime", "1 minute") as 'window)
  .count()
  .writeStream
  .format("console")
  .mode("append") // In append mode, we only output finalized aggregations.
  .start()
```

### Calculating the watermark.
The current event time is computed by looking at the `MAX(eventTime)` seen this epoch across all of the partitions in the query minus some user defined _delayThreshold_.  An additional constraint is that the watermark must increase monotonically.

Note that since we must coordinate this value across partitions occasionally, the actual watermark used is only guaranteed to be at least `delay` behind the actual event time.  In some cases we may still process records that arrive more than delay late.

This mechanism was chosen for the initial implementation over processing time for two reasons:
  - it is robust to downtime that could affect processing delay
  - it does not require syncing of time or timezones between the producer and the processing engine.

### Other notable implementation details
 - A new trigger metric `eventTimeWatermark` outputs the current value of the watermark.
 - We mark the event time column in the `Attribute` metadata using the key `spark.watermarkDelay`.  This allows downstream operations to know which column holds the event time.  Operations like `window` propagate this metadata.
 - `explain()` marks the watermark with a suffix of `-T${delayMs}` to ease debugging of how this information is propagated.
 - Currently, we don't filter out late records, but instead rely on the state store to avoid emitting records that are both added and filtered in the same epoch.

### Remaining in this PR
 - [ ] The test for recovery is currently failing as we don't record the watermark used in the offset log.  We will need to do so to ensure determinism, but this is deferred until #15626 is merged.

### Other follow-ups
There are some natural additional features that we should consider for future work:
 - Ability to write records that arrive too late to some external store in case any out-of-band remediation is required.
 - `Update` mode so you can get partial results before a group is evicted.
 - Other mechanisms for calculating the watermark.  In particular a watermark based on quantiles would be more robust to outliers.

Author: Michael Armbrust <michael@databricks.com>

Closes #15702 from marmbrus/watermarks.

(cherry picked from commit c071878)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

uzadude added a commit to uzadude/spark that referenced this pull request Jan 27, 2017

[SPARK-18124] Observed delay based Event Time Watermarks
This PR adds a new method `withWatermark` to the `Dataset` API, which can be used specify an _event time watermark_.  An event time watermark allows the streaming engine to reason about the point in time after which we no longer expect to see late data.  This PR also has augmented `StreamExecution` to use this watermark for several purposes:
  - To know when a given time window aggregation is finalized and thus results can be emitted when using output modes that do not allow updates (e.g. `Append` mode).
  - To minimize the amount of state that we need to keep for on-going aggregations, by evicting state for groups that are no longer expected to change.  Although, we do still maintain all state if the query requires (i.e. if the event time is not present in the `groupBy` or when running in `Complete` mode).

An example that emits windowed counts of records, waiting up to 5 minutes for late data to arrive.
```scala
df.withWatermark("eventTime", "5 minutes")
  .groupBy(window($"eventTime", "1 minute") as 'window)
  .count()
  .writeStream
  .format("console")
  .mode("append") // In append mode, we only output finalized aggregations.
  .start()
```

### Calculating the watermark.
The current event time is computed by looking at the `MAX(eventTime)` seen this epoch across all of the partitions in the query minus some user defined _delayThreshold_.  An additional constraint is that the watermark must increase monotonically.

Note that since we must coordinate this value across partitions occasionally, the actual watermark used is only guaranteed to be at least `delay` behind the actual event time.  In some cases we may still process records that arrive more than delay late.

This mechanism was chosen for the initial implementation over processing time for two reasons:
  - it is robust to downtime that could affect processing delay
  - it does not require syncing of time or timezones between the producer and the processing engine.

### Other notable implementation details
 - A new trigger metric `eventTimeWatermark` outputs the current value of the watermark.
 - We mark the event time column in the `Attribute` metadata using the key `spark.watermarkDelay`.  This allows downstream operations to know which column holds the event time.  Operations like `window` propagate this metadata.
 - `explain()` marks the watermark with a suffix of `-T${delayMs}` to ease debugging of how this information is propagated.
 - Currently, we don't filter out late records, but instead rely on the state store to avoid emitting records that are both added and filtered in the same epoch.

### Remaining in this PR
 - [ ] The test for recovery is currently failing as we don't record the watermark used in the offset log.  We will need to do so to ensure determinism, but this is deferred until apache#15626 is merged.

### Other follow-ups
There are some natural additional features that we should consider for future work:
 - Ability to write records that arrive too late to some external store in case any out-of-band remediation is required.
 - `Update` mode so you can get partial results before a group is evicted.
 - Other mechanisms for calculating the watermark.  In particular a watermark based on quantiles would be more robust to outliers.

Author: Michael Armbrust <michael@databricks.com>

Closes apache#15702 from marmbrus/watermarks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.