Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
[FLINK-5697] [kinesis] Add periodic per-shard watermark support #6980
What is the purpose of the change
Adds support for periodic per-shard watermarks to the Kinesis consumer. This functionality is off by default and can be enabled by setting an optional watermark assigner on the consumer. When enabled, the watermarking also optionally supports idle shard detection based on configurable interval of inactivity.
Brief change log
Verifying this change
This change added tests and can be verified as follows:
Added a unit test and planning to add more test coverage with subsequent work for shared watermark state and emit queue as discussed on ML. This change is ported from Lyft internal codebase that is used in production.
Does this pull request potentially affect one of the following parts:
There is a caveat with this implementation that the docs should perhaps mention. The caveat is that it may produce spurious late events when processing a backlog of data.
Here's an example of when that may occur. Imagine that subtask 1 is processing shard A and subtask 2 is processing shard B. Shard A has reached 6:00 in event time (as per the assigner), and so the subtask emits the corresponding watermark. At this point, the subtask has made the irrevocable assertion that subsequent events will be past 6:00. Meanwhile, Shard B is at 5:30 and undergoes a split into C/D. If either shard is subsequently assigned to subtask 1, the events will be considered late due to the assertion made earlier.
@EronWright that's correct and I will make sure to document this. Even our planned follow-up work won't be able to address such resharding scenario. I think we will only be able to address that with the new source design that is currently under discussion (which should permit centralized discovery and more sophisticated splitting/shard distribution).