[FLINK-4341] Let idle consumer subtasks emit max value watermarks and fail on resharding #2414

tzulitai · 2016-08-24T17:12:17Z

This is a short-term fix, until the min-watermark service for the JobManager described in the JIRA discussion is available.

The way this fix works is that we let idle subtasks that initially don't get assigned shards emit a Long.MAX_VALUE watermark. Also, to avoid messing up the watermarks on resharding, we only deliberately fail hard if the new shards are assigned to idle subtasks. So, if all subtasks are not initially idle on startup (i.e., when total number of shards >= consumer parallelism), the Kinesis consumer can still transparently handle resharding like before without failing.

I've tested exactly-once with our manual tests (with and w/o resharding) and the fix works nicely, still retaining exactly-once guarantee despite non-transparency.

However, I can't reproduce the unbounded state & akka frame size exceeding with window operators w/o this change (perhaps the window I'm testing with is too simple?), so I'm not sure if that issue is also correctly fixed with this change; I'll need a bit of help to let us clarify this.

This change should also go into the 1.1.2 bugfix release branch.

R: @rmetzger and @aljoscha for review. Thanks in advance!

… fail on resharding This no longer allows the Kinesis consumer to transparently handle resharding. This is a short-term workaround until we have a min-watermark notification service available in the JobManager.

rmetzger · 2016-08-25T14:27:47Z

Thank you for opening a pull request to fix the issue.

I think we also need to cover another case: What happens when the number of shards has been reduced in a resharding and some fetchers are now without a shard? I think in that case, the worker also needs to emit a final Long.MAX_VALUE, and it has to fail once it gets a shard assigned again.

tzulitai · 2016-08-25T15:10:45Z

Ah yes, correct. I'll update this soon.

aljoscha · 2016-08-27T06:54:52Z

Minus @rmetzger's comment this looks good to merge! Thanks for fixing this @tzulitai!

tzulitai · 2016-08-27T09:18:58Z

To include the missing case @rmetzger mentioned, it turns out the complete fix is actually more complicated than I expected to perform correct case determination after every reshard, and perhaps might require a little bit of rework on the current shard discovery mechanism to get it right.

Heads-up notice that this will probably need a re-review. Sorry for the delay, I'm currently still on it, hopefully will update the PR by the end of today ;) I'll notify when it's ready.

…otice

tzulitai · 2016-08-28T07:53:30Z

@rmetzger, @aljoscha the fix is ready for another review now, thanks!

rmetzger · 2016-08-29T09:43:58Z

Thank you for the pull request. I'll merge it to master and the release-1.1 branch.

… fail on resharding This no longer allows the Kinesis consumer to transparently handle resharding. This is a short-term workaround until we have a min-watermark notification service available in the JobManager. This closes #2414

tzulitai · 2016-08-29T10:06:39Z

Thanks @rmetzger !

… fail on resharding This no longer allows the Kinesis consumer to transparently handle resharding. This is a short-term workaround until we have a min-watermark notification service available in the JobManager. This closes apache#2414

[FLINK-4341] Let idle consumer subtasks emit max value watermarks and…

bc8e50d

… fail on resharding This no longer allows the Kinesis consumer to transparently handle resharding. This is a short-term workaround until we have a min-watermark notification service available in the JobManager.

tzulitai added 2 commits August 28, 2016 15:40

[FLINK-4341] Fully consider all cases to emit max value watermark / fail

d8d0942

[FLINK-4341] Inform checkpointing must be enabled in the workaround n…

57b4cb7

…otice

[FLINK-4341] Remove unnecessary synchronized block

ddbd4b5

asfgit closed this in 7b574cf Aug 29, 2016

rmetzger added the component=Connectors/Common label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-4341] Let idle consumer subtasks emit max value watermarks and fail on resharding #2414

[FLINK-4341] Let idle consumer subtasks emit max value watermarks and fail on resharding #2414

Uh oh!

tzulitai commented Aug 24, 2016 •

edited

Loading

Uh oh!

rmetzger commented Aug 25, 2016

Uh oh!

tzulitai commented Aug 25, 2016

Uh oh!

aljoscha commented Aug 27, 2016

Uh oh!

tzulitai commented Aug 27, 2016 •

edited

Loading

Uh oh!

tzulitai commented Aug 28, 2016 •

edited

Loading

Uh oh!

rmetzger commented Aug 29, 2016

Uh oh!

tzulitai commented Aug 29, 2016

Uh oh!

Uh oh!

[FLINK-4341] Let idle consumer subtasks emit max value watermarks and fail on resharding #2414

[FLINK-4341] Let idle consumer subtasks emit max value watermarks and fail on resharding #2414

Uh oh!

Conversation

tzulitai commented Aug 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmetzger commented Aug 25, 2016

Uh oh!

tzulitai commented Aug 25, 2016

Uh oh!

aljoscha commented Aug 27, 2016

Uh oh!

tzulitai commented Aug 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tzulitai commented Aug 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmetzger commented Aug 29, 2016

Uh oh!

tzulitai commented Aug 29, 2016

Uh oh!

Uh oh!

tzulitai commented Aug 24, 2016 •

edited

Loading

tzulitai commented Aug 27, 2016 •

edited

Loading

tzulitai commented Aug 28, 2016 •

edited

Loading