Skip to content

Conversation

tzulitai
Copy link
Contributor

@tzulitai tzulitai commented Aug 24, 2016

This is a short-term fix, until the min-watermark service for the JobManager described in the JIRA discussion is available.

The way this fix works is that we let idle subtasks that initially don't get assigned shards emit a Long.MAX_VALUE watermark. Also, to avoid messing up the watermarks on resharding, we only deliberately fail hard if the new shards are assigned to idle subtasks. So, if all subtasks are not initially idle on startup (i.e., when total number of shards >= consumer parallelism), the Kinesis consumer can still transparently handle resharding like before without failing.

I've tested exactly-once with our manual tests (with and w/o resharding) and the fix works nicely, still retaining exactly-once guarantee despite non-transparency.

However, I can't reproduce the unbounded state & akka frame size exceeding with window operators w/o this change (perhaps the window I'm testing with is too simple?), so I'm not sure if that issue is also correctly fixed with this change; I'll need a bit of help to let us clarify this.

This change should also go into the 1.1.2 bugfix release branch.

R: @rmetzger and @aljoscha for review. Thanks in advance!

… fail on resharding

This no longer allows the Kinesis consumer to transparently handle resharding.
This is a short-term workaround until we have a min-watermark notification service available in the JobManager.
@rmetzger
Copy link
Contributor

Thank you for opening a pull request to fix the issue.

I think we also need to cover another case: What happens when the number of shards has been reduced in a resharding and some fetchers are now without a shard? I think in that case, the worker also needs to emit a final Long.MAX_VALUE, and it has to fail once it gets a shard assigned again.

@tzulitai
Copy link
Contributor Author

Ah yes, correct. I'll update this soon.

@aljoscha
Copy link
Contributor

Minus @rmetzger's comment this looks good to merge! Thanks for fixing this @tzulitai!

@tzulitai
Copy link
Contributor Author

tzulitai commented Aug 27, 2016

To include the missing case @rmetzger mentioned, it turns out the complete fix is actually more complicated than I expected to perform correct case determination after every reshard, and perhaps might require a little bit of rework on the current shard discovery mechanism to get it right.

Heads-up notice that this will probably need a re-review. Sorry for the delay, I'm currently still on it, hopefully will update the PR by the end of today ;) I'll notify when it's ready.

@tzulitai
Copy link
Contributor Author

tzulitai commented Aug 28, 2016

@rmetzger, @aljoscha the fix is ready for another review now, thanks!

@rmetzger
Copy link
Contributor

Thank you for the pull request. I'll merge it to master and the release-1.1 branch.

@asfgit asfgit closed this in 7b574cf Aug 29, 2016
asfgit pushed a commit that referenced this pull request Aug 29, 2016
… fail on resharding

This no longer allows the Kinesis consumer to transparently handle resharding.
This is a short-term workaround until we have a min-watermark notification service available in the JobManager.

This closes #2414
@tzulitai
Copy link
Contributor Author

Thanks @rmetzger !

haoch pushed a commit to haoch/flink that referenced this pull request Sep 10, 2016
… fail on resharding

This no longer allows the Kinesis consumer to transparently handle resharding.
This is a short-term workaround until we have a min-watermark notification service available in the JobManager.

This closes apache#2414
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants