Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-30334][runtime] Fix noMoreSplits event handling for HybridSource #21464

Closed
wants to merge 1 commit into from

Conversation

chucheng92
Copy link
Member

@chucheng92 chucheng92 commented Dec 8, 2022

What is the purpose of the change

SourceCoordinator#handleRequestSplitEvent hasNoMoreSplits check not consider the HybridSource situation. It will cause HybridSource do not read next child sources data and finally lead to data loss and unexpected runtime behavior.

Brief change log

Add an intercept strategy for hasNoMoreSplits checking correctly of HybridSourceSplitEnumerator.

Verifying this change

Add 2 new cases to cover the code path about the issue.

HybridSourceSplitEnumeratorTest#testInterceptNoMoreSplitEvent
SourceCoordinatorContextTest#testSignalNoMoreSplits

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@chucheng92 chucheng92 changed the title [FLINK-30334] Reset SourceCoordinator#handleRequestSplitEvent without… [FLINK-30334] Reset SourceCoordinator#handleRequestSplitEvent without violent hasNoMoreSplits check Dec 8, 2022
@flinkbot
Copy link
Collaborator

flinkbot commented Dec 8, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@chucheng92 chucheng92 changed the title [FLINK-30334] Reset SourceCoordinator#handleRequestSplitEvent without violent hasNoMoreSplits check [FLINK-30334] Reset SourceCoordinator#handleRequestSplitEvent because wrong hasNoMoreSplits check Dec 8, 2022
@chucheng92 chucheng92 changed the title [FLINK-30334] Reset SourceCoordinator#handleRequestSplitEvent because wrong hasNoMoreSplits check [FLINK-30334] Reset SourceCoordinator#handleRequestSplitEvent check logic because wrong hasNoMoreSplits check Dec 8, 2022
@chucheng92 chucheng92 changed the title [FLINK-30334] Reset SourceCoordinator#handleRequestSplitEvent check logic because wrong hasNoMoreSplits check [FLINK-30334][runtime] Reset SourceCoordinator#handleRequestSplitEvent check logic because hasNoMoreSplits not consider the hybridsource situation Dec 8, 2022
@chucheng92 chucheng92 changed the title [FLINK-30334][runtime] Reset SourceCoordinator#handleRequestSplitEvent check logic because hasNoMoreSplits not consider the hybridsource situation [FLINK-30334][runtime] Reset SourceCoordinator#handleRequestSplitEvent check logic because hasNoMoreSplits check not consider the hybridsource situation Dec 8, 2022
@chucheng92 chucheng92 changed the title [FLINK-30334][runtime] Reset SourceCoordinator#handleRequestSplitEvent check logic because hasNoMoreSplits check not consider the hybridsource situation [FLINK-30334][runtime] Fix SourceCoordinator#handleRequestSplitEvent hasNoMoreSplits check not consider the hybridsource situation Dec 20, 2022
@chucheng92 chucheng92 force-pushed the FLINK-30334 branch 3 times, most recently from d463f68 to 16f81a4 Compare December 21, 2022 09:02
@chucheng92
Copy link
Member Author

@zhuzhurk Hi, Can u help me to review it? thanks a lot

@zhuzhurk zhuzhurk self-assigned this Dec 22, 2022
Copy link
Contributor

@zhuzhurk zhuzhurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR! @chucheng92
My major concern is that we should avoid changing the public interface SplitEnumeratorContext which is not necessary. Other changes looks good to me.

* @param subtask The index of the operator's parallel subtask that shall be signaled it will
* receive splits later.
*/
default void signalIntermediateNoMoreSplits(int subtask) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid adding this method to SplitEnumeratorContext which is a @Public interface.
Intermediate noMoreSplits event is strictly related to HybridSource and is hard for users to understand or handle.

I think we can just directly add the method to SourceCoordinatorContext, then use instanceOf and type cast in HybridSourceSplitEnumerator to invoke this method.

Copy link
Member Author

@chucheng92 chucheng92 Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuzhurk Yes. SplitEnumeratorContext is a Public api. i add it with default method to avoid impact. We can not use realContext instanceOf SourceCoordinatorContext directly, HybridSourceSplitEnumerator is in flink-connector-base, SourceCoordinatorContext is in flink-runtime, how can we reference it? using reflection or add runtime dependency both are ugly.

Copy link
Contributor

@zhuzhurk zhuzhurk Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add an interface in flink-core, e.g.

@Internal
interface SupportsIntermediateNoMoreSplits {
    void signalIntermediateNoMoreSplits(int subtask);
}

SourceCoordinatorContext should implement it. And HybridSourceSplitEnumerator should check that the realContext is an instance of SupportsIntermediateNoMoreSplits.

@zhuzhurk
Copy link
Contributor

@chucheng92 F.I.Y. the community is preparing to release 1.16.1. I hope we can get this issue fixed in this release.
Would you try if the suggested way can work?

@chucheng92 chucheng92 force-pushed the FLINK-30334 branch 2 times, most recently from 6cdf6d6 to ae9e955 Compare December 26, 2022 04:18
@chucheng92
Copy link
Member Author

chucheng92 commented Dec 26, 2022

@chucheng92 F.I.Y. the community is preparing to release 1.16.1. I hope we can get this issue fixed in this release. Would you try if the suggested way can work?

Yes, done with your suggested way. It works well. I have updated the pr. PTAL. thanks.

@chucheng92 chucheng92 force-pushed the FLINK-30334 branch 2 times, most recently from 3d0b9bd to 2f4ba4a Compare December 26, 2022 05:41

@Override
public void signalIntermediateNoMoreSplits(int subtask) {
subtaskHasNoMoreSplits[subtask] = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be empty. The values should be false initially but not set to be false in this method.

@@ -157,7 +158,17 @@ private void sendSwitchSourceEvent(int subtaskId, int sourceIndex) {
LOG.debug("Restoring splits to subtask={} {}", subtaskId, splits);
context.assignSplits(
new SplitsAssignment<>(Collections.singletonMap(subtaskId, splits)));
context.signalNoMoreSplits(subtaskId);
if (context instanceof SupportsIntermediateNoMoreSplits) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to have a method to be reused, e.g.

private static void signalNoMoreSplits(
    SplitEnumeratorContext<HybridSourceSplit> context, 
    subtaskId, 
    int sourceIndex, 
    int sourceSize);

// It's an intermediate noMoreSplit event, notify subtask to deal with this event.
callInCoordinatorThread(
() -> {
subtaskHasNoMoreSplits[subtask] = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed to be set to false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, it should be false initially and set to true when on signalNoMoreSplits(int). When failover happens, the value will be reset to false. Seems there is no need for signalIntermediateNoMoreSplits(int) to do the reset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, no need to set false. i delete it. thanks.

@chucheng92 chucheng92 force-pushed the FLINK-30334 branch 2 times, most recently from 39fda04 to de3b039 Compare December 26, 2022 06:57
@chucheng92
Copy link
Member Author

@zhuzhurk thanks for reviewing, i have fixed the problem you commented. PTAL.

Copy link
Contributor

@zhuzhurk zhuzhurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments! @chucheng92
LGTM.

@chucheng92
Copy link
Member Author

@flinkbot run azure

// test add splits back, then SUBTASK0 restore splitFromSource0 split
// reset splits assignment & previous subtaskHasNoMoreSplits flag.
context.getSplitsAssignmentSequence().clear();
Whitebox.setInternalState(context, "subtaskHasNoMoreSplits", new boolean[] {false, false});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add a method resetNoMoreSplits(int) in MockSplitEnumeratorContext.
Instead of using reflection to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuzhurk thanks for your suggestion.

…hasNoMoreSplits check not consider the hybridsource situation
@chucheng92
Copy link
Member Author

@flinkbot run azure

@zhuzhurk
Copy link
Contributor

Merging.

@zhuzhurk zhuzhurk changed the title [FLINK-30334][runtime] Fix SourceCoordinator#handleRequestSplitEvent hasNoMoreSplits check not consider the hybridsource situation [FLINK-30334][runtime] Fix noMoreSplits event handling for HybridSource Dec 27, 2022
@zhuzhurk zhuzhurk closed this in 6e4e6c6 Dec 27, 2022
chucheng92 added a commit to chucheng92/flink that referenced this pull request Feb 3, 2023
sergeitsar pushed a commit to fentik/flink that referenced this pull request Feb 8, 2023
sergeitsar pushed a commit to fentik/flink that referenced this pull request Feb 8, 2023
akkinenivijay pushed a commit to krisnaru/flink that referenced this pull request Feb 11, 2023
@chucheng92 chucheng92 deleted the FLINK-30334 branch June 14, 2023 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants