Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-19936] Make SinkITCase stable #13898

Closed
wants to merge 2 commits into from
Closed

Conversation

guoweiM
Copy link
Contributor

@guoweiM guoweiM commented Nov 3, 2020

What is the purpose of the change

In the streaming execution mode there are two uncertain things:

  1. The order of receiving the checkpoint complete.(Committer receive first or GlobalCommitter receive first)
  2. Whether the operator can receive the last checkpoint
    This patch does following two things to resolve SinkITCase's unstable owning to these two uncertainty

This pr does two things

  1. Changing how to verifying the output of GlobalCommitter for the case 1
  2. Let the FiniteTestSource exit only when the Committer/GlobalCommitter received all the elements.

Brief change log

  • This patch changes how to verifying the output of GlobalCommitter to resolve SinkITCase's unstable owning to this uncertainty.
  • FiniteTestSource exits only after user provide condition is satisfied or time out.
  • All streaming mode ITCases exit only when the Committer/GlobalCommitter receive all the data or time out

Verifying this change

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

The order of receiving the checkpoint complete is uncertain.(`Committer` receive first or `GlobalCommitter` receive first)
This patch changes how to verifying the output of `GlobalCommitter` to resolve `SinkITCase`'s unstable owning to this uncertainty.
@flinkbot
Copy link
Collaborator

flinkbot commented Nov 3, 2020

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit e9e477b (Tue Nov 03 06:38:13 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!
  • This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 3, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@kl0u
Copy link
Contributor

kl0u commented Nov 3, 2020

@guoweiM The build fails on AZP. Also could you explain a bit what the problem is so that I can understand what is the goal of the solution?

@kl0u
Copy link
Contributor

kl0u commented Nov 3, 2020

From the solution, it seems that you think that the source of the instability is that the notification for the completion of the checkpoint does not come before the test exits, right?

@guoweiM
Copy link
Contributor Author

guoweiM commented Nov 4, 2020

From the solution, it seems that you think that the source of the instability is that the notification for the completion of the checkpoint does not come before the test exits, right?

Yes. In the streaming execution mode the Committer/globalCommitter depends on the checkpoint complete so I let the FiniteTestSource waits until the Committer/GlobalCommitter receive all the datas.
(Of course if we finish the 《FLIP-147: Support Checkpoints After Tasks Finished》 there will be no this problem.)

Another thing that makes the test unstable is the uncertain order in which the Committer and GlobalCommitter receives the checkpoint complete notification. It leads that we could not know the output of GlobalCommitter("1+3" or "1+2+3" or "1"?) when a checkpoint is complete. So this pr also changes how to verify the output of GlobalCommitter. In the new verification way we verify the splitted output of the GlobalCommitter.

Following is an example that GlobalCommitter receive the checkpoint complete before the no.3 Committer.

image

…/GlobalCommitter receive all the elements.

This patch does two things
1. FiniteTestSource exits only after user provide condition is satisfied or time out.
2. All streaming mode ITCases exit only when the Committer/GlobalCommitter receive all the data or time out.

fix compile error
Copy link
Contributor

@kl0u kl0u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @guoweiM . LGTM, I will also let it run on my AZP and then merge when I have the green light.

@kl0u kl0u closed this in 7a2d76d Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants