Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-31803] Harden UpdateJobResourceRequirementsRecoveryITCase. #22408

Merged
merged 1 commit into from
Apr 21, 2023

Conversation

dmvk
Copy link
Member

@dmvk dmvk commented Apr 17, 2023

This fixes a race condition where HA data might have been accidentally cleaned up due to job transition to the terminal state.

https://issues.apache.org/jira/browse/FLINK-31803

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 17, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which terminal state did the job enter, and why?

Couldn't the test still fail if the update call is made while a restarted job is still initializing?


configuration.set(JobManagerOptions.SCHEDULER, JobManagerOptions.SchedulerType.Adaptive);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant in combination with:

        assumeThat(ClusterOptions.isAdaptiveSchedulerEnabled(configuration)).isTrue();

which ensures this is already set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea is to skip this test when smoke testing the default scheduler, because it can run for > 10s

Copy link
Contributor

@zentol zentol Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just consider this a test for the adaptive scheduler, and we don't categorically skip those in PRs.

If test times are a concern, well then those should be addressed anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just consider this a test for the adaptive scheduler, and we don't categorically skip those in PRs.

Fixed.

If test times are a concern, well then those should be addressed anyway.

The proper fix would be to have a real HA setup for testing that doesn't require running Zookeeper, which is out of scope for now 😢

@dmvk
Copy link
Member Author

dmvk commented Apr 17, 2023

Which terminal state did the job enter, and why?

FAILED because during TM disconnect, the task failed, and the NoRestartStrategy tore the job down.

Couldn't the test still fail if the update call is made while a restarted job is still initializing?

I think this is prevented by calling waitUntilJobInitializationFinished before the update, or am I missing something?

@dmvk dmvk requested a review from zentol April 18, 2023 07:09
@dmvk
Copy link
Member Author

dmvk commented Apr 18, 2023

Couldn't the test still fail if the update call is made while a restarted job is still initializing?

We don't make any more updates after spinning up the 2nd cluster. The job restart this PR refers to is caused by race conditions during closeAsyncWithoutCleaningHighAvailabilityData which happens after the update.

@zentol
Copy link
Contributor

zentol commented Apr 18, 2023

FAILED because during TM disconnect, the task failed, and the NoRestartStrategy tore the job down.

How does the restart strategy fix that? If the TM disconnects (which I assume happens due to the shutdown), surely the job wont reach a running state again.

Ah

This fixes a race condition where HA data might have been accidentally cleaned up due to job transition to the terminal state.
@dmvk
Copy link
Member Author

dmvk commented Apr 20, 2023

@zentol all comments should be addressed, PTAL

@dmvk dmvk merged commit c2ab806 into apache:master Apr 21, 2023
@dmvk dmvk deleted the FLINK-31803 branch April 21, 2023 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants