-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-31803] Harden UpdateJobResourceRequirementsRecoveryITCase. #22408
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which terminal state did the job enter, and why?
Couldn't the test still fail if the update call is made while a restarted job is still initializing?
|
||
configuration.set(JobManagerOptions.SCHEDULER, JobManagerOptions.SchedulerType.Adaptive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant in combination with:
assumeThat(ClusterOptions.isAdaptiveSchedulerEnabled(configuration)).isTrue();
which ensures this is already set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea is to skip this test when smoke testing the default scheduler, because it can run for > 10s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just consider this a test for the adaptive scheduler, and we don't categorically skip those in PRs.
If test times are a concern, well then those should be addressed anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just consider this a test for the adaptive scheduler, and we don't categorically skip those in PRs.
Fixed.
If test times are a concern, well then those should be addressed anyway.
The proper fix would be to have a real HA setup for testing that doesn't require running Zookeeper, which is out of scope for now 😢
FAILED because during TM disconnect, the task failed, and the NoRestartStrategy tore the job down.
I think this is prevented by calling |
We don't make any more updates after spinning up the 2nd cluster. The job restart this PR refers to is caused by race conditions during |
Ah |
This fixes a race condition where HA data might have been accidentally cleaned up due to job transition to the terminal state.
@zentol all comments should be addressed, PTAL |
This fixes a race condition where HA data might have been accidentally cleaned up due to job transition to the terminal state.
https://issues.apache.org/jira/browse/FLINK-31803