[FLINK-31803] Harden UpdateJobResourceRequirementsRecoveryITCase. #22408

dmvk · 2023-04-17T08:37:43Z

This fixes a race condition where HA data might have been accidentally cleaned up due to job transition to the terminal state.

https://issues.apache.org/jira/browse/FLINK-31803

flinkbot · 2023-04-17T08:45:52Z

CI report:

f54c5e3 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

zentol

Which terminal state did the job enter, and why?

Couldn't the test still fail if the update call is made while a restarted job is still initializing?

zentol · 2023-04-17T13:33:29Z

...rc/test/java/org/apache/flink/test/recovery/UpdateJobResourceRequirementsRecoveryITCase.java


-        configuration.set(JobManagerOptions.SCHEDULER, JobManagerOptions.SchedulerType.Adaptive);


I don't understand this change.

This is meant in combination with:

assumeThat(ClusterOptions.isAdaptiveSchedulerEnabled(configuration)).isTrue();

which ensures this is already set

The main idea is to skip this test when smoke testing the default scheduler, because it can run for > 10s

I'd just consider this a test for the adaptive scheduler, and we don't categorically skip those in PRs.

If test times are a concern, well then those should be addressed anyway.

I'd just consider this a test for the adaptive scheduler, and we don't categorically skip those in PRs.

Fixed.

If test times are a concern, well then those should be addressed anyway.

The proper fix would be to have a real HA setup for testing that doesn't require running Zookeeper, which is out of scope for now 😢

dmvk · 2023-04-17T15:01:21Z

Which terminal state did the job enter, and why?

FAILED because during TM disconnect, the task failed, and the NoRestartStrategy tore the job down.

Couldn't the test still fail if the update call is made while a restarted job is still initializing?

I think this is prevented by calling waitUntilJobInitializationFinished before the update, or am I missing something?

dmvk · 2023-04-18T07:12:31Z

Couldn't the test still fail if the update call is made while a restarted job is still initializing?

We don't make any more updates after spinning up the 2nd cluster. The job restart this PR refers to is caused by race conditions during closeAsyncWithoutCleaningHighAvailabilityData which happens after the update.

zentol · 2023-04-18T09:42:29Z

FAILED because during TM disconnect, the task failed, and the NoRestartStrategy tore the job down.

~~How does the restart strategy fix that? If the TM disconnects (which I assume happens due to the shutdown), surely the job wont reach a running state again.~~

Ah

This fixes a race condition where HA data might have been accidentally cleaned up due to job transition to the terminal state.

dmvk · 2023-04-20T12:25:24Z

@zentol all comments should be addressed, PTAL

dmvk force-pushed the FLINK-31803 branch from dbc0cea to 41f7437 Compare April 17, 2023 08:37

dmvk assigned zentol Apr 17, 2023

flinkbot added component=Runtime/Coordination component=Tests labels Apr 17, 2023

zentol requested changes Apr 17, 2023

View reviewed changes

dmvk requested a review from zentol April 18, 2023 07:09

[FLINK-31803] Harden UpdateJobResourceRequirementsRecoveryITCase.

f54c5e3

This fixes a race condition where HA data might have been accidentally cleaned up due to job transition to the terminal state.

dmvk force-pushed the FLINK-31803 branch from 41f7437 to f54c5e3 Compare April 20, 2023 12:25

zentol approved these changes Apr 21, 2023

View reviewed changes

dmvk merged commit c2ab806 into apache:master Apr 21, 2023

dmvk deleted the FLINK-31803 branch April 21, 2023 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-31803] Harden UpdateJobResourceRequirementsRecoveryITCase. #22408

[FLINK-31803] Harden UpdateJobResourceRequirementsRecoveryITCase. #22408

dmvk commented Apr 17, 2023

flinkbot commented Apr 17, 2023 •

edited

zentol left a comment

zentol Apr 17, 2023

dmvk Apr 17, 2023

dmvk Apr 18, 2023

zentol Apr 18, 2023 •

edited

dmvk Apr 20, 2023

dmvk commented Apr 17, 2023 •

edited

dmvk commented Apr 18, 2023

zentol commented Apr 18, 2023 •

edited

dmvk commented Apr 20, 2023


		configuration.set(JobManagerOptions.SCHEDULER, JobManagerOptions.SchedulerType.Adaptive);

[FLINK-31803] Harden UpdateJobResourceRequirementsRecoveryITCase. #22408

[FLINK-31803] Harden UpdateJobResourceRequirementsRecoveryITCase. #22408

Conversation

dmvk commented Apr 17, 2023

flinkbot commented Apr 17, 2023 • edited

CI report:

zentol left a comment

Choose a reason for hiding this comment

zentol Apr 17, 2023

Choose a reason for hiding this comment

dmvk Apr 17, 2023

Choose a reason for hiding this comment

dmvk Apr 18, 2023

Choose a reason for hiding this comment

zentol Apr 18, 2023 • edited

Choose a reason for hiding this comment

dmvk Apr 20, 2023

Choose a reason for hiding this comment

dmvk commented Apr 17, 2023 • edited

dmvk commented Apr 18, 2023

zentol commented Apr 18, 2023 • edited

dmvk commented Apr 20, 2023

flinkbot commented Apr 17, 2023 •

edited

zentol Apr 18, 2023 •

edited

dmvk commented Apr 17, 2023 •

edited

zentol commented Apr 18, 2023 •

edited