Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-32362] [connectors/common] increase the robustness of announceCombinedWatermark to cover the case task failover #22806

Merged

Conversation

LoveHeat
Copy link

see Flink-32362

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

  • The TaskInfo is stored in the blob store on job creation time as a persistent artifact
  • Deployments RPC transmits only the blob storage reference
  • TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (100MB)
  • Extended integration test for recovery after master (JobManager) failure
  • Added test that validates that TaskInfo is transferred only once across recoveries
  • Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Jun 16, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@LoveHeat LoveHeat force-pushed the FLINK-32362-fix-souceAlignment-task-lost branch from cf9faa1 to c38eed0 Compare June 30, 2023 06:50
@LoveHeat LoveHeat changed the title [FLINK-32362] [SourceAlignment] increase the robustness of announceCombinedWatermark to cover the case task failover [FLINK-32362] [connectors/common] increase the robustness of announceCombinedWatermark to cover the case task failover Jun 30, 2023
Comment on lines +231 to +235
final OperatorCoordinator.SubtaskGateway gateway =
subtaskGateways.getOnlyGatewayAndNotCheckReady(subtaskId);
if (gateway != null) {
gateway.sendEvent(event);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be a race condition. Coordinator might run this check successfully and start sending an event, while simultaneously the receiver starts restarting? Shouldn't we have a try/catch here and actually checking if the exception is "retry-able", so things like "subtask has failed" or "subtask not ready" or "task is initialising"? 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There maybe no race condition, because all the operations that get/update SourceCoordinatorContext::subtaskGateways will execute in CoordinatorExecutor(just a single thread executor)~

Copy link
Member

@1996fanrui 1996fanrui Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @LoveHeat , as I understand, the operations of subtaskGateways isn't race condition. Piotr means when you intend to call gateway.sendEvent(event), the gateway must not have a race condition. But the corresponding subtask may be restarting, causing gateway.sendEvent(event) to fail, right?

If so, it's probably fine too. From the SourceCoordinatorContext#sendEventToSourceOperator -> SubtaskGatewayImpl#sendEvent, if sendEvent fails, it won't throw exception directly and call the subtaskAccess.triggerTaskFailover.

BTW, I'm not sure whether the SubtaskGatewayImpl#sendEvent should throw FlinkRuntimeException. From the current code, the coordinatorExecutor cannot meet any exception, and most of SubtaskGatewayImpl#sendEvent are called inside of the coordinatorExecutor thread. So the coordinatorExecutor will fail after any FlinkRuntimeException is thrown. However, the expected behavior should be that the task fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the corresponding subtask may be restarting, causing gateway.sendEvent(event) to fail, right?

Yes, that's my concern.

Re your 2nd and 3rd paragraphs:
If sendEvent fails, it will throw an exception, that will bubble up until the catch-all safety net in schedulePeriodTask, so that will cause the job to failover. Which I guess is currently mostly fine, unless that source has been used in batch job with multiple failover regions. In that case expected behaviour should be that only subtasks belonging to the failover region with source subtasks should restart, not whole job. That's my main worry with this race condition. That's why I suggested:

Shouldn't we have a try/catch here and actually checking if the exception is "retry-able", so things like "subtask has failed" or "subtask not ready" or "task is initialising"? 🤔

Copy link
Author

@LoveHeat LoveHeat Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but gateway.sendEvent() only throw three potential exception:

  1. task is not ready, i think this exception will not thrown for ever, because we have already check task ready before call sendEvent()
  2. serialize event failed, i think this exception can not retry, should just fail job(but i don't known whether fail job for batch job is ok? )
  3. just like: if sendEvent fails, it won't throw exception directly and call the subtaskAccess.triggerTaskFailover.

For other exception (like oom , jvm error) is also can not be retried

(correct me if i misunderstand your opinion 😃)

@pnowojski @1996fanrui

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I think you are right. Thanks for pointing this out. In that case this code be fine

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix!

LGTM assuming green build

@pnowojski
Copy link
Contributor

There seems to be a related failure?

Jun 30 07:41:32 07:41:32.084 [ERROR] Failures: 
Jun 30 07:41:32 07:41:32.084 [ERROR]   SourceCoordinatorAlignmentTest.testWatermarkAlignmentWithTwoGroups:129->assertLatestWatermarkAlignmentEvent:284 

@snuyanzin
Copy link
Contributor

just fyi: probably need to rebase to latest commit
since this issue was fixed in https://issues.apache.org/jira/browse/FLINK-32495

@LoveHeat LoveHeat force-pushed the FLINK-32362-fix-souceAlignment-task-lost branch from c38eed0 to 5c2518a Compare July 5, 2023 02:07
@1996fanrui
Copy link
Member

Hi @LoveHeat , thanks for the fix.

Could you help backport this PR to release-1.16 and release-1.17? It's better to wait until all CIs pass before merging.

@LoveHeat
Copy link
Author

LoveHeat commented Jul 5, 2023

1.16 : #22953
1.17: #22954
done @1996fanrui

@1996fanrui
Copy link
Member

1996fanrui commented Jul 5, 2023

The CI failed due to org.apache.flink.architecture.rules.ConnectorRules.

Hi @snuyanzin , could you help take a look? I see you changed a little ArchTag at FLINK-32379 recently.

Follow this JIRA: https://issues.apache.org/jira/browse/FLINK-32539

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=50958&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=23061

@snuyanzin
Copy link
Contributor

i confirm that test_ci misc failure

Jul 05 03:44:53 03:44:53.419 [ERROR] Errors: 
Jul 05 03:44:53 03:44:53.419 [ERROR]   Updating frozen violations is disabled (enable by configuration freeze.store.default.allowStoreUpdate=true)
Jul 05 03:44:53 03:44:53.419 [INFO] 

is related to FLINK-32539 (the reason is FLINK-27415)
currently all builds are failing with on archunit tests after that merge
FLINK-32539 is aiming to fix that, PR is submitted and approved however need to wait for success ci

@1996fanrui
Copy link
Member

Thanks @snuyanzin for the feedback~

@1996fanrui
Copy link
Member

Hi @LoveHeat , FLINK-32539 is merged, could you rebase the master again? 😂

…CombinedWatermark to cover the case task failover
@LoveHeat LoveHeat force-pushed the FLINK-32362-fix-souceAlignment-task-lost branch from 5c2518a to a82fc51 Compare July 5, 2023 10:44
@LoveHeat
Copy link
Author

LoveHeat commented Jul 5, 2023

Hi @LoveHeat , FLINK-32539 is merged, could you rebase the master again? 😂

done~ @1996fanrui @pnowojski

@1996fanrui
Copy link
Member

Thanks to @LoveHeat for the fix, and thanks @pnowojski for the review.

CI passed, merging~

@1996fanrui 1996fanrui merged commit 5509c61 into apache:master Jul 5, 2023
@LoveHeat LoveHeat deleted the FLINK-32362-fix-souceAlignment-task-lost branch July 6, 2023 02:47
@LoveHeat LoveHeat restored the FLINK-32362-fix-souceAlignment-task-lost branch July 6, 2023 02:47
@LoveHeat LoveHeat deleted the FLINK-32362-fix-souceAlignment-task-lost branch July 6, 2023 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants