[FLINK-32362] [connectors/common] increase the robustness of announceCombinedWatermark to cover the case task failover #22806

LoveHeat · 2023-06-16T08:37:14Z

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

The TaskInfo is stored in the blob store on job creation time as a persistent artifact
Deployments RPC transmits only the blob storage reference
TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Added test that validates that TaskInfo is transferred only once across recoveries
Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2023-06-16T08:41:37Z

CI report:

a82fc51 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

flink-runtime/src/main/java/org/apache/flink/runtime/source/coordinator/SourceCoordinator.java

...rc/test/java/org/apache/flink/runtime/source/coordinator/SourceCoordinatorAlignmentTest.java

flink-runtime/src/main/java/org/apache/flink/runtime/source/coordinator/SourceCoordinator.java

pnowojski · 2023-07-03T13:13:13Z

...time/src/main/java/org/apache/flink/runtime/source/coordinator/SourceCoordinatorContext.java

+                    final OperatorCoordinator.SubtaskGateway gateway =
+                            subtaskGateways.getOnlyGatewayAndNotCheckReady(subtaskId);
+                    if (gateway != null) {
+                        gateway.sendEvent(event);
+                    }


I think there might be a race condition. Coordinator might run this check successfully and start sending an event, while simultaneously the receiver starts restarting? Shouldn't we have a try/catch here and actually checking if the exception is "retry-able", so things like "subtask has failed" or "subtask not ready" or "task is initialising"? 🤔

There maybe no race condition, because all the operations that get/update SourceCoordinatorContext::subtaskGateways will execute in CoordinatorExecutor(just a single thread executor)~

Hi @LoveHeat , as I understand, the operations of subtaskGateways isn't race condition. Piotr means when you intend to call gateway.sendEvent(event), the gateway must not have a race condition. But the corresponding subtask may be restarting, causing gateway.sendEvent(event) to fail, right?

If so, it's probably fine too. From the SourceCoordinatorContext#sendEventToSourceOperator -> SubtaskGatewayImpl#sendEvent, if sendEvent fails, it won't throw exception directly and call the subtaskAccess.triggerTaskFailover.

BTW, I'm not sure whether the SubtaskGatewayImpl#sendEvent should throw FlinkRuntimeException. From the current code, the coordinatorExecutor cannot meet any exception, and most of SubtaskGatewayImpl#sendEvent are called inside of the coordinatorExecutor thread. So the coordinatorExecutor will fail after any FlinkRuntimeException is thrown. However, the expected behavior should be that the task fails.

But the corresponding subtask may be restarting, causing gateway.sendEvent(event) to fail, right?

Yes, that's my concern.

Re your 2nd and 3rd paragraphs:
If sendEvent fails, it will throw an exception, that will bubble up until the catch-all safety net in schedulePeriodTask, so that will cause the job to failover. Which I guess is currently mostly fine, unless that source has been used in batch job with multiple failover regions. In that case expected behaviour should be that only subtasks belonging to the failover region with source subtasks should restart, not whole job. That's my main worry with this race condition. That's why I suggested:

Shouldn't we have a try/catch here and actually checking if the exception is "retry-able", so things like "subtask has failed" or "subtask not ready" or "task is initialising"? 🤔

but gateway.sendEvent() only throw three potential exception:

task is not ready, i think this exception will not thrown for ever, because we have already check task ready before call sendEvent()

serialize event failed, i think this exception can not retry, should just fail job(but i don't known whether fail job for batch job is ok? )

just like: if sendEvent fails, it won't throw exception directly and call the subtaskAccess.triggerTaskFailover.

For other exception (like oom , jvm error) is also can not be retried

(correct me if i misunderstand your opinion 😃)

@pnowojski @1996fanrui

Ahh, I think you are right. Thanks for pointing this out. In that case this code be fine

...time/src/main/java/org/apache/flink/runtime/source/coordinator/SourceCoordinatorContext.java

pnowojski

thanks for the fix!

LGTM assuming green build

pnowojski · 2023-07-04T12:29:30Z

There seems to be a related failure?

Jun 30 07:41:32 07:41:32.084 [ERROR] Failures: 
Jun 30 07:41:32 07:41:32.084 [ERROR]   SourceCoordinatorAlignmentTest.testWatermarkAlignmentWithTwoGroups:129->assertLatestWatermarkAlignmentEvent:284

snuyanzin · 2023-07-04T12:43:49Z

just fyi: probably need to rebase to latest commit
since this issue was fixed in https://issues.apache.org/jira/browse/FLINK-32495

1996fanrui · 2023-07-05T03:15:03Z

Hi @LoveHeat , thanks for the fix.

Could you help backport this PR to release-1.16 and release-1.17? It's better to wait until all CIs pass before merging.

LoveHeat · 2023-07-05T03:49:31Z

1.16 : #22953
1.17: #22954
done @1996fanrui

1996fanrui · 2023-07-05T06:26:15Z

The CI failed due to org.apache.flink.architecture.rules.ConnectorRules.

Hi @snuyanzin , could you help take a look? I see you changed a little ArchTag at FLINK-32379 recently.

Follow this JIRA: https://issues.apache.org/jira/browse/FLINK-32539

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=50958&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=23061

snuyanzin · 2023-07-05T06:41:09Z

i confirm that test_ci misc failure

Jul 05 03:44:53 03:44:53.419 [ERROR] Errors: 
Jul 05 03:44:53 03:44:53.419 [ERROR]   Updating frozen violations is disabled (enable by configuration freeze.store.default.allowStoreUpdate=true)
Jul 05 03:44:53 03:44:53.419 [INFO]

is related to FLINK-32539 (the reason is FLINK-27415)
currently all builds are failing with on archunit tests after that merge
FLINK-32539 is aiming to fix that, PR is submitted and approved however need to wait for success ci

1996fanrui · 2023-07-05T07:06:48Z

Thanks @snuyanzin for the feedback~

1996fanrui · 2023-07-05T09:44:32Z

Hi @LoveHeat , FLINK-32539 is merged, could you rebase the master again? 😂

…CombinedWatermark to cover the case task failover

LoveHeat · 2023-07-05T10:48:30Z

Hi @LoveHeat , FLINK-32539 is merged, could you rebase the master again? 😂

done~ @1996fanrui @pnowojski

1996fanrui · 2023-07-05T13:44:19Z

Thanks to @LoveHeat for the fix, and thanks @pnowojski for the review.

CI passed, merging~

1996fanrui requested review from pnowojski and 1996fanrui June 16, 2023 08:48

1996fanrui reviewed Jun 16, 2023

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/source/coordinator/SourceCoordinator.java Outdated Show resolved Hide resolved

pnowojski requested changes Jun 28, 2023

View reviewed changes

LoveHeat force-pushed the FLINK-32362-fix-souceAlignment-task-lost branch from cf9faa1 to c38eed0 Compare June 30, 2023 06:50

LoveHeat changed the title ~~[FLINK-32362] [SourceAlignment] increase the robustness of announceCombinedWatermark to cover the case task failover~~ [FLINK-32362] [connectors/common] increase the robustness of announceCombinedWatermark to cover the case task failover Jun 30, 2023

pnowojski reviewed Jul 3, 2023

View reviewed changes

pnowojski approved these changes Jul 4, 2023

View reviewed changes

1996fanrui approved these changes Jul 4, 2023

View reviewed changes

LoveHeat force-pushed the FLINK-32362-fix-souceAlignment-task-lost branch from c38eed0 to 5c2518a Compare July 5, 2023 02:07

[FLINK-32362] [connectors/common] increase the robustness of announce…

a82fc51

…CombinedWatermark to cover the case task failover

LoveHeat force-pushed the FLINK-32362-fix-souceAlignment-task-lost branch from 5c2518a to a82fc51 Compare July 5, 2023 10:44

1996fanrui merged commit 5509c61 into apache:master Jul 5, 2023

LoveHeat deleted the FLINK-32362-fix-souceAlignment-task-lost branch July 6, 2023 02:47

LoveHeat restored the FLINK-32362-fix-souceAlignment-task-lost branch July 6, 2023 02:47

LoveHeat deleted the FLINK-32362-fix-souceAlignment-task-lost branch July 6, 2023 02:48

flinkbot added the component=<none> label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-32362] [connectors/common] increase the robustness of announceCombinedWatermark to cover the case task failover #22806

[FLINK-32362] [connectors/common] increase the robustness of announceCombinedWatermark to cover the case task failover #22806

LoveHeat commented Jun 16, 2023

flinkbot commented Jun 16, 2023 •

edited

pnowojski Jul 3, 2023

LoveHeat Jul 4, 2023

1996fanrui Jul 4, 2023 •

edited

pnowojski Jul 4, 2023

LoveHeat Jul 4, 2023 •

edited

pnowojski Jul 4, 2023

pnowojski left a comment •

edited

pnowojski commented Jul 4, 2023

snuyanzin commented Jul 4, 2023

1996fanrui commented Jul 5, 2023

LoveHeat commented Jul 5, 2023

1996fanrui commented Jul 5, 2023 •

edited

snuyanzin commented Jul 5, 2023

1996fanrui commented Jul 5, 2023

1996fanrui commented Jul 5, 2023

LoveHeat commented Jul 5, 2023

1996fanrui commented Jul 5, 2023

[FLINK-32362] [connectors/common] increase the robustness of announceCombinedWatermark to cover the case task failover #22806

[FLINK-32362] [connectors/common] increase the robustness of announceCombinedWatermark to cover the case task failover #22806

Conversation

LoveHeat commented Jun 16, 2023

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Jun 16, 2023 • edited

CI report:

pnowojski Jul 3, 2023

Choose a reason for hiding this comment

LoveHeat Jul 4, 2023

Choose a reason for hiding this comment

1996fanrui Jul 4, 2023 • edited

Choose a reason for hiding this comment

pnowojski Jul 4, 2023

Choose a reason for hiding this comment

LoveHeat Jul 4, 2023 • edited

Choose a reason for hiding this comment

pnowojski Jul 4, 2023

Choose a reason for hiding this comment

pnowojski left a comment • edited

Choose a reason for hiding this comment

pnowojski commented Jul 4, 2023

snuyanzin commented Jul 4, 2023

1996fanrui commented Jul 5, 2023

LoveHeat commented Jul 5, 2023

1996fanrui commented Jul 5, 2023 • edited

snuyanzin commented Jul 5, 2023

1996fanrui commented Jul 5, 2023

1996fanrui commented Jul 5, 2023

LoveHeat commented Jul 5, 2023

1996fanrui commented Jul 5, 2023

flinkbot commented Jun 16, 2023 •

edited

1996fanrui Jul 4, 2023 •

edited

LoveHeat Jul 4, 2023 •

edited

pnowojski left a comment •

edited

1996fanrui commented Jul 5, 2023 •

edited