Skip to content

Conversation

@kristoffSC
Copy link
Contributor

What is the purpose of the change

Fixing FLINK-29627 where recovery more than one Committable causesed IllegalStateException and prevents cluster to start.

Implemented fix is based on merging committables from same subtaskId during recovery for Sink V2 architecture.

Brief change log

  • Enhance CheckpointSimpleVersionedSerializer::deserialize by calling SubtaskCommittableManager::merge for committables for same subtaskId.

Verifying this change

  • Add new test Committablecollectorserializertest::testCommittablesForSameSubtaskIdV2SerDe to verify deserialization of multiple committables for the same subtaskId.
  • Enhanced CommitterOperatorTest::testStateRestore to include committables for same subtaskId.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (yes)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@kristoffSC kristoffSC marked this pull request as ready for review October 18, 2022 13:56
@flinkbot
Copy link
Collaborator

flinkbot commented Oct 18, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@fapaul fapaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments to improve the test setup, hopefully. I'll take another more in-depth look tomorrow, but I think we can finish it up by EOD tomorrow.

Please also adjust the commit message formatting i.e. [FLINK-29627].

assertThat(subtaskCommittableManagerCheckpoint2.getSubtaskId())
.isEqualTo(subtaskId);

int i = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the expectedNumberOfPendingRequestsPerCommittable and committableIterator always have same size you can zip (guavas Streams.zip) them it also makes the loop like nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented your suggestion,
however Streams.zip requiters as a third argument a BiFunction that has to return something.
In my case it raterusn a Pair of CheckpointCommittableManagerImpl and List<Integer> expectedNumberOfPendingRequestsPerCommittable elements from respective streams.

In my opinion it is slightly less readable then previous version with while loop but I'm ok with Streams.zip, since we don't have to maintain the index.

.map(CommitRequestImpl::getCommittable)
.collect(Collectors.toList()))
.containsExactly((Integer) expectedPendingRequestCount);
} else if (expectedPendingRequestCount instanceof Integer[]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use containsExactlyElementsOf and always pass a list of integers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this hint, I was looking for some alternative method in the API but I must have missed that one, I was focused to find the one that checks also the order as the containsExactly does.

I've implemented your proposition which made assertPendingRequests method much simpler.
However assertCommittableCollector now has to accept List<List<Integer>> expectedNumbersOfPendingRequestsPerCommittable as an argument.

The reason for this is that for testCommittableCollectorV2SerDe test we will have Two committable Managers with one committable each, where for testCommittablesForSameSubtaskIdV2SerDe test we will have one committable manager but with two committables after recovery.

@kristoffSC kristoffSC force-pushed the FLINK-29627_master_noSinkItTest branch 3 times, most recently from 2bd6daf to 9817234 Compare October 19, 2022 09:25
@kristoffSC kristoffSC requested a review from fapaul October 19, 2022 09:29
Copy link
Contributor

@fapaul fapaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % rename one of the variables for the inline comment.

Can you also create the backports for 1.16/1.15?

CommittableCollector<Integer> committableCollector) {
int expectedNumberOfSubtasks,
CommittableCollector<Integer> committableCollector,
List<List<Integer>> expectedNumbersOfPendingRequestsPerCommittable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The variable name is misleading. Shouldn't this be the committablesPerSubtaskPerCheckpoint or something like this? expectedNumbers implies it is referring to the number but we are matching the content of the committable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

… more than 1 committable.

Recovery more than one Committable causes `IllegalStateException` and prevents cluster to start.

When we recover the `CheckpointCommittableManager` we deserialize SubtaskCommittableManager instances from recovery state, and we put them into `Map<Integer, SubtaskCommittableManager<CommT>>`. The key of this map is subtaskId of the recovered manager. However, this will fail if we have to recover more than one committable.

What was implemented as a fix is to call `SubtaskCommittableManager::merge` if we already deserialize manager for this subtaskId.

Signed-off-by: Krzysztof Chmielewski <krzysiek.chmielewski@gmail.com>
@kristoffSC
Copy link
Contributor Author

kristoffSC commented Oct 19, 2022

@fapaul
Backports:
1.15 - #21113
1.16 - #21115

@kristoffSC kristoffSC changed the title [Flink 29627][streaming] Fix duplicate key exception during recover more than 1 committable. [Flink 29627][streaming] Fix duplicate key exception during recovery more than 1 committable. Oct 19, 2022
@kristoffSC
Copy link
Contributor Author

@flinkbot run azure

@fapaul fapaul merged commit ac044f8 into apache:master Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants