Skip to content

Conversation

@1996fanrui
Copy link
Member

@1996fanrui 1996fanrui commented Jul 12, 2023

What is the purpose of the change

Fix thread-safe bug of channel state executor when some subtasks are closed while other subtasks are starting.

When one subtask is closing, it will call the ChannelStateWriteRequestExecutorImpl#releaseSubtask. The releaseSubtask method will hold 2 locks at the different time.

  • The first one is ChannelStateWriteRequestExecutorImpl#lock, and update the isRegistering from true to false.
  • The second one is onRegistered.accept(this) will hold ChannelStateWriteRequestExecutorFactory#lock and update the ChannelStateWriteRequestExecutorFactory#executor

However, these 2 locks are held at the different time. The following calls will meet a bug:

  • Subtask 1 finished the first step of releaseSubtask, and doesn't hold the second lock.
  • Subtask 2 is starting, and call the executor#registerSubtask

Subtask2 will throw This executor has been registered.

Brief change log

Ensure that update the isRegistering and call the onRegistered within one registerLock. They should guarantee idempotence.

Verifying this change

Created 2 threads, one thread creates the executor and register subtask, one thread releases these subtask. This bug can reproduce by the following test .

This change added tests and can be verified as follows:

  • ChannelStateWriteRequestExecutorFactoryTest#testSomeSubtasksCloseDuringOtherSubtasksStarting

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive):no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 12, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @1996fanrui for the quick fix, this changes overall looks good to me. I only left some minor comment.

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, +1 for merging.

…or when some subtasks are closed while other subtasks are starting
@1996fanrui
Copy link
Member Author

Thanks a lot @reswqa for the review, merging~

@1996fanrui 1996fanrui merged commit bc4c21e into apache:master Jul 13, 2023
@1996fanrui 1996fanrui deleted the 32049/UC-executor-registered branch July 13, 2023 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants