-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(kafka): ensure allocated resources are removed on failures #10813
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1f74d26
to
7cb8a3a
Compare
4 tasks
1411d77
to
9984bd2
Compare
6f6d5c4
to
0154d89
Compare
Example failure: https://github.com/emqx/emqx/actions/runs/5070096710/jobs/9105822319#step:7:515 The attempt here is to setup the spy as early as possible, before the bridge starts, so we avoid missing rebalancing events.
0154d89
to
6308110
Compare
…or_one` Using `simple_one_for_one` has a potential race condition issue where we read the PID of the resource manager before trying to remove a resource, and then that PID changes because it was either dead at first, or it crashed and changed, and later we use this stale PID to try to remove it from the supervisor. Under such circumstances, the restarting child might linger in the supervisor, leaking resources. By using the resource ID itself as a child ID (and using `one_for_one` restart strategy), we ensure the child is truly removed.
6308110
to
32e6213
Compare
zmstone
reviewed
May 26, 2023
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
zmstone
reviewed
May 29, 2023
zmstone
reviewed
May 29, 2023
zmstone
approved these changes
May 29, 2023
Comment on lines
+374
to
+375
kind => Kind, | ||
error => Error, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested change
kind => Kind, | |
error => Error, | |
exception => Kind, | |
reason => Error, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix it in the next on_stop
refactoring PR.
Enhancements
|
增强
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes https://emqx.atlassian.net/browse/EMQX-9936
This also introduces the following changes that were uncovered during development:
Improved Kafka Consumer logging when the node is shutting down to reduce noise.
We now ensure the bridge resources are removed if an exception happens during its creation to avoid leaking resources.
We changed the
emqx_resource_manager_sup
restart strategy fromsimple_one_for_one
toone_for_one
.Using
simple_one_for_one
has a potential race condition issue where we read the PID of the resource manager before trying to remove a resource, and then that PID changes because it was either dead at first, or it crashed and changed, and later we use this stale PID to try to remove it from the supervisor. Under such circumstances, the restarting child might linger in the supervisor, leaking resources.By using the resource ID itself as a child ID (and using
one_for_one
restart strategy), we ensure the child is truly removed.Summary
🤖 Generated by Copilot at 1f74d26
This pull request adds resource management and telemetry for the kafka and pulsar bridges using the
emqx_resource
module. It also updates the version numbers, the change logs, and the test cases for the bridge applications. The changes aim to improve the reliability and observability of the bridges.PR Checklist
Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:
changes/{ce,ee}/(feat|perf|fix)-<PR-id>.en.md
filesChecklist for CI (.github/workflows) changes
changes/
dir for user-facing artifacts update