Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kafka): ensure allocated resources are removed on failures #10813

Merged
merged 10 commits into from
May 29, 2023

Conversation

thalesmg
Copy link
Contributor

@thalesmg thalesmg commented May 24, 2023

Fixes https://emqx.atlassian.net/browse/EMQX-9936

This also introduces the following changes that were uncovered during development:

  • Improved Kafka Consumer logging when the node is shutting down to reduce noise.

  • We now ensure the bridge resources are removed if an exception happens during its creation to avoid leaking resources.

  • We changed the emqx_resource_manager_sup restart strategy from simple_one_for_one to one_for_one.

    Using simple_one_for_one has a potential race condition issue where we read the PID of the resource manager before trying to remove a resource, and then that PID changes because it was either dead at first, or it crashed and changed, and later we use this stale PID to try to remove it from the supervisor. Under such circumstances, the restarting child might linger in the supervisor, leaking resources.

    By using the resource ID itself as a child ID (and using one_for_one restart strategy), we ensure the child is truly removed.

Summary

🤖 Generated by Copilot at 1f74d26

This pull request adds resource management and telemetry for the kafka and pulsar bridges using the emqx_resource module. It also updates the version numbers, the change logs, and the test cases for the bridge applications. The changes aim to improve the reliability and observability of the bridges.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Changed lines covered in coverage report
  • Change log has been added to changes/{ce,ee}/(feat|perf|fix)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • If there should be document changes, a PR to emqx-docs.git is sent, or a jira ticket is created to follow up
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

@thalesmg thalesmg force-pushed the refactor-kafka-on-stop-v50 branch 2 times, most recently from 6f6d5c4 to 0154d89 Compare May 25, 2023 18:42
@thalesmg thalesmg force-pushed the refactor-kafka-on-stop-v50 branch from 0154d89 to 6308110 Compare May 25, 2023 19:38
…or_one`

Using `simple_one_for_one` has a potential race condition issue where
we read the PID of the resource manager before trying to remove a
resource, and then that PID changes because it was either dead at
first, or it crashed and changed, and later we use this stale PID to
try to remove it from the supervisor.  Under such circumstances, the
restarting child might linger in the supervisor, leaking resources.

By using the resource ID itself as a child ID (and using `one_for_one`
restart strategy), we ensure the child is truly removed.
@thalesmg thalesmg force-pushed the refactor-kafka-on-stop-v50 branch from 6308110 to 32e6213 Compare May 25, 2023 21:07
@thalesmg thalesmg marked this pull request as ready for review May 26, 2023 14:00
@thalesmg thalesmg requested a review from a team as a code owner May 26, 2023 14:00
Makefile Outdated Show resolved Hide resolved
@coveralls
Copy link
Collaborator

coveralls commented May 26, 2023

Pull Request Test Coverage Report for Build 5094422552

  • 52 of 59 (88.14%) changed or added relevant lines in 7 files are covered.
  • 132 unchanged lines in 17 files lost coverage.
  • Overall coverage decreased (-0.04%) to 81.595%

Changes Missing Coverage Covered Lines Changed/Added Lines %
apps/emqx_bridge/src/emqx_bridge.erl 5 6 83.33%
apps/emqx_bridge_kafka/src/emqx_bridge_kafka_impl_consumer.erl 20 23 86.96%
apps/emqx_bridge_kafka/src/emqx_bridge_kafka_impl_producer.erl 17 20 85.0%
Files with Coverage Reduction New Missed Lines %
apps/emqx_bridge_kafka/src/emqx_bridge_kafka_impl_consumer.erl 1 81.76%
apps/emqx_gateway_mqttsn/src/emqx_mqttsn_channel.erl 1 73.18%
apps/emqx_gateway_mqttsn/src/emqx_mqttsn_frame.erl 1 63.48%
apps/emqx_s3/src/emqx_s3_profile_conf.erl 1 93.07%
apps/emqx/src/emqx_cm.erl 1 91.6%
apps/emqx/src/emqx_limiter/src/emqx_limiter_schema.erl 1 75.44%
apps/emqx/src/emqx_mqueue.erl 1 89.86%
apps/emqx/src/emqx_quic_data_stream.erl 1 75.21%
apps/emqx_machine/src/emqx_machine_boot.erl 2 77.46%
apps/emqx/src/emqx_banned.erl 2 89.39%
Totals Coverage Status
Change from base Build 5080911934: -0.04%
Covered Lines: 28626
Relevant Lines: 35083

💛 - Coveralls

thalesmg and others added 2 commits May 26, 2023 11:29
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
@thalesmg thalesmg requested a review from lafirest as a code owner May 26, 2023 19:44
Comment on lines +374 to +375
kind => Kind,
error => Error,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kind => Kind,
error => Error,
exception => Kind,
reason => Error,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it in the next on_stop refactoring PR.

@thalesmg thalesmg merged commit 67e182e into emqx:master May 29, 2023
164 checks passed
@thalesmg thalesmg deleted the refactor-kafka-on-stop-v50 branch May 29, 2023 19:49
@yanzhiemq
Copy link
Collaborator

yanzhiemq commented Jun 13, 2023

Enhancements

  • Refactored Kafka Producer and Consumer bridges to avoid leaking resources during crashes at creation.

@yanzhiemq
Copy link
Collaborator

yanzhiemq commented Jun 13, 2023

增强

  • 重构了Kafka 生产者和消费者桥接,在创建过程中避免资源泄漏。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants