feat(kafka): ensure allocated resources are removed on failures #10813

thalesmg · 2023-05-24T19:46:46Z

Fixes https://emqx.atlassian.net/browse/EMQX-9936

This also introduces the following changes that were uncovered during development:

Improved Kafka Consumer logging when the node is shutting down to reduce noise.
We now ensure the bridge resources are removed if an exception happens during its creation to avoid leaking resources.
We changed the emqx_resource_manager_sup restart strategy from simple_one_for_one to one_for_one.

Using simple_one_for_one has a potential race condition issue where we read the PID of the resource manager before trying to remove a resource, and then that PID changes because it was either dead at first, or it crashed and changed, and later we use this stale PID to try to remove it from the supervisor. Under such circumstances, the restarting child might linger in the supervisor, leaking resources.

By using the resource ID itself as a child ID (and using one_for_one restart strategy), we ensure the child is truly removed.

Summary

`🤖 Generated by Copilot at 1f74d26`

This pull request adds resource management and telemetry for the kafka and pulsar bridges using the emqx_resource module. It also updates the version numbers, the change logs, and the test cases for the bridge applications. The changes aim to improve the reliability and observability of the bridges.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

Added tests for the changes
Changed lines covered in coverage report
Change log has been added to changes/{ce,ee}/(feat|perf|fix)-<PR-id>.en.md files
For internal contributor: there is a jira ticket to track this change
If there should be document changes, a PR to emqx-docs.git is sent, or a jira ticket is created to follow up
Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

If changed package build workflow, pass this action (manual trigger)
Change log has been added to changes/ dir for user-facing artifacts update

Example failure: https://github.com/emqx/emqx/actions/runs/5070096710/jobs/9105822319#step:7:515 The attempt here is to setup the spy as early as possible, before the bridge starts, so we avoid missing rebalancing events.

…or_one` Using `simple_one_for_one` has a potential race condition issue where we read the PID of the resource manager before trying to remove a resource, and then that PID changes because it was either dead at first, or it crashed and changed, and later we use this stale PID to try to remove it from the supervisor. Under such circumstances, the restarting child might linger in the supervisor, leaking resources. By using the resource ID itself as a child ID (and using `one_for_one` restart strategy), we ensure the child is truly removed.

Makefile

coveralls · 2023-05-26T14:09:07Z

Pull Request Test Coverage Report for Build 5094422552

52 of 59 (88.14%) changed or added relevant lines in 7 files are covered.
132 unchanged lines in 17 files lost coverage.
Overall coverage decreased (-0.04%) to 81.595%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
apps/emqx_bridge/src/emqx_bridge.erl	5	6	83.33%
apps/emqx_bridge_kafka/src/emqx_bridge_kafka_impl_consumer.erl	20	23	86.96%
apps/emqx_bridge_kafka/src/emqx_bridge_kafka_impl_producer.erl	17	20	85.0%

Files with Coverage Reduction	New Missed Lines	%
apps/emqx_bridge_kafka/src/emqx_bridge_kafka_impl_consumer.erl	1	81.76%
apps/emqx_gateway_mqttsn/src/emqx_mqttsn_channel.erl	1	73.18%
apps/emqx_gateway_mqttsn/src/emqx_mqttsn_frame.erl	1	63.48%
apps/emqx_s3/src/emqx_s3_profile_conf.erl	1	93.07%
apps/emqx/src/emqx_cm.erl	1	91.6%
apps/emqx/src/emqx_limiter/src/emqx_limiter_schema.erl	1	75.44%
apps/emqx/src/emqx_mqueue.erl	1	89.86%
apps/emqx/src/emqx_quic_data_stream.erl	1	75.21%
apps/emqx_machine/src/emqx_machine_boot.erl	2	77.46%
apps/emqx/src/emqx_banned.erl	2	89.39%

Totals
Change from base Build 5080911934:	-0.04%
Covered Lines:	28626
Relevant Lines:	35083

💛 - Coveralls

Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>

apps/emqx_resource/src/emqx_resource_manager.erl

zmstone · 2023-05-29T15:06:17Z

apps/emqx_bridge/src/emqx_bridge.erl

+                            kind => Kind,
+                            error => Error,


Suggested change

kind => Kind,

error => Error,

exception => Kind,

reason => Error,

I'll fix it in the next on_stop refactoring PR.

yanzhiemq · 2023-06-13T06:03:42Z

Enhancements

Refactored Kafka Producer and Consumer bridges to avoid leaking resources during crashes at creation.

yanzhiemq · 2023-06-13T06:03:44Z

增强

重构了Kafka 生产者和消费者桥接，在创建过程中避免资源泄漏。

docs: improve changelog

297b8b3

thalesmg force-pushed the refactor-kafka-on-stop-v50 branch from 1f74d26 to 7cb8a3a Compare May 24, 2023 19:47

thalesmg mentioned this pull request May 24, 2023

feat(pulsar): ensure allocated resources are removed on failures (v5.0) #10778

Merged

4 tasks

thalesmg force-pushed the refactor-kafka-on-stop-v50 branch 2 times, most recently from 1411d77 to 9984bd2 Compare May 25, 2023 18:05

ci: ensure cluster.hocon is removed before running tests

34be8b3

thalesmg force-pushed the refactor-kafka-on-stop-v50 branch 2 times, most recently from 6f6d5c4 to 0154d89 Compare May 25, 2023 18:42

thalesmg added 5 commits May 25, 2023 16:38

refactor(pulsar): use macros for allocatable resources

42b3769

feat(kafka): ensure allocated resources are removed on failures

5df7314

fix(kafka): improve shutdown and health check logs during shutdown

0ca3f51

test(kafka_consumer): attempt to stabilize cluster tests

cb34bc5

Example failure: https://github.com/emqx/emqx/actions/runs/5070096710/jobs/9105822319#step:7:515 The attempt here is to setup the spy as early as possible, before the bridge starts, so we avoid missing rebalancing events.

fix(bridge): remove bridge if exception happens during creation

e435171

thalesmg force-pushed the refactor-kafka-on-stop-v50 branch from 0154d89 to 6308110 Compare May 25, 2023 19:38

thalesmg force-pushed the refactor-kafka-on-stop-v50 branch from 6308110 to 32e6213 Compare May 25, 2023 21:07

thalesmg marked this pull request as ready for review May 26, 2023 14:00

thalesmg requested a review from a team as a code owner May 26, 2023 14:00

zmstone reviewed May 26, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

thalesmg and others added 2 commits May 26, 2023 11:29

ci: improve script

7188346

Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>

test: fix flaky authn test

d27f593

thalesmg requested a review from lafirest as a code owner May 26, 2023 19:44

zmstone reviewed May 29, 2023

View reviewed changes

apps/emqx_resource/src/emqx_resource_manager.erl Show resolved Hide resolved

zmstone reviewed May 29, 2023

View reviewed changes

apps/emqx_resource/src/emqx_resource_manager.erl Show resolved Hide resolved

zmstone approved these changes May 29, 2023

View reviewed changes

thalesmg merged commit 67e182e into emqx:master May 29, 2023
164 checks passed

thalesmg deleted the refactor-kafka-on-stop-v50 branch May 29, 2023 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kafka): ensure allocated resources are removed on failures #10813

feat(kafka): ensure allocated resources are removed on failures #10813

thalesmg commented May 24, 2023 •

edited

Loading

coveralls commented May 26, 2023 •

edited

Loading

zmstone May 29, 2023

thalesmg May 29, 2023

yanzhiemq commented Jun 13, 2023 •

edited

Loading

yanzhiemq commented Jun 13, 2023 •

edited

Loading

feat(kafka): ensure allocated resources are removed on failures #10813

feat(kafka): ensure allocated resources are removed on failures #10813

Conversation

thalesmg commented May 24, 2023 • edited Loading

Summary

🤖 Generated by Copilot at 1f74d26

PR Checklist

Checklist for CI (.github/workflows) changes

coveralls commented May 26, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5094422552

💛 - Coveralls

zmstone May 29, 2023

Choose a reason for hiding this comment

thalesmg May 29, 2023

Choose a reason for hiding this comment

yanzhiemq commented Jun 13, 2023 • edited Loading

Enhancements

yanzhiemq commented Jun 13, 2023 • edited Loading

增强

thalesmg commented May 24, 2023 •

edited

Loading

`🤖 Generated by Copilot at 1f74d26`

coveralls commented May 26, 2023 •

edited

Loading

yanzhiemq commented Jun 13, 2023 •

edited

Loading

yanzhiemq commented Jun 13, 2023 •

edited

Loading