Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid logging unnecessary errors in async cleanup functions #11065

Merged

Conversation

SergeTupchiy
Copy link
Contributor

@SergeTupchiy SergeTupchiy commented Jun 15, 2023

Cleanup functions that access ETS tables may fail with badarg error during EMQX shutdown.
They are called asynchronously by emqx_pool workers and accessed ETS tables may be already destroyed as their owners are shut down. This fix catches ETS badarg errors silently and individually, before they can be caught and logged by emqx_pool.

Fixes EMQX-9992

Summary

🤖 Generated by Copilot at 86e3cdb

This pull request refactors the code that uses ets tables in various modules to improve error handling and readability. It introduces a new module emqx_utils_ets that provides wrapper functions and a macro for ets operations. It also moves some broker-related functions to the emqx_broker module.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Changed lines covered in coverage report
  • Change log has been added to changes/{ce,ee}/(feat|perf|fix)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • If there should be document changes, a PR to emqx-docs.git is sent, or a jira ticket is created to follow up
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

@SergeTupchiy SergeTupchiy requested review from a team and lafirest as code owners June 15, 2023 16:12
@SergeTupchiy SergeTupchiy force-pushed the EMQX-9992-fix-async-cleanup-error-logs branch from 86e3cdb to d1ccf03 Compare June 15, 2023 16:20
@coveralls
Copy link
Collaborator

coveralls commented Jun 15, 2023

Pull Request Test Coverage Report for Build 5553583838

  • 12 of 15 (80.0%) changed or added relevant lines in 3 files are covered.
  • 32 unchanged lines in 16 files lost coverage.
  • Overall coverage decreased (-0.007%) to 82.444%

Changes Missing Coverage Covered Lines Changed/Added Lines %
apps/emqx/src/emqx_cm.erl 2 3 66.67%
apps/emqx/src/emqx_broker_helper.erl 7 9 77.78%
Files with Coverage Reduction New Missed Lines %
apps/emqx_bridge_gcp_pubsub/src/emqx_bridge_gcp_pubsub_consumer_worker.erl 1 93.72%
apps/emqx_dashboard/src/emqx_dashboard_desc_cache.erl 1 94.44%
apps/emqx_ft/src/emqx_ft_assembler.erl 1 85.11%
apps/emqx_ft/src/emqx_ft_responder.erl 1 88.24%
apps/emqx_ft/src/emqx_ft_storage_exporter.erl 1 88.24%
apps/emqx_ft/src/emqx_ft_storage_exporter_fs.erl 1 80.43%
apps/emqx_ft/src/emqx_ft_storage_fs.erl 1 82.84%
apps/emqx_gateway/src/emqx_gateway_schema.erl 1 94.44%
apps/emqx_gateway/src/emqx_gateway_utils.erl 1 86.49%
apps/emqx/src/emqx_connection.erl 1 83.7%
Totals Coverage Status
Change from base Build 5546449724: -0.007%
Covered Lines: 31209
Relevant Lines: 37855

💛 - Coveralls

Comment on lines 62 to 64
-spec lookup(ets:tab(), term()) -> [tuple()].
lookup(Tab, Key) ->
?safe_ets(ets:lookup(Tab, Key), []).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to add a try-catch block in every ets operation ?
Maybe the cost too high? is it possible to only handle this during shutdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about adding a generic try catch block directly in the cleanup functions but ultimately opted for this solution instead.
As the goal is to avoid extra logging, it seems slightly risky to catch and hide any badarg error as it's quite generic (may be potentially triggered by other code besides ETS).
As to the performance cost, I think try catch is cheap enough as long as stack trace is not inspected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I think this fix introduces a side effect for the general call, keeping the affected scope in the clean up would be better.
At the same time, I think try-catch is always expensive because it needs to save the calling environment for restoration when some exception happens

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try catch without attempting to retrieve the stacktrace should not impact performance in a observable way.
we can try to run some simple benchmark test to verify it.

Copy link
Member

@zmstone zmstone Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with zhongwen and firest that hiding badarg error in a generic call path may hinder the exposure of some potential issues especially during tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!
After additional discussion with @zmstone I've removed safe ETS wrappers and added try .. catch only in relevant shutdown functions that use async worker pool.

@keynslug
Copy link
Contributor

My 5 cents: sounds like startup order / dependency problem to me, that is usually generally solvable on the supervision tree level. For example, with emqx_cm the problem seems to be that common emqx_pool outlives emqx_cm, I'd say we need to start dedicated emqx_cm_pool and put it after emqx_cm in the startup order (so it shuts down before). Nice thing is it would work with any resources that might be tied to a process lifetime (or its dependencies), not just ETS tables.

@SergeTupchiy
Copy link
Contributor Author

@keynslug,
You are totally right about the root problem and your idea must be a good alternative.
Given all the feedback, I'll revisit this PR and implement something smarter (probably dedicated pools as you suggest).

Cleanup functions that access ETS tables may fail with `badarg` error during EMQX shutdown.
They are called asynchronously by `emqx_pool` workers and accessed ETS tables
may be already destroyed as their owners are shut down.
This fix catches ETS `badarg` errors before they can be caught and logged by `emqx_pool`.

Fixes: EMQX-9992
@SergeTupchiy SergeTupchiy force-pushed the EMQX-9992-fix-async-cleanup-error-logs branch from d1ccf03 to 950d5ed Compare July 14, 2023 11:25
Copy link
Member

@zhongwencool zhongwencool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@SergeTupchiy SergeTupchiy merged commit 8480a26 into emqx:master Jul 20, 2023
139 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants