Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(webhook): add async retries and evaluate reply callback in fresh process #10690

Merged
merged 4 commits into from May 17, 2023

Conversation

thalesmg
Copy link
Contributor

@thalesmg thalesmg commented May 12, 2023

(Partially) fixes https://emqx.atlassian.net/browse/EMQX-9637

There are two optimizations made to improve the throughput of the webhook bridge.

  • normal and {shutdown, normal} requests are retried indefinetely.

    Since this bridge is called using async callback mode, and ehttpc frequently returns errors of the form normal and {shutdown, normal} that are retried "for free" by ehttpc, we add this behavior to async requests as well. Other errors are retried too, but they are not "free": 3 attempts are made at a maximum.

    This is important because, when using buffer workers, we should avoid making them enter the blocked state, since that halts all progress and makes throughput plummet.

  • The reply callback emqx_connector_http:reply_delegator is evaluated in a fresh process.

    This surprisingly simple change yields a big performance improvement in throughput.

    While the previous commit achieves ~ 55 k messages / s in throughput under some test conditions (100 k concurrent publishers publishing 1 QoS 1 message per second), the simple change in this commit improves it further to ~ 63 k messages / s.

    Benchmarks indicated that the evaluating one reply function is consistently quite fast (~ 20 µs), which makes this performance gain counterintuitive. Perhaps, although each call is cheap, ehttpc calls several of these in a row when there are several sent requests, and those costs might add up in latency.

This also makes a slight change in the load balancing of requests to ehttpc from the webhook bridge: now the routing key is the message's clientid rather than the calling PID. This should have the same behavior as before, except in the event of a client takeover. In this case, requests will still be forwarded to the same HTTP connection for the same client (assuming that such connection is still alive when the takeover is complete) rather than a different one.

Summary

🤖 Generated by Copilot at 4ed859c

This pull request enhances the HTTP connector with a retry mechanism and fixes a bug related to gproc counters. It also updates some dependencies in mix.exs and rebar.config to use more reliable sources and versions.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Changed lines covered in coverage report
  • Change log has been added to changes/{ce,ee}/(feat|perf|fix)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • [na] If there should be document changes, a PR to emqx-docs.git is sent, or a jira ticket is created to follow up
  • [na] Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • [na] If changed package build workflow, pass this action (manual trigger)
  • [na] Change log has been added to changes/ dir for user-facing artifacts update

@thalesmg thalesmg force-pushed the perf-webhook-retry-async-reply-v50 branch 2 times, most recently from 2af2628 to f61a5a4 Compare May 12, 2023 18:03
@coveralls
Copy link
Collaborator

coveralls commented May 12, 2023

Pull Request Test Coverage Report for Build 5003109574

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 35 unchanged lines in 20 files lost coverage.
  • Overall coverage decreased (-0.01%) to 81.857%

Files with Coverage Reduction New Missed Lines %
apps/emqx_bridge_dynamo/src/emqx_bridge_dynamo_connector.erl 1 76.47%
apps/emqx_bridge/src/emqx_bridge.erl 1 89.68%
apps/emqx_bridge/src/emqx_bridge_resource.erl 1 76.34%
apps/emqx_conf/src/emqx_conf_schema.erl 1 92.86%
apps/emqx_connector/src/emqx_connector_http.erl 1 66.67%
apps/emqx_dashboard/src/emqx_dashboard_monitor.erl 1 65.97%
apps/emqx_resource/src/emqx_resource_buffer_worker.erl 1 94.32%
apps/emqx/src/emqx_channel.erl 1 87.88%
apps/emqx/src/emqx_config.erl 1 87.4%
apps/emqx/src/emqx_connection.erl 1 85.96%
Totals Coverage Status
Change from base Build 5001529675: -0.01%
Covered Lines: 27125
Relevant Lines: 33137

💛 - Coveralls

@thalesmg thalesmg marked this pull request as ready for review May 12, 2023 19:41
@thalesmg thalesmg requested review from a team, JimMoen and lafirest as code owners May 12, 2023 19:41
@@ -24,7 +24,7 @@
{deps, [
{emqx_utils, {path, "../emqx_utils"}},
{lc, {git, "https://github.com/emqx/lc.git", {tag, "0.3.2"}}},
{gproc, {git, "https://github.com/emqx/gproc", {tag, "0.9.0.1"}}},
{gproc, {git, "https://github.com/uwiger/gproc", {tag, "0.9.1"}}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to sync and use our fork.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our 0.9.0.1 was tagged just because 0.9.1 was not ready upstream yet. Both are basically at the same state, just 0.9.0.1 has appup for v4.4.

I talked with @zmstone , and I understood that we wanted to use upstream if possible here. 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry.
For gproc, since we already forked (with appup etc in place), it makes more sense to use our fork instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted back to using our fork.

✔️

@thalesmg thalesmg force-pushed the perf-webhook-retry-async-reply-v50 branch from f61a5a4 to 0a55c1e Compare May 15, 2023 13:38
This is a performance improvement for webhook bridge.

Since this bridge is called using `async` callback mode, and `ehttpc`
frequently returns errors of the form `normal` and `{shutdown,
normal}` that are retried "for free" by `ehttpc`, we add this behavior
to async requests as well.  Other errors are retried too, but they are
not "free": 3 attempts are made at a maximum.

This is important because, when using buffer workers, we should avoid
making them enter the `blocked` state, since that halts all progress
and makes throughput plummet.
This surprisingly simple change yields a big performance improvement
in throughput.

While the previous commit achieves ~ 55 k messages / s
in throughput under some test conditions (100 k concurrent publishers
publishing 1 QoS 1 message per second), the simple change in this
commit improves it further to ~ 63 k messages / s.

Benchmarks indicated that the evaluating one reply function is
consistently quite fast (~ 20 µs), which makes this performance gain
counterintuitive.  Perhaps, although each call is cheap, `ehttpc`
calls several of these in a row when there are several sent requests,
and those costs might add up in latency.
@thalesmg thalesmg force-pushed the perf-webhook-retry-async-reply-v50 branch from 0a55c1e to bc7d0d5 Compare May 17, 2023 12:21
@thalesmg thalesmg merged commit b2afe4e into emqx:master May 17, 2023
161 of 162 checks passed
@thalesmg thalesmg deleted the perf-webhook-retry-async-reply-v50 branch May 17, 2023 14:06
@yanzhiemq
Copy link
Collaborator

yanzhiemq commented Jun 13, 2023

Enhancements

  • Added a retry mechanism to webhook bridge that attempts to improve throughput.

    This optimization retries request failures without blocking the buffering layer, which can improve throughput in situations of high messaging rate.

@yanzhiemq
Copy link
Collaborator

yanzhiemq commented Jun 13, 2023

增强

  • 为 Webhook 桥接添加了重试机制,旨在尝试提高吞吐量。

    这个优化让客户端可以在请求失败时进行重试,而不会阻塞缓冲层,从而在高消息传输率的情况下提高吞吐量。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants