perf(webhook): add async retries and evaluate reply callback in fresh process #10690

thalesmg · 2023-05-12T17:01:23Z

(Partially) fixes https://emqx.atlassian.net/browse/EMQX-9637

There are two optimizations made to improve the throughput of the webhook bridge.

normal and {shutdown, normal} requests are retried indefinetely.

Since this bridge is called using async callback mode, and ehttpc frequently returns errors of the form normal and {shutdown, normal} that are retried "for free" by ehttpc, we add this behavior to async requests as well. Other errors are retried too, but they are not "free": 3 attempts are made at a maximum.

This is important because, when using buffer workers, we should avoid making them enter the blocked state, since that halts all progress and makes throughput plummet.
The reply callback emqx_connector_http:reply_delegator is evaluated in a fresh process.

This surprisingly simple change yields a big performance improvement in throughput.

While the previous commit achieves ~ 55 k messages / s in throughput under some test conditions (100 k concurrent publishers publishing 1 QoS 1 message per second), the simple change in this commit improves it further to ~ 63 k messages / s.

Benchmarks indicated that the evaluating one reply function is consistently quite fast (~ 20 µs), which makes this performance gain counterintuitive. Perhaps, although each call is cheap, ehttpc calls several of these in a row when there are several sent requests, and those costs might add up in latency.

This also makes a slight change in the load balancing of requests to ehttpc from the webhook bridge: now the routing key is the message's clientid rather than the calling PID. This should have the same behavior as before, except in the event of a client takeover. In this case, requests will still be forwarded to the same HTTP connection for the same client (assuming that such connection is still alive when the takeover is complete) rather than a different one.

Summary

`🤖 Generated by Copilot at 4ed859c`

This pull request enhances the HTTP connector with a retry mechanism and fixes a bug related to gproc counters. It also updates some dependencies in mix.exs and rebar.config to use more reliable sources and versions.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

Added tests for the changes
Changed lines covered in coverage report
Change log has been added to changes/{ce,ee}/(feat|perf|fix)-<PR-id>.en.md files
For internal contributor: there is a jira ticket to track this change
[na] If there should be document changes, a PR to emqx-docs.git is sent, or a jira ticket is created to follow up
[na] Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

[na] If changed package build workflow, pass this action (manual trigger)
[na] Change log has been added to changes/ dir for user-facing artifacts update

coveralls · 2023-05-12T19:12:21Z

Pull Request Test Coverage Report for Build 5003109574

0 of 0 changed or added relevant lines in 0 files are covered.
35 unchanged lines in 20 files lost coverage.
Overall coverage decreased (-0.01%) to 81.857%

Files with Coverage Reduction	New Missed Lines	%
apps/emqx_bridge_dynamo/src/emqx_bridge_dynamo_connector.erl	1	76.47%
apps/emqx_bridge/src/emqx_bridge.erl	1	89.68%
apps/emqx_bridge/src/emqx_bridge_resource.erl	1	76.34%
apps/emqx_conf/src/emqx_conf_schema.erl	1	92.86%
apps/emqx_connector/src/emqx_connector_http.erl	1	66.67%
apps/emqx_dashboard/src/emqx_dashboard_monitor.erl	1	65.97%
apps/emqx_resource/src/emqx_resource_buffer_worker.erl	1	94.32%
apps/emqx/src/emqx_channel.erl	1	87.88%
apps/emqx/src/emqx_config.erl	1	87.4%
apps/emqx/src/emqx_connection.erl	1	85.96%

Totals
Change from base Build 5001529675:	-0.01%
Covered Lines:	27125
Relevant Lines:	33137

💛 - Coveralls

qzhuyan · 2023-05-15T07:14:06Z

apps/emqx/rebar.config

@@ -24,7 +24,7 @@
 {deps, [
    {emqx_utils, {path, "../emqx_utils"}},
    {lc, {git, "https://github.com/emqx/lc.git", {tag, "0.3.2"}}},
-    {gproc, {git, "https://github.com/emqx/gproc", {tag, "0.9.0.1"}}},
+    {gproc, {git, "https://github.com/uwiger/gproc", {tag, "0.9.1"}}},


better to sync and use our fork.

Our 0.9.0.1 was tagged just because 0.9.1 was not ready upstream yet. Both are basically at the same state, just 0.9.0.1 has appup for v4.4.

I talked with @zmstone , and I understood that we wanted to use upstream if possible here. 🤔

Sorry.
For gproc, since we already forked (with appup etc in place), it makes more sense to use our fork instead.

Reverted back to using our fork.

✔️

This is a performance improvement for webhook bridge. Since this bridge is called using `async` callback mode, and `ehttpc` frequently returns errors of the form `normal` and `{shutdown, normal}` that are retried "for free" by `ehttpc`, we add this behavior to async requests as well. Other errors are retried too, but they are not "free": 3 attempts are made at a maximum. This is important because, when using buffer workers, we should avoid making them enter the `blocked` state, since that halts all progress and makes throughput plummet.

This surprisingly simple change yields a big performance improvement in throughput. While the previous commit achieves ~ 55 k messages / s in throughput under some test conditions (100 k concurrent publishers publishing 1 QoS 1 message per second), the simple change in this commit improves it further to ~ 63 k messages / s. Benchmarks indicated that the evaluating one reply function is consistently quite fast (~ 20 µs), which makes this performance gain counterintuitive. Perhaps, although each call is cheap, `ehttpc` calls several of these in a row when there are several sent requests, and those costs might add up in latency.

yanzhiemq · 2023-06-13T06:04:53Z

Enhancements

Added a retry mechanism to webhook bridge that attempts to improve throughput.

This optimization retries request failures without blocking the buffering layer, which can improve throughput in situations of high messaging rate.

yanzhiemq · 2023-06-13T06:04:55Z

增强

为 Webhook 桥接添加了重试机制，旨在尝试提高吞吐量。

这个优化让客户端可以在请求失败时进行重试，而不会阻塞缓冲层，从而在高消息传输率的情况下提高吞吐量。

fix(mix): fix lhttpc version specification

02c8894

thalesmg force-pushed the perf-webhook-retry-async-reply-v50 branch 2 times, most recently from 2af2628 to f61a5a4 Compare May 12, 2023 18:03

thalesmg marked this pull request as ready for review May 12, 2023 19:41

thalesmg requested review from a team, JimMoen and lafirest as code owners May 12, 2023 19:41

qzhuyan reviewed May 15, 2023

View reviewed changes

thalesmg force-pushed the perf-webhook-retry-async-reply-v50 branch from f61a5a4 to 0a55c1e Compare May 15, 2023 13:38

thalesmg added 3 commits May 17, 2023 09:20

feat(webhook): use clientid for hash-loadbalancing ehttpc workers

01cf7fc

thalesmg force-pushed the perf-webhook-retry-async-reply-v50 branch from 0a55c1e to bc7d0d5 Compare May 17, 2023 12:21

qzhuyan approved these changes May 17, 2023

View reviewed changes

thalesmg merged commit b2afe4e into emqx:master May 17, 2023
161 of 162 checks passed

thalesmg deleted the perf-webhook-retry-async-reply-v50 branch May 17, 2023 14:06

keynslug mentioned this pull request May 18, 2023

EMQX Data Bridge Drop Packets #10719

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(webhook): add async retries and evaluate reply callback in fresh process #10690

perf(webhook): add async retries and evaluate reply callback in fresh process #10690

thalesmg commented May 12, 2023 •

edited

coveralls commented May 12, 2023 •

edited

qzhuyan May 15, 2023

thalesmg May 15, 2023

zmstone May 17, 2023

thalesmg May 17, 2023

yanzhiemq commented Jun 13, 2023 •

edited

yanzhiemq commented Jun 13, 2023 •

edited

perf(webhook): add async retries and evaluate reply callback in fresh process #10690

perf(webhook): add async retries and evaluate reply callback in fresh process #10690

Conversation

thalesmg commented May 12, 2023 • edited

Summary

🤖 Generated by Copilot at 4ed859c

PR Checklist

Checklist for CI (.github/workflows) changes

coveralls commented May 12, 2023 • edited

Pull Request Test Coverage Report for Build 5003109574

💛 - Coveralls

qzhuyan May 15, 2023

Choose a reason for hiding this comment

thalesmg May 15, 2023

Choose a reason for hiding this comment

zmstone May 17, 2023

Choose a reason for hiding this comment

thalesmg May 17, 2023

Choose a reason for hiding this comment

yanzhiemq commented Jun 13, 2023 • edited

Enhancements

yanzhiemq commented Jun 13, 2023 • edited

增强

thalesmg commented May 12, 2023 •

edited

`🤖 Generated by Copilot at 4ed859c`

coveralls commented May 12, 2023 •

edited

yanzhiemq commented Jun 13, 2023 •

edited

yanzhiemq commented Jun 13, 2023 •

edited