Add Data Transfer Limits for each connection, configurable by vhost #88

adamncasey · 2022-06-10T12:28:51Z

This commit adds support for limiting the amount of data which can be
read from the client-side (ingress) socket in a one second time period.

The aim is to be able to avoid one single connection being able to
overwhelm the proxy, while also providing an extra tool for limiting an
out of control vhost - giving time for both RabbitMQ, and amqpprox to
recover.

These limits are configurable using

amqpprox_ctl ... LIMIT DATA_RATE DEFAULT 500000000 # About 500MB per second limit per connection on all vhosts

amqpprox_ctl ... LIMIT DATA_RATE VHOST vhostname 5000000 # Restrict 'vhostname' down to about 5MB/s

The limits are enforced by counting the number of bytes read from the
socket during each read operation, and pausing for one second before
starting a read operation if the connection is at the limit.

This might transfer a bit more data than the precise limit since amqpprox
will always pass the whole read buffer to the server after a read operation - which
technically could lead to overshooting the limit, but the next read operation will pause for
until the next 'tick' (up to one second) after doing so. In testing the data transferred per second
stays roughly at the configured limit such that this isn't a huge problem.

I tested and compared maximum data throughput before and after this
change - with no limit enabled and with a high limit enabled. No
particular differences were seen:
Data Transfer rates all measured in / MiB/s

Run #	Without Change	With Change No Limit	With Change 2000MiB/s Limit	With Change 50MiB/s Limit (per client)
1	1058	1078	1064	497
2	1056	1030	1060	497
3	1049	1023	1054	497

Tested with amqpprox and the performance tester running on the same machine - a VM with 8x 2.5GHz cores and 32GB of ram

I ran the performance tester using these arguments:

$ RUST_LOG=info cargo run --release -- --address amqp://127.0.0.1:5672 --listen-address 0.0.0.0:30424 --message-size 10000000 --num-messages 500 --clients 10 --max-threads 50

This runs 10 parallel client threads transmitting ~5000GB each as fast as they can. Each test run took between ~45seconds and ~2-3minutes, depending on whether the rate limit was applied. Longer tests better reflect the true peak data rate achieved since some setup/teardown time is included in the summary calculations. Using bmon I could see that most of the tests were peaking at around 2.2-2.4GiB/s which means our results here aren't too far off that - since the number in bmon reflects the client -> proxy -> server data flow all over the same interface.

When amqpprox has a large number of connections which have hit their
limit at least once since being opened (this triggers the one second
reset interval) there is some additional CPU load from this change.

I run the following experiment:

Start amqpprox
Connect 21k clients which send a 100byte message every minute
Measure CPU usage of amqpprox using pidstat 15 5 -ru -p <pid>

In a few different scenarios:

amqpprox without this change
With the change, with no data rate limits configured
With the change, with a low data rate limit configured (50 bytes per second)
With the change, with a high data rate limit configured but with all timers triggering.
- Triggering all timers was possible by setting the data rate limit very low (50b/s) for a couple of minutes, then set it high: 2GB/s.

Results:

Cpu usage tests	Without Change	No Limit	Low limit	All connections triggering timers
%CPU sample 1	17.93	13.4	19.53	20.8
%CPU sample 2	15.73	18.4	12.73	24.27
%CPU sample 3	18.8	14.53	18.8	21.47
%CPU sample 4	17.73	18.07	13.33	23.87
%CPU sample 5	18.73	13.4	19.47	20.67
%CPU average	17.79	15.56	16.77	22.21

I started amqpprox, setup 21k clients connecting to it which publish a 100byte message every minute.

TODO:

Re-run cpu usage comparisons with 20k connections and put the results on this PR
Figure out the clang-format changes going on here - unsure why it's reformatting some code which looks fine.

alaric · 2022-06-14T21:49:21Z

libamqpprox/amqpprox_maybesecuresocketadaptor.h

+        if (d_dataRateLimit.remainingQuota() == 0) {
+            if (d_dataRateTimerStarted) {
+                d_dataRateTimer.cancel();
+                d_dataRateTimer.expires_at(d_dataRateTimer.expiry());


This seems out of line with the description of the PR as I understood it. It seems like this is resetting the exceeded limit on the next 1 second clock tick, which could be anything in the range 0.0s-1.0s, rather than waiting for a full 1.0 second peroid after hitting a limit.

Awkward. I believe I started out with waiting a whole second each time but changed to this due to failing to hit expected performance test results (not that unsurprising). I just collected some data on throughput results when being limited, and it looks like the method in code is a lot more predictable - which I think is preferable here. I'll update the PR description to reflect the current behaviour - but equally let me know if you prefer waiting the whole second.

No, I was not happy with the 1 second approach so I went looking at the code. I think this approach is better but I am still somewhat concerned about the jitter of confirm times it might introduce, and I wonder if there's a way to smooth the rate limiting out without being too expensive to implement (it doesn't look like there's any leaky bucket or similar implemantation in boost).

If we do decide to take this approach, is the cost of the timers expensive? Because if they might be, then you could have a single 1-second clock and just maintain a map of the sockets that are being limited and need calling back to do the socket read on the next cycle.

Yeah as you say - the timers aren't free - the cost of ASIO figuring out which timers are ready does start to show in perf results, but the CPU usage benchmarks I did down below seem to indicate this isn't overly expensive. Do you agree? We could switch the implementation to have a single timer which runs every second - I think that would work, it just needs a bit more thought around socket ownership & how many timers are needed to keep latency / cpu spikes to a minimum. At least at the moment a lower hanging performance fruit is tidying up statistics collection when there are 20k connections (there's some perf output kicking around internally for this).

and I think this is everything you already know, but to make it explicit: if someone is pushing a lot of data to the broker they will see increased confirm times when they get throttled, but worth noting that don't do anything to hold back broker confirms flowing to them.

We could introduce a smoother threshold to help bursty clients which would like to push (for example) large amounts of data in only one second every 30.

I could go for supporting something like X limit every second, but allow up to Y to accumulate if unused - which looks roughly like what tc exposes in one of its modes. We did discuss something a bit more complicated than this for connection rate limits, but eventually decided to go for just a simple limit to keep it easy for users to understand when they get rate limited.

I'm am unsure how much of a difference it would make overall though since 1 second seems like only a medium period of time for a (mostly) asynchronous application like rabbitmq. At the moment I don't think we have a great way to measure if this would make a difference - not to say that means we should stick with what is currently in the PR - just makes it a bit trickier to decide.

adamncasey · 2022-06-15T13:42:14Z

Most of this comment is now in the PR description, but this does still note the difference between sleeping an entire second, and sleeping for the remainder of the current rate "tick" (the latter is implemented in this PR).

Data Transfer rates all measured in / MiB/s

Run #	Without Change	With Change No Limit	With Change 2000MiB/s Limit	With Change 50MiB/s Limit (per client)	With Change 200MiB/s 1.0s sleep	With Change 50MiB/s 1.0s sleep
1	1058	1078	1064	497	1063	393
2	1056	1030	1060	497	1076	434
3	1049	1023	1054	497	1054	401

Tested with amqpprox and the performance tester running on the same machine - a VM with 8x 2.5GHz cores and 32GB of ram

I ran the performance tester using these arguments:

$ RUST_LOG=info cargo run --release -- --address amqp://127.0.0.1:5672 --listen-address 0.0.0.0:30424 --message-size 10000000 --num-messages 500 --clients 10 --max-threads 50

This runs 10 parallel client threads transmitting ~5000GB each as fast as they can. Each test run took between ~45seconds and ~2-3minutes, depending on whether the rate limit was applied. Longer tests better reflect the true peak data rate achieved since some setup/teardown time is included in the summary calculations. Using bmon I could see that most of the tests were peaking at around 2.2-2.4GiB/s which means our results here aren't too far off that - since the number in bmon reflects the client -> proxy -> server data flow all over the same interface.

adamncasey · 2022-06-20T11:09:40Z

libamqpprox/amqpprox_limitcontrolcommand.cpp

@@ -293,7 +406,7 @@ void LimitControlCommand::handleCommand(const std::string & /* command */,
    }

    auto vhostOrDefault = readVhostOrDefault(iss);
-    if (!readVhostOrDefault) {
+    if (!vhostOrDefault) {


This is a bug in main at the moment.

Chinmay1412

I think we also need to add the commands in README.md and with each command details here: https://github.com/bloomberg/amqpprox/blob/main/docs/config.md

libamqpprox/amqpprox_dataratelimit.cpp

libamqpprox/amqpprox_dataratelimit.h

Chinmay1412 · 2022-06-24T10:35:24Z

libamqpprox/amqpprox_dataratelimit.h

+    /**
+     * Fetch the currently configured quota
+     */
+    std::size_t getQuota();


Suggested change

std::size_t getQuota();

std::size_t getQuota() const;

libamqpprox/amqpprox_dataratelimit.h

libamqpprox/amqpprox_dataratelimitmanager.cpp

Chinmay1412 · 2022-06-24T11:12:55Z

libamqpprox/amqpprox_dataratelimitmanager.cpp

+    d_defaultDataRateQuota = quota;
+}
+
+void DataRateLimitManager::setDefaultDataRateAlarm(std::size_t quota)


Don't we need to maintain default datarate limit per vhost?

Since the counters which tick down are stored individually under each Session, and changing the values already propagates down into the session, we only need to store what values they would get assigned when they're created here. For this we can just store the default values, plus any vhost overrides which exist.

Chinmay1412 · 2022-06-24T14:46:44Z

libamqpprox/amqpprox_maybesecuresocketadaptor.h

@@ -331,6 +371,46 @@ class MaybeSecureSocketAdaptor {
                                                             handler);
        }

+        if (!d_alarmed && d_dataRateAlarm.remainingQuota() == 0) {
+            if (d_dataRateTimerStarted) {
+                LOG_INFO << "Data Rate Alarm: Hit "


Should we log at warning level with certain keyword as a string?

I was going for "Data Rate Alarm" being the string we alarm on

libamqpprox/amqpprox_dataratelimitmanager.cpp

libamqpprox/amqpprox_dataratelimit.cpp

libamqpprox/amqpprox_maybesecuresocketadaptor.h

adamncasey · 2022-06-27T11:43:17Z

There's a few commits here now, but squashing makes reviewing awkward. I'll squash before merging.

willhoy

Think this is good to go for now

willhoy · 2022-06-29T15:14:41Z

libamqpprox/amqpprox_maybesecuresocketadaptor.h

+    void onTimer(const boost::system::error_code &error)
+    {
+        if (error == boost::asio::error::operation_aborted) {
+            return;


take it this isnt worth logging. e.g. timer cancelled

yeah, it gets cancelled every time a connection hits their data rate limit, and also whenever a socket is closed.

This commit adds support for limiting the amount of data which can be read from the client-side (ingress) socket in a one second time period. The aim is to be able to avoid one single connection being able to overwhelm the proxy, while also providing an extra tool for limiting an out of control vhost - giving time for both RabbitMQ, and amqpprox to recover. These limits are configurable using ``` amqpprox_ctl ... LIMIT DATA_RATE DEFAULT 500000000 # About 500MB per second limit per connection on all vhosts amqpprox_ctl ... LIMIT DATA_RATE VHOST vhostname 5000000 # Restrict 'vhostname' down to about 5MB/s ``` The limits are enforced by counting the number of bytes read from the socket during each read operation, and pausing for one second before starting a read operation if the connection is at the limit. This might transfer more, or less data than the precise limit since amqpprox will always pass the whole read buffer to the server after a read operation - which could lead to overshooting the limit, but the next read operation will pause for a whole second after doing so - which could undershoot. In testing the data transferred per second stays roughly at the limit configured. I tested and compared maximum data throughput before and after this change - with no limit enabled and with a high limit enabled. No particular differences were seen. When amqpprox has a large number of connections which have hit their limit at least once since being opened (this triggers the one second reset interval) there is some additional CPU load from this change.

This commit changes MaybeSecureSocketAdaptor to template a few critical types which allows us to write unit tests against some of it's behaviours In theory this could probably also replace the SocketIntercept things, which were a nice solution when the implementation of this class passed straight through to the asio types. Since working around some asio bugs bloomberg#69 and adding support for data rate limits bloomberg#88 the MaybeSecureSocketAdaptor class has expanded enough that we should be writing unit tests for it - and probably should have done so before now.

This commit changes MaybeSecureSocketAdaptor to template a few critical types which allows us to write unit tests against some of it's behaviours In theory this could probably also replace the SocketIntercept things, which were a nice solution when the implementation of this class passed straight through to the asio types. Since working around some asio bugs #69 and adding support for data rate limits #88 the MaybeSecureSocketAdaptor class has expanded enough that we should be writing unit tests for it - and probably should have done so before now.

alaric reviewed Jun 14, 2022

View reviewed changes

adamncasey force-pushed the datarate branch 2 times, most recently from 72a11a6 to 098a41b Compare June 20, 2022 10:29

adamncasey commented Jun 20, 2022

View reviewed changes

adamncasey requested review from Chinmay1412 and alaric June 20, 2022 14:36

Chinmay1412 reviewed Jun 24, 2022

View reviewed changes

adamncasey force-pushed the datarate branch from 5515ea8 to 5606b9f Compare June 27, 2022 11:29

willhoy approved these changes Jun 29, 2022

View reviewed changes

adamncasey force-pushed the datarate branch from b51c6ab to 7acd7c0 Compare June 30, 2022 12:35

Chinmay1412 approved these changes Jun 30, 2022

View reviewed changes

adamncasey merged commit 89eac7b into bloomberg:main Jun 30, 2022

adamncasey deleted the datarate branch June 30, 2022 12:49

adamncasey mentioned this pull request Oct 18, 2022

Start unit testing MaybeSecureSocketAdaptor #93

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Data Transfer Limits for each connection, configurable by vhost #88

Add Data Transfer Limits for each connection, configurable by vhost #88

adamncasey commented Jun 10, 2022 •

edited

Loading

alaric Jun 14, 2022

adamncasey Jun 15, 2022

alaric Jun 16, 2022

adamncasey Jun 20, 2022

adamncasey Jun 20, 2022 •

edited

Loading

adamncasey commented Jun 15, 2022 •

edited

Loading

adamncasey Jun 20, 2022

Chinmay1412 left a comment

Chinmay1412 Jun 24, 2022

Chinmay1412 Jun 24, 2022

adamncasey Jun 27, 2022

Chinmay1412 Jun 24, 2022

adamncasey Jun 27, 2022

adamncasey commented Jun 27, 2022

willhoy left a comment

willhoy Jun 29, 2022

adamncasey Jun 30, 2022

Add Data Transfer Limits for each connection, configurable by vhost #88

Add Data Transfer Limits for each connection, configurable by vhost #88

Conversation

adamncasey commented Jun 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamncasey Jun 20, 2022 • edited Loading

Choose a reason for hiding this comment

adamncasey commented Jun 15, 2022 • edited Loading

Choose a reason for hiding this comment

Chinmay1412 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamncasey commented Jun 27, 2022

willhoy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamncasey commented Jun 10, 2022 •

edited

Loading

adamncasey Jun 20, 2022 •

edited

Loading

adamncasey commented Jun 15, 2022 •

edited

Loading