Stop limiting num-connections based on num-known-IPs (Improve S3-Express performance) #407

graebm · 2024-02-28T01:11:25Z

Issue:
Disappointing S3-Express performance.

Description of changes:
Stop limiting num-connections based on num-known-IPs.

Diagnosing the issue:
We found that num-connections was never getting very high, because num-connections scales based on the num-known-IPs. S3-Express endpoints have very few IPs, so their num-connections weren't scaling very high.

The algorithm was adding 10 connections per known-IP. On a 100Gb/s machine, this maxed out at 250 connections once 25 IPs were known. But S3-Express endpoints only have 4 unique IPs, so they never got higher than 40 connections.

This algorithm was written back when S3 returned 1 IP per DNS query. The intention was to throttle connections until more IPs were known, in order to spread load among S3's server fleet. However, as of Aug 2023 S3 provides multiple IPs per DNS query. So now, we can scale up to max connections after the first DNS query and still be spreading load.

We also believed that spreading load was a key to good performance. But I found that spreading the load didn't have much impact on performance (at least now, in 2024, on the 100Gb/s machine I was using). Tests where I hard-coded a single IP and hit it with max-connections didn't differ much from tests where the load was spread among 8 IPs or 100 IPs.

I want to get this change out quickly and help S3-Express, so I picked magic numbers where the num-connections math ends up with the same result as the old algorithm. Normal S3 performance is mildly improved (max-connections is reached immediately, instead of scaling up over 30sec as it finds more IPs). S3 Express performance is MUCH improved.

Future Work:
Improve this algorithm further:

expect higher throughput on connections to S3 Express
expect lower throughput on connections transferring small objects
dynamic scaling without a bunch of magic numbers ??? (sounds cool, but I don't have any ideas how this would work yet)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…ould get us results closest to old algorithm

TingDaoK

DNS Load Balancing: DNS resolver continuously harvests Amazon S3 IP addresses. When load is spread across the S3 fleet, overall throughput is better than if all connections were hammering the same IP simultaneously.

Should we update the readme here?

TingDaoK · 2024-02-28T18:20:23Z

source/s3_client.c

@@ -151,32 +152,9 @@ uint32_t aws_s3_client_get_max_active_connections(
    struct aws_s3_client *client,
    struct aws_s3_meta_request *meta_request) {
    AWS_PRECONDITION(client);
+    (void)meta_request;


why not just remove meta_request from the args, it's a private helper.

Or do we even need this helper? Can we just have it as a member of client? It won't change after the client initialized now, right?

I'm pretty sure we'll be using meta_request again in the near future. If we're going to use different magic numbers for S3-Express vs S3-Vanilla, there will be some meta-request->storage_class variable that we'll need to check

TingDaoK · 2024-02-28T18:25:25Z

source/s3_client.c

@@ -1687,7 +1669,7 @@ static bool s_s3_client_should_update_meta_request(
    size_t num_known_vips = client->vtable->get_host_address_count(
        client->client_bootstrap->host_resolver, endpoint->host_name, AWS_GET_HOST_ADDRESS_COUNT_RECORD_TYPE_A);
    if (num_known_vips == 0 && (client->threaded_data.num_requests_being_prepared +
-                                client->threaded_data.request_queue_size) >= g_max_num_connections_per_vip) {
+                                client->threaded_data.request_queue_size) >= g_min_num_connections) {


do we still want this?

I don't know yes/no for certain. I wanted to limit the extent of these changes so I'm keeping the old behavior where we limit the number of requests in the prep stage when we don't know any IP addresses for its endpoint yet.

I also wondered if this was added in response to a specific issue, or just something that was added superstitiously. The check was added in this PR: Multiple Bucket Support. There's a specific commit (Preventing too many requests from being prepared) where Ryan adds that adds this check. So it seems pretty deliberate.

So maybe this could be removed? Or the numbers tweaked? We'll address that another day

…d. My tests showed similar performance for 1 vs 100 IPs. But with 1 IP the results were definitely more variable (some IPs were slower/faster than others)

codecov-commenter · 2024-02-28T21:56:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.37%. Comparing base (593c2ab) to head (e677691).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #407   +/-   ##
=======================================
  Coverage   89.37%   89.37%           
=======================================
  Files          20       20           
  Lines        5863     5854    -9     
=======================================
- Hits         5240     5232    -8     
+ Misses        623      622    -1

Files	Coverage Δ
source/s3_client.c	`88.16% <100.00%> (-0.11%)`	⬇️

... and 1 file with indirect coverage changes

graebm added 4 commits February 26, 2024 13:41

Stop limiting num-connections based on num-known IPs

7b07f4c

upload-max-throughput benchmark suffered with 160 connections. 250 sh…

a2c20c0

…ould get us results closest to old algorithm

polish

9268e23

clang-format

12a7c1a

TingDaoK reviewed Feb 28, 2024

View reviewed changes

Update readme to say throughput is more "reliable" when spreading loa…

e677691

…d. My tests showed similar performance for 1 vs 100 IPs. But with 1 IP the results were definitely more variable (some IPs were slower/faster than others)

TingDaoK approved these changes Feb 28, 2024

View reviewed changes

graebm merged commit 59569e3 into main Feb 29, 2024
30 checks passed

graebm deleted the ignore-ip-count branch February 29, 2024 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop limiting num-connections based on num-known-IPs (Improve S3-Express performance) #407

Stop limiting num-connections based on num-known-IPs (Improve S3-Express performance) #407

graebm commented Feb 28, 2024 •

edited

Loading

TingDaoK left a comment

TingDaoK Feb 28, 2024

graebm Feb 28, 2024

TingDaoK Feb 28, 2024

graebm Feb 28, 2024 •

edited

Loading

codecov-commenter commented Feb 28, 2024

Stop limiting num-connections based on num-known-IPs (Improve S3-Express performance) #407

Stop limiting num-connections based on num-known-IPs (Improve S3-Express performance) #407

Conversation

graebm commented Feb 28, 2024 • edited Loading

TingDaoK left a comment

Choose a reason for hiding this comment

TingDaoK Feb 28, 2024

Choose a reason for hiding this comment

graebm Feb 28, 2024

Choose a reason for hiding this comment

TingDaoK Feb 28, 2024

Choose a reason for hiding this comment

graebm Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Feb 28, 2024

Codecov Report

graebm commented Feb 28, 2024 •

edited

Loading

graebm Feb 28, 2024 •

edited

Loading