Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop limiting num-connections based on num-known-IPs (Improve S3-Express performance) #407

Merged
merged 5 commits into from
Feb 29, 2024

Conversation

graebm
Copy link
Contributor

@graebm graebm commented Feb 28, 2024

Issue:
Disappointing S3-Express performance.

Description of changes:
Stop limiting num-connections based on num-known-IPs.

Diagnosing the issue:
We found that num-connections was never getting very high, because num-connections scales based on the num-known-IPs. S3-Express endpoints have very few IPs, so their num-connections weren't scaling very high.

The algorithm was adding 10 connections per known-IP. On a 100Gb/s machine, this maxed out at 250 connections once 25 IPs were known. But S3-Express endpoints only have 4 unique IPs, so they never got higher than 40 connections.

This algorithm was written back when S3 returned 1 IP per DNS query. The intention was to throttle connections until more IPs were known, in order to spread load among S3's server fleet. However, as of Aug 2023 S3 provides multiple IPs per DNS query. So now, we can scale up to max connections after the first DNS query and still be spreading load.

We also believed that spreading load was a key to good performance. But I found that spreading the load didn't have much impact on performance (at least now, in 2024, on the 100Gb/s machine I was using). Tests where I hard-coded a single IP and hit it with max-connections didn't differ much from tests where the load was spread among 8 IPs or 100 IPs.

I want to get this change out quickly and help S3-Express, so I picked magic numbers where the num-connections math ends up with the same result as the old algorithm. Normal S3 performance is mildly improved (max-connections is reached immediately, instead of scaling up over 30sec as it finds more IPs). S3 Express performance is MUCH improved.

Future Work:
Improve this algorithm further:

  • expect higher throughput on connections to S3 Express
  • expect lower throughput on connections transferring small objects
  • dynamic scaling without a bunch of magic numbers ??? (sounds cool, but I don't have any ideas how this would work yet)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@TingDaoK TingDaoK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DNS Load Balancing: DNS resolver continuously harvests Amazon S3 IP addresses. When load is spread across the S3 fleet, overall throughput is better than if all connections were hammering the same IP simultaneously.

Should we update the readme here?

@@ -151,32 +152,9 @@ uint32_t aws_s3_client_get_max_active_connections(
struct aws_s3_client *client,
struct aws_s3_meta_request *meta_request) {
AWS_PRECONDITION(client);
(void)meta_request;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just remove meta_request from the args, it's a private helper.

Or do we even need this helper? Can we just have it as a member of client? It won't change after the client initialized now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we'll be using meta_request again in the near future. If we're going to use different magic numbers for S3-Express vs S3-Vanilla, there will be some meta-request->storage_class variable that we'll need to check

@@ -1687,7 +1669,7 @@ static bool s_s3_client_should_update_meta_request(
size_t num_known_vips = client->vtable->get_host_address_count(
client->client_bootstrap->host_resolver, endpoint->host_name, AWS_GET_HOST_ADDRESS_COUNT_RECORD_TYPE_A);
if (num_known_vips == 0 && (client->threaded_data.num_requests_being_prepared +
client->threaded_data.request_queue_size) >= g_max_num_connections_per_vip) {
client->threaded_data.request_queue_size) >= g_min_num_connections) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still want this?

Copy link
Contributor Author

@graebm graebm Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know yes/no for certain. I wanted to limit the extent of these changes so I'm keeping the old behavior where we limit the number of requests in the prep stage when we don't know any IP addresses for its endpoint yet.

I also wondered if this was added in response to a specific issue, or just something that was added superstitiously. The check was added in this PR: Multiple Bucket Support. There's a specific commit (Preventing too many requests from being prepared) where Ryan adds that adds this check. So it seems pretty deliberate.

So maybe this could be removed? Or the numbers tweaked? We'll address that another day

…d. My tests showed similar performance for 1 vs 100 IPs. But with 1 IP the results were definitely more variable (some IPs were slower/faster than others)
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.37%. Comparing base (593c2ab) to head (e677691).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #407   +/-   ##
=======================================
  Coverage   89.37%   89.37%           
=======================================
  Files          20       20           
  Lines        5863     5854    -9     
=======================================
- Hits         5240     5232    -8     
+ Misses        623      622    -1     
Files Coverage Δ
source/s3_client.c 88.16% <100.00%> (-0.11%) ⬇️

... and 1 file with indirect coverage changes

@graebm graebm merged commit 59569e3 into main Feb 29, 2024
30 checks passed
@graebm graebm deleted the ignore-ip-count branch February 29, 2024 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants