Remove arbitrary concurrency limiter for uplods and replace with a configurable limit that is disabled by default. #642

TheGrizzlyDev · 2024-05-09T17:37:45Z

The existing limit doesn't really help to achieve the goals it states as it cannot help in conditions where multiple clients are using RE at the same time, which are the common exercise conditions. Since the flag can be used to limit quickly uploads in case of some issues with RBE I opten in for keeping it in some form, by making it a setting, though it likely isn't very useful, so it's disabled by default. Instead we should rely on other control mechanisms like number of connections and TCP or HTTP/2 flow control or, like in this case, explicit errors coming from the RE protocol itself (like RESOURCE_EXHAUSTED or others).

configurable limit that is disabled by default.

JakobDegen · 2024-05-14T00:13:58Z

cc @benbrittain who originally added this

The existing limit doesn't really help to achieve the goals it states as it cannot help in conditions where multiple clients are using RE at the same time, which are the common exercise conditions

Yeah, but even if they're the common exercise conditions, it's still pretty easy to imagine that they might not be the failure conditions. Ie you could imagine some server that has one thread per client or something like that and falls over if a client shows up trying to write too many big files at once.

I think there's a nice in-between option to go for though, which both preserves this property and sounds closer to what you actually want: Can we only apply this concurrency to the non-batched uploads, and let batched uploads proceed at unlimited concurrency?

TheGrizzlyDev · 2024-05-14T10:15:19Z

Yeah, but even if they're the common exercise conditions, it's still pretty easy to imagine that they might not be the failure conditions.

Which is why there is an optional configuration to limit concurrency. Any default value would be making likely incorrect assumptions about the implementation and exercise conditions of a RE server and this is just wrong as there are multiple implementations and all of them are wildly different and can run on extremely large distributed clusters with a wide array of different hardware and software specifications. This invariably results in lower performances by default with no real impact. On the other hand if such a limit is necessary for a RE server they can state this setting in their docs. I think this is already a middle ground and brings the potential benefit of limiing concurrency whilst keeping a performant setup by default, wdyt?

TheGrizzlyDev · 2024-05-17T13:37:37Z

friendly ping @JakobDegen :)

TheGrizzlyDev · 2024-05-28T19:49:44Z

Ping :)

valadez-abel · 2024-06-03T16:21:29Z

I tested this on our RE setup, and without it the client seemed to be stuck for 40+ minutes on re_upload stages, but with this PR, it's no longer blocked on re_upload.

zjturner · 2024-06-06T23:40:11Z

Agree that it would be great to see some forward progress on this. @JakobDegen is there anything you're waiting for?

TheGrizzlyDev · 2024-06-17T15:20:35Z

ping @JakobDegen

JakobDegen · 2024-07-01T19:02:12Z

@TheGrizzlyDev yeah, sorry, I realized that in my previous message I managed to actually not right any of the things that I was actually most worried about.

The reason that I think this kind of a concurrency limiter makes sense isn't so much the server side of things, but rather the client side. When it comes to batched uploads, or other uploads that are coming from memory and not from disk, I think I'm willing to trust the system network stack to make sure that resources are effectively utilized. But for non-batched uploads, those are generally reading from disk as a part of the request, and limiting concurrency the concurrency of disk accesses in buck2 does seem like something we should do, if for no other reason than that we don't want to run out of file descriptors. Thoughts?

TheGrizzlyDev · 2024-07-02T10:04:06Z

those are generally reading from disk as a part of the request, and limiting concurrency the concurrency of disk accesses in buck2 does seem like something we should do, if for no other reason than that we don't want to run out of file descriptors. Thoughts?

if that's the case then this limiter still wouldn't help and you'd need a completely different type of limiter that controls open file descriptors within a whole build. Also, I find it highly unlikely this could happen, whereas the current limiter is certain to slow down uploads by a huge factor.

thoughtpolice · 2024-07-02T13:19:41Z

But for non-batched uploads, those are generally reading from disk as a part of the request, and limiting concurrency the concurrency of disk accesses in buck2 does seem like something we should do, if for no other reason than that we don't want to run out of file descriptors. Thoughts?

I think that's really going to depend on a number of factors. A modern NVMe drive will love to eat through a large queue of concurrent read IOs, while a spinning rust hard drive will slow to a crawl. So, I think you need to take actual drive geometry/filesystem type into account if you want to handle concurrent I/Os intelligently and respectuflly. In practice I think just having a path for HDDs (single threaded I/O path with async reads of single files) and "not HDDs" (concurrent/multithreaded) is enough, and to leave the concurrent case uncapped and let the block/IO layer handle it (I think even most networked block devices/filesystems will handle the concurrency reasonably well.)

Regarding file descriptors, while I would like it if Buck2 was more respectful of FD limits, I think that it's probably just as good in practice to encourage people to uncap their fd limits if the alternative is throttling, and at the same time use loud diagnostics to inform them of that. For example, when using Reindeer/Buck2 in a Rust project, I hit #419 about a million times a day on my MBA, so I have to start every shell with ulimit -n 10240 before running Buck. This is because macOS has a pathetic default limit of 256 open FDs.

Now, I would like it if Buck respected this, because it would mean I didn't have to ulimit every time, but I suspect silently respecting it will probably hurt performance more than just uncapping it by 20x or whatever, too. I think actually it would be best if Buck would try to setrlimit(RLIMIT_NOFILE) a high number of fd's immediately on start and then warn if it can't do more than like 4096 of them or something. That's a completely arbitrary number but my thinking is that, in practice if you're going to use Buck on a large scale project, you probably really do want to have a high number of open fd's so that you can push your concurrency limits. If the user has a low number of available fd's, that's a misconfiguration that's going to hurt. It's no different from any other highly concurrent workload (web server, etc) in that regard.

The real thing is that an fd limit is at least in theory solvable pretty quickly with a quick call to ulimit, but this upload limiter currently in place is pretty much a massive performance killer that's silent. I had personally observed really bad re_upload times for very small (10ish MiB) tarballs in my own tests and had no idea this was the reason. Uncapping it has had a huge benefit in practice; the fd limit hasn't nearly been as troublesome in contrast and has an easy mitigation.

zjturner · 2024-07-02T17:35:24Z

I’ll also add that for large projects, you’ll already hit your fd limit just by the fs monitor, so in practice everyone with a large enough project is already going to uncap.

Even beyond that, the concurrency in inherently limited by the max number of concurrent actions, as well as by natural serialization points in the build, which is all to say that the present behavior is far worse than any potential impact caused by increasing the number of file descriptors

TheGrizzlyDev · 2024-07-11T09:38:17Z

@JakobDegen ping :)

cjhopman · 2024-07-18T16:00:37Z

There may be better longer term solutions here, but this seems like a fine improvement over the current state. I think we should just change the key to something like max_concurrent_uploads_per_action (it's per upload_request, but that should basically be 1 per action and per_action i think is more understandable). I'd lean towards prefixing it with oss_ since it's only going to affect the oss client (or maybe with grpc_), but either way is fine.

facebook-github-bot · 2024-07-18T16:01:40Z

@cjhopman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-07-29T22:56:14Z

@cjhopman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Remove arbitrary concurrency limiter for uplods and replace with a

fc16cb6

configurable limit that is disabled by default.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 9, 2024

TheGrizzlyDev closed this May 9, 2024

TheGrizzlyDev deleted the remove-arbitrary-upload-concurrency-limiter branch May 9, 2024 20:07

TheGrizzlyDev restored the remove-arbitrary-upload-concurrency-limiter branch May 10, 2024 10:05

TheGrizzlyDev reopened this May 10, 2024

TheGrizzlyDev mentioned this pull request May 10, 2024

Add find_missing_blobs call on uploads to check if a blob has already been uploaded. #641

Closed

facebook-github-bot closed this in f5e4523 Jul 31, 2024

TheGrizzlyDev deleted the remove-arbitrary-upload-concurrency-limiter branch July 31, 2024 08:40

thoughtpolice mentioned this pull request May 12, 2025

Buck2 bugs and other notes thoughtpolice/a#15

Open

36 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove arbitrary concurrency limiter for uplods and replace with a configurable limit that is disabled by default. #642

Remove arbitrary concurrency limiter for uplods and replace with a configurable limit that is disabled by default. #642

Uh oh!

TheGrizzlyDev commented May 9, 2024

Uh oh!

JakobDegen commented May 14, 2024

Uh oh!

TheGrizzlyDev commented May 14, 2024

Uh oh!

TheGrizzlyDev commented May 17, 2024

Uh oh!

TheGrizzlyDev commented May 28, 2024

Uh oh!

valadez-abel commented Jun 3, 2024

Uh oh!

zjturner commented Jun 6, 2024

Uh oh!

TheGrizzlyDev commented Jun 17, 2024

Uh oh!

JakobDegen commented Jul 1, 2024

Uh oh!

TheGrizzlyDev commented Jul 2, 2024

Uh oh!

thoughtpolice commented Jul 2, 2024 •

edited

Loading

Uh oh!

zjturner commented Jul 2, 2024

Uh oh!

TheGrizzlyDev commented Jul 11, 2024

Uh oh!

cjhopman commented Jul 18, 2024

Uh oh!

facebook-github-bot commented Jul 18, 2024

Uh oh!

facebook-github-bot commented Jul 29, 2024

Uh oh!

Uh oh!

Remove arbitrary concurrency limiter for uplods and replace with a configurable limit that is disabled by default. #642

Remove arbitrary concurrency limiter for uplods and replace with a configurable limit that is disabled by default. #642

Uh oh!

Conversation

TheGrizzlyDev commented May 9, 2024

Uh oh!

JakobDegen commented May 14, 2024

Uh oh!

TheGrizzlyDev commented May 14, 2024

Uh oh!

TheGrizzlyDev commented May 17, 2024

Uh oh!

TheGrizzlyDev commented May 28, 2024

Uh oh!

valadez-abel commented Jun 3, 2024

Uh oh!

zjturner commented Jun 6, 2024

Uh oh!

TheGrizzlyDev commented Jun 17, 2024

Uh oh!

JakobDegen commented Jul 1, 2024

Uh oh!

TheGrizzlyDev commented Jul 2, 2024

Uh oh!

thoughtpolice commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjturner commented Jul 2, 2024

Uh oh!

TheGrizzlyDev commented Jul 11, 2024

Uh oh!

cjhopman commented Jul 18, 2024

Uh oh!

facebook-github-bot commented Jul 18, 2024

Uh oh!

facebook-github-bot commented Jul 29, 2024

Uh oh!

Uh oh!

thoughtpolice commented Jul 2, 2024 •

edited

Loading