-
Notifications
You must be signed in to change notification settings - Fork 282
Remove arbitrary concurrency limiter for uplods and replace with a configurable limit that is disabled by default. #642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
configurable limit that is disabled by default.
cc @benbrittain who originally added this
Yeah, but even if they're the common exercise conditions, it's still pretty easy to imagine that they might not be the failure conditions. Ie you could imagine some server that has one thread per client or something like that and falls over if a client shows up trying to write too many big files at once. I think there's a nice in-between option to go for though, which both preserves this property and sounds closer to what you actually want: Can we only apply this concurrency to the non-batched uploads, and let batched uploads proceed at unlimited concurrency? |
Which is why there is an optional configuration to limit concurrency. Any default value would be making likely incorrect assumptions about the implementation and exercise conditions of a RE server and this is just wrong as there are multiple implementations and all of them are wildly different and can run on extremely large distributed clusters with a wide array of different hardware and software specifications. This invariably results in lower performances by default with no real impact. On the other hand if such a limit is necessary for a RE server they can state this setting in their docs. I think this is already a middle ground and brings the potential benefit of limiing concurrency whilst keeping a performant setup by default, wdyt? |
friendly ping @JakobDegen :) |
Ping :) |
I tested this on our RE setup, and without it the client seemed to be stuck for 40+ minutes on re_upload stages, but with this PR, it's no longer blocked on re_upload. |
Agree that it would be great to see some forward progress on this. @JakobDegen is there anything you're waiting for? |
ping @JakobDegen |
@TheGrizzlyDev yeah, sorry, I realized that in my previous message I managed to actually not right any of the things that I was actually most worried about. The reason that I think this kind of a concurrency limiter makes sense isn't so much the server side of things, but rather the client side. When it comes to batched uploads, or other uploads that are coming from memory and not from disk, I think I'm willing to trust the system network stack to make sure that resources are effectively utilized. But for non-batched uploads, those are generally reading from disk as a part of the request, and limiting concurrency the concurrency of disk accesses in buck2 does seem like something we should do, if for no other reason than that we don't want to run out of file descriptors. Thoughts? |
if that's the case then this limiter still wouldn't help and you'd need a completely different type of limiter that controls open file descriptors within a whole build. Also, I find it highly unlikely this could happen, whereas the current limiter is certain to slow down uploads by a huge factor. |
I think that's really going to depend on a number of factors. A modern NVMe drive will love to eat through a large queue of concurrent read IOs, while a spinning rust hard drive will slow to a crawl. So, I think you need to take actual drive geometry/filesystem type into account if you want to handle concurrent I/Os intelligently and respectuflly. In practice I think just having a path for HDDs (single threaded I/O path with async reads of single files) and "not HDDs" (concurrent/multithreaded) is enough, and to leave the concurrent case uncapped and let the block/IO layer handle it (I think even most networked block devices/filesystems will handle the concurrency reasonably well.) Regarding file descriptors, while I would like it if Buck2 was more respectful of FD limits, I think that it's probably just as good in practice to encourage people to uncap their fd limits if the alternative is throttling, and at the same time use loud diagnostics to inform them of that. For example, when using Reindeer/Buck2 in a Rust project, I hit #419 about a million times a day on my MBA, so I have to start every shell with Now, I would like it if Buck respected this, because it would mean I didn't have to The real thing is that an fd limit is at least in theory solvable pretty quickly with a quick call to |
I’ll also add that for large projects, you’ll already hit your fd limit just by the fs monitor, so in practice everyone with a large enough project is already going to uncap. Even beyond that, the concurrency in inherently limited by the max number of concurrent actions, as well as by natural serialization points in the build, which is all to say that the present behavior is far worse than any potential impact caused by increasing the number of file descriptors |
@JakobDegen ping :) |
There may be better longer term solutions here, but this seems like a fine improvement over the current state. I think we should just change the key to something like |
@cjhopman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@cjhopman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
The existing limit doesn't really help to achieve the goals it states as it cannot help in conditions where multiple clients are using RE at the same time, which are the common exercise conditions. Since the flag can be used to limit quickly uploads in case of some issues with RBE I opten in for keeping it in some form, by making it a setting, though it likely isn't very useful, so it's disabled by default. Instead we should rely on other control mechanisms like number of connections and TCP or HTTP/2 flow control or, like in this case, explicit errors coming from the RE protocol itself (like
RESOURCE_EXHAUSTED
or others).