New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Flight] See if reading/writing to gRPC get/put streams asynchronously helps performance #26337
Comments
Yibo Cai / @cyb70289: |
David Li / @lidavidm: |
David Li / @lidavidm: This way we would do I/O-bound (gRPC read/write) and CPU-bound (Arrow record batch encoding/decoding) work on separate threads with readahead on the I/O side. In our tests it did not have any benefit. I didn't test the actual async gRPC APIs, however, unless there is a major difference between how those are implemented on the gRPC side, I'd be doubtful that they'd help by themselves unless they unlock some opportunity to parallelize/pipeline work. But if you are investigating we'd be curious to see the results! It could definitely improve how ergonomic the APIs are and/or open a path to asyncio support in the Python bindings. It might also improve latency instead of throughput (our tests have focused on throughput). |
yibocai#1: For streams = 1, I see stable improvement of speed (1200 -> 1700) and latency (90 ->70). |
David Li / @lidavidm: What is the effect when no compression is used? |
yibocai#1: Per my profiling, 70% cpu time is used in compression, 30% is packet sending. So an improvement of ~30% looks reasonable when interleaving them. |
David Li / @lidavidm: Current master:
This branch:
|
yibocai#1: |
yibocai#1: My POC test code is too bad to use correctly. I hardcoded to use compression in client.cc. Master branch won't use compression by default. I meant to comment out INTERLEAVE_PREPARE_AND_SEND macro to benchmark again master branch. Will provide a patch to add command line options to exercise easily both code paths. |
David Li / @lidavidm: I'll try to test the 3 cases here (no compression, compression, compression with background thread) on a pair of EC2 instances when I get a chance - testing over localhost probably isn't the fairest comparison. |
yibocai#1: |
David Li / @lidavidm: I tested two EC2 t3.xlarge instances (4 cores, 16 GB RAM). They should have ~5 Gbps of bandwidth between them. All benchmarks were run as With compression, with background thread:
With compression, without background thread:
Without compression, without background thread:
In short, for Flight, it seems compression is simply not worth it, regardless of whether there's a background thread or not. This tradeoff may change when there's less bandwidth available. It does seem p99 latency is better. And there are other factors. For instance, benchmark uses random data which may not compress well; a different dataset may perform better. ZSTD is relatively fast, but here we aren't tuning it for compression/decompression speed. |
yibocai#1: Shall we replace random test data with some real world data? Otherwise, the compression test in current benchmark will always be a pure loss. |
David Li / @lidavidm: The command in all cases was Master (no compression):
flight-poc, with compression, without asynchronous compression:
flight-poc, with compression, with async compression:
So it seems with real data, things get even worse. Async compression is better than sync compression, but neither is in the ballpark of simply not compressing. Of course this is all over localhost which is likely not fair to compression so maybe I should try over EC2 next (~600MiB/s max bandwidth). |
David Li / @lidavidm: Without compression:
flight-poc, with sync compression:
flight-poc, with async compression:
It still doesn't seem very beneficial. Maybe if we have a very compressible dataset, and/or tune the compressor used? |
yibocai#1: What about encryption? Is it handled transparently by gRPC? |
Just to add to the immediately above, I've been doing some benchmarking to see how much of an impact TLS has and it can 1/2 or even 1/3 throughput, at least for large (~GiB) streaming RPCs and this pattern holds for at least C++ and Golang. See https://github.com/amoeba/grpc_bench?tab=readme-ov-file#results. |
We don't use any asynchronous concepts in the way that Flight is implemented now, i.e. IPC deconstruction/reconstruction (which may include compression!) is not performed concurrent with moving FlightData objects through the gRPC machinery, which may yield suboptimal performance.
It might be better to apply an actor-type approach where a dedicated thread retrieves and prepares the next raw IPC message (within a Future) while the current IPC message is being processed – that way reading/writing to/from the gRPC stream is not blocked on the IPC code doing its thing.
Reporter: Wes McKinney / @wesm
Note: This issue was originally created as ARROW-10351. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: