-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader #2916
Comments
This is expected, see the investigation under #1473. The TLDR is that in the absence of resource contention, synchronous blocking code will often outperform the corresponding asynchronous code. This is especially true of file IO, where there aren't stable non-blocking operating system APIs, and so tokio implements this by offloading the task of reading from the files to a separate blocking thread pool. Eventually projects like tokio-uring may address this. The advantage of async comes where either:
Async is about efficiently multiplexing work, if you don't have anything to multiplex, you aren't going to see a return from it |
Thanks for quick reply, your point makes sense to me. Just don't expect it will decrease by 20%, maybe As for now, I think a practical solution is to create two parquet readers, and choose one depending on whether them are in local file or in remote object storage. |
We used to do something in DataFusion, however, on a contended system that is performing other query processing tasks it was not found to make an appreciable difference and so moved away from this approach. |
Perf in practice is hard to measure, so many factors to consider. In my case, I re-test a parquet file with 104022899 row(4G), the cost between them are 10s vs 15s, totally 50% loss. Hope this data here can help others with similar issue. |
Which part is this question about
API Usage & Perf
Describe your question
I create two benchmark based on example code, and in my environment, this is what I got
The tested data is:
This is the schema of parquet file
Additional context
I dig into Parquet's source code, and find they both call
build_array_reader
to read parquet file, so the difference may above this layer.The text was updated successfully, but these errors were encountered: