Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader #2916

Closed
jiacai2050 opened this issue Oct 24, 2022 · 4 comments
Closed

Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader #2916

jiacai2050 opened this issue Oct 24, 2022 · 4 comments
Labels
parquet Changes to the parquet crate question Further information is requested

Comments

@jiacai2050
Copy link
Contributor

jiacai2050 commented Oct 24, 2022

Which part is this question about
API Usage & Perf

Describe your question

I create two benchmark based on example code, and in my environment, this is what I got

  • ParquetRecordBatchReader cost 4s
  • ParquetRecordBatchStream cost 5s

The tested data is:

  • total rows: 40935755
  • row group: 4998

This is the schema of parquet file

message arrow_schema {
  required int64 tsid (INTEGER(64,false));
  required int64 enddate (TIMESTAMP(MILLIS,false));
  optional int64 id;
  optional int64 code;
  optional binary source (STRING);
  optional int64 innercode;
  optional int64 del;
  optional int64 jsid;
  optional int64 updatetime (TIMESTAMP(MILLIS,false));
  optional double weight;
}

Additional context
I dig into Parquet's source code, and find they both call build_array_reader to read parquet file, so the difference may above this layer.

@jiacai2050 jiacai2050 added the question Further information is requested label Oct 24, 2022
@tustvold
Copy link
Contributor

This is expected, see the investigation under #1473.

The TLDR is that in the absence of resource contention, synchronous blocking code will often outperform the corresponding asynchronous code. This is especially true of file IO, where there aren't stable non-blocking operating system APIs, and so tokio implements this by offloading the task of reading from the files to a separate blocking thread pool. Eventually projects like tokio-uring may address this.

The advantage of async comes where either:

  • You are communicating over some network connection, e.g. to object storage
  • There is resource contention, where instead of blocking the thread on IO, you could be getting on with processing some other part of the query

Async is about efficiently multiplexing work, if you don't have anything to multiplex, you aren't going to see a return from it

@jiacai2050
Copy link
Contributor Author

Thanks for quick reply, your point makes sense to me.

Just don't expect it will decrease by 20%, maybe io-uring is one solution, will try it in future development.

As for now, I think a practical solution is to create two parquet readers, and choose one depending on whether them are in local file or in remote object storage.

@tustvold
Copy link
Contributor

As for now, I think a practical solution is to create two parquet readers, and choose one depending on whether them are in local file or in remote object storage.

We used to do something in DataFusion, however, on a contended system that is performing other query processing tasks it was not found to make an appreciable difference and so moved away from this approach.

@jiacai2050
Copy link
Contributor Author

Perf in practice is hard to measure, so many factors to consider.

In my case, I re-test a parquet file with 104022899 row(4G), the cost between them are 10s vs 15s, totally 50% loss.

Hope this data here can help others with similar issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants