Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader #2916

jiacai2050 · 2022-10-24T08:12:10Z

Which part is this question about
API Usage & Perf

Describe your question

I create two benchmark based on example code, and in my environment, this is what I got

ParquetRecordBatchReader cost 4s
ParquetRecordBatchStream cost 5s

The tested data is:

total rows: 40935755
row group: 4998

This is the schema of parquet file

message arrow_schema {
  required int64 tsid (INTEGER(64,false));
  required int64 enddate (TIMESTAMP(MILLIS,false));
  optional int64 id;
  optional int64 code;
  optional binary source (STRING);
  optional int64 innercode;
  optional int64 del;
  optional int64 jsid;
  optional int64 updatetime (TIMESTAMP(MILLIS,false));
  optional double weight;
}

Additional context
I dig into Parquet's source code, and find they both call build_array_reader to read parquet file, so the difference may above this layer.

The text was updated successfully, but these errors were encountered:

tustvold · 2022-10-24T18:52:53Z

This is expected, see the investigation under #1473.

The TLDR is that in the absence of resource contention, synchronous blocking code will often outperform the corresponding asynchronous code. This is especially true of file IO, where there aren't stable non-blocking operating system APIs, and so tokio implements this by offloading the task of reading from the files to a separate blocking thread pool. Eventually projects like tokio-uring may address this.

The advantage of async comes where either:

You are communicating over some network connection, e.g. to object storage
There is resource contention, where instead of blocking the thread on IO, you could be getting on with processing some other part of the query

Async is about efficiently multiplexing work, if you don't have anything to multiplex, you aren't going to see a return from it

jiacai2050 · 2022-10-25T03:36:49Z

Thanks for quick reply, your point makes sense to me.

Just don't expect it will decrease by 20%, maybe io-uring is one solution, will try it in future development.

As for now, I think a practical solution is to create two parquet readers, and choose one depending on whether them are in local file or in remote object storage.

tustvold · 2022-10-25T20:21:58Z

As for now, I think a practical solution is to create two parquet readers, and choose one depending on whether them are in local file or in remote object storage.

We used to do something in DataFusion, however, on a contended system that is performing other query processing tasks it was not found to make an appreciable difference and so moved away from this approach.

jiacai2050 · 2022-10-26T14:43:15Z

Perf in practice is hard to measure, so many factors to consider.

In my case, I re-test a parquet file with 104022899 row(4G), the cost between them are 10s vs 15s, totally 50% loss.

Hope this data here can help others with similar issue.

jiacai2050 added the question Further information is requested label Oct 24, 2022

jiacai2050 mentioned this issue Oct 27, 2022

WIP: add datafusion based parquet reader apache/horaedb#312

Closed

jiacai2050 closed this as completed Nov 15, 2022

alamb added the parquet Changes to the parquet crate label Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader #2916

Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader #2916

jiacai2050 commented Oct 24, 2022 •

edited

Loading

tustvold commented Oct 24, 2022

jiacai2050 commented Oct 25, 2022

tustvold commented Oct 25, 2022

jiacai2050 commented Oct 26, 2022

Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader #2916

Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader #2916

Comments

jiacai2050 commented Oct 24, 2022 • edited Loading

tustvold commented Oct 24, 2022

jiacai2050 commented Oct 25, 2022

tustvold commented Oct 25, 2022

jiacai2050 commented Oct 26, 2022

jiacai2050 commented Oct 24, 2022 •

edited

Loading