-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41760: [C++][Parquet] Add file metadata read/write benchmark #41761
Conversation
@mapleFU @emkornfield FYI |
Benchmark results here. We see that performance is O(num_columns * num_row_groups).
|
writer_properties_ = | ||
prop_builder.version(ParquetVersion::PARQUET_2_6)->disable_dictionary()->build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we later add benchmark for drop statistics? ( this is useful in some case )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(besides, should we later add benchmark for byte-array statistics decoding ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "drop statistics"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(decrypt can also be added later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(besides, should we later add benchmark for byte-array statistics decoding ?)
We can add benchmarks for whatever may be a performance bottleneck :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "drop statistics"?
By default, parquet enable "statistics" for column, this will add a Statistics
in column-chunk. We may add benchmark for droping the "statistics". And for ByteArray, this might be a bit long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we can add it in further patches, this could be a good start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stats are already encoded so I think it is pretty trivial. But may be different if it is BYTE_ARRAY type with large strings.
That'd be valuable, but here we want to benchmark metadata cost in isolation of other factors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea to also try this with the statistics dropped since these do have a big impact and have been heavily discussed. That is to say, run the numbers with builder.disable_statistics()
. Based on previous profiling I have done in #39676 this will cut the metadata read time by about 1/3. Still the main conclusion that the scaling is O(num_columns * num_row_groups) should still hold. @pitrou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, thanks a bunch for going to the effort to create these benchmarks! I think they are very helpful for the discussion!
benchmark.ReadFile(contents); | ||
} | ||
state.SetItemsProcessed(state.iterations()); | ||
state.counters["file_size"] = static_cast<double>(contents->size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add metadata()->SerializeToString().size()
as metadata size here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What additional information would it give?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File has been write with different batches with multiple columns. The file_size might also changed if encoding
changed. Instead of batch-size, shouldn't we focus more on file metadata thrift size / runtrip, since no "data" is being read, and here we only benchmark reading metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File has been write with different batches with multiple columns.
Varying the number of columns is important here.
The file_size might also changed if
encoding
changed.
Did you check the data? It's tiny (one INT32 value per column chunk). The entire file will consist of metadata overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check the data? It's tiny (one INT32 value per column chunk). The entire file will consist of metadata overhead.
I'll download one. I think we'd better get "concrete" size of footer, but we can choose current impl if you insist
( Besides, footer read by default would be 64KiB, if the footer is greater than 64KiB, a more around of read and parse might be run here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Metadata" is not only the footer, it's all the Thrift overhead + other signaling : page headers, statistics, column chunk metadata...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right
auto reader = ParquetFileReader::Open(source, props);
auto metadata = reader->metadata();
Any way, this only reads from footer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the concern from @mapleFU is that the write and read paths in this benchmark are not identical. On the read path, only file footer is loaded. However, there are more work done on the write path (e.g. encoding and writing page headers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, that's a good point. I can try to rewrite the write benchmark so as to only measure the metadata write part of writing a file, though that's dependent on where exactly it happens in the Parquet API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm not sure it makes sense, as metadata and data are quite intertwined in practice (why would we not measure page header serialization?).
The reason the benchmarks are not symmetrical is that reading and writing is not symmetrical. You have to write an entire Parquet file at once (there's no practical way to avoid that), while you can read a Parquet file piecewise or partially.
I can rename the write benchmark to something else if that's less misleading...
Besides changing the benchmark names, I've fixed a bug where the benchmark was writing too many rows. Updated benchmark numbers:
|
@pitrou, thank you very much for your work on these benchmarks. They actually confirm the observation we made months ago. Great job! 👍 |
0db476c
to
c54125d
Compare
I'll merge once CI is green. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit f3d4639. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 12 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…apache#41761) Following the discussions on the Parquet ML (see [this thread](https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo) and [this thread](https://lists.apache.org/thread/vs3w2z5bk6s3c975rrkqdttr1dpsdn7h)), and the various complaints about poor Parquet metadata performance on wide schemas, this adds a benchmark to measure the overhead of Parquet file metadata parsing or serialization for different numbers of row groups and columns. Sample output: ``` ----------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------------------------- WriteFileMetadataAndData/num_columns:1/num_row_groups:1 11743 ns 11741 ns 59930 data_size=54 file_size=290 items_per_second=85.1726k/s WriteFileMetadataAndData/num_columns:1/num_row_groups:100 843137 ns 842920 ns 832 data_size=5.4k file_size=20.486k items_per_second=1.18635k/s WriteFileMetadataAndData/num_columns:1/num_row_groups:1000 8232304 ns 8230294 ns 85 data_size=54k file_size=207.687k items_per_second=121.502/s WriteFileMetadataAndData/num_columns:10/num_row_groups:1 101214 ns 101190 ns 6910 data_size=540 file_size=2.11k items_per_second=9.8824k/s WriteFileMetadataAndData/num_columns:10/num_row_groups:100 8026185 ns 8024361 ns 87 data_size=54k file_size=193.673k items_per_second=124.621/s WriteFileMetadataAndData/num_columns:10/num_row_groups:1000 81370293 ns 81343455 ns 8 data_size=540k file_size=1.94392M items_per_second=12.2936/s WriteFileMetadataAndData/num_columns:100/num_row_groups:1 955862 ns 955528 ns 733 data_size=5.4k file_size=20.694k items_per_second=1.04654k/s WriteFileMetadataAndData/num_columns:100/num_row_groups:100 80115516 ns 80086117 ns 9 data_size=540k file_size=1.94729M items_per_second=12.4866/s WriteFileMetadataAndData/num_columns:100/num_row_groups:1000 856428565 ns 856065370 ns 1 data_size=5.4M file_size=19.7673M items_per_second=1.16814/s WriteFileMetadataAndData/num_columns:1000/num_row_groups:1 9330003 ns 9327439 ns 75 data_size=54k file_size=211.499k items_per_second=107.211/s WriteFileMetadataAndData/num_columns:1000/num_row_groups:100 834609159 ns 834354590 ns 1 data_size=5.4M file_size=19.9623M items_per_second=1.19853/s ReadFileMetadata/num_columns:1/num_row_groups:1 3824 ns 3824 ns 182381 data_size=54 file_size=290 items_per_second=261.518k/s ReadFileMetadata/num_columns:1/num_row_groups:100 88519 ns 88504 ns 7879 data_size=5.4k file_size=20.486k items_per_second=11.299k/s ReadFileMetadata/num_columns:1/num_row_groups:1000 849558 ns 849391 ns 825 data_size=54k file_size=207.687k items_per_second=1.17731k/s ReadFileMetadata/num_columns:10/num_row_groups:1 19918 ns 19915 ns 35449 data_size=540 file_size=2.11k items_per_second=50.2138k/s ReadFileMetadata/num_columns:10/num_row_groups:100 715822 ns 715667 ns 975 data_size=54k file_size=193.673k items_per_second=1.3973k/s ReadFileMetadata/num_columns:10/num_row_groups:1000 7017008 ns 7015432 ns 100 data_size=540k file_size=1.94392M items_per_second=142.543/s ReadFileMetadata/num_columns:100/num_row_groups:1 175988 ns 175944 ns 3958 data_size=5.4k file_size=20.694k items_per_second=5.68363k/s ReadFileMetadata/num_columns:100/num_row_groups:100 6814382 ns 6812781 ns 103 data_size=540k file_size=1.94729M items_per_second=146.783/s ReadFileMetadata/num_columns:100/num_row_groups:1000 77858645 ns 77822157 ns 9 data_size=5.4M file_size=19.7673M items_per_second=12.8498/s ReadFileMetadata/num_columns:1000/num_row_groups:1 1670001 ns 1669563 ns 419 data_size=54k file_size=211.499k items_per_second=598.959/s ReadFileMetadata/num_columns:1000/num_row_groups:100 77339599 ns 77292924 ns 9 data_size=5.4M file_size=19.9623M items_per_second=12.9378/s ``` * GitHub Issue: apache#41760 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Following the discussions on the Parquet ML (see this thread and this thread), and the various complaints about poor Parquet metadata performance on wide schemas, this adds a benchmark to measure the overhead of Parquet file metadata parsing or serialization for different numbers of row groups and columns.
Sample output: