Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Add file metadata read/write benchmark #41760

Closed
pitrou opened this issue May 21, 2024 · 2 comments
Closed

[C++][Parquet] Add file metadata read/write benchmark #41760

pitrou opened this issue May 21, 2024 · 2 comments

Comments

@pitrou
Copy link
Member

pitrou commented May 21, 2024

Describe the enhancement requested

Following the discussions on the Parquet ML (see this thread and this thread), we should add a benchmark to measure the overhead of Parquet file metadata parsing or serialization for different numbers of row groups and columns.

Component(s)

C++, Parquet

@pitrou pitrou self-assigned this May 21, 2024
pitrou added a commit that referenced this issue May 22, 2024
Following the discussions on the Parquet ML (see [this thread](https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo) and [this thread](https://lists.apache.org/thread/vs3w2z5bk6s3c975rrkqdttr1dpsdn7h)), and the various complaints about poor Parquet metadata performance on wide schemas, this adds a benchmark to measure the overhead of Parquet file metadata parsing or serialization for different numbers of row groups and columns.

Sample output:
```
-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------
WriteFileMetadataAndData/num_columns:1/num_row_groups:1           11743 ns        11741 ns        59930 data_size=54 file_size=290 items_per_second=85.1726k/s
WriteFileMetadataAndData/num_columns:1/num_row_groups:100        843137 ns       842920 ns          832 data_size=5.4k file_size=20.486k items_per_second=1.18635k/s
WriteFileMetadataAndData/num_columns:1/num_row_groups:1000      8232304 ns      8230294 ns           85 data_size=54k file_size=207.687k items_per_second=121.502/s
WriteFileMetadataAndData/num_columns:10/num_row_groups:1         101214 ns       101190 ns         6910 data_size=540 file_size=2.11k items_per_second=9.8824k/s
WriteFileMetadataAndData/num_columns:10/num_row_groups:100      8026185 ns      8024361 ns           87 data_size=54k file_size=193.673k items_per_second=124.621/s
WriteFileMetadataAndData/num_columns:10/num_row_groups:1000    81370293 ns     81343455 ns            8 data_size=540k file_size=1.94392M items_per_second=12.2936/s
WriteFileMetadataAndData/num_columns:100/num_row_groups:1        955862 ns       955528 ns          733 data_size=5.4k file_size=20.694k items_per_second=1.04654k/s
WriteFileMetadataAndData/num_columns:100/num_row_groups:100    80115516 ns     80086117 ns            9 data_size=540k file_size=1.94729M items_per_second=12.4866/s
WriteFileMetadataAndData/num_columns:100/num_row_groups:1000  856428565 ns    856065370 ns            1 data_size=5.4M file_size=19.7673M items_per_second=1.16814/s
WriteFileMetadataAndData/num_columns:1000/num_row_groups:1      9330003 ns      9327439 ns           75 data_size=54k file_size=211.499k items_per_second=107.211/s
WriteFileMetadataAndData/num_columns:1000/num_row_groups:100  834609159 ns    834354590 ns            1 data_size=5.4M file_size=19.9623M items_per_second=1.19853/s

ReadFileMetadata/num_columns:1/num_row_groups:1                    3824 ns         3824 ns       182381 data_size=54 file_size=290 items_per_second=261.518k/s
ReadFileMetadata/num_columns:1/num_row_groups:100                 88519 ns        88504 ns         7879 data_size=5.4k file_size=20.486k items_per_second=11.299k/s
ReadFileMetadata/num_columns:1/num_row_groups:1000               849558 ns       849391 ns          825 data_size=54k file_size=207.687k items_per_second=1.17731k/s
ReadFileMetadata/num_columns:10/num_row_groups:1                  19918 ns        19915 ns        35449 data_size=540 file_size=2.11k items_per_second=50.2138k/s
ReadFileMetadata/num_columns:10/num_row_groups:100               715822 ns       715667 ns          975 data_size=54k file_size=193.673k items_per_second=1.3973k/s
ReadFileMetadata/num_columns:10/num_row_groups:1000             7017008 ns      7015432 ns          100 data_size=540k file_size=1.94392M items_per_second=142.543/s
ReadFileMetadata/num_columns:100/num_row_groups:1                175988 ns       175944 ns         3958 data_size=5.4k file_size=20.694k items_per_second=5.68363k/s
ReadFileMetadata/num_columns:100/num_row_groups:100             6814382 ns      6812781 ns          103 data_size=540k file_size=1.94729M items_per_second=146.783/s
ReadFileMetadata/num_columns:100/num_row_groups:1000           77858645 ns     77822157 ns            9 data_size=5.4M file_size=19.7673M items_per_second=12.8498/s
ReadFileMetadata/num_columns:1000/num_row_groups:1              1670001 ns      1669563 ns          419 data_size=54k file_size=211.499k items_per_second=598.959/s
ReadFileMetadata/num_columns:1000/num_row_groups:100           77339599 ns     77292924 ns            9 data_size=5.4M file_size=19.9623M items_per_second=12.9378/s
```

* GitHub Issue: #41760

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou
Copy link
Member Author

pitrou commented May 22, 2024

Issue resolved by pull request 41761
#41761

@pitrou pitrou added this to the 17.0.0 milestone May 22, 2024
@pitrou pitrou closed this as completed May 22, 2024
@rok
Copy link
Member

rok commented May 22, 2024

Similar benchmark here: lancedb/lance#2367

vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…apache#41761)

Following the discussions on the Parquet ML (see [this thread](https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo) and [this thread](https://lists.apache.org/thread/vs3w2z5bk6s3c975rrkqdttr1dpsdn7h)), and the various complaints about poor Parquet metadata performance on wide schemas, this adds a benchmark to measure the overhead of Parquet file metadata parsing or serialization for different numbers of row groups and columns.

Sample output:
```
-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------
WriteFileMetadataAndData/num_columns:1/num_row_groups:1           11743 ns        11741 ns        59930 data_size=54 file_size=290 items_per_second=85.1726k/s
WriteFileMetadataAndData/num_columns:1/num_row_groups:100        843137 ns       842920 ns          832 data_size=5.4k file_size=20.486k items_per_second=1.18635k/s
WriteFileMetadataAndData/num_columns:1/num_row_groups:1000      8232304 ns      8230294 ns           85 data_size=54k file_size=207.687k items_per_second=121.502/s
WriteFileMetadataAndData/num_columns:10/num_row_groups:1         101214 ns       101190 ns         6910 data_size=540 file_size=2.11k items_per_second=9.8824k/s
WriteFileMetadataAndData/num_columns:10/num_row_groups:100      8026185 ns      8024361 ns           87 data_size=54k file_size=193.673k items_per_second=124.621/s
WriteFileMetadataAndData/num_columns:10/num_row_groups:1000    81370293 ns     81343455 ns            8 data_size=540k file_size=1.94392M items_per_second=12.2936/s
WriteFileMetadataAndData/num_columns:100/num_row_groups:1        955862 ns       955528 ns          733 data_size=5.4k file_size=20.694k items_per_second=1.04654k/s
WriteFileMetadataAndData/num_columns:100/num_row_groups:100    80115516 ns     80086117 ns            9 data_size=540k file_size=1.94729M items_per_second=12.4866/s
WriteFileMetadataAndData/num_columns:100/num_row_groups:1000  856428565 ns    856065370 ns            1 data_size=5.4M file_size=19.7673M items_per_second=1.16814/s
WriteFileMetadataAndData/num_columns:1000/num_row_groups:1      9330003 ns      9327439 ns           75 data_size=54k file_size=211.499k items_per_second=107.211/s
WriteFileMetadataAndData/num_columns:1000/num_row_groups:100  834609159 ns    834354590 ns            1 data_size=5.4M file_size=19.9623M items_per_second=1.19853/s

ReadFileMetadata/num_columns:1/num_row_groups:1                    3824 ns         3824 ns       182381 data_size=54 file_size=290 items_per_second=261.518k/s
ReadFileMetadata/num_columns:1/num_row_groups:100                 88519 ns        88504 ns         7879 data_size=5.4k file_size=20.486k items_per_second=11.299k/s
ReadFileMetadata/num_columns:1/num_row_groups:1000               849558 ns       849391 ns          825 data_size=54k file_size=207.687k items_per_second=1.17731k/s
ReadFileMetadata/num_columns:10/num_row_groups:1                  19918 ns        19915 ns        35449 data_size=540 file_size=2.11k items_per_second=50.2138k/s
ReadFileMetadata/num_columns:10/num_row_groups:100               715822 ns       715667 ns          975 data_size=54k file_size=193.673k items_per_second=1.3973k/s
ReadFileMetadata/num_columns:10/num_row_groups:1000             7017008 ns      7015432 ns          100 data_size=540k file_size=1.94392M items_per_second=142.543/s
ReadFileMetadata/num_columns:100/num_row_groups:1                175988 ns       175944 ns         3958 data_size=5.4k file_size=20.694k items_per_second=5.68363k/s
ReadFileMetadata/num_columns:100/num_row_groups:100             6814382 ns      6812781 ns          103 data_size=540k file_size=1.94729M items_per_second=146.783/s
ReadFileMetadata/num_columns:100/num_row_groups:1000           77858645 ns     77822157 ns            9 data_size=5.4M file_size=19.7673M items_per_second=12.8498/s
ReadFileMetadata/num_columns:1000/num_row_groups:1              1670001 ns      1669563 ns          419 data_size=54k file_size=211.499k items_per_second=598.959/s
ReadFileMetadata/num_columns:1000/num_row_groups:100           77339599 ns     77292924 ns            9 data_size=5.4M file_size=19.9623M items_per_second=12.9378/s
```

* GitHub Issue: apache#41760

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants