refactor: optimize read of small row group in parquet #13530

zenus · 2023-11-01T15:13:18Z

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

normally, reading a parquet file on s3 is broken down into several steps

read file meta when read_partitions
read column chunks needed
when the row group is small, it is more efficient to merge reading column chunks.
it's no new things, i just reuse the idea of block_reader of fuse egine.
I did a simple test and it is indeed profitable。

mysql> SELECT * FROM INSPECT_PARQUET('@data/ontime_200.parquet');
+----------------------------------------------+-------------+----------+----------------+-----------------+--------------------------------+----------------------------------+
| created_by                                   | num_columns | num_rows | num_row_groups | serialized_size | max_row_groups_size_compressed | max_row_groups_size_uncompressed |
+----------------------------------------------+-------------+----------+----------------+-----------------+--------------------------------+----------------------------------+
| Arrow2 - Native Rust implementation of Arrow |         109 |      199 |              1 |           28087 |                          15197 |                           107581 |
+----------------------------------------------+-------------+----------+----------------+-----------------+--------------------------------+----------------------------------+
1 row in set (0.03 sec)
Read 1 rows, 448.00 B in 0.013 sec., 76.17 rows/sec., 33.33 KiB/sec.

mysql> set global storage_io_min_bytes_for_seek=0;
Query OK, 0 rows affected (0.02 sec)

mysql> select * from @data/ontime_200.parquet limit 1\G;
*************************** 1. row ***************************
                           year: 2020
                        quarter: 4
                          month: 12
               .....
1 row in set (0.06 sec)
Read 199 rows, 141.85 KiB in 0.029 sec., 6.93 thousand rows/sec., 4.83 MiB/sec.

ERROR:
No query specified

mysql> set global storage_io_min_bytes_for_seek=100;
Query OK, 0 rows affected (0.02 sec)

mysql> select * from @data/ontime_200.parquet limit 1\G;
*************************** 1. row ***************************
                           year: 2020
                        quarter: 4
                          month: 12
     ....
1 row in set (0.06 sec)
Read 199 rows, 141.85 KiB in 0.025 sec., 7.94 thousand rows/sec., 5.53 MiB/sec.

mysql> set global use_parquet2=0;
Query OK, 0 rows affected (0.02 sec)

mysql> set global storage_io_min_bytes_for_seek=0;
Query OK, 0 rows affected (0.03 sec)

mysql> select * from @data/ontime_200.parquet limit 1\G;
*************************** 1. row ***************************
                           year: 2020
                        quarter: 4
                          month: 12
                     dayofmonth: 1
       .......
]
1 row in set (0.06 sec)
Read 199 rows, 141.85 KiB in 0.030 sec., 6.53 thousand rows/sec., 4.55 MiB/sec.

mysql> set global storage_io_min_bytes_for_seek=100;
Query OK, 0 rows affected (0.02 sec)

mysql> select * from @data/ontime_200.parquet limit 1\G;
*************************** 1. row ***************************
                           year: 2020
                        quarter: 4
                          month: 12
               ........
1 row in set (0.06 sec)
Read 199 rows, 141.85 KiB in 0.022 sec., 8.99 thousand rows/sec., 6.26 MiB/sec.

Closes Feature: optimize read of small row group in parquet #11378

This change is

vercel · 2023-11-01T15:13:22Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
databend	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 3, 2023 0:45am

BohuTANG · 2023-11-03T13:03:50Z

Can we do a performance bench for this PR?

zenus · 2023-11-03T13:06:56Z

@BohuTANG not do a performance bench first , @youngsofun review it first.

RinChanNOWWW

I think the implementation is wrong. We should read all the small row groups at once. Parquet2Groups processing in this PR is the same as the original Parquet2RowGroup actually.

zenus · 2023-11-06T01:48:19Z

@RinChanNOWWW ok, let me fix.

sundy-li · 2023-11-06T04:15:00Z

We should read all the small row groups at once

Yes, but I think it's better to unify to read_columns_data_by_merge_io, it's a reading strategy.

youngsofun · 2023-11-06T15:43:32Z

src/query/storages/parquet/src/parquet2/pruning.rs

+            let mut groups = HashMap::with_capacity(parts.len());
+
+            for (gid, p) in parts.into_iter().enumerate() {
                max_compression_ratio = max_compression_ratio
                    .max(p.uncompressed_size() as f64 / p.compressed_size() as f64);
                max_compressed_size = max_compressed_size.max(p.compressed_size());
-                partitions.push(Arc::new(
-                    Box::new(ParquetPart::Parquet2RowGroup(p)) as Box<dyn PartInfo>
-                ));
+                groups.insert(gid, p);


sorry for the late review.
do you mean each file as a part? not make sense to me.

maybe u can expain amore in comments and/or the pr summary. so we can understand your code better.

i am fixing it , the code would be push tonight.

@youngsofun please help check