Faster hive part filters #4746

samansmink · 2022-09-18T20:03:54Z

Did some profiling and found that compiling the regex was done for every file and is actually quite expensive. Precompiling the regex gains quite a big speedup. To benchmark this i replicated the use case in issue #4339, with 120 x 30 partitions using the tpch sf0.1 lineitem table in each folder.

Original:

SELECT COUNT(*) FROM parquet_scan('./hive_test_data/*/*/*.parquet', HIVE_PARTITIONING=1) WHERE country='belize' AND date='2106-01-01';
Run Time (s): real 0.315 user 0.233535 sys 0.081209

This PR

SELECT COUNT(*) FROM parquet_scan('./hive_test_data/*/*/*.parquet', HIVE_PARTITIONING=1) WHERE country='belize' AND date='2106-01-01';
Run Time (s): real 0.109 user 0.034185 sys 0.075329

Baseline with single file queried

SELECT COUNT(*) FROM parquet_scan('./hive_test_data2/country=belize/date=2106-01-01/*.parquet', HIVE_PARTITIONING=1);
Run Time (s): real 0.002 user 0.001802 sys 0.000728

Note that there's still quite a bit of overhead, most of which is due to the fact that using the hive partition filters does require to get a full listing from the filesystem. This may be improved further though, will look at it tomorrow.

Mytherin · 2022-09-19T07:02:10Z

I wonder if it would not be better to replace the regex entirely with a simple FSM to scan the string for key/value pairs?

Mytherin · 2022-09-19T07:03:44Z

Actually I wonder, is the regex even correct? What if you have column/directory names with slashes in them? (e.g. /velocity m\/s=10/)?

hannes · 2022-09-19T07:05:09Z

Actually I wonder, is the regex even correct? What if you have column/directory names with slashes in them? (e.g. /velocity m\/s=10/)?

I wonder if hive itself supports this :)

Mytherin · 2022-09-19T07:06:00Z

Fair question :)

samansmink · 2022-09-19T07:44:08Z

Performance wise i don't think it really matters because most time is spent currently listing the files, but indeed partition names with slashes in them currently do not work.

Mytherin · 2022-09-19T09:16:17Z

I wonder could we not construct the path directly from the filters, rather than having to do a file listing? At least after we have determined the structure of the hive partitioning.

For example, if we know the hive partitioning in a particular folder is /<country>/<date>/*.parquet we could construct the path /country=netherlands/date=1992-01-01/*.parquet and skip most of the file listings/globbing.

samansmink · 2022-09-19T12:01:01Z

More a note to future me, but feel free to comment :)

@Mytherin and I discussed some ideas on further improving the hive partition filter speed:

A solution would be to transform the filters into glob patterns although this would limit the types of filter operations pretty severely as we can only do some casts and equality checks. So this is probably not what we want.

What we would really want is to make the listing process itself faster as it requires multiple syscalls. As far as I can see there's no good way of doing this. To illustrate: an ls -R ./hive_test_data call of the partitions suffers similar overhead to what we experience in the duckdb listing function. What we could do however is apply the filters before we actually list and stat all the files. The issue is that we currently resolve the file glob during the bind which we need to do to figure out the schema of the parquet file. What we would need is a combined glob+filter function that while listing the files immediately checks if they pass the filter so that we don't need to stat non-matching files at all saving a function call and also prevents recursing into directories that don't match. This preserve the possibility for complex filter operations while also resulting in a considerable speedup.

Practically, an idea would be to have a parquetGlob function that only returns the first file matching the glob, using this single filename to finish the bind, then in the ParquetComplexFilterPushdown we could actually resolve the partially resolved glob using the combined glob+filter function. Ideally we apply this feature automatically and only when there are filters on hive partitions, as it will mess up the ParquetCardinality and ParquetScanMaxThreads unnecessarily if no filters are present. Otherwise this could be a manually configurable option if that proves tricky.

parisni · 2022-10-14T20:35:54Z

usually (hive, spark, trino...) partition pruning is efficient when a hive metastore is used to fetch the paths from partition predicates. this is basically an index of partitions path. When no metastore is used, those engine fallback to the current duckdb implementation, which is in my understanding 1. full file listing 2. file filtering based on partition predicate.
I still wonder why they did not spent effort to improve this as you are trying here.

prevents recursing into directories that don't match

this looks an interesting approach

Hope the implementation will have multiple filesystems in mind (such s3)

mvanaltvorst · 2023-05-22T11:43:24Z

A solution would be to transform the filters into glob patterns although this would limit the types of filter operations pretty severely as we can only do some casts and equality checks. So this is probably not what we want.

What we would really want is to make the listing process itself faster as it requires multiple syscalls. As far as I can see there's no good way of doing this. To illustrate: an ls -R ./hive_test_data call of the partitions suffers similar overhead to what we experience in the duckdb listing function. What we could do however is apply the filters before we actually list and stat all the files. The issue is that we currently resolve the file glob during the bind which we need to do to figure out the schema of the parquet file. What we would need is a combined glob+filter function that while listing the files immediately checks if they pass the filter so that we don't need to stat non-matching files at all saving a function call and also prevents recursing into directories that don't match. This preserve the possibility for complex filter operations while also resulting in a considerable speedup.

Practically, an idea would be to have a parquetGlob function that only returns the first file matching the glob, using this single filename to finish the bind, then in the ParquetComplexFilterPushdown we could actually resolve the partially resolved glob using the combined glob+filter function. Ideally we apply this feature automatically and only when there are filters on hive partitions, as it will mess up the ParquetCardinality and ParquetScanMaxThreads unnecessarily if no filters are present. Otherwise this could be a manually configurable option if that proves tricky.

I found your comment after experiencing slow parquet_scan performance myself. Currently, there's no suitable tool that allows me to get ~10GB dataframe slices out of a 1TB+ hive partitioned dataset. Duckdb would be the perfect fit, except it doesn't prune its recursive directory search early despite of the filters I try to apply. Do you know if anyone is working on this? Or would it be worth creating a new issue?

samansmink · 2023-05-22T11:49:52Z

Hi @mvanaltvorst, thanks for your comment! This sounds like a good point to open a discussion for! Then we can discuss this in a bit more detail there. I haven't looked at this in a bit but it seems definitely like an interesting improvement to look into!

samansmink added 2 commits September 16, 2022 16:22

make regex faster by precompiling expression

20b89d2

compile regex only when needed

9666f07

hannes merged commit 3d23491 into duckdb:master Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster hive part filters #4746

Faster hive part filters #4746

samansmink commented Sep 18, 2022

Mytherin commented Sep 19, 2022

Mytherin commented Sep 19, 2022

hannes commented Sep 19, 2022

Mytherin commented Sep 19, 2022

samansmink commented Sep 19, 2022

Mytherin commented Sep 19, 2022

samansmink commented Sep 19, 2022

parisni commented Oct 14, 2022 •

edited

mvanaltvorst commented May 22, 2023

samansmink commented May 22, 2023

Faster hive part filters #4746

Faster hive part filters #4746

Conversation

samansmink commented Sep 18, 2022

Mytherin commented Sep 19, 2022

Mytherin commented Sep 19, 2022

hannes commented Sep 19, 2022

Mytherin commented Sep 19, 2022

samansmink commented Sep 19, 2022

Mytherin commented Sep 19, 2022

samansmink commented Sep 19, 2022

parisni commented Oct 14, 2022 • edited

mvanaltvorst commented May 22, 2023

samansmink commented May 22, 2023

parisni commented Oct 14, 2022 •

edited