Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster hive part filters #4746

Merged
merged 2 commits into from
Sep 19, 2022
Merged

Conversation

samansmink
Copy link
Contributor

Did some profiling and found that compiling the regex was done for every file and is actually quite expensive. Precompiling the regex gains quite a big speedup. To benchmark this i replicated the use case in issue #4339, with 120 x 30 partitions using the tpch sf0.1 lineitem table in each folder.

Original:

SELECT COUNT(*) FROM parquet_scan('./hive_test_data/*/*/*.parquet', HIVE_PARTITIONING=1) WHERE country='belize' AND date='2106-01-01';
Run Time (s): real 0.315 user 0.233535 sys 0.081209

This PR

SELECT COUNT(*) FROM parquet_scan('./hive_test_data/*/*/*.parquet', HIVE_PARTITIONING=1) WHERE country='belize' AND date='2106-01-01';
Run Time (s): real 0.109 user 0.034185 sys 0.075329

Baseline with single file queried

SELECT COUNT(*) FROM parquet_scan('./hive_test_data2/country=belize/date=2106-01-01/*.parquet', HIVE_PARTITIONING=1);
Run Time (s): real 0.002 user 0.001802 sys 0.000728

Note that there's still quite a bit of overhead, most of which is due to the fact that using the hive partition filters does require to get a full listing from the filesystem. This may be improved further though, will look at it tomorrow.

@hannes hannes merged commit 3d23491 into duckdb:master Sep 19, 2022
@Mytherin
Copy link
Collaborator

I wonder if it would not be better to replace the regex entirely with a simple FSM to scan the string for key/value pairs?

@Mytherin
Copy link
Collaborator

Actually I wonder, is the regex even correct? What if you have column/directory names with slashes in them? (e.g. /velocity m\/s=10/)?

@hannes
Copy link
Member

hannes commented Sep 19, 2022

Actually I wonder, is the regex even correct? What if you have column/directory names with slashes in them? (e.g. /velocity m\/s=10/)?

I wonder if hive itself supports this :)

@Mytherin
Copy link
Collaborator

Fair question :)

@samansmink
Copy link
Contributor Author

Performance wise i don't think it really matters because most time is spent currently listing the files, but indeed partition names with slashes in them currently do not work.

@Mytherin
Copy link
Collaborator

I wonder could we not construct the path directly from the filters, rather than having to do a file listing? At least after we have determined the structure of the hive partitioning.

For example, if we know the hive partitioning in a particular folder is /<country>/<date>/*.parquet we could construct the path /country=netherlands/date=1992-01-01/*.parquet and skip most of the file listings/globbing.

@samansmink
Copy link
Contributor Author

More a note to future me, but feel free to comment :)

@Mytherin and I discussed some ideas on further improving the hive partition filter speed:

A solution would be to transform the filters into glob patterns although this would limit the types of filter operations pretty severely as we can only do some casts and equality checks. So this is probably not what we want.

What we would really want is to make the listing process itself faster as it requires multiple syscalls. As far as I can see there's no good way of doing this. To illustrate: an ls -R ./hive_test_data call of the partitions suffers similar overhead to what we experience in the duckdb listing function. What we could do however is apply the filters before we actually list and stat all the files. The issue is that we currently resolve the file glob during the bind which we need to do to figure out the schema of the parquet file. What we would need is a combined glob+filter function that while listing the files immediately checks if they pass the filter so that we don't need to stat non-matching files at all saving a function call and also prevents recursing into directories that don't match. This preserve the possibility for complex filter operations while also resulting in a considerable speedup.

Practically, an idea would be to have a parquetGlob function that only returns the first file matching the glob, using this single filename to finish the bind, then in the ParquetComplexFilterPushdown we could actually resolve the partially resolved glob using the combined glob+filter function. Ideally we apply this feature automatically and only when there are filters on hive partitions, as it will mess up the ParquetCardinality and ParquetScanMaxThreads unnecessarily if no filters are present. Otherwise this could be a manually configurable option if that proves tricky.

@parisni
Copy link

parisni commented Oct 14, 2022

usually (hive, spark, trino...) partition pruning is efficient when a hive metastore is used to fetch the paths from partition predicates. this is basically an index of partitions path. When no metastore is used, those engine fallback to the current duckdb implementation, which is in my understanding 1. full file listing 2. file filtering based on partition predicate.
I still wonder why they did not spent effort to improve this as you are trying here.

prevents recursing into directories that don't match

this looks an interesting approach

Hope the implementation will have multiple filesystems in mind (such s3)

@mvanaltvorst
Copy link

A solution would be to transform the filters into glob patterns although this would limit the types of filter operations pretty severely as we can only do some casts and equality checks. So this is probably not what we want.

What we would really want is to make the listing process itself faster as it requires multiple syscalls. As far as I can see there's no good way of doing this. To illustrate: an ls -R ./hive_test_data call of the partitions suffers similar overhead to what we experience in the duckdb listing function. What we could do however is apply the filters before we actually list and stat all the files. The issue is that we currently resolve the file glob during the bind which we need to do to figure out the schema of the parquet file. What we would need is a combined glob+filter function that while listing the files immediately checks if they pass the filter so that we don't need to stat non-matching files at all saving a function call and also prevents recursing into directories that don't match. This preserve the possibility for complex filter operations while also resulting in a considerable speedup.

Practically, an idea would be to have a parquetGlob function that only returns the first file matching the glob, using this single filename to finish the bind, then in the ParquetComplexFilterPushdown we could actually resolve the partially resolved glob using the combined glob+filter function. Ideally we apply this feature automatically and only when there are filters on hive partitions, as it will mess up the ParquetCardinality and ParquetScanMaxThreads unnecessarily if no filters are present. Otherwise this could be a manually configurable option if that proves tricky.

I found your comment after experiencing slow parquet_scan performance myself. Currently, there's no suitable tool that allows me to get ~10GB dataframe slices out of a 1TB+ hive partitioned dataset. Duckdb would be the perfect fit, except it doesn't prune its recursive directory search early despite of the filters I try to apply. Do you know if anyone is working on this? Or would it be worth creating a new issue?

@samansmink
Copy link
Contributor Author

Hi @mvanaltvorst, thanks for your comment! This sounds like a good point to open a discussion for! Then we can discuss this in a bit more detail there. I haven't looked at this in a bit but it seems definitely like an interesting improvement to look into!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants