-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster hive part filters #4746
Faster hive part filters #4746
Conversation
I wonder if it would not be better to replace the regex entirely with a simple FSM to scan the string for key/value pairs? |
Actually I wonder, is the regex even correct? What if you have column/directory names with slashes in them? (e.g. |
I wonder if hive itself supports this :) |
Fair question :) |
Performance wise i don't think it really matters because most time is spent currently listing the files, but indeed partition names with slashes in them currently do not work. |
I wonder could we not construct the path directly from the filters, rather than having to do a file listing? At least after we have determined the structure of the hive partitioning. For example, if we know the hive partitioning in a particular folder is |
More a note to future me, but feel free to comment :) @Mytherin and I discussed some ideas on further improving the hive partition filter speed: A solution would be to transform the filters into glob patterns although this would limit the types of filter operations pretty severely as we can only do some casts and equality checks. So this is probably not what we want. What we would really want is to make the listing process itself faster as it requires multiple syscalls. As far as I can see there's no good way of doing this. To illustrate: an Practically, an idea would be to have a |
usually (hive, spark, trino...) partition pruning is efficient when a hive metastore is used to fetch the paths from partition predicates. this is basically an index of partitions path. When no metastore is used, those engine fallback to the current duckdb implementation, which is in my understanding 1. full file listing 2. file filtering based on partition predicate.
this looks an interesting approach Hope the implementation will have multiple filesystems in mind (such s3) |
I found your comment after experiencing slow parquet_scan performance myself. Currently, there's no suitable tool that allows me to get ~10GB dataframe slices out of a 1TB+ hive partitioned dataset. Duckdb would be the perfect fit, except it doesn't prune its recursive directory search early despite of the filters I try to apply. Do you know if anyone is working on this? Or would it be worth creating a new issue? |
Hi @mvanaltvorst, thanks for your comment! This sounds like a good point to open a discussion for! Then we can discuss this in a bit more detail there. I haven't looked at this in a bit but it seems definitely like an interesting improvement to look into! |
Did some profiling and found that compiling the regex was done for every file and is actually quite expensive. Precompiling the regex gains quite a big speedup. To benchmark this i replicated the use case in issue #4339, with 120 x 30 partitions using the tpch sf0.1 lineitem table in each folder.
Original:
This PR
Baseline with single file queried
Note that there's still quite a bit of overhead, most of which is due to the fact that using the hive partition filters does require to get a full listing from the filesystem. This may be improved further though, will look at it tomorrow.