-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Bloom Filter Support in Parquet Reader/Writer #14597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
branch 'feature' of https://github.com/duckdb/duckdb into parquetbloomfilter
|
@hannes this looks really nice. I wonder why is there a limitation to dictionary encoded columns. Wouldn’t UUID columns benefit greatly from equality checks using bloom filters? |
|
Is that the right way to handle this parameter? Consider the case of nested types; if I set
This comes out of testing the actual data that inspired #14888 for with your PR; unfortunately I can't share the exact data. copy (
select map(
[(trunc(random() * 200) + 0.1 * i)::varchar for i in generate_series(1, 10)],
[(trunc(random() * 200) + 0.1 * i)::varchar for i in generate_series(1, 10)]
) as m
from generate_series(1, 100000)
) to './test.parquet';gives an example of the case I'm most concerned about. With the removal of (None of this is to say I don't want dictionary encodings on nested types, of course! Thank you for the hard work.) |
|
Thanks! We can pick up the partitioned write regression separately |
Bloom Filter Support in Parquet Reader/Writer (duckdb/duckdb#14597)
Bloom Filter Support in Parquet Reader/Writer (duckdb/duckdb#14597)
Bloom Filter Support in Parquet Reader/Writer (duckdb/duckdb#14597)
Bloom Filter Support in Parquet Reader/Writer (duckdb/duckdb#14597)
Bloom Filter Support in Parquet Reader/Writer (duckdb/duckdb#14597) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
This PR adds support for Parquet Bloom Filters. Bloom Filters are small set approximation data structures that for a given value can either exclude the presence of the value with certainty or include the presence of a value with some confidence.
With this PR, DuckDB automatically will create a Bloom Filter for each column chunk in each row group of parquet files as long as dictionary encoding is used. Dictionary encoding will be used if there is a considerable amount of value duplication. A new parameter,
dictionary_size_limitto theCOPYcommand with the parquet format controls the maximum size of this dictionary per row group. The default value is 1% of the configuredrow_group_size. For column chunks with fewer distinct values than the limit, dictionary encoding will be used and a Bloom filter will be created. There is also a new parameter to control the desired false positive ratio,bloom_filter_false_positive_ratio, with a default of 0.01 or 1%. The previously used parameterdictionary_compression_ratio_thresholdis removed, but still accepted (with no effect).The presence of a bloom filter can be checked with the
parquet_metadatafunction, it gains two new columns,bloom_filter_offsetandbloom_filter_length. To enable pre-fetching, Bloom filters for all row groups are co-located just in front of the Parquet metadata footer.When reading data from a Parquet file with a constant filter, e.g.
SELECT c1 FROM 'my_file.parquet' WHERE c2 = 42, DuckDB will now first check if a Bloom filter is present for each row group. If present, it will probe the Bloom filter with the constant value provided. If a match can be excluded, the row group is skipped.To debug whether a Bloom filter matches or not can now be checked with the new function
parquet_bloom_probe. This function takes a Parquet file name, a column name and an arbitrary constant. For each row group, the function will return a booleanbloom_filter_excludesif the row group was excluded by the Bloom filter with the given constant.This PR also greatly expands the data types for which dictionary encoding can be used from only strings to most data types. This will lead to a size reduction in many cases, for example, the Parquet file size for the TPC-H
lineitemtable at scale factor 1 is reduced from 261 MB to 210 MB. There is a minor slowdown of around 5% since the additional dictionaries take slightly longer to create.