Skip to content

Bloom Filter Support in Parquet Reader/Writer #14597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 55 commits into from
Dec 3, 2024

Conversation

hannes
Copy link
Member

@hannes hannes commented Oct 28, 2024

This PR adds support for Parquet Bloom Filters. Bloom Filters are small set approximation data structures that for a given value can either exclude the presence of the value with certainty or include the presence of a value with some confidence.

With this PR, DuckDB automatically will create a Bloom Filter for each column chunk in each row group of parquet files as long as dictionary encoding is used. Dictionary encoding will be used if there is a considerable amount of value duplication. A new parameter, dictionary_size_limit to the COPY command with the parquet format controls the maximum size of this dictionary per row group. The default value is 1% of the configured row_group_size. For column chunks with fewer distinct values than the limit, dictionary encoding will be used and a Bloom filter will be created. There is also a new parameter to control the desired false positive ratio, bloom_filter_false_positive_ratio, with a default of 0.01 or 1%. The previously used parameter dictionary_compression_ratio_threshold is removed, but still accepted (with no effect).

The presence of a bloom filter can be checked with the parquet_metadata function, it gains two new columns, bloom_filter_offset and bloom_filter_length. To enable pre-fetching, Bloom filters for all row groups are co-located just in front of the Parquet metadata footer.

When reading data from a Parquet file with a constant filter, e.g. SELECT c1 FROM 'my_file.parquet' WHERE c2 = 42, DuckDB will now first check if a Bloom filter is present for each row group. If present, it will probe the Bloom filter with the constant value provided. If a match can be excluded, the row group is skipped.

To debug whether a Bloom filter matches or not can now be checked with the new function parquet_bloom_probe. This function takes a Parquet file name, a column name and an arbitrary constant. For each row group, the function will return a boolean bloom_filter_excludes if the row group was excluded by the Bloom filter with the given constant.

This PR also greatly expands the data types for which dictionary encoding can be used from only strings to most data types. This will lead to a size reduction in many cases, for example, the Parquet file size for the TPC-H lineitem table at scale factor 1 is reduced from 261 MB to 210 MB. There is a minor slowdown of around 5% since the additional dictionaries take slightly longer to create.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 29, 2024 06:41
@hannes hannes marked this pull request as ready for review October 29, 2024 06:41
@hannes hannes marked this pull request as ready for review November 8, 2024 12:39
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 11, 2024 08:37
@hannes hannes marked this pull request as ready for review November 11, 2024 12:13
@alippai
Copy link

alippai commented Nov 12, 2024

@hannes this looks really nice. I wonder why is there a limitation to dictionary encoded columns. Wouldn’t UUID columns benefit greatly from equality checks using bloom filters?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 13, 2024 12:42
@dmitriy-helius
Copy link

@hannes this looks really nice. I wonder why is there a limitation to dictionary encoded columns. Wouldn’t UUID columns benefit greatly from equality checks using bloom filters?
+1 on high cardinality fields and bloom filters. It would be awesome to have that functionality available. Otherwise, very nice!

@marvold-mw
Copy link

marvold-mw commented Nov 19, 2024

A new parameter, dictionary_size_limit to the COPY command with the parquet format controls the maximum size of this dictionary per row group. The default value is 1% of the configured row_group_size.

Is that the right way to handle this parameter? Consider the case of nested types; if I set row_group_size to 100K for data with a map column with an average of 10 entries, the key and value columns in the row group each have ~1M values. The default 1% of row group size is still 1K, not 10K. If I raise dictionary_size_limit, that affects other columns as well, though; a string column in the same row group has only 100K values.

dictionary_compression_ratio_threshold was nice to have as a single parameter, but if Bloom filters makes that challenging, a ratio of the number of values in a row group would be more helpful than a single number. (That is, have a parameter that defaults to 0.01 and adjust from there.)

This comes out of testing the actual data that inspired #14888 for with your PR; unfortunately I can't share the exact data.

copy (
    select map(
        [(trunc(random() * 200) + 0.1 * i)::varchar for i in generate_series(1, 10)],
        [(trunc(random() * 200) + 0.1 * i)::varchar for i in generate_series(1, 10)]
    ) as m
    from generate_series(1, 100000)
) to './test.parquet';

gives an example of the case I'm most concerned about. With the removal of dictionary_compression_ratio_threshold, it becomes much more important for me to adjust parameters, but also it becomes harder to at the same time.

(None of this is to say I don't want dictionary encodings on nested types, of course! Thank you for the hard work.)

@hannes hannes marked this pull request as ready for review November 25, 2024 13:09
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 29, 2024 09:56
@hannes hannes marked this pull request as ready for review November 29, 2024 09:56
@Mytherin Mytherin merged commit f86ed2d into duckdb:main Dec 3, 2024
42 of 44 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented Dec 3, 2024

Thanks! We can pick up the partitioned write regression separately

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
Bloom Filter Support in Parquet Reader/Writer (duckdb/duckdb#14597)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
Bloom Filter Support in Parquet Reader/Writer (duckdb/duckdb#14597)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
Bloom Filter Support in Parquet Reader/Writer (duckdb/duckdb#14597)
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 28, 2024
Bloom Filter Support in Parquet Reader/Writer (duckdb/duckdb#14597)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 28, 2024
Bloom Filter Support in Parquet Reader/Writer (duckdb/duckdb#14597)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation Ready For Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants