Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

audit and create a document for bloom filter configurations #3138

Closed
jimexist opened this issue Nov 19, 2022 · 7 comments · Fixed by #3165
Closed

audit and create a document for bloom filter configurations #3138

jimexist opened this issue Nov 19, 2022 · 7 comments · Fixed by #3165
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@jimexist
Copy link
Member

    Thank you @Jimexist  -- this is very cool. I went through the code fairly thoroughly. I had some minor suggestions / comments for documentation and code structure but nothing that would block merging.

I think the biggest thing I would like to discuss is "what parameters to expose for the writer API". I was thinking, for example, will users of this feature be able to set "fpp" and "ndv" reasonably? I suppose having the number of distinct values before writing a parquet file is reasonable, but maybe not the expected number of distinct values for each row group.

I did some research of other implementations. Here are the spark settingss https://spark.apache.org/docs/latest/configuration.html

spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold 10MB Size threshold of the bloom filter creation side plan. Estimated size needs to be under this value to try to inject bloom filter. 3.3.0
spark.sql.optimizer.runtime.bloomFilter.enabled false When true and if one side of a shuffle join has a selective predicate, we attempt to insert a bloom filter in the other side to reduce the amount of shuffle data. 3.3.0
spark.sql.optimizer.runtime.bloomFilter.expectedNumItems 1000000 The default number of expected items for the runtime bloomfilter 3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumBits 67108864 The max number of bits to use for the runtime bloom filter 3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumItems 4000000 The max allowed number of expected items for the runtime bloom filter 3.3.0
spark.sql.optimizer.runtime.bloomFilter.numBits 8388608 The default number of bits to use for the runtime bloom filter 3.3.0

the arrow parquet C++ writer seems to allow for the fpp setting

https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE

double bloom_filter_fpp = 0.05
The upper limit of the false-positive rate of the bloom filter, default 0.05.

Databricks seems to expose the fpp, max_fpp, and num distinct values:
https://docs.databricks.com/sql/language-manual/delta-create-bloomfilter-index.html

Originally posted by @alamb in #3119 (review)

@jimexist jimexist self-assigned this Nov 19, 2022
@jimexist
Copy link
Member Author

this is considered a follow up of:

@jimexist
Copy link
Member Author

FYI in spark there's also a document regarding options that can be set for parquet bloom filter: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html

@alamb
Copy link
Contributor

alamb commented Nov 21, 2022

Do you have any suggestions? After a few more days of thought I don't have anything better than ndv and fpp.

The only other possibly I have is to keep this crate simpler and simply expose set_bloom_filter_size and have the users explicitly specify the size. It isn't ideal, but perhaps it would be ok if we added a pointer to the canonical ndv/fpp calculations?

@jimexist
Copy link
Member Author

jimexist commented Nov 22, 2022

@alamb i believe we should start simple, to support only 2 params:

  1. whether bloom filter is enabled as a master switch
  2. fpp (0, 1.0), with which we'd assume all unique items, and use that row count per row group to calculate a bitset size, but cap that to 128MiB for unreasonably small fpp e.g. 0.0000001; for very large fpp e.g. 0.9999 the minimal is 32.

controlling disk size does not quite make sense or is counter intuitive because users then need to both estimate unique number of items per row group as well as know how to derive fpp from that - in most cases, having a maxinum fpp is good enough

cc @tustvold

@alamb
Copy link
Contributor

alamb commented Nov 22, 2022

I like the idea of specifying fpp (and it follows the arrow C++model)

with which we'd assume all unique items

I think that makes sense as the main use case for bloom filters is high cardinality / close to unique columns.

Perhaps we can document the case clearly (aka "bloom filters will likely only help for almost unique data like "ids" and "uuids", for other types sorting /clustering and min/max statistics will work as well if not better)

@jimexist
Copy link
Member Author

turns out i have to allow users to specify ndv and have that defaults to say 1 million. the current code architect requires flow encoding which means there's no good way to know in advance how many num of rows will be written.

@alamb alamb added the parquet Changes to the parquet crate label Nov 25, 2022
@alamb
Copy link
Contributor

alamb commented Nov 25, 2022

label_issue.py automatically added labels {'parquet'} from #3165

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
2 participants