-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
audit and create a document for bloom filter configurations #3138
Comments
this is considered a follow up of: |
FYI in spark there's also a document regarding options that can be set for parquet bloom filter: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html |
Do you have any suggestions? After a few more days of thought I don't have anything better than ndv and fpp. The only other possibly I have is to keep this crate simpler and simply expose |
@alamb i believe we should start simple, to support only 2 params:
controlling disk size does not quite make sense or is counter intuitive because users then need to both estimate unique number of items per row group as well as know how to derive fpp from that - in most cases, having a maxinum fpp is good enough cc @tustvold |
I like the idea of specifying fpp (and it follows the arrow C++model)
I think that makes sense as the main use case for bloom filters is high cardinality / close to unique columns. Perhaps we can document the case clearly (aka "bloom filters will likely only help for almost unique data like "ids" and "uuids", for other types sorting /clustering and min/max statistics will work as well if not better) |
turns out i have to allow users to specify ndv and have that defaults to say 1 million. the current code architect requires flow encoding which means there's no good way to know in advance how many num of rows will be written. |
|
I think the biggest thing I would like to discuss is "what parameters to expose for the writer API". I was thinking, for example, will users of this feature be able to set "fpp" and "ndv" reasonably? I suppose having the number of distinct values before writing a parquet file is reasonable, but maybe not the expected number of distinct values for each row group.
I did some research of other implementations. Here are the spark settingss https://spark.apache.org/docs/latest/configuration.html
the arrow parquet C++ writer seems to allow for the fpp setting
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE
Databricks seems to expose the fpp, max_fpp, and num distinct values:
https://docs.databricks.com/sql/language-manual/delta-create-bloomfilter-index.html
Originally posted by @alamb in #3119 (review)
The text was updated successfully, but these errors were encountered: