audit and create a document for bloom filter configurations #3138

jimexist · 2022-11-19T05:29:53Z

    Thank you @Jimexist  -- this is very cool. I went through the code fairly thoroughly. I had some minor suggestions / comments for documentation and code structure but nothing that would block merging.

I think the biggest thing I would like to discuss is "what parameters to expose for the writer API". I was thinking, for example, will users of this feature be able to set "fpp" and "ndv" reasonably? I suppose having the number of distinct values before writing a parquet file is reasonable, but maybe not the expected number of distinct values for each row group.

I did some research of other implementations. Here are the spark settingss https://spark.apache.org/docs/latest/configuration.html

spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold	10MB	Size threshold of the bloom filter creation side plan. Estimated size needs to be under this value to try to inject bloom filter.	3.3.0
spark.sql.optimizer.runtime.bloomFilter.enabled	false	When true and if one side of a shuffle join has a selective predicate, we attempt to insert a bloom filter in the other side to reduce the amount of shuffle data.	3.3.0
spark.sql.optimizer.runtime.bloomFilter.expectedNumItems	1000000	The default number of expected items for the runtime bloomfilter	3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumBits	67108864	The max number of bits to use for the runtime bloom filter	3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumItems	4000000	The max allowed number of expected items for the runtime bloom filter	3.3.0
spark.sql.optimizer.runtime.bloomFilter.numBits	8388608	The default number of bits to use for the runtime bloom filter	3.3.0

the arrow parquet C++ writer seems to allow for the fpp setting

https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE

double bloom_filter_fpp = 0.05
The upper limit of the false-positive rate of the bloom filter, default 0.05.

Databricks seems to expose the fpp, max_fpp, and num distinct values:
https://docs.databricks.com/sql/language-manual/delta-create-bloomfilter-index.html

Originally posted by @alamb in #3119 (review)

The text was updated successfully, but these errors were encountered:

jimexist · 2022-11-19T05:30:12Z

this is considered a follow up of:

Support bloom filter reading and writing for parquet #3023

jimexist · 2022-11-19T05:34:56Z

FYI in spark there's also a document regarding options that can be set for parquet bloom filter: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html

alamb · 2022-11-21T23:02:42Z

Do you have any suggestions? After a few more days of thought I don't have anything better than ndv and fpp.

The only other possibly I have is to keep this crate simpler and simply expose set_bloom_filter_size and have the users explicitly specify the size. It isn't ideal, but perhaps it would be ok if we added a pointer to the canonical ndv/fpp calculations?

jimexist · 2022-11-22T14:49:01Z

@alamb i believe we should start simple, to support only 2 params:

whether bloom filter is enabled as a master switch
fpp (0, 1.0), with which we'd assume all unique items, and use that row count per row group to calculate a bitset size, but cap that to 128MiB for unreasonably small fpp e.g. 0.0000001; for very large fpp e.g. 0.9999 the minimal is 32.

controlling disk size does not quite make sense or is counter intuitive because users then need to both estimate unique number of items per row group as well as know how to derive fpp from that - in most cases, having a maxinum fpp is good enough

cc @tustvold

alamb · 2022-11-22T14:56:22Z

I like the idea of specifying fpp (and it follows the arrow C++model)

with which we'd assume all unique items

I think that makes sense as the main use case for bloom filters is high cardinality / close to unique columns.

Perhaps we can document the case clearly (aka "bloom filters will likely only help for almost unique data like "ids" and "uuids", for other types sorting /clustering and min/max statistics will work as well if not better)

jimexist · 2022-11-23T08:00:51Z

turns out i have to allow users to specify ndv and have that defaults to say 1 million. the current code architect requires flow encoding which means there's no good way to know in advance how many num of rows will be written.

alamb · 2022-11-25T10:36:39Z

label_issue.py automatically added labels {'parquet'} from #3165

jimexist self-assigned this Nov 19, 2022

jimexist mentioned this issue Nov 19, 2022

parquet bloom filter part III: add sbbf writer, remove bloom default feature, add reader properties #3119

Merged

jimexist mentioned this issue Nov 23, 2022

bloom filter part IV: adjust writer properties, bloom filter properties, and incorporate into column encoder #3165

Merged

tustvold closed this as completed in #3165 Nov 24, 2022

alamb added the parquet Changes to the parquet crate label Nov 25, 2022

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audit and create a document for bloom filter configurations #3138

audit and create a document for bloom filter configurations #3138

jimexist commented Nov 19, 2022

jimexist commented Nov 19, 2022

jimexist commented Nov 19, 2022

alamb commented Nov 21, 2022

jimexist commented Nov 22, 2022 •

edited

Loading

alamb commented Nov 22, 2022

jimexist commented Nov 23, 2022

alamb commented Nov 25, 2022

audit and create a document for bloom filter configurations #3138

audit and create a document for bloom filter configurations #3138

Comments

jimexist commented Nov 19, 2022

jimexist commented Nov 19, 2022

jimexist commented Nov 19, 2022

alamb commented Nov 21, 2022

jimexist commented Nov 22, 2022 • edited Loading

alamb commented Nov 22, 2022

jimexist commented Nov 23, 2022

alamb commented Nov 25, 2022

jimexist commented Nov 22, 2022 •

edited

Loading