-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faceting/Dynamic indexing (with MVs) #47
Comments
This becomes less useful the more merged files become honestly. E.g. if reducing a day to 3 files, then something probably appears in all 3 files, and unless we pull it out into its own column it's a full-scan of the files anyway. |
Ultimately this would be more useful if we had lower-level control over the parquet file reading. For example if we could say it lives in {...} row groups in files {...} then we could just directly to those byte ranges, but we don't have this low-level control within duckdb or clickhouse to read that way. A custom parquet reader could probably do this, which could be written in python. |
Instead of using a JSONB column and a GIN index, we could make a table IS and index. That way we could do sparse-indexing and probably be far more efficient with both size and cardinality. |
moving to example as this is more a way to use icedb rather than a feature, and also really is just a materialized view because data should be copied to most efficiently handle this |
Closed as these are just materialized views |
Facets kind of taken from datadog's terminology, but this would allow for dynamic "indexing" of columsn not in the partition stretegy.
We can add additional columns to the meta store called
facet_keys
andfacet_values
to serve as a that keeps track of the known keys and values of additional columns inside the parquet file.Schema like:
Facet values should be stored in arrays like:
A secondary GIN index will then allow us to track the values so that they can be considered in queries for increased filtering. This would allow a sort of "indexing" on additional columns without double-writing (i.e. a second table).
Facets will have to be defined, and will not backfill on previous data. The known facets will need to be stored in the DB as well in a new table.
Facets can have any data type, since they are JSONB columns. We will have to match query predicates to these facets similar to #45
This should be exposed entirely as python functions so that any query engine can be used, and as long as interception of the predicate can occur then faceting can be supported (otherwise I guess really ugly functions could be used)
The text was updated successfully, but these errors were encountered: