Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faceting/Dynamic indexing (with MVs) #47

Closed
danthegoodman1 opened this issue Jul 4, 2023 · 5 comments
Closed

Faceting/Dynamic indexing (with MVs) #47

danthegoodman1 opened this issue Jul 4, 2023 · 5 comments
Labels
example How to do something with IceDB

Comments

@danthegoodman1
Copy link
Owner

Facets kind of taken from datadog's terminology, but this would allow for dynamic "indexing" of columsn not in the partition stretegy.

We can add additional columns to the meta store called facet_keys and facet_values to serve as a that keeps track of the known keys and values of additional columns inside the parquet file.

Schema like:

facets JSONB

Facet values should be stored in arrays like:

{
  "some.known.path": [1, 2, 'a']
}

A secondary GIN index will then allow us to track the values so that they can be considered in queries for increased filtering. This would allow a sort of "indexing" on additional columns without double-writing (i.e. a second table).

Facets will have to be defined, and will not backfill on previous data. The known facets will need to be stored in the DB as well in a new table.

Facets can have any data type, since they are JSONB columns. We will have to match query predicates to these facets similar to #45

This should be exposed entirely as python functions so that any query engine can be used, and as long as interception of the predicate can occur then faceting can be supported (otherwise I guess really ugly functions could be used)

@danthegoodman1 danthegoodman1 added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 4, 2023
@danthegoodman1
Copy link
Owner Author

This becomes less useful the more merged files become honestly. E.g. if reducing a day to 3 files, then something probably appears in all 3 files, and unless we pull it out into its own column it's a full-scan of the files anyway.

@danthegoodman1
Copy link
Owner Author

Ultimately this would be more useful if we had lower-level control over the parquet file reading. For example if we could say it lives in {...} row groups in files {...} then we could just directly to those byte ranges, but we don't have this low-level control within duckdb or clickhouse to read that way. A custom parquet reader could probably do this, which could be written in python.

@danthegoodman1
Copy link
Owner Author

Instead of using a JSONB column and a GIN index, we could make a table IS and index. That way we could do sparse-indexing and probably be far more efficient with both size and cardinality.

@danthegoodman1 danthegoodman1 added example How to do something with IceDB and removed documentation Improvements or additions to documentation enhancement New feature or request labels Aug 9, 2023
@danthegoodman1
Copy link
Owner Author

moving to example as this is more a way to use icedb rather than a feature, and also really is just a materialized view because data should be copied to most efficiently handle this

@danthegoodman1 danthegoodman1 changed the title Faceting/Dynamic indexing Faceting/Dynamic indexing (with MVs) Aug 9, 2023
@danthegoodman1
Copy link
Owner Author

Closed as these are just materialized views

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
example How to do something with IceDB
Projects
None yet
Development

No branches or pull requests

1 participant