Faceting/Dynamic indexing (with MVs) #47

danthegoodman1 · 2023-07-04T01:04:48Z

Facets kind of taken from datadog's terminology, but this would allow for dynamic "indexing" of columsn not in the partition stretegy.

We can add additional columns to the meta store called facet_keys and facet_values to serve as a that keeps track of the known keys and values of additional columns inside the parquet file.

Schema like:

facets JSONB

Facet values should be stored in arrays like:

{
  "some.known.path": [1, 2, 'a']
}

A secondary GIN index will then allow us to track the values so that they can be considered in queries for increased filtering. This would allow a sort of "indexing" on additional columns without double-writing (i.e. a second table).

Facets will have to be defined, and will not backfill on previous data. The known facets will need to be stored in the DB as well in a new table.

Facets can have any data type, since they are JSONB columns. We will have to match query predicates to these facets similar to #45

This should be exposed entirely as python functions so that any query engine can be used, and as long as interception of the predicate can occur then faceting can be supported (otherwise I guess really ugly functions could be used)

The text was updated successfully, but these errors were encountered:

danthegoodman1 · 2023-07-04T01:14:49Z

This becomes less useful the more merged files become honestly. E.g. if reducing a day to 3 files, then something probably appears in all 3 files, and unless we pull it out into its own column it's a full-scan of the files anyway.

danthegoodman1 · 2023-07-04T01:22:20Z

Ultimately this would be more useful if we had lower-level control over the parquet file reading. For example if we could say it lives in {...} row groups in files {...} then we could just directly to those byte ranges, but we don't have this low-level control within duckdb or clickhouse to read that way. A custom parquet reader could probably do this, which could be written in python.

danthegoodman1 · 2023-07-04T01:24:48Z

Instead of using a JSONB column and a GIN index, we could make a table IS and index. That way we could do sparse-indexing and probably be far more efficient with both size and cardinality.

danthegoodman1 · 2023-08-09T11:47:56Z

moving to example as this is more a way to use icedb rather than a feature, and also really is just a materialized view because data should be copied to most efficiently handle this

danthegoodman1 · 2023-08-13T16:34:40Z

Closed as these are just materialized views

danthegoodman1 added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 4, 2023

danthegoodman1 mentioned this issue Jul 4, 2023

(Custom) table engine bindings chdb-io/chdb#52

Open

danthegoodman1 added example How to do something with IceDB and removed documentation Improvements or additions to documentation enhancement New feature or request labels Aug 9, 2023

danthegoodman1 changed the title ~~Faceting/Dynamic indexing~~ Faceting/Dynamic indexing (with MVs) Aug 9, 2023

danthegoodman1 closed this as completed Aug 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faceting/Dynamic indexing (with MVs) #47

Faceting/Dynamic indexing (with MVs) #47

danthegoodman1 commented Jul 4, 2023

danthegoodman1 commented Jul 4, 2023

danthegoodman1 commented Jul 4, 2023

danthegoodman1 commented Jul 4, 2023

danthegoodman1 commented Aug 9, 2023

danthegoodman1 commented Aug 13, 2023

Faceting/Dynamic indexing (with MVs) #47

Faceting/Dynamic indexing (with MVs) #47

Comments

danthegoodman1 commented Jul 4, 2023

danthegoodman1 commented Jul 4, 2023

danthegoodman1 commented Jul 4, 2023

danthegoodman1 commented Jul 4, 2023

danthegoodman1 commented Aug 9, 2023

danthegoodman1 commented Aug 13, 2023