-
Notifications
You must be signed in to change notification settings - Fork 0
feat(ste_vec): add GIN-indexable containment using jsonb arrays #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
8fa9a84
1606121
071c4fe
4dcf9f9
1a8fa97
b5e840f
1cb7c18
abd8418
c6b1927
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,14 +5,14 @@ | |
| > If you are using Protect.js, see the [Protect.js schema](https://github.com/cipherstash/protectjs/blob/main/docs/reference/schema.md). | ||
|
|
||
| The following functions allow you to configure indexes for encrypted columns. | ||
| All these functions modify the `eql_v2_configuration` table in your database, and are added during the EQL installation. | ||
| All these functions modify the `public.eql_v2_configuration` table in your database, and are added during the EQL installation. | ||
|
|
||
| > **IMPORTANT:** When you modify or add search configuration index, you must re-encrypt data that's already been stored in the database. | ||
| > The CipherStash encryption solution will encrypt the data based on the current state of the configuration. | ||
|
|
||
| ### Configuring search (`eql_v2.add_search_config`) | ||
|
|
||
| Add an index to an encrypted column. | ||
| Add an index to an encrypted column. Returns the updated configuration as JSONB. | ||
|
|
||
| ```sql | ||
| SELECT eql_v2.add_search_config( | ||
|
|
@@ -31,6 +31,7 @@ SELECT eql_v2.add_search_config( | |
| | `index_name` | The index kind | Required | | ||
| | `cast_as` | The PostgreSQL type decrypted data will be cast to | Optional. Defaults to `text` | | ||
| | `opts` | Index options | Optional for `match` indexes, required for `ste_vec` indexes (see below) | | ||
| | `migrating` | Skip auto-migration if true | Optional. Defaults to `false`. Set to `true` for batch operations | | ||
|
|
||
| #### Option (`cast_as`) | ||
|
|
||
|
|
@@ -60,33 +61,33 @@ The default match index options are: | |
| "tokenizer": { | ||
| "kind": "ngram", | ||
| "token_length": 3 | ||
| } | ||
| "token_filters": { | ||
| "kind": "downcase" | ||
| } | ||
| }, | ||
| "token_filters": [ | ||
| {"kind": "downcase"} | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| - `tokenFilters`: a list of filters to apply to normalize tokens before indexing. | ||
| - `token_filters`: a list of filters to apply to normalize tokens before indexing. | ||
| - `tokenizer`: determines how input text is split into tokens. | ||
| - `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`. | ||
| - `bf`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`. | ||
| - `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`. | ||
|
|
||
| **Token filters** | ||
|
|
||
| There are currently only two token filters available: `downcase` and `upcase`. These are used to normalise the text before indexing and are also applied to query terms. An empty array can also be passed to `tokenFilters` if no normalisation of terms is required. | ||
| The `downcase` token filter is available to normalise text before indexing and is also applied to query terms. An empty array can also be passed to `token_filters` if no normalisation of terms is required. | ||
|
|
||
| **Tokenizer** | ||
|
|
||
| There are two `tokenizer`s provided: `standard` and `ngram`. | ||
| `standard` simply splits text into tokens using this regular expression: `/[ ,;:!]/`. | ||
| `ngram` splits the text into n-grams and accepts a configuration object that allows you to specify the `tokenLength`. | ||
|
|
||
| **m** and **k** | ||
| **bf** and **k** | ||
|
|
||
| `k` and `m` are optional fields for configuring [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) that back full text search. | ||
| `k` and `bf` are optional fields for configuring [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) that back full text search. | ||
|
|
||
| `m` is the size of the bloom filter in bits. `filterSize` must be a power of 2 between `32` and `65536` and defaults to `2048`. | ||
| `bf` is the size of the bloom filter in bits. It must be a power of 2 between `32` and `65536` and defaults to `2048`. | ||
|
|
||
| `k` is the number of hash functions to use per term. | ||
| This determines the maximum number of bits that will be set in the bloom filter per term. | ||
|
|
@@ -103,7 +104,9 @@ Try to ensure that the string you search for is at least as long as the `tokenLe | |
|
|
||
| #### Options for ste_vec indexes (`opts`) | ||
|
|
||
| An ste_vec index on a encrypted JSONB column enables the use of PostgreSQL's `@>` and `<@` [containment operators](https://www.postgresql.org/docs/16/functions-json.html#FUNCTIONS-JSONB-OP-TABLE). | ||
| An ste_vec index on an encrypted JSONB column enables the use of PostgreSQL's `@>` and `<@` [containment operators](https://www.postgresql.org/docs/16/functions-json.html#FUNCTIONS-JSONB-OP-TABLE). | ||
|
|
||
| > **Note:** The `@>` and `<@` operators work directly on `eql_v2_encrypted` types, allowing simple query syntax like `encrypted_col @> search_term`. | ||
|
|
||
| An ste_vec index requires one piece of configuration: the `prefix` (a string) which is passed as an info string to a MAC (Message Authenticated Code). | ||
| This ensures that all of the encrypted values are unique to that prefix. | ||
|
|
@@ -204,7 +207,7 @@ A query prior to encrypting and indexing looks like a structurally similar subse | |
| } | ||
| ``` | ||
|
|
||
| The expression `cs_ste_vec_v2(encrypted_account) @> cs_ste_vec_v2($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com". | ||
| The expression `encrypted_account @> $query` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com". | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An example of an equivalent plaintext Postgres query would be illustrative. |
||
|
|
||
| When reduced to a prefix list, it would look like this: | ||
|
|
||
|
|
@@ -224,9 +227,26 @@ When reduced to a prefix list, it would look like this: | |
|
|
||
| Which is then turned into an ste_vec of hashes which can be directly queries against the index. | ||
|
|
||
| #### GIN indexing for ste_vec | ||
|
|
||
| For efficient containment queries on large tables, you can create a GIN index using the `eql_v2.jsonb_array()` function: | ||
|
|
||
| ```sql | ||
| -- Create GIN index for containment queries | ||
| CREATE INDEX idx_encrypted_jsonb ON mytable USING GIN (eql_v2.jsonb_array(encrypted_col)); | ||
|
|
||
| -- Query using containment (will use the GIN index) | ||
| SELECT * FROM mytable WHERE encrypted_col @> $1::eql_v2_encrypted; | ||
| ``` | ||
|
|
||
| The following helper functions are available for GIN-indexed containment queries: | ||
| - `eql_v2.jsonb_array(val)` - Extracts encrypted JSONB as an array for GIN indexing | ||
| - `eql_v2.jsonb_contains(a, b)` - GIN-indexable containment check (`a @> b`) | ||
| - `eql_v2.jsonb_contained_by(a, b)` - GIN-indexable "is contained by" check (`a <@ b`) | ||
|
Comment on lines
+243
to
+245
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll need to review how the mapper handles the transformation and ensure that it invokes the correct EQL functions. |
||
|
|
||
| ### Modifying an index (`eql_v2.modify_search_config`) | ||
|
|
||
| Modifies an existing index configuration. | ||
| Modifies an existing index configuration. Returns the updated configuration as JSONB. | ||
| Accepts the same parameters as `eql_v2.add_search_config` | ||
|
|
||
| ```sql | ||
|
|
@@ -240,9 +260,22 @@ SELECT eql_v2.modify_search_config( | |
| ); | ||
| ``` | ||
|
|
||
| **Example:** | ||
|
|
||
| ```sql | ||
| -- Update match index options to increase bloom filter size | ||
| SELECT eql_v2.modify_search_config( | ||
| 'users', | ||
| 'email', | ||
| 'match', | ||
| 'text', | ||
| '{"bf": 4096, "k": 8}'::jsonb | ||
| ); | ||
| ``` | ||
|
|
||
| ### Removing an index (`eql_v2.remove_search_config`) | ||
|
|
||
| Removes an index configuration from the column. | ||
| Removes an index configuration from the column. Returns the updated configuration as JSONB. | ||
|
|
||
| ```sql | ||
| SELECT eql_v2.remove_search_config( | ||
|
|
@@ -253,6 +286,17 @@ SELECT eql_v2.remove_search_config( | |
| ); | ||
| ``` | ||
|
|
||
| **Example:** | ||
|
|
||
| ```sql | ||
| -- Remove the match index from the email column | ||
| SELECT eql_v2.remove_search_config( | ||
| 'users', | ||
| 'email', | ||
| 'match' | ||
| ); | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ### Didn't find what you wanted? | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reading about Bloom filters,
kis usually described as the number of hash functions (each of which will produce a hash that maps to one bit).Not a hill I'll die on, a minor nit really.
Might not hurt to link to a Bloom filter calculator though, so that users can set the config to appropriate values.
e.g. https://di-mgt.com.au/bloom-calculator.html