-
Notifications
You must be signed in to change notification settings - Fork 0
feat(ste_vec): add GIN-indexable containment using jsonb arrays #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add ste_vec_jsonb() functions that return jsonb[] instead of eql_v2_encrypted[], enabling efficient GIN indexed containment queries. The root cause of GIN index failures at scale was that eql_v2_encrypted lacked hash operator class support required by PostgreSQL's array_ops. By converting to jsonb[], which has native hash support, GIN indexes work correctly with containment operators (@>, <@). New SQL functions: - eql_v2.ste_vec_jsonb(jsonb) -> jsonb[] - eql_v2.ste_vec_jsonb(eql_v2_encrypted) -> jsonb[] - eql_v2.ste_vec_contains_jsonb(a, b) -> boolean (@> wrapper) - eql_v2.ste_vec_is_contained_by_jsonb(a, b) -> boolean (<@ wrapper) Test infrastructure: - 500-row test dataset for GIN index testing - 5 containment tests verifying index usage - Helper functions for GIN index creation and verification
Rename GIN-indexable containment functions to use cleaner, purpose-based naming that doesn't expose the internal ste_vec structure: - ste_vec_jsonb() -> jsonb_array() - ste_vec_contains_jsonb() -> jsonb_contains() - ste_vec_is_contained_by_jsonb() -> jsonb_contained_by() The new names follow the existing jsonb_* pattern in the codebase and clearly indicate these functions are for JSONB containment operations.
- Fix needless borrow in assert_uses_index - Apply cargo fmt formatting
- Add TOC entry for GIN Indexes section - Add cross-reference in B-tree limitations pointing to GIN alternative - Add comprehensive GIN Indexes section covering: - When to use GIN indexes - Creating a GIN index with jsonb_array() - Query patterns (direct and helper function) - Verifying index usage with EXPLAIN - GIN vs B-tree comparison table
Verified and corrected documentation against current implementation: - Replace obsolete cs_ste_vec_v2 function with direct @> operator syntax - Fix bloom filter parameter names (m/filterSize → bf) - Fix token_filters JSON structure (object → array) and add missing comma - Change tokenFilters to token_filters (snake_case) - Remove non-existent upcase token filter reference - Add GIN indexing section for ste_vec containment queries - Document migrating parameter in parameter table - Add return type (JSONB) to all config function descriptions - Add SQL examples for modify_search_config and remove_search_config - Specify public.eql_v2_configuration schema - Add note about @> operator working directly on eql_v2_encrypted types
- Add "Indexed Containment Queries" subsection after containment operators - Add "GIN-Indexable Functions" section with jsonb_array, jsonb_contains, and jsonb_contained_by function documentation - Add cross-references to database-indexes.md GIN section
Verify that containment queries use Seq Scan before index creation and switch to using the GIN index after creation. Exports explain_query and assert_uses_seq_scan helpers to support this test pattern.
SQL examples using undefined 'search_value' variable would cause "column does not exist" errors if copied verbatim. Changed to $1 placeholder to indicate query parameter usage. Fixes common issue identified by dual-verification review.
Documentation fixes (from cross-checked verification): - E1-1: Add integer -> operator documentation for array indexing - E1-2: Clarify sv/ste_vec is a storage field, not an index term type - E1-3: Correct ->> return type from "ciphertext" to "encrypted value as text" - E1-4: Document ste_vec(jsonb) overload alongside ste_vec(eql_v2_encrypted) - E1-5: Add note about -> vs ->> operator asymmetry (integer index support) Rust lint fixes: - Apply cargo fmt formatting to lib.rs and containment_with_index_tests.rs
yujiyokoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and tests and the accompanying explanation look good, but I'm not super familiar with SQL...
| - `tokenizer`: determines how input text is split into tokens. | ||
| - `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`. | ||
| - `bf`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`. | ||
| - `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reading about Bloom filters, k is usually described as the number of hash functions (each of which will produce a hash that maps to one bit).
Not a hill I'll die on, a minor nit really.
Might not hurt to link to a Bloom filter calculator though, so that users can set the config to appropriate values.
| ``` | ||
|
|
||
| The expression `cs_ste_vec_v2(encrypted_account) @> cs_ste_vec_v2($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com". | ||
| The expression `encrypted_account @> $query` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of an equivalent plaintext Postgres query would be illustrative.
| - `eql_v2.jsonb_array(val)` - Extracts encrypted JSONB as an array for GIN indexing | ||
| - `eql_v2.jsonb_contains(a, b)` - GIN-indexable containment check (`a @> b`) | ||
| - `eql_v2.jsonb_contained_by(a, b)` - GIN-indexable "is contained by" check (`a <@ b`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to review how the mapper handles the transformation and ensure that it invokes the correct EQL functions.
freshtonic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, with comments.
This is really good.
Add jsonb_array() functions that return jsonb[] instead of eql_v2_encrypted[], enabling efficient
GIN indexed containment queries.
The root cause of GIN index failures at scale was that eql_v2_encrypted lacked hash operator class
support required by PostgreSQL's array_ops. By converting to jsonb[], which has native hash
support, GIN indexes work correctly with containment operators (@>, <@).
New SQL functions:
Test infrastructure: