Skip to content

Conversation

@tobyhede
Copy link
Contributor

@tobyhede tobyhede commented Dec 11, 2025

Add jsonb_array() functions that return jsonb[] instead of eql_v2_encrypted[], enabling efficient
GIN indexed containment queries.

The root cause of GIN index failures at scale was that eql_v2_encrypted lacked hash operator class
support required by PostgreSQL's array_ops. By converting to jsonb[], which has native hash
support, GIN indexes work correctly with containment operators (@>, <@).

New SQL functions:

  • eql_v2.jsonb_array(jsonb) -> jsonb[]
  • eql_v2.jsonb_array(eql_v2_encrypted) -> jsonb[]
  • eql_v2.jsonb_contains(a, b) -> boolean (@> wrapper)
  • eql_v2.jsonb_contained_by(a, b) -> boolean (<@ wrapper)

Test infrastructure:

  • 500-row test dataset for GIN index testing
  • 5 containment tests verifying index usage
  • Helper functions for GIN index creation and verification

Add ste_vec_jsonb() functions that return jsonb[] instead of
eql_v2_encrypted[], enabling efficient GIN indexed containment queries.

The root cause of GIN index failures at scale was that eql_v2_encrypted
lacked hash operator class support required by PostgreSQL's array_ops.
By converting to jsonb[], which has native hash support, GIN indexes
work correctly with containment operators (@>, <@).

New SQL functions:
- eql_v2.ste_vec_jsonb(jsonb) -> jsonb[]
- eql_v2.ste_vec_jsonb(eql_v2_encrypted) -> jsonb[]
- eql_v2.ste_vec_contains_jsonb(a, b) -> boolean (@> wrapper)
- eql_v2.ste_vec_is_contained_by_jsonb(a, b) -> boolean (<@ wrapper)

Test infrastructure:
- 500-row test dataset for GIN index testing
- 5 containment tests verifying index usage
- Helper functions for GIN index creation and verification
Rename GIN-indexable containment functions to use cleaner, purpose-based
naming that doesn't expose the internal ste_vec structure:

- ste_vec_jsonb() -> jsonb_array()
- ste_vec_contains_jsonb() -> jsonb_contains()
- ste_vec_is_contained_by_jsonb() -> jsonb_contained_by()

The new names follow the existing jsonb_* pattern in the codebase and
clearly indicate these functions are for JSONB containment operations.
- Fix needless borrow in assert_uses_index
- Apply cargo fmt formatting
- Add TOC entry for GIN Indexes section
- Add cross-reference in B-tree limitations pointing to GIN alternative
- Add comprehensive GIN Indexes section covering:
  - When to use GIN indexes
  - Creating a GIN index with jsonb_array()
  - Query patterns (direct and helper function)
  - Verifying index usage with EXPLAIN
  - GIN vs B-tree comparison table
Verified and corrected documentation against current implementation:
- Replace obsolete cs_ste_vec_v2 function with direct @> operator syntax
- Fix bloom filter parameter names (m/filterSize → bf)
- Fix token_filters JSON structure (object → array) and add missing comma
- Change tokenFilters to token_filters (snake_case)
- Remove non-existent upcase token filter reference
- Add GIN indexing section for ste_vec containment queries
- Document migrating parameter in parameter table
- Add return type (JSONB) to all config function descriptions
- Add SQL examples for modify_search_config and remove_search_config
- Specify public.eql_v2_configuration schema
- Add note about @> operator working directly on eql_v2_encrypted types
- Add "Indexed Containment Queries" subsection after containment operators
- Add "GIN-Indexable Functions" section with jsonb_array, jsonb_contains,
  and jsonb_contained_by function documentation
- Add cross-references to database-indexes.md GIN section
Verify that containment queries use Seq Scan before index creation
and switch to using the GIN index after creation. Exports explain_query
and assert_uses_seq_scan helpers to support this test pattern.
SQL examples using undefined 'search_value' variable would cause
"column does not exist" errors if copied verbatim. Changed to $1
placeholder to indicate query parameter usage.

Fixes common issue identified by dual-verification review.
Documentation fixes (from cross-checked verification):
- E1-1: Add integer -> operator documentation for array indexing
- E1-2: Clarify sv/ste_vec is a storage field, not an index term type
- E1-3: Correct ->> return type from "ciphertext" to "encrypted value as text"
- E1-4: Document ste_vec(jsonb) overload alongside ste_vec(eql_v2_encrypted)
- E1-5: Add note about -> vs ->> operator asymmetry (integer index support)

Rust lint fixes:
- Apply cargo fmt formatting to lib.rs and containment_with_index_tests.rs
Copy link
Contributor

@yujiyokoo yujiyokoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and tests and the accompanying explanation look good, but I'm not super familiar with SQL...

- `tokenizer`: determines how input text is split into tokens.
- `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`.
- `bf`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`.
- `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading about Bloom filters, k is usually described as the number of hash functions (each of which will produce a hash that maps to one bit).

Not a hill I'll die on, a minor nit really.

Might not hurt to link to a Bloom filter calculator though, so that users can set the config to appropriate values.

e.g. https://di-mgt.com.au/bloom-calculator.html

```

The expression `cs_ste_vec_v2(encrypted_account) @> cs_ste_vec_v2($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com".
The expression `encrypted_account @> $query` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of an equivalent plaintext Postgres query would be illustrative.

Comment on lines +243 to +245
- `eql_v2.jsonb_array(val)` - Extracts encrypted JSONB as an array for GIN indexing
- `eql_v2.jsonb_contains(a, b)` - GIN-indexable containment check (`a @> b`)
- `eql_v2.jsonb_contained_by(a, b)` - GIN-indexable "is contained by" check (`a <@ b`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to review how the mapper handles the transformation and ensure that it invokes the correct EQL functions.

Copy link
Contributor

@freshtonic freshtonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, with comments.

This is really good.

@calvinbrewer calvinbrewer merged commit bf6e01f into main Dec 11, 2025
4 checks passed
@calvinbrewer calvinbrewer deleted the jsonb_contains_clean branch December 11, 2025 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants