diff --git a/README.md b/README.md index 335ca86e..604bfa63 100644 --- a/README.md +++ b/README.md @@ -228,6 +228,7 @@ EQL provides specialized functions to interact with encrypted data: - **`cs_match_v1(val JSONB)`**: Enables basic full-text search. - **`cs_unique_v1(val JSONB)`**: Retrieves the unique index for enforcing uniqueness. - **`cs_ore_v1(val JSONB)`**: Retrieves the Order-Revealing Encryption index for range queries. +- **`cs_ste_vec_v1(val JSONB)`**: Retrieves the Structured Encryption Vector for containment queries. ### Index functions @@ -245,7 +246,7 @@ cs_add_index(table_name text, column_name text, index_name text, cast_as text, o | column_name | Name of target column | Required | index_name | The index kind | Required. | cast_as | The PostgreSQL type decrypted data will be cast to | Optional. Defaults to `text` -| opts | Index options | Optional for `match` indexes (see below) +| opts | Index options | Optional for `match` indexes, required for `ste_vec` indexes (see below) ##### cast_as @@ -256,6 +257,7 @@ Supported types: - big_int - boolean - date + - jsonb ##### match opts @@ -278,6 +280,137 @@ The default Match index options are: } ``` +- `tokenFilters`: a list of filters to apply to normalise tokens before indexing. +- `tokenizer`: determines how input text is split into tokens. +- `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`. +- `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`. + +**Token Filters** + +There are currently only two token filters available `downcase` and `upcase`. These are used to normalise the text before indexing and are also applied to query terms. An empty array can also be passed to `tokenFilters` if no normalisation of terms is required. + +**Tokenizer** + +There are two `tokenizer`s provided: `standard` and `ngram`. +The `standard` simply splits text into tokens using this regular expression: `/[ ,;:!]/`. +The `ngram` tokenizer splits the text into n-grams and accepts a configuration object that allows you to specify the `tokenLength`. + +**m** and **k** + +`k` and `m` are optional fields for configuring [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) that back full text search. + +`m` is the size of the bloom filter in bits. `filterSize` must be a power of 2 between `32` and `65536` and defaults to `2048`. + +`k` is the number of hash functions to use per term. +This determines the maximum number of bits that will be set in the bloom filter per term. +`k` must be an integer from `3` to `16` and defaults to `6`. + +**Caveats around n-gram tokenization** + +While using n-grams as a tokenization method allows greater flexibility when doing arbitrary substring matches, it is important to bear in mind the limitations of this approach. +Specifically, searching for strings _shorter_ than the `tokenLength` parameter will not _generally_ work. + +If you're using n-gram as a token filter, then a token that is already shorter than the `tokenLength` parameter will be kept as-is when indexed, and so a search for that short token will match that record. +However, if that same short string only appears as a part of a larger token, then it will not match that record. +In general, therefore, you should try to ensure that the string you search for is at least as long as the `tokenLength` of the index, except in the specific case where you know that there are shorter tokens to match, _and_ you are explicitly OK with not returning records that have that short string as part of a larger token. + +##### ste_vec opts + +An ste_vec index on a encrypted JSONB column enables the use of PostgreSQL's `@>` and `<@` [containment operators](https://www.postgresql.org/docs/16/functions-json.html#FUNCTIONS-JSONB-OP-TABLE). + +An ste_vec index requires one piece of configuration: the `prefix` (a string) which is functionally similar to a salt for the hashing process. + +Within a dataset, encrypted columns indexed using an ste_vec that use different prefixes can never compare as equal. +Containment queries that manage to mix index terms from multiple columns will never return a positive result. +This is by design. + +The index is generated from a JSONB document by first flattening the structure of the document such that a hash can be generated for each unique path prefix to a node. + +The complete set of JSON types is supported by the indexer. +Null values are ignored by the indexer. + +- Object `{ ... }` +- Array `[ ... ]` +- String `"abc"` +- Boolean `true` +- Number `123.45` + +For a document like this: + +```json +{ + "account": { + "email": "alice@example.com", + "name": { + "first_name": "Alice", + "last_name": "McCrypto", + }, + "roles": [ + "admin", + "owner", + ] + } +} +``` + +Hashes would be produced from the following list of entries: + +```json +[ + [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")], + [Obj, Key("account"), Obj, Key("name"), Obj, Key("first_name"), String("Alice")], + [Obj, Key("account"), Obj, Key("name"), Obj, Key("last_name"), String("McCrypto")], + [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")], + [Obj, Key("account"), Obj, Key("roles"), Array, String("owner")], +] +``` + +Using the first entry to illustrate how an entry is converted to hashes: + +```json +[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")] +``` + +The hashes would be generated for all prefixes of the full path to the leaf node. + +```json +[ + [Obj], + [Obj, Key("account")], + [Obj, Key("account"), Obj], + [Obj, Key("account"), Obj, Key("email")], + [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")], + // (remaining leaf nodes omitted) +] +``` + +Query terms are processed in the same manner as the input document. + +A query prior to encrypting & indexing looks like a structurally similar subset of the encrypted document, for example: + +```json +{ "account": { "email": "alice@example.com", "roles": "admin" }} +``` + +The expression `cs_ste_vec_v1(encrypted_account) @> cs_ste_vec_v1($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com". + +When reduced to a prefix list, it would look like this: + +```json +[ + [Obj], + [Obj, Key("account")], + [Obj, Key("account"), Obj], + [Obj, Key("account"), Obj, Key("email")], + [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")] + [Obj, Key("account"), Obj, Key("roles")], + [Obj, Key("account"), Obj, Key("roles"), Array], + [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")] +] +``` + +Which is then turned into an ste_vec of hashes which can be directly queries against the index. + #### cs_modify_index ```sql @@ -335,6 +468,15 @@ cs_ore_v1(val jsonb) Extracts an ore index from the `jsonb` value. Returns `null` if no ore index is present. +#### cs_ste_vec_v1 + +```sql +cs_ste_vec_v1(val jsonb) +``` + +Extracts an ste_vec index from the `jsonb` value. +Returns `null` if no ste_vec index is present. + ### Data Format Encrypted data is stored as `jsonb` with a specific schema: @@ -383,7 +525,8 @@ Cipherstash proxy handles the encoding, and EQL provides the functions. | c | Ciphertext | Ciphertext value. Encrypted by proxy. Required if kind is plaintext/pt or encrypting/et. | m.1 | Match index | Ciphertext index value. Encrypted by proxy. | o.1 | ORE index | Ciphertext index value. Encrypted by proxy. -| u.1 | Uniqueindex | Ciphertext index value. Encrypted by proxy. +| u.1 | Unique index | Ciphertext index value. Encrypted by proxy. +| sv.1 | STE vector index | Ciphertext index value. Encrypted by proxy. #### Helper packages diff --git a/WHYEQL.md b/WHYEQL.md index e015a031..d3bb0376 100644 --- a/WHYEQL.md +++ b/WHYEQL.md @@ -26,6 +26,7 @@ A variety of searchable encryption techniques are available, including: - **Matching** - Equality or partial matches - **Ordering** - comparison operations using order revealing encryption - **Uniqueness** - enforcing unique constraints +- **Containment** - containment queries using structured encryption ### What is encryption in use?