cipherstash · auxesis · Oct 11, 2024 · Oct 10, 2024 · Oct 10, 2024 · Oct 10, 2024
diff --git a/README.md b/README.md
@@ -228,6 +228,7 @@ EQL provides specialized functions to interact with encrypted data:
 - **`cs_match_v1(val JSONB)`**: Enables basic full-text search.
 - **`cs_unique_v1(val JSONB)`**: Retrieves the unique index for enforcing uniqueness.
 - **`cs_ore_v1(val JSONB)`**: Retrieves the Order-Revealing Encryption index for range queries.
+- **`cs_ste_vec_v1(val JSONB)`**: Retrieves the Structured Encryption Vector for containment queries.
 
 ### Index functions
 
@@ -245,7 +246,7 @@ cs_add_index(table_name text, column_name text, index_name text, cast_as text, o
 | column_name   | Name of target column                              | Required
 | index_name    | The index kind                                     | Required.
 | cast_as       | The PostgreSQL type decrypted data will be cast to | Optional. Defaults to `text`
-| opts          | Index options                                      | Optional for `match` indexes (see below)
+| opts          | Index options                                      | Optional for `match` indexes, required for `ste_vec` indexes (see below)
 
 ##### cast_as
 
@@ -256,6 +257,7 @@ Supported types:
   - big_int
   - boolean
   - date
+  - jsonb
 
 ##### match opts
 
@@ -278,6 +280,137 @@ The default Match index options are:
   }
 ```
 
+- `tokenFilters`: a list of filters to apply to normalise tokens before indexing.
+- `tokenizer`: determines how input text is split into tokens.
+- `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`.
+- `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`.
+
+**Token Filters**
+
+There are currently only two token filters available `downcase` and `upcase`. These are used to normalise the text before indexing and are also applied to query terms. An empty array can also be passed to `tokenFilters` if no normalisation of terms is required.
+
+**Tokenizer**
+
+There are two `tokenizer`s provided: `standard` and `ngram`.
+The `standard` simply splits text into tokens using this regular expression: `/[ ,;:!]/`.
+The `ngram` tokenizer splits the text into n-grams and accepts a configuration object that allows you to specify the `tokenLength`.
+
+**m** and **k**
+
+`k` and `m` are optional fields for configuring [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) that back full text search.
+
+`m` is the size of the bloom filter in bits. `filterSize` must be a power of 2 between `32` and `65536` and defaults to `2048`.
+
+`k` is the number of hash functions to use per term.
+This determines the maximum number of bits that will be set in the bloom filter per term.
+`k` must be an integer from `3` to `16` and defaults to `6`.
+
+**Caveats around n-gram tokenization**
+
+While using n-grams as a tokenization method allows greater flexibility when doing arbitrary substring matches, it is important to bear in mind the limitations of this approach.
+Specifically, searching for strings _shorter_ than the `tokenLength` parameter will not _generally_ work.
+
+If you're using n-gram as a token filter, then a token that is already shorter than the `tokenLength` parameter will be kept as-is when indexed, and so a search for that short token will match that record.
+However, if that same short string only appears as a part of a larger token, then it will not match that record.
+In general, therefore, you should try to ensure that the string you search for is at least as long as the `tokenLength` of the index, except in the specific case where you know that there are shorter tokens to match, _and_ you are explicitly OK with not returning records that have that short string as part of a larger token.
+
+##### ste_vec opts
+
+An ste_vec index on a encrypted JSONB column enables the use of PostgreSQL's `@>` and `<@` [containment operators](https://www.postgresql.org/docs/16/functions-json.html#FUNCTIONS-JSONB-OP-TABLE).
+
+An ste_vec index requires one piece of configuration: the `prefix` (a string) which is functionally similar to a salt for the hashing process.
+
+Within a dataset, encrypted columns indexed using an ste_vec that use different prefixes can never compare as equal. 
+Containment queries that manage to mix index terms from multiple columns will never return a positive result. 
+This is by design.
+
+The index is generated from a JSONB document by first flattening the structure of the document such that a hash can be generated for each unique path prefix to a node.
+
+The complete set of JSON types is supported by the indexer. 
+Null values are ignored by the indexer.
+
+- Object `{ ... }`
+- Array `[ ... ]`
+- String `"abc"`
+- Boolean `true`
+- Number `123.45`
+
+For a document like this:
+
+```json
+{
+  "account": {
+    "email": "alice@example.com",
+    "name": {
+      "first_name": "Alice",
+      "last_name": "McCrypto",
+    },
+    "roles": [
+      "admin",
+      "owner",
+    ]
+  }
+}
+```
+
+Hashes would be produced from the following list of entries:
+
+```json
+[
+  [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")],
+  [Obj, Key("account"), Obj, Key("name"), Obj, Key("first_name"), String("Alice")],
+  [Obj, Key("account"), Obj, Key("name"), Obj, Key("last_name"), String("McCrypto")],
+  [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")],
+  [Obj, Key("account"), Obj, Key("roles"), Array, String("owner")],
+]
+```
+
+Using the first entry to illustrate how an entry is converted to hashes:
+
+```json
+[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")]
+```
+
+The hashes would be generated for all prefixes of the full path to the leaf node.
+
+```json
+[
+  [Obj],
+  [Obj, Key("account")],
+  [Obj, Key("account"), Obj],
+  [Obj, Key("account"), Obj, Key("email")],
+  [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")],
+  // (remaining leaf nodes omitted)
+]
+```
+
+Query terms are processed in the same manner as the input document.
+
+A query prior to encrypting & indexing looks like a structurally similar subset of the encrypted document, for example:
+
+```json
+{ "account": { "email": "alice@example.com", "roles": "admin" }}
+```
+
+The expression `cs_ste_vec_v1(encrypted_account) @> cs_ste_vec_v1($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com".
+
+When reduced to a prefix list, it would look like this:
+
+```json
+[
+  [Obj],
+  [Obj, Key("account")],
+  [Obj, Key("account"), Obj],
+  [Obj, Key("account"), Obj, Key("email")],
+  [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")]
+  [Obj, Key("account"), Obj, Key("roles")],
+  [Obj, Key("account"), Obj, Key("roles"), Array],
+  [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")]
+]
+```
+
+Which is then turned into an ste_vec of hashes which can be directly queries against the index.
+
 #### cs_modify_index
 
 ```sql
@@ -335,6 +468,15 @@ cs_ore_v1(val jsonb)
 Extracts an ore index from the `jsonb` value.
 Returns `null` if no ore index is present.
 
+#### cs_ste_vec_v1
+
+```sql
+cs_ste_vec_v1(val jsonb)
+```
+
+Extracts an ste_vec index from the `jsonb` value.
+Returns `null` if no ste_vec index is present.
+
 ### Data Format
 
 Encrypted data is stored as `jsonb` with a specific schema:
@@ -383,7 +525,8 @@ Cipherstash proxy handles the encoding, and EQL provides the functions.
 | c        | Ciphertext         | Ciphertext value. Encrypted by proxy. Required if kind is plaintext/pt or encrypting/et.
 | m.1      | Match index        | Ciphertext index value. Encrypted by proxy.
 | o.1      | ORE index          | Ciphertext index value. Encrypted by proxy.
-| u.1      | Uniqueindex        | Ciphertext index value. Encrypted by proxy.
+| u.1      | Unique index       | Ciphertext index value. Encrypted by proxy.
+| sv.1     | STE vector index   | Ciphertext index value. Encrypted by proxy.
 
 #### Helper packages
 

diff --git a/WHYEQL.md b/WHYEQL.md
@@ -26,6 +26,7 @@ A variety of searchable encryption techniques are available, including:
 - **Matching** - Equality or partial matches
 - **Ordering** - comparison operations using order revealing encryption
 - **Uniqueness** - enforcing unique constraints
+- **Containment** - containment queries using structured encryption
 
 ### What is encryption in use?