Merge pull request #1 from 4bakker/has

added _has as operator
botsquad · Jan 24, 2024 · 6d2b286 · 6d2b286
2 parents 66e9f24 + 38e0147
commit 6d2b286
Show file tree

Hide file tree

Showing 9 changed files with 418 additions and 353 deletions.
diff --git a/.github/workflows/elixir.yml b/.github/workflows/elixir.yml
@@ -2,9 +2,9 @@ name: Elixir CI
 
 on:
   push:
-    branches: [ master ]
+    branches: [master]
   pull_request:
-    branches: [ master ]
+    branches: [master]
 
 permissions:
   contents: read
@@ -18,24 +18,24 @@ jobs:
       MIX_ENV: test
     strategy:
       matrix:
-        otp: ["24.0"]
-        elixir: ["1.12"]
+        otp: ["26.0", "24.0"]
+        elixir: ["1.15", "1.14"]
 
     steps:
-    - uses: actions/checkout@v3
-    - uses: erlef/setup-elixir@v1
-      with:
-        otp-version: ${{matrix.otp}}
-        elixir-version: ${{matrix.elixir}}
-    - name: Restore dependencies cache
-      uses: actions/cache@v3
-      with:
-        path: deps
-        key: ${{ runner.os }}-mix-${{ hashFiles('**/mix.lock') }}
-        restore-keys: ${{ runner.os }}-mix-
-    - name: Install dependencies
-      run: mix deps.get
-    - name: Compile
-      run: mix compile --warnings-as-errors
-    - name: Run tests
-      run: mix test
+      - uses: actions/checkout@v3
+      - uses: erlef/setup-elixir@v1
+        with:
+          otp-version: ${{matrix.otp}}
+          elixir-version: ${{matrix.elixir}}
+      - name: Restore dependencies cache
+        uses: actions/cache@v3
+        with:
+          path: deps
+          key: ${{ runner.os }}-mix-${{ hashFiles('**/mix.lock') }}
+          restore-keys: ${{ runner.os }}-mix-
+      - name: Install dependencies
+        run: mix deps.get
+      - name: Compile
+        run: mix compile --warnings-as-errors
+      - name: Run tests
+        run: mix test
diff --git a/README.md b/README.md
@@ -1,28 +1,298 @@
 # MatchEngine
 
-[![Elixir CI](https://github.com/botsquad/match_engine/actions/workflows/elixir.yml/badge.svg)](https://github.com/botsquad/match_engine/actions/workflows/elixir.yml) [![Hex.pm](https://img.shields.io/hexpm/v/match_engine.svg)]()
+[![Build Status](https://github.com/botsquad/match_engine/workflows/elixir/badge.svg)](https://github.com/botsquad/match_engine)
+[![Module Version](https://img.shields.io/hexpm/v/match_engine.svg)](https://hex.pm/packages/match_engine)
+[![Hex Docs](https://img.shields.io/badge/hex-docs-lightgreen.svg)](https://hexdocs.pm/match_engine/)
+[![Total Download](https://img.shields.io/hexpm/dt/match_engine.svg)](https://hex.pm/packages/match_engine)
+[![License](https://img.shields.io/hexpm/l/match_engine.svg)](https://github.com/botsquad/match_engine/blob/master/LICENSE)
+[![Last Updated](https://img.shields.io/github/last-commit/botsquad/match_engine.svg)](https://github.com/botsquad/match_engine/commits/master)
 
-A query language for filtering and scoring of documents, inspired by
-the MongoDB query language and Solr. The query language consists of
-nested Elixir _keyword list_. Each component of the query consists of
-a _key_ part and a _value_ part. The key part is either a logic
-operator (and/or/not), or a reference to a field, the value part is
-either a plain value, or a value operator.
+<!-- MDOC !-->
+
+MatchEngine is an in-memory matching/filtering engine with
+MongoDB-like query syntax.
+
+## Introduction
+
+The query language consists of nested Elixir "keyword list". Each
+component of the query consists of a _key_ part and a _value_
+part. The key part is either a logic operator (and/or/not), or a
+reference to a field, the value part is either a plain value, or a
+value operator.
 
 When a query is run against a document, where each term is scored
-individually and then summed. (This implies "or").
+individually and then summed. (This implies "or"). Some example
+queries:
+
+```
+[title: "hoi"]
+[title: [_eq: "hoi"]]
+[_and: [name: "John", age: 36]]
+[_or: [name: "John", age: 36]]
+[_not: [title: "foo"]]
+```
 
-Example queries:
+Two ways of saying "Score all documents in which the title equals `"hoi"`":
 
 ```
 [title: "hoi"]
 [title: [_eq: "hoi"]]
-[_and: [name: "Arjan", age: 36]]
-[_or: [name: "Arjan", age: 36]]
+```
+
+Combining various matchers with logic operators:
+
+```
+[_and: [name: "John", age: 36]]
+[_or: [name: "John", age: 36]]
+[_not: [title: "foo"]]
+```
+
+Performing matches in nested objects is also possible; the query
+simply follows the shape of the data.
+
+Given a document consisting of a nested structure, `%{"user" => %{"name" => "John"}}`:
+
+"User name equals John":
+
+```
+[user: [name: "John"]]
+```
+
+"User name does not equal John":
+
+```
+[_not: [user: [name: "John"]]]
+```
+
+> Note that this is a different approach for nesting fields than MongoDB, which uses dot notation for field nesting.
+
+## Query execution
+
+The queries can be run by calling `MatchEngine.score_all/2` or `MatchEngine.filter_all/2`.
+
+Queries are first preprocessed, and then executed on a list of search
+"documents". A "document" is just a normal Elixir map, with string
+keys.
+
+The preprocessing phase compiles any regexes, checks whether all
+operators exist, and de-nests nested field structures.
+
+The query phase runs the preprocessed query for each document in the
+list, by calculating the score for the given document, given the
+query. When using filter_all/2, documents with a zero score are
+removed from the input list. When using score_all, the list is
+sorted on score, descending, and this score, including any
+additional metadata, is returned in a `"_match"` map inside the
+document.
+
+## Value operators
+
+_Value operators_ work on an individual field. Various operators can
+be used to calculate a score for a given field.
+
+### `_eq`
+
+Scores on the equality of the argument.
+
+```
+[title: "hello"]
+[title: [_eq: "hello"]]
+```
+
+### `_ne`
+
+Scores on the *in*equality of the argument. ("Not equals")
+
+```
+[title: [_ne: "hello"]]
+```
+
+### `_has`
+
+Scores when the document's value is a member of the given list.
+
+```
+[role: [_in: ["developer", "freelancer"]]]
+```
+
+### `_hasnt`
+
+Scores when the document's value contains a member of the given list or contains the given word or words
+
+```
+[tag: [_has: ["production"]]]
+[title: [_has: "The"]]
+[title: [_has: ["The", "title"]]]
+```
+
+### `_in`
+
+Scores when the document's value is a member of the given list.
+
+```
+[role: [_in: ["developer", "freelancer"]]]
+```
+
+### `_nin`
+
+Scores when the document's value is _not_ a member of the given list.
+
+```
+[role: [_nin: ["recruiter"]]]
+```
+
+### `_lt`, `_gt`, `_lte`, `_gte`
+
+Scores on using the comparison operators <, >, <= and >=.
+
+```
+[age: [_gt: 18]]
+```
+
+### `_sim`
+
+Normalized string similarity. The max of the Normalised Levenshtein
+distance and Jaro distance.
+
+### `_regex`
+
+Match a regular expression. The input is a string, which gets compiled
+into a regex. This operator scores on the length of match divided by
+the total string length. It is possible to add named captures to the
+regex, which then get added to the `_match` metadata map, as seen in the following exapmle:
+
+```
+# regex matches entire string, 100% score
+assert %{"score" => 1} == score([title: [_regex: "foo"]], %{"title" => "foo"})
+# regex matches with a capture called 'name'. It is boosted by weight.
+assert %{"score" => 1.6, "name" => "food"} == score([title: [_regex: "(?P<name>foo[dl])", w: 4]], %{"title" => "foodtrucks"})
+```
+
+The regex match can also be inversed, where the document value is
+treated as the regular expression, and the query input is treated as
+the string to be matched. (No captures are supported in this case).
+
+```
+assert %{"score" => 0.5} == score([title: [_regex: "foobar", inverse: true]], %{"title" => "foo"})
+```
+
+### `_geo`
+
+Calculate document score based on its geographical distance to a given
+point. The geo distance (both in the operator and in the document) can
+be given as:
+
+- A regular list, e.g. `[4.56, 52.33]`
+- A keyword list, e.g. `[lat: 52.33, lon: 4.56]`
+- A map with atom keys, e.g. `%{lat: 52.33, lon: 4.56}`
+- A map with string keys, e.g. `%{"lat" => 52.33, "lon" => 4.56}`
+
+The calculated `distance` is returned in meters, as part of the `_match` map.
+
+An extra argument, `max_distance` can be given to the operator which
+specifies the maximum cutoff point. It defaults to 100km. (100_000).
+Distance is scored logarithmically with respect to the maximum
+distance.
+
+```
+doc = %{"location" => %{"lat" => 52.340500999999996, "lon" => 4.8832816}}
+q = [location: [_geo: [lat: 52.340500999999996, lon: 4.8832816]]]
+assert %{"score" => 1, "distance" => 0.0} == score(q, doc)
+```
+
+When `radius` is given as an option, all geo points that are within
+the radius will score a 1 and the max_distance scoring will be in
+effect for distances larger than the radius.
+
+### `_geo_poly`
+
+Calculate document score based on its containment inside a given
+geographical polygon.
+
+Accepts a list of geographical coordinates, each in the same format
+as `_geo`.
+
+Like `_geo`, the `max_distance` option can be given to the operator
+which specifies the maximum cutoff point. It defaults to
+100km. (100_000). Distance is scored logarithmically with respect
+to the maximum distance.
+
+When the point is inside the polygon, the score is always 1. Only
+when the point is outside the polygon, the geographical distance
+from the document point to the closest point on the edge of the
+polygon is calculated and scored based on the `max_distance`
+setting.
+
+### `_time`
+
+Score by an UTC timestamp, relative to the given time.
+
+```
+t1 = "2018-02-19T15:29:53.672235Z"
+t2 = "2018-02-19T15:09:53.672235Z"
+assert %{"score" => s} = score([inserted_at: [_time: t1]], %{"inserted_at" => t2})
+```
+
+This way, documents can be returned in order of recency.
+
+## Logic operators
+
+### `_and`
+
+Combine matchers, multiplying the score. When one of the matchers
+returns 0, the total score is 0 as well.
+
+```
+[_and: [name: "John", age: 36]]
+```
+
+### `_or`
+
+Combine matchers, adding the scores.
+
+```
+[_or: [name: "John", id: 12]]
+```
+
+### `_not`
+
+Reverse the score of the nested matchers. (when score > 0, return 0, otherwise, return 1.
+
+```
 [_not: [title: "foo"]]
 ```
 
-Full documentation can be found at [https://hexdocs.pm/match_engine](https://hexdocs.pm/match_engine/MatchEngine.html).
+### Matcher weights
+
+`w: 10` can be added to a matcher term to boost its score by the given weight.
+
+```
+[title: [_eq: "Pete", w: 5], summary: [_sim: "hello", w: 2]]
+```
+
+`b: true` can be added to force a score of 1 when the score is > 0.
+
+```
+[title: [_sim: "hello", b: true]]
+```
+
+## Map syntax for queries
+
+Instead of keyword lists, queries can also be specified as maps. In
+this case, the keys of the map need to be strings. Query maps are
+meant to be used from user-generated input, and can be easily created from JSON files.
+
+```
+[_not: [title: "foo"]]
+# can also be written as:
+%{"_not" => %{"title" => "foo"}}
+
+[title: [_eq: "Pete", w: 5], summary: [_sim: "hello", w: 2]]
+# can also be written as:
+%{"title" => %{"_eq" => "Pete", "w" => 5}, "summary" => %{"_sim" => "hello", "w" => 2}}
+```
+
+<!-- MDOC !-->
 
 ## Installation