![arangodb](https://github.com/arangodb/interactive_tutorials/blob/master/notebooks/img/ArangoDB_logo.png?raw=1)

# Fuzzy Search 

<a href="https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/FuzzySearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[ArangoSearch](https://www.arangodb.com/why-arangodb/full-text-search-engine-arangosearch/) provides information retrieval features, natively integrated into ArangoDB’s query language and with support for all data models. It is primarily a full-text search engine, a much more powerful alternative to the full-text index type.
Check this [ArangoSearch notebook](https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ArangoSearch.ipynb) for an introduction to ArangoSearch.

When dealing with real-world text retrieval, we often not only care about exact matches to our search phrase but need to consider for example typos or alternative spellings.
“Fuzzy search” is an umbrella term referring to a set of algorithms for such approximate matching. Usually such algorithms evaluate some similarity measure showing how close a search term is to the items in a dictionary. Then a search engine can make a decision on which results have to be shown first.

If you'd like to learn the theoretical aspects of how fuzzy search works under the hood, be sure to read [this article](https://www.arangodb.com/2020/07/deep-and-fuzzy-dive-into-search/).

In this notebook we will explore the different implementations of fuzzy search in [ArangoSearch](https://www.arangodb.com/why-arangodb/full-text-search-engine-arangosearch/):
* [N-Gram Similarity](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#ngram_match)
* [N-Gram Positional Similarity](https://www.arangodb.com/docs/stable/aql/functions-string.html#ngram_positional_similarity)
* [N-Gram Match](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#ngram_match)
* [Levenshtein Distance](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match
)
* [Levenshtein Match](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match)

# Setup 

Before getting started with ArangoSearch we need to prepare our environment and create a temporary database on ArangoDB's managed Service Oasis.

In [None]:
%%capture
!git clone -b oasis_connector --single-branch https://github.com/arangodb/interactive_tutorials.git
!rsync -av interactive_tutorials/ ./ --exclude=.git
!chmod -R 755 ./tools
!git clone -b imdb_no_ratings --single-branch https://github.com/arangodb/interactive_tutorials.git imdb_no_ratings
!rsync -av imdb_no_ratings/data ./data
!pip3 install pyarango
!pip3 install "python-arango>=5.0"

In [None]:
import json
import requests
import sys
import oasis
import time
import textwrap

from pyArango.connection import *
from arango import ArangoClient

Create the temporary database:

In [None]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName="FuzzyArangoSearch", credentialProvider="https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB")

# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch 
database = oasis.connect_python_arango(login)

In [None]:
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

Feel free to use the above URL to checkout the ArangoDB WebUI!

##  IMDB Example Dataset

![imdb](https://github.com/arangodb/interactive_tutorials/blob/master/notebooks/img/IMDB_graph.png?raw=1)

Last, but not least we will import the [IMDB Example Dataset](https://github.com/arangodb/example-datasets/tree/master/Graphs/IMDB) including information about various movies, actors, directors, ... as a graph. 
*Note the included arangorestore will only work on Linux or Windows systems, if you want to run this notebook on a different OS please consider using the appropriate arangorestore from the [Download area](https://www.arangodb.com/download-major/) and for more information on how to use the ArangoDB client tools, [see the documentation](https://www.arangodb.com/docs/stable/programs.htmlhttps://www.arangodb.com/docs/stable/programs.html).

## Linux:

In [None]:
! ./tools/arangorestore -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "data/data/imdb"

# Create First View

As discussed above, an ArangoSearch view contains references to documents stored in different collections. 
This makes it possible to perform complex federated searches, even over a complete graph including vertex and edge collections.

In [None]:
# Create an ArangoSearch view.
database.create_arangosearch_view(
    name='v_imdb'
)

Let us check it is actually there:

In [None]:
print(database["v_imdb"])

Next, we will create a [custom analyzer](https://www.arangodb.com/docs/stable/arangosearch-analyzers.html) to preprocess the values.
Note that, in order to support n-gram similarity the analyzer must have:
* At least the "position" and "frequency" features enabled
* The same min and max values
* preserveOriginal set to False

In [None]:
# Delete in case analyzer existed before
database.delete_analyzer('fuzzy_search_bigram', ignore_missing=True)

database.create_analyzer(
        name='fuzzy_search_bigram',
        analyzer_type='ngram',
        properties={  
        "min": 2,  
        "max": 2,  
        "preserveOriginal": False 
        }, 
        features=["position", "frequency", "norm"] 
    )

Next, we need to link the view and our custom analyzer:

In [None]:
 link = { 
  "includeAllFields": True,
  "fields" : { 
      "title" : { "analyzers" : [ "fuzzy_search_bigram"] },
      "description" : { "analyzers" : [ "fuzzy_search_bigram"] }
      }
}


database.update_arangosearch_view(
    name='v_imdb',
    properties={'links': { 'imdb_vertices': link }}
)

As the indexing might take a few seconds, let us have a brief look at what is actually going on.

When you link a collection to an ArangoSearch View, you can choose which individual fields to link or specify to link all fields. It might be helpful to think about linking fields in the same way you think about indexing attributes, although not exactly the same. When you link data to a view it is indexed in a way that allows for quick retrieval. This process also stores the data in a way that allows for the ArangoSearch-specific AQL functions to perform unique queries such as tokenizing, stemming, removing stop words, and as we will see in this notebook complex matching functions.

An additional benefit and a difference to typical indexing is that you are able to link multiple collections to one view and apply the desired analyzers. The image below shows how the collections are linked, analyzed and then made available via the view. When performing queries you can use all the typical AQL functions against a view, the same way that you would with a collection name. Though, the real benefit comes when using ArangoSearch-specific functions and you start taking advantage of features such as ranking.

![ArangoSearch](https://github.com/arangodb/interactive_tutorials/blob/master/notebooks/img/ArangoSearch_Arch.jpg?raw=1)

By now our view should be ready, so let us issue the first test query and look for short Comedy Movies.

In [None]:
cursor = database.aql.execute(
  """
  FOR d IN v_imdb 
    SEARCH d.type == "Movie" 
    AND 
    d.genre == "Comedy" 
    AND 
    d.runtime IN 10..50 
    RETURN d.title
  """
)
# Iterate through the result cursor
print('\033[4mMovie Titles\033[0m ')

for doc in cursor:
  print(doc)

If we set up everything correctly there should be 13 results, containing comedies that are less than 50 minutes long, such as:
 * Finders Keepers
 * Stuart Dee
 * The Pawnshop
 * Robot Wrecks

Now that we have finished some of the setup, let's move on to the functions that make up Fuzzy search. 
As mentioned in the beginning of this notebook, Fuzzy search comes in the form of various [N-Gram Similarity](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#ngram_match) and [Levenshtein distance](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match
) AQL functions. 

# N-Gram Similarity

`NGRAM_SIMILARITY(input, target, ngramSize) → similarity`

N-gram similarity is a measure for the difference between two strings represented by counting how long the longest sequence of matching n-grams is, divided by target’s total n-gram count. To better understand this concept let's start with a simple example. The below query compares the phrase `quick fox` to the similar phrase of `quick foxx` (additional `x`). These are similar phrases and as such, they should have a high n-gram similarity. 

```
Case Conversion Note:
N-gram analyzers do not currently support case conversion. 
It's definitely worthwhile to have the input match the stored terms case, but even without it NGRAM_MATCH can work, but will be less accurate
```

Go ahead and execute the query below:

In [None]:
cursor = database.aql.execute(
"""
RETURN NGRAM_SIMILARITY(
"quick fox",
"quick foxx", 
2)"""
)
# Iterate through the result cursor
for doc in cursor:
  print('\033[4mNGRAM_SIMILARITY\033[0m ')
  print(doc)


With an n-gram size of 2, the n-gram similarity between both strings is 0.888, the closer the similarity is to 1 the more similar they are. Feel free experiment with other combinations such as `NGRAM_SIMILARITY( "same string","same string", 2)` or vary the ngramSize.

N-gram functions such as this break apart the words using the supplied ngram size, 2 in our query above. This means that the function compares the two words broken up into their 2 letter n-grams:

  ```
  quick fox         --         quick foxx
  ----------------------------------------
  qu                --         qu (match)
  ui                --         ui (match)
  ic                --         ic (match)
  ck                --         ck (match)
  k                 --         k  (match)
   f                --          f (match)
  fo                --         fo (match)
  ox                --         ox (match)
  x                 --         xx (do not match)
  ```
If we use simple math here we can see that there is around an 85% match when an extra `x` is supplied. However, n-gram similarity and distance is not as simple as this, but hopefully this provides a quick intro to the basic concept of ngram matching and similarity. If you would like to take a deep dive into this topic, a paper published by [Grzegorz Kondrak at the University of Alberta](https://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf) is a great resource.

### N-Gram Positional Similarity
`NGRAM_POSITIONAL_SIMILARITY(input, target, ngramSize) → similarity`

While [NGRAM_SIMILARITY()](https://www.arangodb.com/docs/stable/aql/functions-string.html#ngram_similarity) only counts fully matching n-grams, [NGRAM_POSITIONAL_SIMILARITY()](https://www.arangodb.com/docs/stable/aql/functions-string.html#ngram_positional_similarity) also considers partially matching ones. Let us look at how that effects the returned scores. 
In this first example we are comparing `NGRAM_SIMILARITY` and `NGRAM_POSITIONAL_SIMILARITY` scores using the same two phrases as with our previous  example. These phrases are so similar that counting partial matches doesn't make any difference, thus we get the same scores.

In [None]:
cursor = database.aql.execute(
"""
RETURN
{"NGRAM_SIMILARITY" : NGRAM_SIMILARITY(
"quick fox",
"quick foxx", 
3),
"NGRAM_POSITIONAL_SIMILARITY" : NGRAM_POSITIONAL_SIMILARITY(
"quick fox",
"quick foxx", 
3)}"""
)
# Iterate through the result cursor
for doc in cursor:
  print('\033[4mNGRAM_SIMILARITY\033[0m ', '\033[4mNGRAM_POSITIONAL_SIMILARITY\033[0m '.rjust(44))
  print(doc['NGRAM_SIMILARITY'], str(doc['NGRAM_POSITIONAL_SIMILARITY']).rjust(25))

If we start to change a few more letters in the phrases, the differences between the two functions becomes more clear. The score for `NGRAM_POSITIONAL_SIMILARITY` is nearly double that of `NGRAM_SIMILARITY`, due to the fact that it counted the partial matches. This provides us with some additional 'fuzziness' by allowing the matching requirement to be a bit more lenient.

In [None]:
cursor = database.aql.execute(
"""
RETURN
{"NGRAM_SIMILARITY" : NGRAM_SIMILARITY(
"quick fox",
"quirky foxx", 
3),
"NGRAM_POSITIONAL_SIMILARITY" : NGRAM_POSITIONAL_SIMILARITY(
"quick fox",
"quirky foxx", 
3)}"""
)
# Iterate through the result cursor
for doc in cursor:
  print('\033[4mNGRAM_SIMILARITY\033[0m ', '\033[4mNGRAM_POSITIONAL_SIMILARITY\033[0m '.rjust(44))
  print(doc['NGRAM_SIMILARITY'], str(doc['NGRAM_POSITIONAL_SIMILARITY']).rjust(25))

Depending on your requirements, the decision to count partially matching n-grams adds some 'fuzziness' that may help provide some context to your searches.

[NGRAM_SIMILARITY](https://www.arangodb.com/docs/stable/aql/functions-string.html#ngram_similarity) and [NGRAM_POSITIONAL_SIMILARITY](https://www.arangodb.com/docs/stable/aql/functions-string.html#ngram_positional_similarity) are two new functions that come with ArangoDB 3.7 and can be used to improve text searches but have a drawback of not being able to utilize the indexing benefits of views. They are still very powerful string functions and can offer a lot of functionality for text queries.

<br>

## N-Gram Match

`NGRAM_MATCH(path, target, threshold, analyzer) -> bool`

However, [NGRAM_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#ngram_match) is able to use the indexing of ArangoSearch views and is what we will look at next.

Let us start by using the [NGRAM_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#ngram_match) function to find a movie using a phrase supplied by the user. 

The main takeaway from this example is that this exact phrase does not exist in the description and the search terms have quite a few typos as well. Instead, thanks to the ngram analyzer and the NGRAM_MATCH function, the search is able to find the movie based on the relevance of the words supplied. This is a good start and at the end of the notebook we will see how combining this with Levenshtein can really help to round out your fuzzy searches.

In [None]:
cursor = database.aql.execute(
"""
FOR d IN v_imdb 
  SEARCH NGRAM_MATCH(
    d.description, 
    'rodo Same goo to Moardoor', 
    0.6, 
    'fuzzy_search_bigram'
    )
  LET score = BM25(d)
  SORT score DESC
  RETURN { 
    Title:d.title, 
    Description:d.description, 
    Score:score 
    }
"""
)
# Iterate through the result cursor
for doc in cursor:
  print('\033[4mTitle: ' + doc['Title'] + '\033[0m')
  print('\033[4mDescription:\033[0m ',textwrap.fill(doc['Description'], 90))
  print('\033[4mScore:\033[0m ',str(doc['Score']))
  print(' ')

The `NGRAM_MATCH` syntax follows typical ArangoSearch function structure. You first supply the field you would like to search, the search term(s), and then the next value is the threshold amount which must be between `0.0` and `1.0`, last is the analyzer to use on the search terms. The `.6` threshold amount is the new addition and is how much ‘fuzziness’ we are still considering to be a match.

The similarity is calculated by counting how long the longest sequence of matching n-grams is, divided by the target’s total ngram count. Only fully matching n-grams are counted.

The analyzer we used was configured with a min and max of 2, which means it looks at words 2 letters at a time. This is useful for determining the longest common sequence and context. The idea behind n-gram matching is searching for similar words, but not necessarily exact matches. One of the simplest ways of calculating similarity between two words is calculating the longest common sequence (LCS) of letters. The longer the LCS is the more similar the words are. However, this approach has one big disadvantage – absence of context. For example, words `connection` and `fonetica` have a long LCS (o-n-e-t-i) but very different meanings. To add some context, ngram sequences are used.

Each word is split into a series of letter groups and these groups are then matched. If we use the same words, but calculate similarity based on 3-grams, an ngram with max and min of 3, we will get a better similarity measure: con-onn-nne-nec-ect-cti-tio-ion vs. fon-one-net-eti-tic-ica gives shorter LCS ( zero matches). To get rid of length differences we normalize the LCS length by word length. We calculate these matches to get a rating with a value between 0 (no match at all) and 1(fully matched). 

Increasing the ngram size is not always the best choice due to it also increasing the accuracy requirement of the search. Scores would be much lower for the above Star Wars search if we had chosen an ngram size of 3. We would need to decrease our threshold requirement which can have the impact of returning less relevant results.


# Levenshtein

ArangoDB comes with two forms of the Levenshtein matching algorithm, [LEVENSHTEIN_DISTANCE](https://www.arangodb.com/docs/stable/aql/functions-string.html#levenshtein_distance) and [LEVENSHTEIN_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match). These AQL functions provide two similar approaches for adding 'fuzziness' to your AQL queries. While the AQL functions are similar there are some important differences, which will discuss in this section, as well as showcase some examples. 

### Levenshtein Distance
`LEVENSHTEIN_DISTANCE(value1, value2) → levenshteinDistance`

[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) is a another measure for the difference between two strings represented by the  minimum number of single-character transformations required to move from one string to the other. Let us consider a concrete example:

In [None]:
cursor = database.aql.execute(
"""
RETURN LEVENSHTEIN_DISTANCE(
"The quick brown fox jumps over the lazy dog", 
"The quick black dog jumps over the brown fox")"""
)
# Iterate through the result cursor
for doc in cursor:
  print("Edit Distance Transformations Required: ", doc)




Here we need a minimum of 13 transformations to move from one string to the other. 
Feel free to find a minimum sequence for this transformation or experiment with other combinations such as `LEVENSHTEIN_DISTANCE("ab", "ba")`. Once the distance has been calculated it can be used in other parts of your application logic and even with the same query. 

This functionality provides some added control over your text analysis by handling what to do with the distance measure once you have it. However, in most situations you may prefer to find the relevance, distance, and determine if the keywords or phrases match some user supplied input, with one function or statement. This functionality is where Levenshtein Match comes in and is what we will review next.

Before we go and as a nice transition to looking at Levenshtein Match, here are some key differences:
* Distance is considered a string function, not tied to ArangoSearch
* Distance does not take advantage of ArangoSearch indexing
* Distance uses Damerau and treats transpositions of 2 adjacent characters atomically
* Match does not use Damerau by default, but can be optionally enabled

### Levenshtein Match


`LEVENSHTEIN_MATCH(path, target, distance, transpositions, maxTerms) -> bool`

Levenshtein match, matches documents with a Levenshtein distance lower than or equal to a distance between a document value and provided search value. This takes the power of the above Levenshtein distance function and combines it with filtering and relevance matching.

```
Analyzer Note:
For our LEVENSHTEIN_MATCH examples we will use a text analyzer, instead of ngram, that has stemming disabled.
Stemming is disabled as a convenience to avoid terms not matching due to a stemmed word, ie: galaxy is galaxi when stemmed.
```
The following code block:
* Creates the analyzer named `en_tokenizer`
* Updates our link definition object
* Updates the `v_imdb` view definition


In [None]:
 # Delete in case analyzer existed before
database.delete_analyzer('en_tokenizer', ignore_missing=True)

# Create a new english text analyzer to tokenize our text
database.create_analyzer(
        name='en_tokenizer',
        analyzer_type='text',
        properties={
            'locale': 'en',
            'stemming': False
        }, 
        features=["position","norm", "frequency"] 
    )

# Update the link definition object
 link = { 
  "includeAllFields": True,
  "fields" : { 
      "title" : { "analyzers" : [ "fuzzy_search_bigram", "en_tokenizer"] },
      "description" : { "analyzers" : [ "fuzzy_search_bigram", "en_tokenizer"] }
      }
}

# Update the ArangoSearch view with the new link definition
database.update_arangosearch_view(
    name='v_imdb',
    properties={'links': { 'imdb_vertices': link }}
)


To continue exploring how Levenshtein Match can leverage edit distance scoring with term matching functionality, run the query below:

In [None]:
# Execute the query
cursor = database.aql.execute(
  """
  FOR d IN v_imdb
    SEARCH ANALYZER(LEVENSHTEIN_MATCH(
      d.title, 
      'galxy', 
      2,
      true,
      3
      ), 
    "en_tokenizer")
    SORT BM25(d) DESC 
    LIMIT 10
    RETURN {
      "Title": d.title,
      "Score": BM25(d)
      }
      """)
# Iterate through the result cursor
for doc in cursor:
  print('Title: ', doc['Title'], )
  print('Score: ', str(doc['Score']))
  print(' ')

The above query is an example of a user searching for a movie title but their search term contains a typo. The user intended to type `galaxy` but accidentally left out an `a`, easy mistake. Thanks to the [LEVENSHTEIN_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match) function we have accommodated this very common scenario. 

The edit distance to add in an `a` would be less than `2`, which is the distance supplied in this query, so the term `galaxy` is also taken into account, not just the misspelled word.

The `3` supplied here is optional and specifies the max number of terms, such as `galaxy`, to take into account. The higher this number is, the more results you are likely to get, this makes sorting by relevance very important.

### Levenshtein Match + Phrase Search
In practice it will be common to combine [LEVENSHTEIN_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match) with other ArangoSearch AQL functions, [PHRASE](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#phrase) being a likely choice.
The Phrase function honors the position of the search term, where Levenshtein Match just looks for the word to exist and evaluates it based on relevance to the term.

This combination is so common that Levenshtein Match comes with a second syntax style that works perfectly with PHRASE. The array syntax variant shown in the query below allows for omitting the initial path argument, as it is already supplied by the PHRASE function. This combination gives us the best of both worlds, precise control with the flexibility of fuzzy search.

This query looks for movie titles starting with the word `star` and uses [LEVENSHTEIN_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match) to match a second word, with Damerau transposition set to true.

In [None]:
cursor = database.aql.execute(
    """
FOR d IN v_imdb
  SEARCH PHRASE(d.title, [ 'star', { LEVENSHTEIN_MATCH : ['wr', 2, true] } ], "en_tokenizer")
    SORT BM25(d) DESC 
    RETURN d.title
    """)
for doc in cursor:
  print(doc)

You can continue using multiple [LEVENSHTEIN_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match) functions even in a single Phrase statement. This makes it possible to search for phrases where every word possibly has a typo, along with the additional [PHRASE options](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#phrase) such as skipTokens.

In [None]:
cursor = database.aql.execute(
"""
LET phraseStructure = [ 
  { LEVENSHTEIN_MATCH : ['lrd', 1, true] }, 
  2, // offset between adjacent phrase parts
  { LEVENSHTEIN_MATCH : ['rng', 2, true] }
]
FOR d IN v_imdb
  SEARCH PHRASE(d.title, phraseStructure, "en_tokenizer")
    SORT BM25(d) DESC 
    RETURN d.title
""")
for doc in cursor:
  print(doc)

### Combined Fuzziness

#### Differences

**Levenshtein**

If your searches are typically single word searches, Levenshtein match is usually the better option. With these types of searches [NGRAM_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#ngram_match) is not as performant and adds more overhead to CPU. Also, when indexing, ngram analyzers store positional information, which results in larger sized indexes even for single and small words.

**N-Gram**

N-gram is better for longer search terms. Some use cases where using n-gram functions are ideal include:
 * Genome Sequencing
 * Languages with longer words, such as German
 * Log and Sensor data that may have long connected strings

**Combination**

While there are differences between the two, they can combine to bring a high level of fuzziness and accuracy to your searches.

The below examples shows how you can use [NGRAM_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#ngram_match) to try and match portions of the words and use then [LEVENSHTEIN_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match) to match to boost whole words and phrases. You gain the benefit of error checking and context from [NGRAM_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#ngram_match) while maintaining accuracy and relevance with [LEVENSHTEIN_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match). Both functions contribute to the overall score, giving you a balance of both. 

Notice as well the sharp drop in the score for the movies whose description doesn't contain `Luke Skywalker`, this is BOOST and [LEVENSHTEIN_MATCH](https://www.arangodb.com/docs/stable/aql/functions-arangosearch.html#levenshtein_match) in action!

Note: We parsed the `phraseStructure` separately here, just for convenience and readability. This all could have been done in-line, in the same statement.

In [None]:
cursor = database.aql.execute("""
LET input = "Luk Skywlker"
LET phraseStructure = [
    {
      "LEVENSHTEIN_MATCH": [
        "luk",
        3,
        true
      ]
    },
    {
      "LEVENSHTEIN_MATCH": [
        "skywlker",
        3,
        true
      ]
    }
  ]
FOR d IN v_imdb 
  SEARCH NGRAM_MATCH(d.description, input, 0.6, 'fuzzy_search_bigram') // matches part of the words to provide context
         OR
         BOOST(PHRASE(d.description, phraseStructure, 'en_tokenizer'), 10) // matches whole words to boost documents containing the matched words
  SORT BM25(d) DESC  
  RETURN {
    "Title" : d.title,
    "Description": d.description, "Score":BM25(d)
    }"""
)
for doc in cursor:
  print('\033[4mTitle: ' + doc['Title'] + '\033[0m')
  print('\033[4mDescription:\033[0m ',textwrap.fill(doc['Description'], 90))
  print('\033[4mScore:\033[0m ',str(doc['Score']))
  print(' ')

# Further Links

Hopefully, you can now see the potential that fuzzy search has with ArangoSearch. If you would like to continue learning more about ArangoDB and ArangoSearch here are some great next steps to get you started!

* ArangoSearch Demo on Oasis (Just follow the onboarding guide)
  * https://cloud.arangodb.com

* ArangoSearch Documentation
  * https://www.arangodb.com/docs/stable/arangosearch.html

* ArangoSearch Training Center
  * https://www.arangodb.com/arangodb-training-center/search/arangosearch/