<a href="https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ArangoSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![arangodb](https://github.com/arangodb/interactive_tutorials/blob/master/notebooks/img/ArangoDB_logo.png?raw=1)

# ArangoSearch

<a href="https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ArangoSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ArangoSearch provides information retrieval features, natively integrated into ArangoDB’s query language and with support for all data models. It is primarily a full-text search engine, a much more powerful alternative to the full-text index type.

ArangoSearch introduces the concept of Views which can be seen as virtual collections. Each View represents an inverted index to provide fast full-text searching over one or multiple linked collections and holds the configuration for the search capabilities, such as the attributes to index. It can cover multiple or even all attributes of the documents in the linked collections. Search results can be sorted by their similarity ranking to return the best matches first using popular scoring algorithms.

Configurable Analyzers are available for text processing, such as for tokenization, language-specific word stemming, case conversion, removal of diacritical marks (accents) from characters and more. Analyzers can be used standalone or in combination with Views for sophisticated searching.

# Setup 

Before getting started with ArangoSearch we need to prepare our environment and create a temporary database on ArangoDB's managed Service Oasis.

In [None]:
%%capture
!git clone -b oasis_connector --single-branch https://github.com/arangodb/interactive_tutorials.git
!rsync -av interactive_tutorials/ ./ --exclude=.git
!chmod -R 755 ./tools
!git clone -b imdb_complete --single-branch https://github.com/arangodb/interactive_tutorials.git imdb_complete
!rsync -av imdb_complete/data/imdb_dump/ ./imdb_dump/
!pip3 install pyarango
!pip3 install "python-arango>=5.0"

In [None]:
import json
import requests
import sys
import oasis
import time

from pyArango.connection import *
from arango import ArangoClient

Create the temporary database:

In [None]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials('ArangoSearchIMDBTutorial', credentialProvider='https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB')

# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch 
database = oasis.connect_python_arango(login)

In [None]:
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

Feel free to use to above URL to checkout the WebUI!

##  IMDB Example Dataset

![imdb](https://github.com/arangodb/interactive_tutorials/blob/master/notebooks/img/IMDB_graph.png?raw=1)

Last, but not least we will import the [IMBD Example Dataset](https://github.com/arangodb/example-datasets/tree/master/Graphs/IMDB) including information about various movies, actors, directors, ... as a graph. 
*Note the included arangorestore will only work on Linux or Windows systems, if you want to run this notebook on a different OS please consider using the appropriate arangorestore from the [Download area](https://www.arangodb.com/download-major/).

## Linux:

In [None]:
!./tools/arangorestore -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "imdb_dump"

# Create First View

As discussed above, an ArangoSearch view contains references to documents stored in different collections. 
This makes it possible to perform complex federated searches, even over a complete graph including vertex and edge collections.

In [None]:
# Create an ArangoSearch view.
database.create_arangosearch_view(
    name='v_imdb'
)

Let us check it is actually there:

In [None]:
print(database["v_imdb"])

As of now this view is empty, so we need to link it to a collection (i.e., imdb_vertices).

In [None]:
 link = { 
  "includeAllFields": True,
  "fields" : { "description" : { "analyzers" : [ "text_en" ] } }
}

database.update_arangosearch_view(
    name='v_imdb',
    properties={'links': { 'imdb_vertices': link }}
)

As the indexing might take a few seconds, let us have a brief look at what is actually going on.

![ArangoSearch](https://github.com/arangodb/interactive_tutorials/blob/master/notebooks/img/ArangoSearch_Arch.jpg?raw=1)

In order to fill the View using the specified analyzer, (`"analyzers" : [ "text_en" ]`) in our case, analyzers parse input values and transform them into sets of sub-values. For example, by breaking up text into words with language specific tokenization and stemming.
Let us check how the `text_en` Analyzer tranforms an input into tokens:

In [None]:
cursor = database.aql.execute(
  'RETURN TOKENS("I like ArangoDB because it rocks!", "text_en")'
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

By now our view should be ready, so let us issue the first query and look for short Drama Movies.

In [None]:

cursor = database.aql.execute(
    """
    FOR d IN v_imdb SEARCH d.type == "Movie" AND d.genres == "['Drama']" AND d.runtime IN 10..50 RETURN d.title
    """

)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

At this point you might wonder whether you could have achieved the same results with a simple AQL Filter

In [None]:
cursor = database.aql.execute(
  """FOR d IN v_imdb FILTER d.type == "Movie" AND d.genres == "['Drama']" AND d.runtime IN 10..50 RETURN d.title"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

The difference between both queries is that the `SEARCH` query is using the previosuly created view whereas the 'FILTER' query has to perform post-processing on the entire result set.
Furthermore, `SEARCH` queries allow us to do other cool things, which we will explore next.

In the next example we retrieve all movies mentioning “Star wars” in the description.

In [None]:
cursor = database.aql.execute(
"""FOR d IN v_imdb 
SEARCH PHRASE(d.description, "Star wars", "text_en") 
RETURN {title:d.title, description: d.description}"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

## Proximity Search

Proximity searching is a way to search for two or more words that occur within a certain number of words from each other.
In the next example, we are looking for the word sequence "in <any word> galaxy" in the description of a movie.
Feel free to try other values!

In [None]:
# Execute the query
cursor = database.aql.execute(
  'FOR d IN v_imdb SEARCH PHRASE(d.description, "in", 1, "galaxy", "text_en") RETURN {title:d.title, description: d.description}'
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

## Ranking and Document Relevance

Great, now we can identify documents containing a specific phrase,
but especially with large document bases we need to be able to rank documents based on the their relevance.
ArangoSearch supports the following two schemes:

* [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25)

* [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

You can learn more about ranking in the [documentation](https://www.arangodb.com/docs/3.6/aql/functions-arangosearch.html#scoring-functions).

So let us find movies with the following key-words: “amazing, action, world, alien, sci-fi, science, documental, galaxy”

In [None]:
cursor = database.aql.execute(
  """FOR d IN v_imdb 
  SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental galaxy', 'text_en'), 'text_en') 
  SORT BM25(d) DESC 
  LIMIT 10 
  RETURN {"title": d.title, "description" : d.description}"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

Another crucial point of ArangoSearch is the ability to fine-tune document scores evaluated by relevance models at query time. That functionality is exposed in AQL via the BOOST function.
So let us tweak our previous query to prefer “galaxy” amongst the others keywords.

In [None]:
cursor = database.aql.execute(
"""FOR d IN v_imdb 
   SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental', 'text_en') ||
   BOOST(d.description IN TOKENS('galaxy', 'text_en'), 5), 'text_en') 
   SORT BM25(d) DESC 
   LIMIT 10 
   RETURN {"title": d.title, "description" : d.description}"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

## ArangoSearch Meets Graph

One of the coolest features of ArangoDB, being a multi-model database, is that we can combine different data-model and query capabilities.
So, for example, we can easily combine ArangoSearch with a Graph traversal. Recall that our imdb dataset is a graph with edges connecting 
the movies we have been looking at to their respective actors, genres, or directors. Let us explore this and look up the director for each each of the Sci-fi movies above.

In [None]:
cursor = database.aql.execute(
"""
FOR d IN v_imdb 
   SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental', 'text_en') ||
    BOOST(d.description IN TOKENS('galaxy', 'text_en'), 5), 'text_en') 
   SORT BM25(d) DESC 
   LIMIT 10 
     FOR vertex, edge, path IN 1..1 INBOUND d imdb_edges
     FILTER path.edges[0].$label == "DIRECTED"
     RETURN DISTINCT {"director" : vertex.name, "movie" : d.title} 
"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

# Further Links

* https://www.arangodb.com/docs/stable/arangosearch.html

* https://www.arangodb.com/arangodb-training-center/search/arangosearch/