<a href="https://colab.research.google.com/github/cw00dw0rd/ArangoNotebooks/blob/master/ArangoSearch_3_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!git clone -b UpdateSearch --single-branch https://github.com/cw00dw0rd/ArangoNotebooks.git
!rsync -av ArangoNotebooks/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"
!pip3 install graphviz

In [2]:
import json
import requests
import sys
import oasis
import time

from pyArango.connection import *
from arango import ArangoClient

In [3]:
%tb
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName="ArangoSearch_3-7 Overview", tempURL='https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB')
# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch 
database = oasis.connect_python_arango(login)

No traceback available to show.


Requesting new temp credentials.
Temp database ready to use.


In [4]:
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

https://tutorials.arangodb.cloud:8529
Username: TUTzjxbgtl9ryi2axzjv7h2i6
Password: TUTcgkg0wop4sb3k3d3nqqia4
Database: TUTav2cg67kudja67k54dac


In [5]:
%%capture
!chmod 755 ./tools/arangorestore

restore = !./tools/arangorestore -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "./data" 

In [6]:
response = str(restore).split(" ")
try: 
  for token in response:
    if token == "ERROR":
      raise Exception
except:
      print("ERROR")
      print(restore)
      raise

print("restored successfully")


restored successfully


In [7]:
#@title Query Formatting Functions Hidden Here 
import itertools
from types import *
import pprint
pp = pprint.PrettyPrinter(indent=4)

aql = database.aql

def getQueryStatistics(query):
  # Execute the query
  with_stored_values = aql.execute(query,profile = True)
  with_stored_explain = aql.explain(query, all_plans = False, max_plans=1)
  return with_stored_values, with_stored_explain

def printQueryProfile(header1, query1, header2, query2):
  print(header1.ljust(62, ' '), header2)
  print("-" * 120)
  
  for rule, value in itertools.zip_longest(query1, query2):
     r = str(rule).ljust(20, ' ')
     if rule in query1:
       if value is not None:
         if rule == 'viewValuesVars' or rule == 'condition' or rule == 'scorers':
           s1 = str(rule).ljust(30, ' ')
         elif isinstance(query1[rule], list) or isinstance(query1[rule], dict):
           s1 = " ".ljust(30, ' ')
           for item in query1[rule]:
             s1 = s1 + (str(item).ljust(30, ' ')) + "\n"
            
         else:
           s1 = str(query1[rule]).ljust(30, ' ')
     else: 
       s1 = " ".ljust(30, ' ')

     if rule in query2:
       if value is not None:
         if rule == 'viewValuesVars' or rule == 'condition' or rule == 'scorers':
           n1 = str(rule).ljust(30, ' ')
         elif isinstance(query1[rule], list) or isinstance(query1[rule], dict):
           n1 = "".ljust(30, ' ')
           for item in query1[rule]:
             n1 = n1 + (str(item).ljust(30, ' ')) + "\n"
         else:
           n1 = str(query2[rule]).ljust(30, ' ')
     else:
       n1 = ' '.ljust(30, ' ')
     print(r, s1, "|".ljust(10, ' '),r, n1)    

# ArangoSearch Improvements in ArangoDB 3.7

***Disclaimer: These functionalities are currently available only in the Beta release of the upcoming ArangoDB 3.7. You can download the Beta preview here.***

ArangoDB 3.7 will come with many new features and improvement for ArangoDB’s integrated search and ranking engine ArangoSearch.

This tutorial will detail those features with examples available via an interactive Colab notebook.

The new features include:

Fuzzy filter support
Stored Values in ArangoSearch Views
A new LIKE operator
Enhanced PHRASE functionality

To make it more fun, we will use the IMDB dataset with data about movies like title and description of a movie.

# Stored Values in ArangoSearch Views

Typically, when performing queries, you `FILTER` on some criteria to locate a document and then either return that entire document or a few of its attributes. This process can be quite fast in ArangoDB but with the help of ArangoSearch Views it is now faster. With the introduction of the ability to store values in ArangoSearch Views you are able to store fields of documents in views. This avoids the need to load the entire document into memory from the storage engine, inspect the entire document, and then return the requested attributes. Instead you only return the specific attributes requested, directly from the view, without needing to access the storage engine. In order to add stored values to a view you can assign the values you would like to store to the `storedValues` attribute in the view properties, like so:

```
db._createView("imdb_with_stored_values", "arangosearch", { storedValues: [ "title" ] })
```

So, running the following query against our IMDB dataset returns results that:

1. Perform faster than a view without stored values
2. Applies late document materialization, thanks to the sort+limit combination

``` 
FOR d IN imdb_with_stored_values
SEARCH d.type == 'Movie'
SORT bm25(d), d.title
LIMIT 100
RETURN d.title
```

This is a trivial example as we are just retrieving all of the Movies (not actors) in the dataset and sorting based on the score and movie title. However, even with this simple example, execution time can be multiple times faster than without stored values. Running the following cell block shows a comparison of the rules and exection stats for when you use and don't use stored values.

This process of storing values now happens implicitly when setting a primarySort, as well. The benefit of storing groups of values comes in the form of further performance gains. The query optimizer will prefer the stored group of values over single stored values, if it can cover the query more completely. For further information on this feature please refer to the documentation for everything that comes with stored values.

In [8]:
import itertools

aql = database.aql

# Execute the query
with_stored_values = getQueryStatistics(
"""
FOR d IN imdb_with_stored_values 
  SEARCH d.type == 'Movie' 
  SORT bm25(d), d.title 
  LIMIT 5000 
  RETURN d.title
""")

no_stored_values = getQueryStatistics(
"""
FOR d IN imdb_no_stored_values 
  SEARCH d.type == 'Movie' 
  SORT bm25(d), d.title 
  LIMIT 5000 
  RETURN d.title
""")
nodes = with_stored_values[1]['nodes']
printQueryProfile("With Stored Values", with_stored_values[0].profile(), "No Stored Values", no_stored_values[0].profile())

for i in itertools.zip_longest(with_stored_values[1]['nodes'], no_stored_values[1]['nodes'], fillvalue={'rule': 'Not Needed'}):
  printQueryProfile("With Stored Values", i[0], "No Stored Values", i[1])

With Stored Values                                             No Stored Values
------------------------------------------------------------------------------------------------------------------------
initializing         3.8868747651577e-06            |          initializing         3.02889384329319e-06          
parsing              0.00011293101124465466         |          parsing              9.075016714632511e-05         
optimizing ast       9.234994649887085e-06          |          optimizing ast       8.609844371676445e-06         
loading collections  5.170004442334175e-06          |          loading collections  5.063135176897049e-06         
instantiating plan   5.869404412806034e-05          |          instantiating plan   3.548501990735531e-05         
optimizing plan      0.04023073799908161            |          optimizing plan      0.0015453689265996218         
executing            0.116419947007671              |          executing            0.08724166406318545      