# Description

> compare and contrast legacy text/regex operators with search

Review the comparison table on this page first: https://www.mongodb.com/developer/article/Atlas-Search-vs-regex/

Using [regex](https://www.mongodb.com/docs/manual/reference/operator/query/regex/) and [text](https://www.mongodb.com/docs/manual/text-search/) operators tends to correlate with frustration and our opinion is that $search mitigates that risk.

This proof will run through the same query and collection comparing:

1. Index size
2. Query syntax
3. Elapsed time
4. Returned documents/relevance

In [28]:
# import our libraries, instantiate our classes
import json 
from pymongo import MongoClient, TEXT
import pprint
import time

mongo_uri = "INSERT_YOUR_CLUSTER_STRING_HERE"

# uses mongodb sample data from Atlas
collection = 'movies'
db = 'sample_mflix'
client = MongoClient(mongo_uri)[db]

# Document sample
```
{
  "_id": {
    "$oid": "573a1394f29313caabcde70a"
  },
  "plot": "Notorious gunfighter Jimmy Ringo rides into town to find his true love, who doesn't want to see him. He hasn't come looking for trouble, but trouble finds him around every corner.",
  "genres": ["Drama", "Western"],
  "runtime": 85,
  "rated": "NOT RATED",
  "cast": ["Gregory Peck", "Helen Westcott", "Millard Mitchell", "Jean Parker"],
  "num_mflix_comments": 1,
  "poster": "https://m.media-amazon.com/images/M/MV5BYjBiNmNiOWUtZTJhYi00N2JkLTgwYWItYTdhMjA0M2VlNGU0XkEyXkFqcGdeQXVyMjI4MjA5MzA@._V1_SY1000_SX677_AL_.jpg",
  "title": "The Gunfighter",
  "fullplot": "A reformed Gunfighter Jimmy Ringo is on his way to a sleepy town in the hope of a reunion with his estranged sweetheart and their young son who he has never seen. On arrival, a chance meeting with some old friends including the town's Marshal gives the repentant Jimmy some respite. But as always Jimmy's reputation has already cast its shadow, this time in the form of three vengeful cowboys hot on his trail and a local gunslinger hoping to use Jimmy to make a name for himself. With a showdown looming, the town is soon in a frenzy as news of Jimmy's arrival spreads. His movements are restricted to the saloon while a secret meeting with his son can be arranged giving him ideas of a long term reunion with his family far removed from his wild past.",
  "languages": ["English"],
  "released": {
    "$date": {
      "$numberLong": "-611107200000"
    }
  },
  "directors": ["Henry King"],
  "writers": ["William Bowers (screenplay)", "William Sellers (screenplay)", "William Bowers (story)", "Andrè De Toth (story)"],
  "awards": {
    "wins": 0,
    "nominations": 2,
    "text": "Nominated for 1 Oscar. Another 1 nomination."
  },
  "lastupdated": "2015-08-21 00:31:52.783000000",
  "year": 1950,
  "imdb": {
    "rating": 7.7,
    "votes": 6395,
    "id": 42531
  },
  "countries": ["USA"],
  "type": "movie",
  "tomatoes": {
    "viewer": {
      "rating": 3.9,
      "numReviews": 1205,
      "meter": 89
    },
    "dvd": {
      "$date": {
        "$numberLong": "1210636800000"
      }
    },
    "critic": {
      "rating": 8,
      "numReviews": 9,
      "meter": 100
    },
    "lastUpdated": {
      "$date": {
        "$numberLong": "1441131150000"
      }
    },
    "rotten": 0,
    "production": "Twentieth Century Fox",
    "fresh": 9
  }
}
```

## Text Index

In [33]:
# drop the index first
try:
    client[collection].drop_index('title_text')
except:
    print("index doesn't exist, yet.")
    
# create text index; should return 'title_text'
client[collection].create_index([("title", TEXT)], default_language='english')

'title_text'

In [38]:
# get index size
stats = db.command('collStats', collection)
text_index_size_in_kb = stats['indexDetails']['title_text']['block-manager']['file size in bytes'] / 1000

# run query, notice how it's all lowercase. Text is case insensitive.
# search_term = "\"fight club\""
search_term = "fight"

# 

pipeline = [
    {
        '$match': {
            '$text': {
                '$search': search_term
            }
        }
    }, {
        '$project': {
            "title": 1,
            '_id':0,
            "searchScore":{'$meta':'textScore'}
        }
    },
    {
        '$sort':{"searchScore":-1}
    },
    {
        '$limit':15
    }    
]

# execute and measure elapsed time
start_time = time.time()
text_query = client[collection].aggregate(pipeline)
end_time = time.time()

print(f'TEXT QUERY for "fight club" in {collection} \n')
print('documents returned: ')
pprint.pprint(list(text_query))
print('\n')

print('index size:', text_index_size_in_kb, ' KBs \n')

print(f'elapsed time in MS: {(end_time - start_time) * 1000}')


TEXT QUERY for "fight club" in movies 

documents returned: 
[{'searchScore': 1.5, 'title': 'Fight Club'},
 {'searchScore': 1.25, 'title': 'Fight, Zatoichi, Fight'},
 {'searchScore': 1.25, 'title': 'Fight, Zatoichi, Fight'},
 {'searchScore': 1.0, 'title': 'Fighting'},
 {'searchScore': 1.0, 'title': 'The Club'},
 {'searchScore': 1.0, 'title': 'Clubbed'},
 {'searchScore': 1.0, 'title': 'Why We Fight'},
 {'searchScore': 1.0, 'title': 'The Club'},
 {'searchScore': 0.75, 'title': 'Gourmet Club'},
 {'searchScore': 0.75, 'title': 'Club Paradise'},
 {'searchScore': 0.75, 'title': 'The Monster Club'},
 {'searchScore': 0.75, 'title': 'The Fighting Lady'},
 {'searchScore': 0.75, 'title': 'The Cotton Club'},
 {'searchScore': 0.75, 'title': 'Typhoon Club'},
 {'searchScore': 0.75, 'title': 'Geography Club'}]


index size: 819.2  KBs 

elapsed time in MS: 15.308141708374023


### Notice a couple things:
    
1. The searchScore between Fight Club and Fight, Zatoichi Fight have such an insignificant delta for the first result being so clearly the correct document. 
2. Additionally, you can only have one text index. Suppose you need to query against multiple fields. 
3. No ability for custom Scoring
4. not clear on how the searchScores are calculated (is it TF-IDF), see code: https://github.com/mongodb/mongo/blob/e97e4ff09cdb2398b571683312b2ddf92694a025/src/mongo/db/fts/fts_spec.cpp#L212-L232 which is saying if it's an exact match multiply by 1.1
4. searchScore isn't sorted by default, so it's not quite intuitive on which documents SHOULD appear first: 

#### Notes:
1. B-Tree data structure under the hood
2. Limited language support: https://www.mongodb.com/docs/manual/reference/text-search-languages/#std-label-text-search-languages


## Regex Query

In [46]:
# get index size

search_term = "\\b(Fight)\\b|(Club)\\b"
# search_term = "\fight club\"

# run query. Regex is case sensitive. 
pipeline = [
    {
        '$match': {
            "title": {
                '$regex': search_term
            }
        }
    }, {
        '$project': {
            "title": 1, 
            '_id': 0
        }
    }, {
        '$limit': 16
    }
]

# execute and measure elapsed time
start_time = time.time()
regex_query = client[collection].aggregate(pipeline)
end_time = time.time()

print(f'TEXT QUERY for "{search_term}" in {collection} \n')
print('documents returned: ')
pprint.pprint(list(regex_query))
print('\n')

print('index size:', text_index_size_in_kb, ' KBs \n')

print(f'elapsed time in MS: {(end_time - start_time) * 1000}')


TEXT QUERY for "\b(Fight)\b|(Club)\b" in movies 

documents returned: 
[{'title': 'The Cheyenne Social Club'},
 {'title': 'I Will Fight No More Forever'},
 {'title': 'The Club'},
 {'title': 'The Monster Club'},
 {'title': 'The Cotton Club'},
 {'title': 'Typhoon Club'},
 {'title': 'Club Paradise'},
 {'title': 'Fight Back to School'},
 {'title': 'The Cemetery Club'},
 {'title': 'The Joy Luck Club'},
 {'title': 'The Baby-Sitters Club'},
 {'title': 'The First Wives Club'},
 {'title': 'The Boys Club'},
 {'title': 'The Players Club'},
 {'title': 'Fight, Zatoichi, Fight'},
 {'title': 'Fight Club'}]


index size: 819.2  KBs 

elapsed time in MS: 60.05597114562988


### Notice a couple things:
    
1. Great precision, but no "flexibility", meaning database queries are boolean (yes or no) so there's no room for error. 
2. It uses the existing text index, but the performance is abysmal. 
3. The syntax for regex pattern is not intuitive, and gets worse as needs increase.

#### Why Regex?
1. Query vs Search (query is precise vs search is human) 
2. Best of both worlds, where Regex is supported in Atlas Search as well: https://www.mongodb.com/docs/atlas/atlas-search/regex/#lucene-regular-expression-behavior

## Search Query

In [50]:
# get index size
# TODO

# run query
search_term = "fight club"
pipeline = [
    {
        '$search': {
            'text': {
                'query': search_term,
                'path': "title"
            }
        }
    }, {
        '$project': {
            "title": 1, 
            '_id': 0,
            "searchScore":{'$meta':'searchScore'}
        }
    }, 
    {
        '$sort':{"searchScore":-1}
    },
    {
        '$limit': 10
    }
]

# execute and measure elapsed time
start_time = time.time()
search_query = client[collection].aggregate(pipeline)
end_time = time.time()

print(f'SEARCH QUERY for "{search_term}" in {collection} \n')
print('documents returned: ')
pprint.pprint(list(search_query))
print('\n')

# TODO: call atlas search API to get index size
print('index size: 1.2 MB')

print(f'elapsed time in MS: {(end_time - start_time) * 1000}')


SEARCH QUERY for "fight club" in movies 

documents returned: 
[{'searchScore': 7.272419452667236, 'title': 'Fight Club'},
 {'searchScore': 4.605154037475586, 'title': 'Fight, Zatoichi, Fight'},
 {'searchScore': 4.605154037475586, 'title': 'Fight, Zatoichi, Fight'},
 {'searchScore': 3.8719658851623535, 'title': 'Girl Fight'},
 {'searchScore': 3.8719658851623535, 'title': 'Street Fight'},
 {'searchScore': 3.400453567504883, 'title': 'Geography Club'},
 {'searchScore': 3.400453567504883, 'title': 'Club Paradise'},
 {'searchScore': 3.400453567504883, 'title': 'Suicide Club'},
 {'searchScore': 3.400453567504883, 'title': 'Typhoon Club'},
 {'searchScore': 3.400453567504883, 'title': 'Suicide Club'}]


index size: 1.2 MB
elapsed time in MS: 22.75705337524414


![fts index](fts_index.png)

### Notice a couple things:

1. the index size isn't that much larger than the standard text
2. The documents are ordered in terms of relevance
3. the querying syntax is no different than $text
4. speed is on par with $text
5. the queries are typo tolerant
6. sending the query as the user types in from the string vs injecting new characters