In this walkthrough we will build a full text search capable application by incrementally building the features to be exposed via a simple REST API.

## Setup

1. Create a cluster in Atlas:

<img src="assets/create_cluster.png" style="width: 300px;"/>

2. Import data into the cluster or use existing data.

<img src="assets/import_data.png" style="width: 300px;"/>

3. Create the FTS Index

<img src="assets/create_index.png" style="width: 300px;"/>


## Install Prerequisites

In [None]:
! pip install pymongo

In [None]:
# Just making sure everything prints cleanly

from pygments.style import Style
from pygments.token import Token
from pygments import highlight
from pygments.lexers import JsonLexer
from pygments.formatters import Terminal256Formatter


class MyStyle(Style):
    styles = {
        Token.String: 'ansigreen',
        Token.Literal: 'ansibrightyellow',
        Token.Keyword: 'ansimagenta',
        Token.Operator: 'ansibrightmagenta'
    }

def pp(doc):
    formatted_json = json.dumps(json.loads(json_util.dumps(doc)), indent=4)
    colorful_json = highlight(formatted_json, JsonLexer(), Terminal256Formatter(style=MyStyle))
    print(colorful_json)

In [None]:
import pymongo
from bson import json_util
import ssl
from config import mongo_uri
import json
    
conn = pymongo.MongoClient(mongo_uri, ssl_cert_reqs=ssl.CERT_NONE)

movies_collection = conn['sample_mflix']['movies']

## Basic Search

Run a simple text search.

In [None]:
pipeline = [
    {
        '$search': {
            'text': {
                'query': "fight club",
                'path': "title"
            }
        }
    },
    {
        '$project': {
            'title':1,
            '_id':0,
            'score': {
                '$meta': 'searchScore'
            }
        }
    }
]
docs = movies_collection.aggregate(pipeline)
pp(docs)

conn.close()

## Fuzzy

Often referred to as approximate string matching, fuzzy matching is a technique of finding strings that match a pattern approximately rather than exactly. It's common use case is when there are common mispellings, when users make errors ("fat fingering"), etc.

**maxEdits** uses the Levenshtein distance, which is the difference between two string sequences.

In [None]:
pipeline = [
    {
        '$search': {
            'text': {
                'query': "might cub",
                'path': "title",
                'fuzzy':{
                    'maxEdits':2
                }
            }
        }
    },
    {
        '$project': {
            'title':1,
            '_id':0,
            'score': {
                '$meta': 'searchScore'
            }
        }
    }
]
docs = movies_collection.aggregate(pipeline)
pp(docs)

conn.close()

## Highlighting

Add a relevance score and hit highlights to the results.

In [None]:
pipeline = [
        {
            '$search': {
                'text': {
                    'query': "fight",
                    'path': "title"
                },
                # text highlighting
                'highlight': { "path": "title" }
            }
        }, {
            '$project': {
                'title':1,
                '_id':0,
                'score': {
                    '$meta': 'searchScore'
                },
                'highlights': {"$meta": "searchHighlights"},
                'score': {
                    '$meta': 'searchScore'
                }
            }
        }
    ]

docs = movies_collection.aggregate(pipeline)
pp(docs)

conn.close()