# Semantic search
The below notebook is a really quick and dirty example for how one could implement a semantic search engine using `sentence_transformers` (part of [sbert](https://www.sbert.net/index.html)) and Opensearch (AWS' Elasticsearch implementation).

## Glossary
* Model - Basically a collection of mathematical functions that can predict or transform data. These usually require training in some way so they can understand the world
* Embedding - A word or sentence converted to numbers, usually stored as a list or vector
* Distance - In this context, this is how far two sets of numbers are from one another. Related to:
    * K Nearest Neighbours - K being the amount of nearby values to return
    * Dot product and cosine similarity - ways of evaluating distance

## Semantic what?
Okay, so what I mean is "searching based on meaning, not synonyms". The models used below try to extract the actual meaning in a sentence and represent it mathematically as a list of numbers. Two lists that are similar to each other will have similar numbers (which we often refer to as distance). Bigger models will use longer lists of numbers for this but will also take longer to encode and match, it's a trade off that I've not really explored here, preferring to just see how well different models work.

In the olden days, this kind of thing just used single words but that often failed to catch the meaning behind things, glasses for example could be for your face or for your wine, it's all about context. Later attempts combined the scores for all the words in a sentence to get the aggregate meaning, but this often led to an over-representation of junk words such as `the` or `and` (or `or` for that matter!) so more recently models have been built that will actually take the whole sentence in and try to get an aggregate meaning. I've not done the reading to decide which model does what in the below but I have focused on sentence based approaches. This is a double edged sword in that we will probably understand the meaning of the sentence better but our vectors will be bigger (and thus more expensive to create and search), and also it may mean that match scores can be low (as there's more meaning in the weight bench description than just `gym`). The latter of these should be fine as `gym` will still score more highly than something nonsensical, such as `cat`.

## How could this look on AWS?
Read [this AWS post](https://aws.amazon.com/blogs/machine-learning/building-an-nlu-powered-search-application-with-amazon-sagemaker-and-the-amazon-es-knn-feature/) for an example. This is basically what I've done here but hosts the encoding of the sentences to numbers on Sagemaker. The advantage of this is SM has endpoints that are specifically adapted to run rapid predictions from complex models (called Elastic Endpoints), meaning you will likely get results faster than using a lambda/ECS container of similar size.

## Methodology
I've loaded up two product descriptions from Amazon.co.uk, one for a tent and one for a weights bench. I've then tried a few different models (see below sections for details) and also tried them using the whole description and also just the nouns from the description. The advantage of stripping nouns is that the things we are looking for are likely nouns so you are less likely to get non-target matches. The disadvantage is that there is less context to learn.

My queries are just randomly chosen words. Some are in the descriptions, some are not but are related to them, and others are just randomly chosen words from deep in my subconscious, you can make what you will of the fact `bra` is in there. Both the descriptions and the queries are encoded to list of numbers (often called vectors) and then their distance is compared.


## Warnings
* I've not done this on lots of data, just the two examples below, if you've got a load of descriptions then hit me up and we can play!
* I have a GPU so this runs really fast. You mileage in production may vary, if you need it faster then use a smaller model (but you may lose accuracy in doing so, it's a trade off)
* A simple word embedding using extracted nouns may be the fastest option and I've not done that here, preferring sentence embeddings. As with the above point, by losing the sentences you lose some meaning. The resultant search would struggle to differentiate glasses on your face and glasses for your wine!
* You may need to play with Opensearch's settings to get the right distance metrics for your model, also not done here.

## Right tools, right job
AWS Kendra will also do this for you but it's not cheap (£1,000> a month), reading the brief it also seems optimised for internal documents over products but I've not tested this.

It's important to state that this is for search, if your problem is classifying products into categories, you could use searches on those names and decide on a confidence cutoff but that may get clunky, a better option may be AWS Comprehend if you have time and money. Similarly, if your problem is one of "what does this person want to buy" you may get more luck from AWS Personalize [sic] which will do variations on product recommendation systems for you. This can group similar people and similar products together.


In [1]:
from sentence_transformers import SentenceTransformer, util
from textblob import TextBlob
import json
from opensearchpy import OpenSearch

## On first run you may have to uncomment these three and run them to get everything installed. I think this is just needed for the noun stripping so if you don't need that you don't need this.
# import nltk
# nltk.download('brown')
# nltk.download('punkt')

In [2]:
psg_gym_bench = 'The BodyMax CF302 Flat Bench with Dumbbell Rack allows you to create an exciting and varied workout to help strengthen, tone and promote weight loss, whilst also keeping your gym floor clutter free! Stylish and classic, this flat bench has been thoughtfully constructed with durable upholstery and high-density padding to ensure maximum comfort while you train.'

psg_tent = "The Theta 4 Tent is a four-person tunnel tent, uniquely featuring two sleeping cabins and a moveable front wall, allowing you to choose whether you have a covered porch or a larger living area inside your tent. With such an incredible size of 340(W) x 480(D) x 190cm(H), the Theta 4 has more than plenty of space inside – perfect for a family getaway, a couples’ retreat, or a trip shared among friends."

queries = ["gym", "holiday", "bench", "bra", "television", "exercise", "exercise equipment",
           "weights bench" , "rower", "weights", "tent", "outdoors", "camping"]

# What do the vectors look like?
We get a vector of encoded values

In [3]:
m = SentenceTransformer('msmarco-distilroberta-base-v2')
m.encode(psg_gym_bench).tolist()[0:10]

[-0.29896676540374756,
 -0.9481991529464722,
 0.7112562656402588,
 0.2055627405643463,
 -0.11523410677909851,
 -0.06249874085187912,
 0.48820844292640686,
 -0.6752977967262268,
 0.43621954321861267,
 0.6652036309242249]

In [4]:
m.encode("gym").tolist()[0:10]

[-0.0926285833120346,
 0.12392871081829071,
 -0.008582912385463715,
 0.4329494535923004,
 0.9176907539367676,
 -0.2596074044704437,
 0.34361034631729126,
 -0.06777438521385193,
 -0.22211872041225433,
 -0.6315351724624634]

## Choosing a model
Based on [this](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search) link we are probably looking at an asymmetric search here so should favour those kinds of models. For this, the Bing-derived `msmacro` suite seem to be a good fit ([link](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html)).

There is also a need to pick your distance measure, of which there are several but the above links suggest that dot product and cosine are good places to start and cosine is often better for "short" descriptions. Do note that Opensearch uses `Approximate Nearest Neighbours` as it's default distance measure which is optimised for fast search, you can override this with cosine but it may slow you down a bit.


In [5]:
def try_model(mod_name, match_algo, just_nouns=False, psg_in=psg_gym_bench, search_terms=queries):
    """
    This function loads up the descrption passage into a vector and then compares it to the vector of all of the search terms.
    Results are presented in descending order.
    """
    if just_nouns:
        psg_in = TextBlob(psg_in).noun_phrases
        psg_in = " ".join(psg_in)
    model = SentenceTransformer(mod_name)
    passage_embedding_1 = model.encode([psg_in])
    search_terms = {q: model.encode(q) for q in search_terms}

    if match_algo.lower() == "dot":
        match_func = util.dot_score
    elif match_algo.lower() == "cosine":
        match_func = util.pytorch_cos_sim  # There's also a cos_sim, unsure of diff
    else:
        raise ValueError("Invalid Match Function!")

    return sorted([(float(match_func(v, passage_embedding_1)), k) for k,v in search_terms.items()], reverse=True)

This one was recommended in the getting started section for semantic search [here](https://www.sbert.net/examples/applications/semantic-search/README.html#) works the same with cosine and dot but it's supposed to be cosine according to the docs

In [6]:
try_model('multi-qa-MiniLM-L6-cos-v1', "cosine") # This was recommeneded in the tutorial

[(0.6330403089523315, 'weights bench'),
 (0.5748993158340454, 'bench'),
 (0.4849099814891815, 'gym'),
 (0.3846684694290161, 'exercise equipment'),
 (0.3389934301376343, 'weights'),
 (0.19906282424926758, 'exercise'),
 (0.16111910343170166, 'bra'),
 (0.16074158251285553, 'rower'),
 (0.1323268860578537, 'tent'),
 (0.10875529050827026, 'camping'),
 (0.030693121254444122, 'outdoors'),
 (-0.06338250637054443, 'television'),
 (-0.06988093256950378, 'holiday')]

In [7]:
try_model('multi-qa-MiniLM-L6-cos-v1', "cosine", just_nouns=True) # This was recommeneded in the tutorial

[(0.6421118974685669, 'weights bench'),
 (0.6113723516464233, 'bench'),
 (0.43030375242233276, 'gym'),
 (0.3676666021347046, 'weights'),
 (0.2904892861843109, 'exercise equipment'),
 (0.18365596234798431, 'exercise'),
 (0.1546277403831482, 'rower'),
 (0.15009814500808716, 'tent'),
 (0.14600059390068054, 'bra'),
 (0.09983384609222412, 'camping'),
 (0.07700767368078232, 'outdoors'),
 (-0.004586204886436462, 'television'),
 (-0.03594575822353363, 'holiday')]

The documentation for sbert suggests the v3 of this model is good for asynch search (when your queries are short and your documents long). My library couldn't download v3 so I grabbed v2.

In [8]:
try_model('msmarco-distilroberta-base-v2', "cosine")

[(0.6697664260864258, 'weights bench'),
 (0.5055959224700928, 'exercise equipment'),
 (0.4895481467247009, 'bench'),
 (0.3891954720020294, 'weights'),
 (0.36898455023765564, 'exercise'),
 (0.21264904737472534, 'bra'),
 (0.16299866139888763, 'camping'),
 (0.09984882175922394, 'rower'),
 (0.06350091099739075, 'gym'),
 (0.009487345814704895, 'outdoors'),
 (-0.0696445032954216, 'holiday'),
 (-0.07525771111249924, 'television'),
 (-0.10196186602115631, 'tent')]

In [9]:
try_model('msmarco-distilroberta-base-v2', "cosine", just_nouns=True)

[(0.5739995241165161, 'weights bench'),
 (0.4067000150680542, 'weights'),
 (0.38276442885398865, 'exercise equipment'),
 (0.33077865839004517, 'bench'),
 (0.2035306841135025, 'exercise'),
 (0.1936309039592743, 'camping'),
 (0.09358666092157364, 'bra'),
 (0.06547510623931885, 'outdoors'),
 (0.05741633102297783, 'gym'),
 (0.019634557887911797, 'rower'),
 (-0.07226055860519409, 'holiday'),
 (-0.07655307650566101, 'television'),
 (-0.1295216977596283, 'tent')]

In [10]:
try_model('msmarco-distilroberta-base-v2', "cosine", psg_in=psg_tent) # trying with tent

[(0.3185180425643921, 'tent'),
 (0.250637948513031, 'outdoors'),
 (0.13116911053657532, 'camping'),
 (0.08360445499420166, 'bench'),
 (0.06664702296257019, 'television'),
 (0.03521207720041275, 'exercise equipment'),
 (0.034268710762262344, 'holiday'),
 (0.0004281101282685995, 'weights bench'),
 (-0.01946995034813881, 'gym'),
 (-0.02163332886993885, 'bra'),
 (-0.02525954321026802, 'rower'),
 (-0.032034821808338165, 'exercise'),
 (-0.05440262332558632, 'weights')]

This is a dot optimised model, trying it just to see

In [11]:
# dot optimised
try_model('msmarco-distilbert-base-v4', "dot")

[(86.560546875, 'weights bench'),
 (64.37287139892578, 'gym'),
 (62.79330825805664, 'bench'),
 (57.10064697265625, 'exercise equipment'),
 (43.40785598754883, 'weights'),
 (41.65951919555664, 'exercise'),
 (29.711671829223633, 'rower'),
 (29.216854095458984, 'bra'),
 (14.020475387573242, 'tent'),
 (13.067646026611328, 'outdoors'),
 (9.146438598632812, 'camping'),
 (3.8990843296051025, 'television'),
 (-23.02558135986328, 'holiday')]

In [12]:
try_model('msmarco-distilbert-base-v4', "dot", just_nouns=True)

[(60.13253402709961, 'weights bench'),
 (41.857872009277344, 'bench'),
 (37.81570816040039, 'weights'),
 (36.69390869140625, 'exercise equipment'),
 (36.47909927368164, 'gym'),
 (16.717876434326172, 'rower'),
 (10.237001419067383, 'exercise'),
 (5.496613502502441, 'bra'),
 (3.109081506729126, 'tent'),
 (-1.6696975231170654, 'outdoors'),
 (-3.579359531402588, 'camping'),
 (-9.178805351257324, 'television'),
 (-26.839519500732422, 'holiday')]

# Reranking
This may be something worth looking into, I've not at this point. [Link](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)

# Opensearch
To see how this works IRL I've plumbed this into an Opensearch instance to see if it works. This is an older version that I had lying around, I'd def suggest you use the latest as this functionality is quite new. Documentation exists [here](https://opensearch.org/), and the K-nearest Neighbours (KNN) implementation that will find you the closest match is documented [here](https://opensearch.org/docs/latest/search-plugins/knn/index/).

The different ways it searches are documented [here](https://opensearch.org/docs/latest/search-plugins/knn/approximate-knn/#spaces), the default (I think) is `L1` so to get optimum performance on the above models you probably want to flip it to `cosinesimil`.

In [13]:
# Sorry, not sharing my login here!
with open(r"C:\Users\robert.mansfield\.passes\test_es.json", "r") as f:
    creds = json.loads(f.read())

es = OpenSearch(hosts=creds["host"], http_auth=(creds['user'], creds['pass']))

my_idx = 'rob_semantic_test'
# es.cat.indices()

In [16]:
# This model did pretty well above but it does give a pretty big vector to search!
m = SentenceTransformer('msmarco-distilbert-base-v4')
len(m.encode(psg_gym_bench).tolist())

768

In [17]:
# create index. Should probably try to activate the cosine similarity at some point. Think my test ES is too old
try:
    es.indices.delete(index=my_idx)
except:
    pass

idx = {
    "settings": {
        "index.knn": True,
    },
    "mappings": {
        "properties": {
            "vector": {
                "type": "knn_vector",
                "dimension": len(m.encode(psg_gym_bench).tolist()),  # This is the length of your model's output vector.
                # "method":{"space_type": "cosinesimil"}  # I think this is how you set this but you need a newer Opensearch than I have to hand.
            },
            "description": {
                "type": "text"
            }
        }
    }
}

es.indices.create(index=my_idx, body=idx)

{'acknowledged': True,
 'shards_acknowledged': True,
 'index': 'rob_semantic_test'}

In [18]:
# Add some documents
es.create(index=my_idx,
          body={"vector": m.encode(psg_gym_bench).tolist(), "description": psg_gym_bench},
          id=1
          )

es.create(index=my_idx,
          body={"vector": m.encode(psg_tent).tolist(), "description": psg_tent},
          id=2
          )


{'_index': 'rob_semantic_test',
 '_type': '_doc',
 '_id': '2',
 '_version': 1,
 'result': 'created',
 '_shards': {'total': 2, 'successful': 2, 'failed': 0},
 '_seq_no': 0,
 '_primary_term': 1}

In [19]:
def do_query(term):
    """
    Do a search based on distance to the given vector. Return just the score and the description
    """
    qry = {
        "size": 2,  # Max results
        "query": {
            "knn": {
                "vector": {  # This is the one with your column name, I stupidly called my column vector!
                    "vector": m.encode(term).tolist(),
                    "k": 2 # Max results per shard
                }
            }
        }
    }

    res = es.search(qry, index=my_idx)
    return [(h['_source']["description"], h["_score"]) for h in res["hits"]["hits"]]

In [20]:
do_query("gym")

[('The BodyMax CF302 Flat Bench with Dumbbell Rack allows you to create an exciting and varied workout to help strengthen, tone and promote weight loss, whilst also keeping your gym floor clutter free! Stylish and classic, this flat bench has been thoughtfully constructed with durable upholstery and high-density padding to ensure maximum comfort while you train.',
  0.004655081),
 ('The Theta 4 Tent is a four-person tunnel tent, uniquely featuring two sleeping cabins and a moveable front wall, allowing you to choose whether you have a covered porch or a larger living area inside your tent. With such an incredible size of 340(W) x 480(D) x 190cm(H), the Theta 4 has more than plenty of space inside – perfect for a family getaway, a couples’ retreat, or a trip shared among friends.',
  0.002962307)]

In [21]:
do_query("tent")

[('The Theta 4 Tent is a four-person tunnel tent, uniquely featuring two sleeping cabins and a moveable front wall, allowing you to choose whether you have a covered porch or a larger living area inside your tent. With such an incredible size of 340(W) x 480(D) x 190cm(H), the Theta 4 has more than plenty of space inside – perfect for a family getaway, a couples’ retreat, or a trip shared among friends.',
  0.005420531),
 ('The BodyMax CF302 Flat Bench with Dumbbell Rack allows you to create an exciting and varied workout to help strengthen, tone and promote weight loss, whilst also keeping your gym floor clutter free! Stylish and classic, this flat bench has been thoughtfully constructed with durable upholstery and high-density padding to ensure maximum comfort while you train.',
  0.0031871875)]

In [22]:
# Note the description doesn't actually include the word exercise anywhere!
do_query("exercise equipment")

[('The BodyMax CF302 Flat Bench with Dumbbell Rack allows you to create an exciting and varied workout to help strengthen, tone and promote weight loss, whilst also keeping your gym floor clutter free! Stylish and classic, this flat bench has been thoughtfully constructed with durable upholstery and high-density padding to ensure maximum comfort while you train.',
  0.0044708215),
 ('The Theta 4 Tent is a four-person tunnel tent, uniquely featuring two sleeping cabins and a moveable front wall, allowing you to choose whether you have a covered porch or a larger living area inside your tent. With such an incredible size of 340(W) x 480(D) x 190cm(H), the Theta 4 has more than plenty of space inside – perfect for a family getaway, a couples’ retreat, or a trip shared among friends.',
  0.0029711577)]

In [23]:
do_query("tent")

[('The Theta 4 Tent is a four-person tunnel tent, uniquely featuring two sleeping cabins and a moveable front wall, allowing you to choose whether you have a covered porch or a larger living area inside your tent. With such an incredible size of 340(W) x 480(D) x 190cm(H), the Theta 4 has more than plenty of space inside – perfect for a family getaway, a couples’ retreat, or a trip shared among friends.',
  0.005420531),
 ('The BodyMax CF302 Flat Bench with Dumbbell Rack allows you to create an exciting and varied workout to help strengthen, tone and promote weight loss, whilst also keeping your gym floor clutter free! Stylish and classic, this flat bench has been thoughtfully constructed with durable upholstery and high-density padding to ensure maximum comfort while you train.',
  0.0031871875)]

In [24]:
do_query("bench")

[('The BodyMax CF302 Flat Bench with Dumbbell Rack allows you to create an exciting and varied workout to help strengthen, tone and promote weight loss, whilst also keeping your gym floor clutter free! Stylish and classic, this flat bench has been thoughtfully constructed with durable upholstery and high-density padding to ensure maximum comfort while you train.',
  0.0045321896),
 ('The Theta 4 Tent is a four-person tunnel tent, uniquely featuring two sleeping cabins and a moveable front wall, allowing you to choose whether you have a covered porch or a larger living area inside your tent. With such an incredible size of 340(W) x 480(D) x 190cm(H), the Theta 4 has more than plenty of space inside – perfect for a family getaway, a couples’ retreat, or a trip shared among friends.',
  0.0027877921)]

In [25]:
do_query("cat")

[('The Theta 4 Tent is a four-person tunnel tent, uniquely featuring two sleeping cabins and a moveable front wall, allowing you to choose whether you have a covered porch or a larger living area inside your tent. With such an incredible size of 340(W) x 480(D) x 190cm(H), the Theta 4 has more than plenty of space inside – perfect for a family getaway, a couples’ retreat, or a trip shared among friends.',
  0.0027689987),
 ('The BodyMax CF302 Flat Bench with Dumbbell Rack allows you to create an exciting and varied workout to help strengthen, tone and promote weight loss, whilst also keeping your gym floor clutter free! Stylish and classic, this flat bench has been thoughtfully constructed with durable upholstery and high-density padding to ensure maximum comfort while you train.',
  0.0027312732)]

In [26]:
do_query("holiday")

[('The Theta 4 Tent is a four-person tunnel tent, uniquely featuring two sleeping cabins and a moveable front wall, allowing you to choose whether you have a covered porch or a larger living area inside your tent. With such an incredible size of 340(W) x 480(D) x 190cm(H), the Theta 4 has more than plenty of space inside – perfect for a family getaway, a couples’ retreat, or a trip shared among friends.',
  0.002747939),
 ('The BodyMax CF302 Flat Bench with Dumbbell Rack allows you to create an exciting and varied workout to help strengthen, tone and promote weight loss, whilst also keeping your gym floor clutter free! Stylish and classic, this flat bench has been thoughtfully constructed with durable upholstery and high-density padding to ensure maximum comfort while you train.',
  0.0025693618)]