# Semantic search

Copyleft 2022 Forrest Sheng Bao, [a.k.a. Prof. Kung Fun 孔方教授](https://www.youtube.com/c/ForrestBao/videos)

Conventional text search (for example, when you press Ctrl+F in your browser) is basically string matching. It is very stupid. For example, if you are a lawyer in China and want to find all cases that your firm has dealt with in the US, you will have to search once in  "U.S.", second time in "United  States", third time in "America", etc. 

In contrast, semantic search "understands" your query and will treat "U.S.", "U.S.A.", "America", and "United States" all at once. 

We will built our semantic search based on [Sentence-BERT](https://www.sbert.net/) (EMNLP 2019) which is trained by forcing the model to learn sentence similarities. 

## Preparation

In [1]:
# ! pip3 install -U sentence-transformers
# ! pip3 install torch

import torch
import sentence_transformers # import SentenceTransformer, util
import typing

embedder = sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2') 

## Then it comes our function 

In [2]:
def semantic_search(query:str, documents:typing.List[str], number_of_matches = 5):
    """Search a list of _documents_ against a query
    """

    query_embedding = embedder.encode(query, convert_to_tensor=True)
    document_embeddings = embedder.encode(documents, convert_to_tensor=True)
    cos_scores = sentence_transformers.util.cos_sim(query_embedding, document_embeddings)[0]
    top_matches = torch.topk(cos_scores, k=number_of_matches)
    
    top_matching_documents = [(documents[idx], score) for score, idx in zip(top_matches[0], top_matches[1]) ]

    for document, score in top_matching_documents:
        print (document.ljust(80, "-"), " {:.2f}% match".format(score*100))

    return top_matching_documents


## Now, let's spin!

### Demo 1

We will search the query "U.S.A." And you will see how the sentences below match up with the query even when "US", "United States", or "America" is not in the sentence. 

In [3]:
documents = [
    "We helped Alibaba go IPO in the U.S.",
    "We defended our clients in the United States.",
    "We have practiced laws for 20 years in America.", 
    "We have offices in California, Delaware, and Iowa.",
    "We love ramen.",
    "Suits is what we wear and our favorite show."      
]

query = "U.S.A."

_ = semantic_search(query, documents, number_of_matches=len(documents))

We defended our clients in the United States.-----------------------------------  39.46% match
We helped Alibaba go IPO in the U.S.--------------------------------------------  28.98% match
We have offices in California, Delaware, and Iowa.------------------------------  25.40% match
We have practiced laws for 20 years in America.---------------------------------  24.19% match
Suits is what we wear and our favorite show.------------------------------------  12.52% match
We love ramen.------------------------------------------------------------------  11.28% match


A clear cliff can be seen when you move US or US states to "ramen" and "suits".


If we change the query to "France", the all sentences have low matches:

In [4]:
_ = semantic_search("France", documents, number_of_matches=len(documents))

We love ramen.------------------------------------------------------------------  16.16% match
We helped Alibaba go IPO in the U.S.--------------------------------------------  13.26% match
We have offices in California, Delaware, and Iowa.------------------------------  12.95% match
We have practiced laws for 20 years in America.---------------------------------  12.52% match
Suits is what we wear and our favorite show.------------------------------------  10.48% match
We defended our clients in the United States.-----------------------------------  10.26% match


### Demo 2

Another thing that bothers is spelling variations, e.g., "email" vs. "e-mail". Conventional search cannot treat the two as the same.  But semantic search can! 

In [7]:
documents = [
    "We love email over WeChat.",
    "I sent you an E-mail.",
    "I do not get any mail today.",
    "In an Open Letter to the citizens, he said no. " 
]

query = "e-mail"

_ = semantic_search(query, documents, number_of_matches=len(documents))

I sent you an E-mail.-----------------------------------------------------------  56.72% match
We love email over WeChat.------------------------------------------------------  53.48% match
I do not get any mail today.----------------------------------------------------  40.75% match
In an Open Letter to the citizens, he said no. ---------------------------------  6.82% match


As you can see, a clear match score cliff from "E-mail" and "email" to "mail". And the query is "e-mail" -- not even in the original text. 

# How can I use this powerful thing in my company? 

Want to use the latest NLP technology in your company but have no NLP engineer? 
Visit http://nlp.llc or email forrest dot bao at gmail dot com and we will help you set it up! 

NLP, LLC., the power of text understanding in the hands of everyone! 