<a href="https://colab.research.google.com/github/amitkag85/AILearning/blob/master/10_Semantic_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Search
The core concept of [semantic search](https://en.wikipedia.org/wiki/Semantic_search) is that you can search by **meaning** rather than exact string matching. This has many advantages, such as the ability to find items that mean the same thing but are different on the surface level (e.g. synonyms).

The building blocks for this are:
*   prepping data (e.g. splitting a document into sub-strings)
*   a vectorizer (something that turns input into a vector)
*   a (efficient) nearest neighbor algorithm
*   a way to store and retrieve metadata (e.g. a `dict`, relational database, etc.)

There are two main operations we need to cover, namely indexing (writing) and querying (reading). Of course you could optionally handle updating and deleting as well, but we will ignore that for now.

It should be noted that this can work with **any** type of data that you can turn into a vector representation (images, audio, etc.) but for this example we will be working with text.

## Installing Libraries

In [None]:
!pip install beautifulsoup4 faiss-cpu InstructorEmbedding lxml nltk numpy sentence-transformers==2.2.2 torch tqdm

## Splitting Text Documents
When working with text, you often do not want to treat an entire document as a single entity. This can be for practical reasons, such as it being too long to use as an input to your vectorizer, or for design reasons, such as wanting to search for locations within a document.

Here we will be working with sentences within a document. In order to split a string into sentences, we can use an `nltk` tokenizer called `punkt`. Under the hood this looks for periods, newlines, and other subsequences that generally denote the end of a sentence. It is a set of hand written rules and is not perfect, but it will get the job done.

In [None]:
import nltk.data
nltk.download('punkt')
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

text = '''
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.  And sometimes sentences
can start with non-capitalized words.  i is a good variable
name.
'''
for i, s in enumerate(sent_detector.tokenize(text.strip())):
    print(i, s)

## Vectorizing Text
This is the true secret sauce of semantic search. All of the meaning that we will ultimately be trying to search by must be encoded into a vector *somehow*. This is typically done with a ML model, more specifically a deep neural network, which is trained on some notion of **similarity** between inputs. Since the model's definition of similarity completely determins how results will be calculated, choosing a model that is trained on the domain of your task is critical.

**NOTE:** Often the term `embedding` is used interchangeably with vectorizing. They effectively mean the same thing.

The [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) family of models is typically used for vectorising text. Specifically when using sentences as inputs, a sentence-transformer is used, which is just a specific way of formatting the output from a BERT model.

A recent paper introduced the [Instructor](https://github.com/HKUNLP/instructor-embedding) model, which takes instructions which define the domain, input type, and task to calculate similarity, along with inputs. It was trained on a wide variety of (English) domains and embedding tasks, and is currently one of the best performing options available.

In [None]:
from InstructorEmbedding import INSTRUCTOR
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = INSTRUCTOR('hkunlp/instructor-large')
sentences_a = [['Represent the Science sentence: ',
                'Parton energy loss in QCD matter'],
               ['Represent the Financial statement: ',
                'The Federal Reserve on Wednesday raised its benchmark interest rate.']]

sentences_b = [['Represent the Science sentence: ',
                'The Chiral Phase Transition in Dissipative Dynamics'],
               ['Represent the Financial statement: ',
                'The funds rose less than 0.5 per cent on Friday']]

embeddings_a = vectorizer.encode(sentences_a)
embeddings_b = vectorizer.encode(sentences_b)
similarities = cosine_similarity(embeddings_a,embeddings_b)
print(similarities)

[FAISS](https://github.com/facebookresearch/faiss/wiki) (Facebook AI Similarity Search) is an efficient [approximate nearest neighbors](https://ignite.apache.org/docs/latest/machine-learning/binary-classification/ann#:~:text=An%20approximate%20nearest%20neighbor%20search,good%20as%20the%20exact%20one.) library.

In [None]:
embeddings_a.shape

In [None]:
import faiss

index = faiss.index_factory(768, 'Flat', faiss.METRIC_INNER_PRODUCT)

## Speeding Things Up
You may have noticed that running our vectorizer earlier took a long time. Instructor is a large [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) and thus does a **LOT** of computation. This can be sped up by using a GPU if you have one available. Fortunately, Google Colab offers GPU runtimes (go to the Runtime menu, select Change Runtime Type, select GPU from the list, and click Save).

**NOTE:** Changing the runtime type will restart the machine, so you will need to run everything again.

The Instructor model that we are using was implemented using [PyTorch](https://pytorch.org/), so we can use the following code to put our model on the GPU.

In [None]:
import torch

# use the GPU (cuda) if we can, otherwise use CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

# put the model on the GPU
vectorizer = vectorizer.to(device)

## Toy Data

We will be using the transcript of the [Quanex Building Products Corporation (NX) Q2 2023 Earnings Call](https://news.alphastreet.com/quanex-building-products-corporation-nx-q2-2023-earnings-call-transcript/).

In [None]:
import requests
from bs4 import BeautifulSoup

# the URL of the transcript
url = 'https://news.alphastreet.com/quanex-building-products-corporation-nx-q2-2023-earnings-call-transcript/'
response = requests.get(url)

# parse the HTML so that we can work with it easily
soup = BeautifulSoup(response.content, 'lxml')

sentences = []

# the transcript itself is inside this particular div
for tag in soup.find_all('div', {'class': 'highlighter-content'}):
    # split each block into individual sentences
    sentences.extend(sent_detector.tokenize(tag.text.strip()))

print(len(sentences))

In [None]:
sentences[:10]

## Batching
When using deep learning models, you generally want to batch your inputs. This means passing multiple inputs to the model at the same time, which allows for us to use as much parallel compute power as we have. This helper function allows turning a list of inputs into a list of lists of at most size n.

**NOTE:** The choice of batch size (n in this function) depends on how much memory your device has. You just need to try things.

In [None]:
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

## Vectorizing
Now we will process each sentence from our dataset into vectors. For Instructor, this requires pairing each sentence with an instruction string telling the model how to determine similarity. The steps are:

*   pair each sentence with the instruction
*   create batches of size `batch_size` to pass to the model
*   pass each batch to the model and collect the vectors in a list
*   turn the list of vectors into a 2D numpy array that FAISS expects


In [None]:
import numpy as np
from tqdm import tqdm  # for a progress bar

instruction = 'Represent the financial sentence for information retrieval: '

batch_size = 40
batches = list(chunks([[instruction, sentence] for sentence in sentences],
                      batch_size))

vectors = []
with torch.no_grad():
  for batch in tqdm(batches):
    vectors.extend(vectorizer.encode(batch))

# make our list of vectors into a 2D array that FAISS needs
vectors = np.array(vectors)
print(vectors.shape)

## Indexing
Here we add our vectors to the FAISS index.

In [None]:
index.add(vectors)
print(index.ntotal)

## Searching
Now we can finally search through our data. To do this, we need to convert our query into a vector just like we did when we added our dataset to the index. Then we search the index, asking for `k` results.

Searching a FAISS index returns two things: the distances of each result to the query, and the ID of each result. Both of these are 2D lists. The outer list exists because you query FAISS with a **batch** of queries (even if you only have one query, it is wrapped in a list). So the resulting lists are always in the shape `(num_queries, num_results)`.

In [None]:
# a string to search for
query = 'statements that represent opportunities for the business'
query_vector = vectorizer.encode([[instruction, query]])
k = 5  # how many results to return

# D is the distance (in this case cosine similarity) of each result
# I is the ID of each result
D, I = index.search(query_vector, k)
print('matched IDs', I)

# the ID of the match is based on the order the vectors were inserted
for i, sentence_id in enumerate(I[0]):
  print(sentences[sentence_id], D[0][i])

## Saving and Loading

Reading and writing the index from disk is straightforward. As-is this code will save the index to a location that will disappear once the runtime is disconnected. To save it to your google drive, click the `Mount Drive` button and choose a path to save it to.

**NOTE:** Keep in mind that saving the index alone does not save the metadata (in our case, the sentences list). You would need to handle that separately.

In [None]:
faiss.write_index(index, 'transcript.index')
index = faiss.read_index('transcript.index')

# Going Farther
This is the bare minimum you need in order to get started with semantic search. However this only scratches the surface of what needs to happen to go beyond a toy example.

## Index Types
Here we used the `"Flat"` FAISS index type, which is exhaustive k-nearest neighbor search. While this is the simplest type, it has the worst performance and is only viable for up to tens of thousands of vectors. There are many options for other more efficient index types depending on needs. [Here](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index) is a good guide for deciding how to configure FAISS.

## Metadata
In this example, we used an in-memory list of strings to store out sentences, and the ID returned by FAISS corresponded to the position in that list of the associated sentence. If your application requires more than simple lookups, if your data is too large to fit into memory all at once, etc. you likely need something like a relational database. The main consideration is that FAISS IDs can only be integers, so you need a way to coordinate integer IDs between the index and database. There are index types that allow you to provide explicit IDs when adding vectors to FAISS, so that you are not restricted to the order you add vectors determining the ID.

## Batteries Included Options
Here we intentionally focused on the index aspect of semantic search in order to get a better understanding of the key components it uses. However in practice it can often be preferable to use a service that handles most or all of these pieces directly. Some options are:

*   [Milvus](https://milvus.io/) is an open source vector index + database. It still requires you to handle vectorization.
*   [Pinecone](https://www.pinecone.io/) is a closed source managed cloud based vector index + database. Again, you need to handle vectorization.
*   [Weaviate](https://weaviate.io/) is an open source vector index + database that also offers support for vectorization and has a managed cloud option.
*   [Huggingface](https://huggingface.co/inference-endpoints) offers inference endpoints to a number of models, which you could use for vectorization.
*   [OpenAI](https://platform.openai.com/docs/guides/embeddings) has an embedding model that you can use via their API.

Every option has its own strengths, weaknesses, and considerations. As a rule of thumb, for prototyping or small local projects, the setup we used here is fine. If you need something that will be available over the internet, you can still host what we have, but scaling and maintaining it can quickly become a major undertaking. In those situations you are probaly better off using a more robust option unless you have a good reason not to.
