#**Goal: Create a code explanation for each cell as text below it.**

**Creating a hybrid search system using**
* Embeddings for semantic search (sentence_transformers)
* BM25 for keyword ranking (Sparse retrieval)
* FAISS as a index.









In [54]:
# !pip install sentence-transformers

In [55]:
# !pip install rank_bm25

In [56]:
# !pip install faiss-cpu

In [57]:
import sentence_transformers

In [58]:
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss

### ^Importing necessary libraries for hybrid search system

In [59]:
documents = [
    "Artificial Intelligence is changing the world.",
    "Machine Learning is a subset of AI.",
    "Deep Learning is a subset of Machine Learning.",
    "Natural Language Processing involves understanding text.",
    "Computer Vision allows machines to see and understand.",
    "AI includes areas like NLP and Computer Vision.",
    "The Pyramids of Giza are architectural marvels.",
    "Mozart was a prolific composer during the classical era.",
    "Mount Everest is the tallest mountain on Earth.",
    "The Nile is one of the world's longest rivers.",
    "Van Gogh's Starry Night is a popular piece of art.",
    "Basketball is a sport played with a round ball and two teams."
]

In [60]:
query = "Tell me about AI in text and vision."

### ^Writing sample documents and query to test the retrieval system

In [61]:
tokenized_corpus = [doc.split(" ") for doc in documents]

### ^Splitting each document in documents by words

In [62]:
bm25 = BM25Okapi(tokenized_corpus)

### ^Initializing the BM25Okapi model with words from document, this enables keyword based ranking of documents by keyword relevance to the query.

In [63]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [64]:
document_embeddings = model.encode(documents)

### ^Loads paraphrase model that converts documents content (sentences) into vector embeddings.

In [65]:
index = faiss.IndexFlatL2(document_embeddings.shape[1])

### ^creates empty faiss index for similarity search using L2 distance model similarity search

In [66]:
index.add(np.array(document_embeddings).astype('float32'))


### ^Adds the document embeddings to the faiss index after converting them to float32. This makes them ready for similarity search.

In [67]:
top_n =10

### ^Defines top_n=10, this means the search will return 10 closest documents for query

In [68]:
bm25_scores = bm25.get_scores(query.split(" "))

### ^Generates relevance scores of how close the document contents are to the query 

In [69]:
top_docs_indices = np.argsort(bm25_scores)[-top_n:]

### ^Selects top_n, which is 10, of documents according to bm25 scores

In [70]:
top_docs_embeddings = [document_embeddings[i] for i in top_docs_indices]

### ^Grabs embeddings of top closest documents 

In [71]:
query_embedding = model.encode([query])

### ^Encodes query into vector embedding

In [72]:
sub_index = faiss.IndexFlatL2(top_docs_embeddings[0].shape[0])

### ^Creates empty faiss index for documents using L2 distance search.

In [73]:
sub_index.add(np.array(top_docs_embeddings).astype('float32'))

### ^Converts to float32 and adds embeddings of top documents to sub_index 

In [74]:
_,sub_dense_ranked_indices = sub_index.search(np.array(query_embedding).astype('float32'), top_n)

### ^Performs semantic similarity search in sub_index for the query

In [75]:
sub_dense_ranked_indices


array([[9, 8, 1, 0, 6, 7, 2, 4, 3, 5]])

### ^Shows which documents in all documents are most relevant to the query

In [76]:
final_ranked_indices = [top_docs_indices[i] for i in sub_dense_ranked_indices[0]]

### ^Converts the top semantic search results to their positions in the original documents list.

In [77]:
ranked_docs = [documents[i] for i in final_ranked_indices]

### ^Retrieves top document text in order or relevance

In [78]:
ranked_docs

['AI includes areas like NLP and Computer Vision.',
 'Computer Vision allows machines to see and understand.',
 'Natural Language Processing involves understanding text.',
 'Deep Learning is a subset of Machine Learning.',
 "Van Gogh's Starry Night is a popular piece of art.",
 'Basketball is a sport played with a round ball and two teams.',
 'Mozart was a prolific composer during the classical era.',
 "The Nile is one of the world's longest rivers.",
 'The Pyramids of Giza are architectural marvels.',
 'Mount Everest is the tallest mountain on Earth.']

#Provide a brief description of the process this code implements.

This code implements hybrid document retrieval process, hybrid is because it combines semantic and keyword based searches
* It ranks documents based on matching words (BM25) in the query

* Turns the documents and query into vector embeddings (SentenceTransformer) to understand their meaning

* It uses a FAISS index to quickly find the most similar documents

* In the end it picks the top keyword matches and reorders them based on semantic similarity