#**Goal: Create a code explanation for each cell as text below it.**

**Creating a hybrid search system using**
* Embeddings for semantic search (sentence_transformers)
* BM25 for keyword ranking (Sparse retrieval)
* FAISS as a index.









In [None]:
# !pip install sentence-transformers

In [None]:
# !pip install rank_bm25

In [None]:
# !pip install faiss-cpu

In [1]:
import sentence_transformers

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss

In [5]:
documents = [
    "Artificial Intelligence is changing the world.",
    "Machine Learning is a subset of AI.",
    "Deep Learning is a subset of Machine Learning.",
    "Natural Language Processing involves understanding text.",
    "Computer Vision allows machines to see and understand.",
    "AI includes areas like NLP and Computer Vision.",
    "The Pyramids of Giza are architectural marvels.",
    "Mozart was a prolific composer during the classical era.",
    "Mount Everest is the tallest mountain on Earth.",
    "The Nile is one of the world's longest rivers.",
    "Van Gogh's Starry Night is a popular piece of art.",
    "Basketball is a sport played with a round ball and two teams."
]

In [6]:
query = "Tell me about AI in text and vision."

In [7]:
tokenized_corpus = [doc.split(" ") for doc in documents]

In [8]:
bm25 = BM25Okapi(tokenized_corpus)

In [9]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

Loading weights: 100%|██████████| 103/103 [00:00<00:00, 273.19it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/paraphrase-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [10]:
document_embeddings = model.encode(documents)

In [11]:
index = faiss.IndexFlatL2(document_embeddings.shape[1])

In [12]:
index.add(np.array(document_embeddings).astype('float32'))


In [13]:
top_n =10

In [14]:
bm25_scores = bm25.get_scores(query.split(" "))

In [15]:
top_docs_indices = np.argsort(bm25_scores)[-top_n:]

In [16]:
top_docs_embeddings = [document_embeddings[i] for i in top_docs_indices]

In [17]:
query_embedding = model.encode([query])

In [18]:
sub_index = faiss.IndexFlatL2(top_docs_embeddings[0].shape[0])

In [19]:
sub_index.add(np.array(top_docs_embeddings).astype('float32'))

In [20]:
_,sub_dense_ranked_indices = sub_index.search(np.array(query_embedding).astype('float32'), top_n)

In [21]:
sub_dense_ranked_indices


array([[9, 8, 1, 0, 6, 7, 2, 4, 3, 5]])

In [22]:
final_ranked_indices = [top_docs_indices[i] for i in sub_dense_ranked_indices[0]]

In [23]:
ranked_docs = [documents[i] for i in final_ranked_indices]

In [24]:
ranked_docs

['AI includes areas like NLP and Computer Vision.',
 'Computer Vision allows machines to see and understand.',
 'Natural Language Processing involves understanding text.',
 'Deep Learning is a subset of Machine Learning.',
 "Van Gogh's Starry Night is a popular piece of art.",
 'Basketball is a sport played with a round ball and two teams.',
 'Mozart was a prolific composer during the classical era.',
 "The Nile is one of the world's longest rivers.",
 'The Pyramids of Giza are architectural marvels.',
 'Mount Everest is the tallest mountain on Earth.']

#Provide a brief description of the process this code implements.

In [None]:
# It is 2 stage hybrid retrieval process
# Stage 1 Uses BM25 to filter database for specific keywords
# Stage 2 Uses Sentence Transformers and FAISS to perform search on filters, ensuring that the results are similar to the query