<a href="https://colab.research.google.com/github/erenarkangil/personalized_chatbot/blob/main/rag_for_hybrid_search_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Chat Bot for Hybrid Search

This is the accompanying notebook for the [Oct 19 (2023) RAG for Hybrid Search meetup](https://www.pinecone.io/community/events/sf-meetup-october-2023/).

---

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/rag-for-hybrid/rag-for-hybrid-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/rag-for-hybrid/rag-for-hybrid-search.ipynb)

Quick notes:
- You will need an OpenAI API Key
- You will need a Pinecone account (API key & environment)
- Cells that preview data are commented out, so that users can more easily navigate the notebook on Github. Run the notebook in Colab with these cells un-commented to see data previews.



In [None]:
!pip3 install colab-xterm # Just makes the shell commands interactive, in case you have to press ENTER or type in 'Y/n' etc.
%load_ext colabxterm

In [None]:
# Install libraries

!pip install pymupdf
!pip install faiss-cpu
!pip install huggingface-hub==0.25.2

#!pip install  pinecone-text==0.5.4
!pip install  unstructured==0.10.24
#!pip install  sentence-transformers==2.2.2
!pip install  langchain==0.0.327
!pip install  openai==0.28.1
!pip install  pdfminer.six
#!pip install  pdf2image==1.16.3
!pip install python-dotenv==1.0.0
#!pip install pytesseract==0.3.10
#!pip install  unstructured_pytesseract==0.3.12
#!pip install  huggingface-hub==0.20.2
!pip install  numpy==2.0.0

In [None]:
from sentence_transformers import SentenceTransformer


In [None]:
!pip show sentence-transformers

In [None]:
#!apt-get install poppler-utils

In [None]:
#sudo apt install tesseract-ocr
#!sudo apt install libtesseract-dev

Imports

In [None]:
#!pip uninstall -y sentence-transformers numpy
#!pip install --no-cache-dir sentence-transformers

In [None]:
import pinecone
import re
from uuid import uuid4
from typing import IO, Any, Dict, List, Tuple
from copy import deepcopy
import requests

from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Text
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
#from pinecone import Pinecone
import openai
from pinecone.core.client.model.query_response import QueryResponse

from pinecone_text.sparse import BM25Encoder

Set up the environment variables we'll need. We recommend using `dotenv`. It's a super simple way to keep your variables safe, but accessible. Simply create a `.env` file with your secrets in it, and use the Python `dotenv` and `os` libraries to load them.

To import your `.env` file into Colab, upload it (or create it) in the `/content/` dir.

In [None]:
%load_ext dotenv
from dotenv import load_dotenv

In [None]:
# Make sure dotenv is in our kernel environment & working

load_dotenv()

In [None]:
pinecone_api_key = os.getenv('PINECONE_API_KEY')  # You can get your Pinecone api key and env (e.g. "us-east-1") at app.pinecone.io
pinecone_env = os.getenv('PINECONE_ENV')
openai_api_key = os.getenv('OPENAI_API_KEY')


In [None]:
# Let's make sure our dotenv secrets loaded correctly

assert len(pinecone_api_key) > 0
assert len(pinecone_env) > 0
assert len(openai_api_key) > 0

# Download some articles we're interested in learning more about.

Remember, hybrid search is best for knowledge that contains a lot of unique keywords that you'd like to search for, along with concepts you'd like clarity on, etc. Data that works best for this type of thing include medical data, most types of research data, data with lots of entities in it, etc.

We'll be using Arxiv.org articles about different vector search algorithms for this demo. They've got lots of jargon and concepts that'll work great for hybrid search!

In [None]:
import requests
import os

def get_pdf(base_url: str, filename: str):
    """
    Download and write a PDF file from a github repository.

    :param url: URL of Github repository containing the file you want to download & write locally.
    """
    res = requests.get(base_url+filename)
    # Check if the request was successful (HTTP status code 200)
    if res.status_code == 200:
      with open(filename, 'wb') as f:
          f.write(res.content)
          print(f"PDF downloaded and saved as {filename}")
    else:
      print(f"Failed to download the PDF. HTTP status code: {res.status_code}")

In [None]:
# Download our files to the /content/ dir in Colab

github_dir = "https://github.com/pinecone-io/examples/raw/master/learn/generation/rag-for-hybrid/"
filenames = ["freshdiskann_paper.pdf", "hnsw_paper.pdf", "ivfpq_paper.pdf"]

for f in filenames:
  get_pdf(github_dir, f)


In [None]:
# Read in our file paths
# Note: change this path to your local dir if running this notebook locally (i.e. not on Colab)

freshdisk = os.path.join("/content/", filenames[0])
hnsw = os.path.join("/content/", filenames[1])
ivfpq = os.path.join("/content/", filenames[2])


# Partitioning & Cleaning our PDFs

This step is optional. Partitioning simply uses ML to break a document up into pages, paragraphs, the title, etc. It's a nice-to-have that allows you to exclude certain elements you might not want to index, such as an article's bibliography (although we'll keep that since it could be useful information).

If you want to skip this step, you can just read the PDFs into text or json, etc. and make your chunks straight from that object(s).

Note: this notebook assumes you have partitioned your PDF. If you want to run this notebook from start to finish as-is, you'll need to run this step.

In [None]:
import nltk
#nltk.download('punkt_tab')
#nltk.download('averaged_perceptron_tagger')
nltk.download('all')

In [None]:
# Let's partition all of our PDFs and store their partitions in a dictionary for easy retrieval & inspection later

# Note: This takes a few mins to run (~12 mins; will be faster if running locally (~3 mins))

partitioned_files = {
    "freshdisk": partition_pdf(freshdisk, url=None, strategy = 'ocr_only'),
    "hnsw": partition_pdf(hnsw, url=None, strategy = 'ocr_only'),
    "ivfpq": partition_pdf(ivfpq, url=None, strategy = 'ocr_only'),
}


In [None]:
import pickle

with open('/content/partitioned_files.pkl', 'rb') as f:  # Adjust path if needed
    partitioned_files = pickle.load(f)

# Verify the loaded object
type(partitioned_files), len(partitioned_files)  # Example check

In [None]:
# Let's make an archived copy of partitioned_files dict so if we mess it up while cleaning, we don't have to re-ocr our PDFs:

partitioned_files_copy = deepcopy(partitioned_files)

In [None]:
# partitioned_files.get('freshdisk')

You can see in the preview above that each of our PDFs now has elements classifying different parts of the text, such as `Text`, `Title`, and `EmailAddress`.

Data cleaning matters a lot when it comes to hybrid search, because for the keyword-search part we care about each individual token (word).

Let's filter out all of the email addresses to start with, since we don't need those for any reason.

In [None]:
def remove_unwanted_categories(elements: Dict[str, List[Text]], unwanted_cat: str) -> None:
    """
    Remove partitions containing an unwanted category.

    :parameter elements: Partitioned pieces of our documents.
    :parameter unwanted_cat: The name of the category we'd like filtered out.
    """
    for key, value in elements.items():
        elements[key] = [i for i in value if not i.category == unwanted_cat]


In [None]:
# Remove unwanted EmailAddress category from dictionary of partitioned PDFs

remove_unwanted_categories(partitioned_files, 'EmailAddress')

No more `EmailAddress` elements!:

In [None]:
# partitioned_files.get('freshdisk')

To actually see what our elements are, we can call the `.text` attribute of each object:

In [None]:
# Text preview of what's actually in one of our dictionary items:

# [i.text for i in partitioned_files.get('freshdisk')]

You can see there are weird things like blank spaces, single letters, etc. as their own partitions. We don't want these either, so let's get rid of them.

You can also see where some page breaks were that spanned single words -- these are identifiable by a word ending with a `- `. For these, we want to get rid of the `- ` and squish the word back together, so it makes sense.

(You can also see that not all of the email addresses were caught by Unstructured's ML. It's too cumbersome to go through each doc and weed those out by hand, so we'll just have to leave them for now)

In [None]:
# Remove empty spaces & single-letter/-digit partitions:

def remove_space_and_single_partitions(elements: Dict[str, List[Text]]) -> None:
    """
    Remove empty partitions & partitions with lengths of 1.

    :parameter elements: Partitioned pieces of our documents.
    """
    for key, value in elements.items():
        elements[key] = [i for i in value if len(i.text.strip()) > 1 ]

In [None]:
remove_space_and_single_partitions(partitioned_files)

No more single-character partitions or partitions with only whitespace, perfect!

In [None]:
# [i.text for i in partitioned_files.get('freshdisk')]

Let's now get rid of those strange words that have been split across page breaks (e.g. `funda- mental`):

In [None]:
# Note: this function transforms our elemenets into their text representations

def rejoin_split_words(elements: Dict[str, List[Text]]) -> None:
    """
    Rejoing words that are split over pagebreaks.

    :parameter elements: Partitioned pieces of our documents.
    """
    for key, value in elements.items():
        elements[key] = [i.text.replace('- ', '') for i in value if '- ' in i.text]



In [None]:
rejoin_split_words(partitioned_files)

In [None]:
# partitioned_files.get('freshdisk')

You can see now that we've sewn those split words back together:

The last cleaning step we'll want to take is removing the inline citations, e.g. `[6, 9, 11, 16, 32, 35, 38, 43, 59]` and `[12]`.

In [None]:
def remove_inline_citation_numbers(elements: Dict[str, List[Text]]) -> None:
    """
    Remove inline citation numbers from partitions.

    :parameter elements: Partitioned pieces of our documents.
    """
    for key, value in elements.items():
        pattern = re.compile(r'\[\s*(\d+\s*,\s*)*\d+\s*\]')
        elements[key] = [pattern.sub('', i) for i in value]



In [None]:
remove_inline_citation_numbers(partitioned_files)

We've still got some weird numbers in there, but it's pretty good!

In [None]:
# partitioned_files.get('freshdisk')

Now that we've cleaned our data, we can zip all the partitions (per PDF) back together so we're starting our chunking from a single, coherent text object.

In [None]:
# Sew our partitions back together, per PDF:

def stitch_partitions_back_together(elements: Dict[str, List[Text]]) -> None:
    """
    Stitch partitions back into single string object.

    :parameter elements:  Partitioned pieces of our documents.
    """
    for key, value in elements.items():
        elements[key] = ' '.join(value)

In [None]:
stitch_partitions_back_together(partitioned_files)

Good to go! All of our PDFs are now cleaned and single globs of text data

In [None]:
partitioned_files

In [None]:
# Let's save our cleaned files to a new variable that makes more sense w/the current state

cleaned_files = partitioned_files

# Chunking our PDF content

Chunking is integral to achieving great relevance with vector search, whether that's sparse vector search, dense vector search, or hybrid vector search.

From our [chunking strategy post](https://www.pinecone.io/learn/chunking-strategies/):

> The main reason for chunking is to ensure we’re embedding a piece of content with as little noise as possible that is still semantically relevant . . . For example, in semantic search, we index a corpus of documents, with each document containing valuable information on a specific topic. By applying an effective chunking strategy, we can ensure our search results accurately capture the essence of the user’s query. If our chunks are too small or too large, it may lead to imprecise search results or missed opportunities to surface relevant content. As a rule of thumb, if the chunk of text makes sense without the surrounding context to a human, it will make sense to the language model as well. Therefore, finding the optimal chunk size for the documents in the corpus is crucial to ensuring that the search results are accurate and relevant.

We need to chunk our PDFs' (text) data into sizable chunks that are semantically coherent and dense with contextual information.

We'll use LangChain's `RecusiveCharacterTextSplitter` since it's a super easy utility that makes chunking quick and customizable. You should experiment with different chunk sizes and overlap values to see how the resulting chunks differ. You want each chunk to make a reasonable amount of sense as a stand-alone data object. After some experimentation on our end, we will choose a `chunk_size` of `512` and a `chunk_overlap` of `35` (characters).

In [None]:
def generate_chunks(doc: str, chunk_size: int = 512, chunk_overlap: int = 35) -> List[Document]:
    """
    Generate chunks of a certain size and token overlap.

    :param doc: Document we want to turn into chunks.
    :param chunk_size: Desired size of our chunks, in tokens (words).
    :param chunk_overlap: Desired # of tokens (words) that will overlap across chunks.

    :return: Chunks representations of the given document.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap
    )

    return splitter.create_documents([doc])



In [None]:
def chunk_documents(docs: Dict[str, List[Text]],  chunk_size: int = 512, chunk_overlap: int = 35) -> None:
    """
    Iterate over documents and chunk each one.

    :parameter docs: The documents we want to chunk.
    :param chunk_size: Desired size of our chunks, in tokens (words).
    :param chunk_overlap: Desired # of tokens (words) that will overlap across chunks.
    """
    for key, value in docs.items():
        chunks = generate_chunks(value)
        docs[key] = [c.page_content for c in chunks]  # Grab the text representation of the chunks via the `page_content` attribute


In [None]:
chunk_documents(cleaned_files)

In [None]:
chunked_files = cleaned_files

Check out our chunks!

# Create Dense Embeddings of our Chunks

Hybrid search needs both dense embeddings and sparse embeddings of the same content in order to work. Let's start with dense embeddings.

We'll use the `'all-MiniLM-L12-v2'` [model](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) hosted by HuggingFace to create our dense embeddings. It's currently high on their [MTEB (Massive Text Embedding Benchmark) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) (Reranking section), so it's a pretty safe bet. This will output dense vectors of 384 dimensions.

Note: if you're playing around with this notebook, make sure to save your chunks and embeddings (both sparse and dense) in `pkl` [files](https://stackoverflow.com/questions/11218477/how-can-i-use-pickle-to-save-a-dict-or-any-other-python-object), so that you don't have to wait for the embeddings to generate again if you want to rerun any steps in this notebook.

We'll have to create a dense embedding of each of our PDFs' chunks:

In [None]:
def produce_embeddings(chunks: List[str]) -> List[str]:
    """
    Produce dense embeddings for each chunk.

    :param chunks: The chunks we want to create dense embeddings of.

    :return: Dense embeddings produced by our SentenceTransformer model `all-MiniLM-L12-v2`.
    """
    model = SentenceTransformer('all-MiniLM-L12-v2')
    embeddings = []
    for c in chunks:
        embedding = model.encode(c)
        embeddings.append(embedding)
    return embeddings


In [None]:
freshdisk_dembeddings = produce_embeddings(chunked_files.get('freshdisk'))  # these take ~30s min to run

In [None]:
type(freshdisk_dembeddings)

In [None]:
(chunked_files.get('freshdisk'))

In [None]:
hnsw_dembeddings = produce_embeddings(chunked_files.get('hnsw'))

In [None]:
ivfpq_dembeddings = produce_embeddings(chunked_files.get('ivfpq'))

In [None]:
# We can confirm the shape of each our dense embeddings is 384:

# Make binary lists to keep track of any shapes that are *not* 384
freshdisk_assertion = [0 for i in freshdisk_dembeddings if i.shape == 384]
hnsw_assertion = [0 for i in hnsw_dembeddings if i.shape == 384]
ivfpq_assertion = [0 for i in ivfpq_dembeddings if i.shape == 384]

# Sum up our lists. If there are any embeddings that are not of shape 384, these sums will be > 0
assert sum(freshdisk_assertion) == 0
assert sum(hnsw_assertion) == 0
assert sum(ivfpq_assertion) == 0

# Create Sparse Embeddings of our Chunks

Now we can create our sparse embeddings. We will use the BM25 algorithm to create our sparse embeddings. The resulting vector will represent an inverted index of the tokens in our chunks, constrained by things like chunk length.

Pinecone has an awesome [text library](https://github.com/pinecone-io/pinecone-text) that makes generating these vectors super easy. We also have [a great notebook](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/sparse/bm25/bm25-vector-generation.ipynb) all about BM25 encodings.

Since we're using a ML-implemented version of BM25, we need to "fit" the model to our corpus. To do this, we'll combine all 3 of our PDFs together, so that the BM25 model can compute all the token frequencies etc correctly. We'll then encode each of our documents with our "fitted" model.

In [None]:
# Join the content of all our PDFs together into 1 large corpus

corpus = ""

for i, v in chunked_files.items():
    corpus += ' '.join(v)

In [None]:
len(corpus)  # Awesome, we've got lots o' tokens here for our BM25 model to learn :)

In [None]:
# Initialize BM25 and fit to our corpus

bm25 = BM25Encoder()
bm25.fit(corpus)  # takes ~30s

In [None]:
# Create embeddings for each chunk
freshdisk_sembeddings = [bm25.encode_documents(i) for i in chunked_files.get('freshdisk')]

In [None]:
hnsw_sembeddings = [bm25.encode_documents(i) for i in chunked_files.get('hnsw')]

In [None]:
ivfpq_sembeddings = [bm25.encode_documents(i) for i in chunked_files.get('ivfpq')]

Let's look at the sparse embeddings for one of our PDFs.

You'll see that each PDF's chunks has now transformed into a dictionary with `indices` and `values` keys.

In [None]:
# freshdisk_sembeddings

In [None]:
# We want the # of chunks per PDF to be equal to the # of sparse embeddings we've generated. Let's check that:

assert len(freshdisk_sembeddings) == len(chunked_files.get('freshdisk'))
assert len(hnsw_sembeddings) == len(chunked_files.get('hnsw'))
assert len(ivfpq_sembeddings) == len(chunked_files.get('ivfpq'))

In [None]:
len(chunked_files.get('freshdisk')) +  len(chunked_files.get('hnsw')) + len(chunked_files.get('ivfpq'))

In [None]:
len(chunked_files.get('freshdisk'))

# Getting Our Embeddings into Pinecone

Now that we have made our sparse and dense embeddings, it's time to index them into our Pinecone index.

One thing to note is that only [p1 and s1 pods support hybrid search](https://docs.pinecone.io/docs/indexes). Since we're not concerned about high throughput for a demo, we'll go with s1, which is optimized for storage over throughput.

Hybrid search indexes inherently also need `"dotproduct"` as their similarity `metric`.

In [None]:

def create_ids(chunks: List[str]) -> List[str]:
    """Create unique IDs for each document chunk."""
    return [str(uuid4()) for _ in range(len(chunks))]

# Generate unique IDs
freshdisk_ids = create_ids(chunked_files.get('freshdisk'))
hnsw_ids = create_ids(chunked_files.get('hnsw'))
ivfpq_ids = create_ids(chunked_files.get('ivfpq'))

In [None]:
import faiss
import numpy as np

# Define the vector dimension (must match your embedding size)
dimension = 384

# Create a FAISS index
f_index = faiss.IndexFlatL2(dimension)

In [None]:
np.array(freshdisk_dembeddings, dtype=np.float32)

In [None]:
freshdisk_dembeddings_np = np.array(freshdisk_dembeddings, dtype=np.float32)
hnsw_dembeddings_np = np.array(hnsw_dembeddings, dtype=np.float32)
ivfpq_dembeddings_np = np.array(ivfpq_dembeddings, dtype=np.float32)

# Add vectors to the FAISS index
#index.add(freshdisk_dembeddings_np)
#index.add(hnsw_dembeddings_np)
#index.add(ivfpq_dembeddings_np)

# Verify number of vectors added
#print("Total vectors in index:", index.ntotal)

In [None]:
def search_faiss(query_embedding, k=5):
    """
    Perform a similarity search in the FAISS index.

    :param query_embedding: The embedding of the query text.
    :param k: Number of closest matches to return.

    :return: List of top-k indices and distances.
    """
    query_embedding_np = np.array(query_embedding, dtype=np.float32).reshape(1, -1)
    distances, indices = f_index.search(query_embedding_np, k)
    return indices[0], distances[0]

# Example query
query_embedding = freshdisk_dembeddings[0]  # Use one of your embeddings as a test query
indices, distances = search_faiss(query_embedding)

print("Top matches:", indices)
print("Distances:", distances)

We'll create an index object out of the index we just made. We'll make this with Pinecone's [GRPC client](https://docs.pinecone.io/docs/performance-tuning#using-the-grpc-client-to-get-higher-upsert-speeds), since it's a little faster for upserts:


We'll need to make unique IDs for all of our objects, which is easy with the `uuid` library in Python:

In [None]:
# Let's make sure we have the same # of IDs as there are chunks:

assert len(freshdisk_ids) == len(chunked_files.get('freshdisk'))
assert len(hnsw_ids) == len(chunked_files.get('hnsw'))
assert len(ivfpq_ids) == len(chunked_files.get('ivfpq'))

Now that we have our IDs, we can make our composite sparse-dense objects that we'll index into Pinecone. These will take 4 components:
- Our IDs
- Our sparse embeddings
- Our dense embeddings
- Our chunks

We'll use the actual text content of our PDFs (stored in our chunks) as metadata. This allows the end user to see the content of what's being returned by their search instead of just the sparse/dense vectors. In order to store our chunks' textual data in digestible metadata object for Pinecone, we'll want to turn each chunk into a dict that has a `'text'` key to hold the chunk value.

In [None]:
def create_metadata_objs(doc: List[str]) -> List[dict[str]]:
    """
    Create objects to store as metadata alongside our sparse and dense vectors in our hybird Pinecone index.

    :param doc: Chunks of a document we'd like to use while creating metadata objects.

    :return: Metadata objects with a "text" key and a value that points to the text content of each chunk.
    """
    return [{'text': d} for d in doc]

In [None]:
freshdisk_metadata = create_metadata_objs(chunked_files.get('freshdisk'))
hnsw_metadata = create_metadata_objs(chunked_files.get('hnsw'))
ivfpq_metadata = create_metadata_objs(chunked_files.get('ivfpq'))

In [None]:
# Preview

freshdisk_metadata[0]

In [None]:
def create_composite_objs(ids: str, sembeddings: List[Dict[str, List[Any]]], dembeddings: List[float], metadata: Dict[str, str]) -> List[Dict[str, Any]]:
    """
    Create objects for indexing into Pinecone. Each object contains a document ID (which corresponds to the chunk, not the larger document),
    the chunk's sparse embedding, the chunk's dense embedding, and the chunk's corresponding metadata object.

    :param ids: Unique ID of a chunk we want to index.
    :param sembeddings: Sparse embedding representation of a chunk we want to index.
    :param dembeddings: Dense embedding representation of a chunk we want to index.
    :param metadata: Metadata objects with a "text" key and a value that points to the text content of each chunk.

    :return: Composite objects in the correct format for ingest into Pinecone.
    """
    to_index = []

    for i in range(len(metadata)):
        to_index_obj = {
                'id': ids[i],
                'sparse_values': sembeddings[i],
                'values': dembeddings[i],
                'metadata': metadata[i]
            }
        to_index.append(to_index_obj)
    return to_index

In [None]:
freshdisk_com_objs = create_composite_objs(freshdisk_ids, freshdisk_sembeddings, freshdisk_dembeddings, freshdisk_metadata)
hnsw_com_objs = create_composite_objs(hnsw_ids, hnsw_sembeddings, hnsw_dembeddings, hnsw_metadata)
ivfpq_com_objs = create_composite_objs(ivfpq_ids, ivfpq_sembeddings, ivfpq_dembeddings, ivfpq_metadata)

In [None]:
len(freshdisk_dembeddings_np) + len(hnsw_dembeddings_np) + len(ivfpq_dembeddings_np)

Now we can index ("upsert") our objects into our Pinecone index!

In [None]:
f_index.add(freshdisk_dembeddings_np)
f_index.add(hnsw_dembeddings_np)
f_index.add(ivfpq_dembeddings_np)


In [None]:
# Woo we have our vectors (252) in our index!

total_vectors = f_index.ntotal
print(f"Total vectors in the index: {total_vectors}")

# Check the index type
print(f"Index type: {type(f_index)}")

# If your index is using a specific search algorithm (e.g., IVF or HNSW), you can access more information:
if isinstance(index, faiss.IndexIVFFlat):
    print(f"Number of centroids: {f_index.nlist}")
elif isinstance(index, faiss.IndexHNSWFlat):
    print(f"Number of neighbors for HNSW: {f_index.efConstruction}")

# Query Our Hybrid Docs

Now that we have all of our hybrid vector objects in our Pinecone index, we can issue some queries!

Since issuing a query to a vector index requires the query to be vectorized in the same way as the objects in the index are vectorized (so they can match up in vector space), for hybrid queries we'll have to vectorize the query *twice*! Once as a sparse vector and once as a dense vector. We then send both of those vectors to Pinecone to get items back.

In [None]:
query = "What is the responsibility of each BSP Manager?"


Create sparse embedding from query

Note: do *not* refit the bm25 model here. We want to keep the token frequencies etc from when we fit it to the text from our PDFs!

You might be wondering how the model gets "refit" when the corpus changes, the answer is a little complicated, but essentially this is a special implementation of BM25 (which usually runs online) that has precomputed frequencies for English words, based off the MSMarco dataset. So, when you add new docs to the corpus, you don't have to "refit" the BM25 model, it just finds the word frequencies in the MSMarco dataset.

More here: https://github.com/pinecone-io/pinecone-text/blob/main/pinecone_text/sparse/bm25_encoder.py#L255



In [None]:
query_sembedding = bm25.encode_queries(query)

In [None]:
# Cool! We can see there are only two values in here, because BM25 automatically removed stop word like "what" and "is"

query_sembedding

In [None]:
# Create dense embedding
query = "What is the responsibility of each BSP Manager?"

query_dembedding = produce_embeddings([query])

In [None]:
type(query_dembedding)
#query_dembedding

Pinecone vector search has a cool user feature where you can weight the sparse vectors higher or lower (i.e. of more or less importance) than the dense vectors. This is controlled by the `alpha` parameter. An `alpha` of 0 means you're doing a totally keyword-based search (i.e. only over sparse vectors), while an `alpha` of 1 means you're doing a totally semantic search (i.e. only over dense vectors).

Let's make a function that'll let us weight our vectors by alpha.

(We'll also include `k`, which is the number of docs we want to retrieve)

In [None]:
'''# Integrate alpha and top-k

def weight_by_alpha(sparse_embedding: Dict[str, List[Any]], dense_embedding: List[float], alpha: float) -> Tuple[Dict[str, List[Any]], List[float]]:
    """
    Weight the values of our sparse and dense embeddings by the parameter alpha (0-1).

    :param sparse_embedding: Sparse embedding representation of one of our documents (or chunks).
    :param dense_embedding: Dense embedding representation of one of our documents (or chunks).
    :param alpha: Weighting parameter between 0-1 that controls the impact of sparse or dense embeddings on the retrieval and ranking
        of returned docs (chunks) in our index.

    :return: Weighted sparse and dense embeddings for one of our documents (chunks).
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hsparse = {
        'indices': sparse_embedding['indices'],
        'values':  [v * (1 - alpha) for v in sparse_embedding['values']]
    }
    hdense = [v * alpha for v in dense_embedding]
    return hsparse, hdense'''

Now let's make a function that'll query our Pinecone index while taking into account whatever `alpha` and `k` values we want to pass:

In [None]:
'''# Note this doesn't have any genAI in it yet


def issue_hybrid_query(sparse_embedding: Dict[str, List[Any]], dense_embedding: List[float], alpha: float, top_k: int) -> QueryResponse:
    """
    Send properly formatted hybrid search query to Pinecone index and get back `k` ranked results (ranked by dot product similarity, as
        defined when we made our index).

    :param sparse_embedding: Sparse embedding representation of one of our documents (or chunks).
    :param dense_embedding: Dense embedding representation of one of our documents (or chunks).
    :param alpha: Weighting parameter between 0-1 that controls the impact of sparse or dense embeddings on the retrieval and ranking
        of returned docs (chunks) in our index.
    :param top_k: The number of documents (chunks) we want back from Pinecone.

    :return: QueryResponse object from Pinecone containing top-k results.
    """
    scaled_sparse, scaled_dense = weight_by_alpha(sparse_embedding, dense_embedding, alpha)

    result = index.query(
        vector=scaled_dense,
        sparse_vector=scaled_sparse,
        top_k=top_k,
        include_metadata=True
    )
    return result'''

In [None]:
'''
def weight_by_alpha(sparse_embedding: Dict[str, List[Any]], dense_embedding: List[float], alpha: float) -> Tuple[Dict[str, List[Any]], List[float]]:
    """
    Weight the values of our sparse and dense embeddings by the parameter alpha (0-1).

    :param sparse_embedding: Sparse embedding representation of one of our documents (or chunks).
    :param dense_embedding: Dense embedding representation of one of our documents (or chunks).
    :param alpha: Weighting parameter between 0-1 that controls the impact of sparse or dense embeddings on the retrieval and ranking
        of returned docs (chunks) in our index.

    :return: Weighted sparse and dense embeddings for one of our documents (chunks).
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hsparse = {
        'indices': sparse_embedding['indices'],
        'values': [v * (1 - alpha) for v in sparse_embedding['values']]
    }
    hdense = [v * alpha for v in dense_embedding]
    return hsparse, hdense


def issue_hybrid_query(sparse_embedding: Dict[str, List[Any]], dense_embedding: List[float], alpha: float, top_k: int) -> List[Tuple[int, float]]:
    """
    Send a hybrid query to the FAISS index and get back top-k ranked results.

    :param sparse_embedding: Sparse embedding representation of one of our documents (or chunks).
    :param dense_embedding: Dense embedding representation of one of our documents (or chunks).
    :param alpha: Weighting parameter between 0-1 that controls the impact of sparse or dense embeddings on the retrieval and ranking
        of returned docs (chunks) in our index.
    :param top_k: The number of documents (chunks) we want back from FAISS.

    :return: List of tuples where each tuple is (index, similarity score) for the top-k results.
    """
    # Weight the sparse and dense embeddings by alpha
    scaled_sparse, scaled_dense = weight_by_alpha(sparse_embedding, dense_embedding, alpha)

    # Convert the weighted dense vector to a numpy array for FAISS
    dense_vector = np.array(scaled_dense, dtype=np.float32).reshape(1, -1)  # Reshaped to match FAISS input shape

    # Perform the FAISS search for the dense vector
    D, I = f_index.search(dense_vector, top_k)  # D: distances (similarities), I: indices

    # Combine the results with the weighted sparse embeddings (assuming sparse embeddings are handled separately)
    # Here, we return the results from FAISS based on the dense vector query.
    results = [(I[0][i], D[0][i]) for i in range(top_k)]
    return results

'''

In [None]:
def query_faiss_index(faiss_index: faiss.Index, query_embedding: np.ndarray, top_k: int) -> list:
    """
    Perform a query on a FAISS index to get the top-k most similar results.

    :param faiss_index: The FAISS index.
    :param query_embedding: The query embedding to search for (shape: (1, dimension)).
    :param top_k: The number of results to return.
    :return: A list of tuples containing (index, similarity_score) for the top-k results.
    """
    # Ensure the query embedding is in the correct format
    query_embedding = np.array(query_embedding, dtype=np.float32).reshape(1, -1)

    # Perform the search to get the top-k results
    distances, indices = f_index.search(query_embedding, top_k)

    # Return the results as a list of (index, similarity_score) tuples
    return list(zip(indices[0], distances[0]))



results = query_faiss_index(f_index, query_dembedding, 5)


In [None]:
results

In [None]:
all_texts_1 = list(chunked_files.get('freshdisk')) + list(chunked_files.get('hnsw')) + list(chunked_files.get('ivfpq'))


In [None]:
print(all_texts_1[0])
chunked_files.get('freshdisk')[0]

In [None]:
print(type(f_index))
print(f_index.d)
print(f_index.ntotal)  # This shows the total number of vectors in the index


Let's issue a pure semantic search:

In [None]:
# Note, for our dense embedding (`query_dembedding`), we need to grab the 1st value [0] since Pinecone expects a Numpy array when queried:

issue_hybrid_query(query_sembedding, query_dembedding[0], 0.0, 5)

And now a pure keyword search. You can see how many more domain-specific words are in these results:

You can see the differences above: when we issue a purely semantic search, our search results are about what the idea of "nearest neighbors" is; in our keyword search, the vast majority of our search results are just exact-word matches for the tokens "nearest" and "neighbors". Most of them are just citations from the HNSW article's bibliography!

Can we get the best of both worlds? In an ideal world, my search results would both tell me "about" the concept of nearest neighbors and contain things like citations that I could read more about later.

Let's see if we can get a combination of semantic and keyword search by toggling our `alpha` value:

In [None]:
issue_hybrid_query(query_sembedding, query_dembedding[0], 0.2, 5)  # closer to 1.0 = closer to pure keyword search

In [None]:
def get_text_from_faiss_indices(faiss_indices, chunked_files):
    """
    Given FAISS indices, return the corresponding text from chunked files.

    :param faiss_indices: List of FAISS indices returned from a search.
    :param chunked_files: Dictionary containing chunked text files.
    :param embedding_type: The embedding type to use ('freshdisk', 'hnsw', or 'ivfpq').

    :return: List of tuples (index, corresponding_text)
    """


    # Retrieve the chunked text corresponding to the embedding type
    documents = all_texts

    # Map FAISS indices to the corresponding text chunks
    return [(faiss_index, documents[faiss_index]) for faiss_index in faiss_indices]

# Example of FAISS indices returned (e.g., from index.search)
faiss_indices = [0,19,8,72,0]  # Top 5 indices returned from FAISS search

result = get_text_from_faiss_indices(faiss_indices, chunked_files)

# Output the result
for index, text in result:
    print(f"Index: {index}, Text: {text}")

Amazing! You can see that our first couple search results are not very different than our pure keyword search. But when you get further down the results list, you'll see that we get an equation we can use to calculate KNN. That's a bit more useful than #3 in our pure keyword search, which is a bibliography entry. That's likely because we have semantic search in there too -- Pinecone knows we want to know "about" KNN, so it fetches items with lots of domain-specific terms (keyword search), but also items that demonstrate the "aboutness" of KNN (semantic search).


# Let's take a closer look. For science!

Above, you can see the subtle ranking differences across each search type. For the most part, `document 8` is the top documents, except in `hybrid_1`, `hybrid_2` and `semantic`. In those two search types, `document 10` is the top document.

It's up to you and your stakeholders to find the ideal `alpha` for your use case(s).

Directly, for our use case, it seems anything >= `alpha=0.3` gets us similar results, so the impact of `alpha` is most discernable between `0.0-0.3`.

Cool!

# Incorporating GenAI

Now, hybrid search is cool enough, but what if you don't want to spend time sifting through your index's search results? What if you just want a single answer to a query?

That's where GenAI comes in.

We will make a retrieval augmented generation (RAG) pipeline that will make this happen.

Since large language models (LLMs) do not know a ton of specific information (they are trained on the general Internet), especially if the information is from PDFs that it would have to download to have access to (like what are in our index), we need to give it this information!

We do this by first sending our query to our Pinecone index and grabbing some search  results. We then attach these search results to our original query and send *both* to the LLM. That way, the LLM both knows what we want to ask it & can pull from its general knowledge store *and* has a specialized knowledge store (our Pinecone search results so that it can get us extra specific information.

Let's try it out:

In [None]:
def generate_augmented_queries(faiss_indices, chunked_files, query):
    """
    Given FAISS indices and chunked files, return the augmented queries in different formats.

    :param faiss_indices: List of FAISS indices returned from a search.
    :param chunked_files: Dictionary containing chunked text files.
    :param query: The original query to augment with context.

    :return: Tuple of hybrid, pure keyword, and pure semantic augmented queries.
    """
    # Retrieve the chunked text corresponding to the FAISS indices
    documents = chunked_files
    context = [all_texts_1[faiss_index] for faiss_index in faiss_indices]

    # Combine the context with the query in the format desired
    hybrid_augmented_query = "\n\n---\n\n".join(context) + "\n\n-----\n\n" + query

    # Return all augmented queries
    return hybrid_augmented_query



# Dynamically provided FAISS indices (these could come from a FAISS search result)
faiss_indices = [0, 19, 8, 72, 0]


# Generate augmented queries
hybrid_augmented_query= generate_augmented_queries(faiss_indices, chunked_files, query)

# Output the results
print("Hybrid Augmented Query:")
print(hybrid_augmented_query)


In [None]:
'''# We are then going to combine this "context" with our original query in a format that our LLM likes:

hybrid_augmented_query = "\n\n---\n\n".join(hybrid_context)+"\n\n-----\n\n"+query
pure_keyword_augmented_query = "\n\n---\n\n".join(pure_keyword_context)+"\n\n-----\n\n"+query
pure_semantic_augmented_query = "\n\n---\n\n".join(pure_keyword_context)+"\n\n-----\n\n"+query'''

In [None]:
print(hybrid_augmented_query)

In [None]:
# We are then going to give our LLM some instructions for how to act:

query = 'What is the responsibility of each BSP Manager?'

primer = f"""You are Q&A bot. A highly intelligent system that answers
user questions based on the information provided by the user above
each question. If the information can not be found in the information
provided by the user you truthfully say "I don't know, sorry".
"""

In [None]:
# Now we query our LLM with our augmented query & our primer!

# Our hybrid query:

openai.api_key = 'yourkey'


hybrid_res = openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": hybrid_augmented_query}
    ]
)

hybrid_res

You can see subtle differences across the different results above. It's up to you and your stakeholders to figure out what type of search (semantic, keyword, hybrid) offers the most relevant information for your end users

# What if we take our our Pinecone vectors altogether??

In [None]:

# Get the response
answer = hybrid_res['choices'][0]['message']['content']
print(answer)

In [None]:
print(query)

print(answer[:89])  # First part of the answer
print(answer[89:])  # Second part of the answer


In [None]:
query = "What is COAM?"

query_dembedding = produce_embeddings([query])
results = query_faiss_index(f_index, query_dembedding, 5)

In [None]:
results

We can see that RAG really does have a huge impact! Without our PDFs, ChatGPT doesn't know much helpful detail at all! Nor can it give us bibliographic data for articles we might want to look up later!

In [None]:
all_texts_1[83]

In [None]:
faiss_indices = [83,10,67,22,52]


# Generate augmented queries
hybrid_augmented_query= generate_augmented_queries(faiss_indices, chunked_files, query)

# Output the results
print("Hybrid Augmented Query:")
print(hybrid_augmented_query)


In [None]:
hybrid_res = openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": hybrid_augmented_query}
    ]
)

hybrid_res
# Get the response
answer = hybrid_res['choices'][0]['message']['content']
print(answer)

In [None]:
print(query)

print(answer[:90])  # First part of the answer
print(answer[90:])  # Second part of the answer
