In [None]:
%load_ext autoreload
%autoreload 2


In this notebook, I will attempt to systematically compare LlamaBot's QueryBot class
using multiscale vs. single scale embeddings.

What are multiscale embeddings all about?
To understand this, I need to walk back and define RAG: Retrieval Augmented Generation.
RAG is a technique used in the context of LLMs to synthesize responses to queries.
At its basic level, RAG begins by embedding a document into vector space.
When embedding the document, we chunk up the text into smaller chunks
and store the embeddings inside a vector storage system.
Then, we take in a human's query, embed it,
and then search for the `k` most similar documents' embeddings to the query,
typically measured by cosine similarity.
Then, those `k` documents (or a subset of them) are stuffed back into the LLM context
for the LLM to generate responses to.


Now, queries to documents can sometimes be small in nature,
involving individual facts about the document
that can be stored within a small context.
Retrieving a large fragment of the document based on embeddings
and stuffing it all into a context window is quite wasteful and limiting.
On the other hand, some queries about documents can be quite large,
needing to cover multiple sections of the document at one shot.
In this case, if our embeddings' source text chunk sizes are large, 
we may not be able to stuff the full set of associated documents 
into the language model context.


One idea that came to my mind was to try _multi-scale embeddings_.
What do we mean here?
By that, I mean chunking the source text into different chunks.
For example, if the source text is a book,
we can chunk it into chapters, paragraphs, and sentences.
Then, we can embed each of these chunks into a vector space.
When performing a similarity search, 
the nature of the query should naturally dictate 
whether a larger chunk needs to be retrieved
or if a smaller chunk is needed, or both.


But does this idea work?
That's what I'd like to evaluate in this notebook.
To test the idea, I will use `QueryBot`'s latest configuration,
in which I implemented multiscale embeddings and single-scale embeddings.

In [None]:
from llamabot import QueryBot


The document that we will work with is the paper on CodonBERT. It's long enough and complex enough but also not too expensive. Firstly, I need to load the library:

In [None]:
from pyprojroot import here
from llamabot.zotero.library import ZoteroLibrary
from pathlib import Path

ZOTERO_JSON_DIR = Path.home() / ".llamabot/zotero/zotero_index/"
library = ZoteroLibrary(json_dir=ZOTERO_JSON_DIR)  # this is my pre-cached library


Then we get the PDF and load it into two separate instances of the QueryBot.

In [None]:
paper_key = library.key_title_map(inverse=True)[
    "The simplicity of protein sequence-function relationships"
]
entry = library[paper_key]
fpath = entry.download_pdf(Path("/tmp"))

docbot_single = QueryBot(
    "You are an expert in answering questions about a paper.",
    doc_paths=[fpath],
    chunk_sizes=[2000],
)

docbot_multi = QueryBot(
    "You are an expert in answering questions about a paper.",
    doc_paths=[fpath],
    chunk_sizes=[200, 500, 1000, 2000, 5000],
)


In [None]:
prompt1 = "Please summarize this paper for me."
_ = docbot_single(prompt1)


In [None]:
_ = docbot_multi(prompt1)


In [None]:
sources_single = docbot_single.source_nodes["Please summarize this paper for me."]
for node in sources_single:
    print(len(node.node.text))


In [None]:
sources_single = docbot_multi.source_nodes["Please summarize this paper for me."]
for node in sources_single:
    print(len(node.node.text))


In [None]:
prompt2 = "Please help me explain first and second order effects."

_ = docbot_single(prompt2)


In [None]:
_ = docbot_multi(prompt2)


In [None]:
sources_single = docbot_single.source_nodes["Please summarize this paper for me."]
for node in sources_single:
    print(len(node.node.text))
