# <a id='toc1_'></a>[Retrieval](#toc0_)

### <a id='toc1_1_1_'></a>[Why RAG ?](#toc0_)

LLMs are usually good for answering questions based on their knowledge from training data. However, for more complex and knowledge-intensive tasks, it's possible to build a language model-based system that accesses external knowledge sources to complete tasks. This enables more factual consistency, improves reliability of the generated responses, and helps to mitigate the problem of "hallucination".

RAG (retrieval augmented generation) takes an input and retrieves a set of relevant/supporting documents given a source (e.g., Wikipedia). The documents are concatenated as context with the original input prompt and fed to the text generator which produces the final output. This makes RAG adaptive for situations where facts could evolve over time. This is very useful as LLMs's parametric knowledge is static. RAG allows language models to bypass retraining, enabling access to the latest information for generating reliable outputs via 
retrieval-based generation.

When building your Q&A chatbot, you may want to give it knowledge about your company for example. Here we want to build a chatbot able to answer questions regarding what we have already done at Artefact and give us the reference to the most relevant documents we already produced.

Here's how the process looks like : 

![Alt Text](https://miro.medium.com/v2/resize:fit:1400/1*UyhiO87T-hejRhqI7EwvgA.png)

### <a id='toc1_1_2_'></a>[Chunking](#toc0_)

In this notebook, we will explore the Retrieval part of the pipeline. The retriever is the component that looks within the vector store for the chunks that correspond to the user query and then extracts such chunks, to then be fed to the LLM as context. The retriever is the core of the RAG architecture, and uses a similarity analysis to identify the embedded chunks within the vector space.

Here, we will showcase a basic retrieval, and then study three other more advanced ways to extract the relevant context.

**Table of contents**<a id='toc0_'></a>    
- [Retrieval](#toc1_)    
    - [Why RAG ?](#toc1_1_1_)    
    - [Chunking](#toc1_1_2_)    
- [Setup](#toc2_)    
- [Strategies](#toc3_)    
  - [Context extension Retriever](#toc3_1_)    
  - [Multi-query Retriever](#toc3_2_)    
  - [Self-query Retriever](#toc3_3_)    
- [To go the extra mile - Combination of retrieval methods](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[Setup](#toc0_)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
from pathlib import Path

from dotenv import load_dotenv

os.chdir(Path.cwd().joinpath(".."))
print(Path.cwd())
load_dotenv(override=True)

In [None]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.ensemble import EnsembleRetriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.schema import Document
from langchain_chroma import Chroma
from langchain_community.retrievers import BM25Retriever

from lib.models import embeddings, llm
from lib.utils import (
    build_vector_store,
    drop_document_duplicates,
    get_all_documents,
    load_documents,
    load_vector_store,
    split_documents_basic,
)

The basic retriever simply embeds the query using the same embedding model, and then evaluates the similarity between the embedded query and the vectorized chunks and returns the chunks with the highest similarity.

The important parameters for a basic retriever are the chosen similarity measure (chroma default is L2, but can also be cosine similarity, which is commonly used) and the number of retrieved chunks, marked by K.
However, the parameter for the similairty measure needs to be specified when building the vector store as it influences how the vectors are positionned.

We initiate our simple vector store based on the two pdf documents we use, using cosine similarity, which is preferred for high-dimensional embeddings

In [None]:
BASE_CHUNK_SIZE = 512

# build vector_store
base_documents = split_documents_basic(load_documents("data/3_docs"), BASE_CHUNK_SIZE, include_linear_index=True)

build_vector_store(
    base_documents,
    embeddings,
    collection_name="3_docs",
    distance_function="cosine",
    erase_existing=True,
)

# Load Vector store / retriever
chroma_vector_store = load_vector_store(embeddings, "3_docs")
chroma_vector_store_retriever = chroma_vector_store.as_retriever()

In [None]:
test_query = "How can cloud adoption help financial institutions"
embedded_test_query = embeddings.embed_query(test_query)
embedded_test_query[:5]

Chroma encapsulates all of the process in a simple search method with a text input, but for the sake of explanation we will use a more detailed approach. We will use a k=5 parameter and display the scores for each chunk.

The scoring method used here is cosine distance which is calculated as 1-cosine_similarity(x,y), with cosine_similarity(x,y)=dot(x,y)/(norm(x)*norm(y))

This score can take values in a [0,2] range, with smaller scores indicating more similarity.

Here we extract the chunks using two methods. The first one which also retrieves the score, takes as input the query itself and does the embedding automatically using the model specified when loading/building the vector store. The second method is more low-level in that it takes in a vector corresponding to the embedded query, and returns the similar chunks. This does not provide the scores however.

In [None]:
retrieved_chunks_basic_wscore = chroma_vector_store.similarity_search_with_score(
    test_query,
    k=5,
)

retrieved_chunks_basic = chroma_vector_store.similarity_search_by_vector(
    embedding=embedded_test_query,
    k=5,
)

for chunk in retrieved_chunks_basic_wscore:
    print(
        f"""SCORE:
        {chunk[1]}

        CHUNK:
        {chunk[0].page_content}
        """
    )

# <a id='toc3_'></a>[Strategies](#toc0_)

## <a id='toc3_1_'></a>[Context extension Retriever](#toc0_)

This Retrieval Strategy is a good start to augmenting the context for the query answer. The basic premise is to retrieve the chunks that are linked with the similarity-fetched chunks in the document structure. Indeed, the assumption is that such chunks can also contain bits of relevant information

A common and simple way to implement this is to add the adjacent chunks, i.e. for each chunk retrieved by similarity, to also retrieve the chunks corresponding to the text right before and after. For this case, we want the overlap between the chunks to be limited, as it would defeat the purpose.Here we have around 10% overlap which is fine.

This is simple to implement if the order of the documents was encoded into the chunk metadata (which we did when creating our vector store):

In [None]:
def extract_indexed_chunk(vector_store: Chroma, source: str, index: int) -> Document:
    """Extracts a chunk from the vector_store based on a linear index and a document source.

    Args:
        vector_store (Chroma): chroma vector store
        source (str): file name
        index (int): linear index corresponding to chunk
    """
    try:
        indexed_chunk = vector_store.get(
            where={
                "$and": [
                    {"source": source},
                    {"linear_index": index},
                ]
            }
        )
        return Document(
            page_content=indexed_chunk["documents"][0],
            metadata=indexed_chunk["metadatas"][0],
        )
    except IndexError:
        raise IndexError


def retrieve_adjacent_chunks(retrieved_chunks: list[Document], vector_store: Chroma) -> list[Document]:
    """From a list of chunks, augments list with adjacent chunks if they exist.

    Args:
        retrieved_chunks (list[Document]): list of chunks
        vector_store (Chroma): chroma vector store

    Returns:
        list[Dcoument]: augmented list of chunks
    """
    retrieved_chunks_with_adjacents = []
    for chunk in retrieved_chunks:
        try:
            retrieved_chunks_with_adjacents.append(
                extract_indexed_chunk(
                    vector_store,
                    source=chunk.metadata["source"],
                    index=chunk.metadata["linear_index"] - 1,
                )
            )
        except IndexError:
            pass
        retrieved_chunks_with_adjacents.append(chunk)
        try:
            retrieved_chunks_with_adjacents.append(
                extract_indexed_chunk(
                    vector_store,
                    source=chunk.metadata["source"],
                    index=chunk.metadata["linear_index"] + 1,
                )
            )
        except IndexError:
            pass
    return drop_document_duplicates(retrieved_chunks_with_adjacents)

To load the vector store without having to rebuild it, we can use the Chroma class with a persistent directory:

We then retrieve the adjacent chunks

In [None]:
adj_chunks_list = retrieve_adjacent_chunks(retrieved_chunks_basic, chroma_vector_store)

adj_chunks_list_unique = drop_document_duplicates(adj_chunks_list)

print(len(adj_chunks_list))
print(len(adj_chunks_list_unique))
adj_chunks_list_unique[:10]

## <a id='toc3_2_'></a>[Multi-query Retriever](#toc0_)

A more advanced and quite common retriever augmentation technique is to use what is called a multi-query retriever. In this case, the idea is to use a LLM agent to reformulate the user question in multiple different ways, and then perform the retrieval for each new query, returning all unique chunks.

This allows for the retriever to minimize how much of the matching is done on the query form (syntax, vocabulary...) and instead focuses more on the semantic meaning of the query. Langchain includes this functionnality, allowing a retriever initialization from a LLM and a vector store.

We set the logging, in order to display the new queries generated by the multi-query retriever.

In [None]:
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [None]:
mq_retriever = MultiQueryRetriever.from_llm(
    retriever=chroma_vector_store.as_retriever(search_kwargs={"k": 5}),
    llm=llm,
    include_original=True,
)

The parameter for controlling the number of reformulated queries is within the prompt itself. The default Langchain prompt specifies three reformulations.

We can see an example of retrieved documents here:

In [None]:
mq_retriever.invoke(test_query)

## <a id='toc3_3_'></a>[Self-query Retriever](#toc0_)

A lesser used but still interesting method is a Self Query Retriever. This technique incorporates an LLM model to do smart filtering on the metadata based on the query, if it contains some information. In our case, the sematnic information in the metadata is limited, but it can work well for more expansive metadata. We can still showcase an example where the SQ retriever will filter on the source document.

In [None]:
sq_test_query = "In the Data for Finance Report, how can cloud adoption help financial institutions"

For this retriever to work, we need to add some context, notably specifying what the metadata and content corresponds to:

In [None]:
metadata_description = [
    AttributeInfo(
        name="linear_index",
        description="The order of the document within the greater corpus",
        type="integer",
    ),
    AttributeInfo(
        name="page",
        description="The page number of the document",
        type="integer",
    ),
    AttributeInfo(
        name="source",
        description="The source file from which the document was extracted",
        type="string",
    ),
]

content_description = "Reports on application of data-driven techniques (including AI/ML) to the financial sector, as well as convictions on data quality management"

We can now initialize and invoke our retriever

In [None]:
sf_retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=chroma_vector_store,
    metadata_field_info=metadata_description,
    document_contents=content_description,
)
sf_query = "Do a summary of the page 6"
results = sf_retriever.invoke(sf_query)
for chunk in results:
    print(f"\npage N°{chunk.metadata['page']} : {chunk.page_content[0:100].strip()}...")

# <a id='toc4_'></a>[To go the extra mile - Combination of retrieval methods](#toc0_)

Finally, we will showcase a custom retriever that combines multiple methods. In this case, we will integrate bm25, which is an NLP model that scores texts based on a custom similarity metric not-related to an embedding model (think of it as close to TF_IDF). We will use this in combination with a multiquery retriever to refine our retrieved chunks. Then, we will augment the retrieved chunks with their neighbors to get a refined set of chunks.

The goal of combining a non-embedding retriever (bm25) with an embedding-based one is to leverage both of their strengths. Indeed the former is good at capturing word-based similarity, while the latter captures semantic similarity. We use the langchain EnsembleRetriever class to create a combination retriever. This class integrates the Reciprocal Rank Fusion algorithm to rank the combined chunks, prioritzing those who appear in both methods.

For the bm25 retriever, we cannot invoke the vector store directly, as it simply takes the list of documents pre-embedding. We have developped a function to do this from the vector database, without having to reload the raw documents

In [None]:
document_list = get_all_documents(chroma_vector_store)

bm25_retriever = BM25Retriever.from_documents(document_list)
bm25_retriever.k = 5

mq_retriever = MultiQueryRetriever.from_llm(
    retriever=chroma_vector_store.as_retriever(search_kwargs={"k": 5}),
    llm=llm,
    include_original=True,
)

We can now create our ensemble retriever and invoke it

In [None]:
combined_retriever = EnsembleRetriever(retrievers=[bm25_retriever, mq_retriever], weights=[0.5, 0.5])

combined_chunks = combined_retriever.invoke(test_query)

Finally, we also retrieve adjacent chunks to get the extended context

In [None]:
combination_chunks = retrieve_adjacent_chunks(combined_chunks, chroma_vector_store)

combination_chunks