## Dense X Retrieval: Propositions as retrieval units for a RAG App


Dense retrieval has emerged as a crucial method for obtaining relevant context or knowledge in open-domain NLP tasks. However, the choice of the retrieval unit, i.e., the pieces of text in which the corpus is indexed, such as a document, passage, or sentence, is often overlooked when a learned dense retriever is applied to a retrieval corpus at inference time. The researchers found that the choice of retrieval unit significantly influences the performance of both retrieval and downstream tasks.

This notebook, that joins the piecs of code of the Langcchain template, demonstrates the multi-vector indexing strategy proposed by Chen, et. al.'s [Dense X Retrieval: What Retrieval Granularity Should We Use?](https://arxiv.org/abs/2312.06648). The prompt directs an LLM to generate de-contextualized "propositions" which can be vectorized to increase the retrieval accuracy.



### Load the API Keys

In [1]:
from dotenv import load_dotenv

# Load the enviroment variables
load_dotenv()

True

## Define the Propositional Chain to build the index

In [5]:
from langchain.output_parsers.openai_tools import JsonOutputToolsParser
from langchain_community.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda




Set the prompt to extract the propositions as units for retieval

In [6]:
# Modified from the paper to be more robust to benign prompt injection
# https://arxiv.org/abs/2312.06648
# @misc{chen2023dense,
#       title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
#       author={Tong Chen and Hongwei Wang and Sihao Chen and Wenhao Yu and Kaixin Ma
#               and Xinran Zhao and Hongming Zhang and Dong Yu},
#       year={2023},
#       eprint={2312.06648},
#       archivePrefix={arXiv},
#       primaryClass={cs.CL}
# }
PROMPT = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of
context.
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input
whenever possible.
2. For any named entity that is accompanied by additional descriptive information, separate this
information into its own distinct proposition.
3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentences
and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the
entities they refer to.
4. Present the results as a list of strings, formatted in JSON.

Example:

Input: Title: ¯Eostre. Section: Theories and interpretations, Connection to Easter Hares. Content:
The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in
1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown in
other parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares were
frequently seen in gardens in spring, and thus may have served as a convenient explanation for the
origin of the colored eggs hidden there for children. Alternatively, there is a European tradition
that hares laid eggs, since a hare’s scratch or form and a lapwing’s nest look very similar, and
both occur on grassland and are first seen in the spring. In the nineteenth century the influence
of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe.
German immigrants then exported the custom to Britain and America where it evolved into the
Easter Bunny."
Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in
1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor of
medicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany until
the 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis about
the possible explanation for the connection between hares and the tradition during Easter", "Hares
were frequently seen in gardens in spring.", "Hares may have served as a convenient explanation
for the origin of the colored eggs hidden in gardens for children.", "There is a European tradition
that hares laid eggs.", "A hare’s scratch or form and a lapwing’s nest look very similar.", "Both
hares and lapwing’s nests occur on grassland and are first seen in the spring.", "In the nineteenth
century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular
throughout Europe.", "German immigrants exported the custom of the Easter Hare/Rabbit to
Britain and America.", "The custom of the Easter Hare/Rabbit evolved into the Easter Bunny in
Britain and America."]""",  # noqa
        ),
        ("user", "Decompose the following:\n{input}"),
    ]
)


Create the propositional chain

In [7]:
def get_propositions(tool_calls: list) -> list:
    if not tool_calls:
        raise ValueError("No tool calls found")
    return tool_calls[0]["args"]["propositions"]


def empty_proposals(x):
    # Model couldn't generate proposals
    return []


proposition_chain = (
    PROMPT
    | ChatOpenAI(model="gpt-3.5-turbo-16k").bind(
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "decompose_content",
                    "description": "Return the decomposed propositions",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "propositions": {
                                "type": "array",
                                "items": {"type": "string"},
                            }
                        },
                        "required": ["propositions"],
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "decompose_content"}},
    )
    | JsonOutputToolsParser()
    | get_propositions
).with_fallbacks([RunnableLambda(empty_proposals)])


## Define the components of the Main RAG Chain

This section build the components of the RAG chain and create it.

In [17]:
from pathlib import Path

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import LocalFileStore
from langchain_community.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.load import load
from langchain_core.output_parsers import StrOutputParser
from langchain_core.pydantic_v1 import BaseModel



In [9]:
DOCSTORE_ID_KEY = "doc_id"
DOCSTORE_DIR="."

Define the retriever for the RAG chain

In [18]:
def get_multi_vector_retriever(docstore_id_key: str, collection_name: str):
    """Create the composed retriever object."""
    vectorstore = Chroma(
        collection_name=collection_name,
        persist_directory=str(Path(DOCSTORE_DIR) / "chroma_db_proposals"),
        embedding_function=OpenAIEmbeddings(),
    )
    store = LocalFileStore(
        str(Path(DOCSTORE_DIR) / "multi_vector_retriever_metadata")
    )
    return MultiVectorRetriever(
        vectorstore=vectorstore,
        byte_store=store,
        id_key=docstore_id_key,
    )

Define the RAG chain

In [19]:
def format_docs(docs: list) -> str:
    loaded_docs = [load(doc) for doc in docs]
    return "\n".join(
        [
            f"<Document id={i}>\n{doc.page_content}\n</Document>"
            for i, doc in enumerate(loaded_docs)
        ]
    )

def rag_chain(retriever):
    """
    The RAG chain

    :param retriever: A function that retrieves the necessary context for the model.
    :return: A chain of functions representing the multi-modal RAG process.
    """
    model = ChatOpenAI(temperature=0, model="gpt-4-1106-preview", max_tokens=1024)
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are an AI assistant. Answer based on the retrieved documents:"
                "\n<Documents>\n{context}\n</Documents>",
            ),
            ("user", "{question}?"),
        ]
    )

    # Define the RAG pipeline
    chain = (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough(),
        }
        | prompt
        | model
        | StrOutputParser()
    )

    return chain


## Create the Retriever and the RAG Chain

In [20]:
# Create the multi-vector retriever
retriever = get_multi_vector_retriever(DOCSTORE_ID_KEY, "attention-paper")


In [21]:
# Create RAG chain
chain = rag_chain(retriever)


In [22]:
# Add typing for input
class Question(BaseModel):
    __root__: str


chain = chain.with_types(input_type=Question)

## Ingest the data, define the propositional retriever and build the index

In [23]:

import uuid
from typing import Sequence

from bs4 import BeautifulSoup as Soup
from langchain_core.documents import Document
from langchain_core.runnables import Runnable

# For our example, we'll load docs from the web
from langchain.text_splitter import RecursiveCharacterTextSplitter  # noqa
from langchain_community.document_loaders.recursive_url_loader import (
        RecursiveUrlLoader,
    )  # noqa


Functions to build the index and include the documents

In [24]:

def add_documents(
    retriever,
    propositions: Sequence[Sequence[str]],
    docs: Sequence[Document],
    id_key: str = DOCSTORE_ID_KEY,
):
    doc_ids = [
        str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.metadata["source"])) for doc in docs
    ]
    prop_docs = [
        Document(page_content=prop, metadata={id_key: doc_ids[i]})
        for i, props in enumerate(propositions)
        for prop in props
        if prop
    ]
    retriever.vectorstore.add_documents(prop_docs)
    retriever.docstore.mset(list(zip(doc_ids, docs)))


def create_index(
    docs: Sequence[Document],
    indexer: Runnable,
    docstore_id_key: str = DOCSTORE_ID_KEY,
    collection_name: str = "default",
):
    """
    Create retriever that indexes docs and their propositions

    :param docs: Documents to index
    :param indexer: Runnable creates additional propositions per doc
    :param docstore_id_key: Key to use to store the docstore id
    :return: Retriever
    """
    print("Creating multi-vector retriever")
    retriever = get_multi_vector_retriever(docstore_id_key, collection_name)
    propositions = indexer.batch(
        [{"input": doc.page_content} for doc in docs], {"max_concurrency": 10}
    )

    add_documents(
        retriever,
        propositions,
        docs,
        id_key=docstore_id_key,
    )

    return retriever


Build the index using the proposiotional chain to define decomposed propositions

In [25]:

# Could add more parsing here, as it's very raw.
loader = RecursiveUrlLoader(
        "https://ar5iv.labs.arxiv.org/html/1706.03762",
        max_depth=2,
        extractor=lambda x: Soup(x, "html.parser").text,
    )
data = loader.load()
print(f"Loaded {len(data)} documents")

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=8000, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)
print(f"Split into {len(all_splits)} documents")

# Create retriever
retriever_multi_vector_img = create_index(
        all_splits,
        proposition_chain,
        DOCSTORE_ID_KEY,
        "llama2-paper"
    )

Loaded 1 documents
Split into 7 documents
Creating multi-vector retriever


## Invoke the Chain

In [26]:
chain.invoke("How are transformers related to convolutional neural networks?")

'Transformers and Convolutional Neural Networks (CNNs) are both types of neural network architectures, but they are designed for different purposes and operate on different principles.\n\nConvolutional Neural Networks (CNNs) are designed primarily for processing data that has a known grid-like topology, such as image data. They are characterized by their use of convolutional layers, which apply a convolution operation to the input to capture the local dependencies and the spatial hierarchy in the data. This makes them particularly well-suited for tasks like image recognition, as they can efficiently process the pixel data and learn features like edges, textures, and shapes.\n\nTransformers, on the other hand, were introduced by Vaswani et al. in the paper "Attention Is All You Need" (2017) and are designed to handle sequential data, such as text for natural language processing tasks. The key innovation in transformers is the attention mechanism, which allows the model to weigh the infl