# Enhanced RAG with NVIDIA NIM Rankers
by Bilge Yucel ([X](https://x.com/bilgeycl), [Linkedin](https://www.linkedin.com/in/bilge-yucel/))


**Ranking** refers to assigning a relevance score to each document based on how well it matches the query. Adding a ranker component to a RAG pipeline enhances both **recall** (retrieving relevant documents) and **precision** (selecting the most relevant ones). The ranker, typically using a fine-tuned **LLM**, reorders retrieved document chunks to ensure the most relevant ones appear at the top, making the retrieval process not only faster but also more accurate.

By prioritizing the right documents, ranking increases the likelihood of providing the LLM with the best context, which improves the quality of generated responses.

In this cookbook, we will build a pipeline with the [NvidiaRanker](https://docs.haystack.deepset.ai/docs/nvidiaranker) and compare the answers of basic RAG pipeline with the enhanced RAG pipeline with ranker.

## Installation

Start by installing `nvidia-haystack` and `datasets` packages:

In [None]:
!pip install nvidia-haystack datasets

Collecting nvidia-haystack
  Downloading nvidia_haystack-0.0.5-py3-none-any.whl.metadata (2.2 kB)
Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting haystack-ai (from nvidia-haystack)
  Downloading haystack_ai-2.6.1-py3-none-any.whl.metadata (13 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting haystack-experimental (from haystack-ai->nvidia-haystack)
  Downloading haystack_experimental-0.2.0-py3-none-any.whl.metadata (11 kB)
Collecting lazy-imports (from haystack-ai->nvidia-haystack)
  Downloading lazy_imports-0.3.1-py3-none-any.whl.metadata (10 kB)
Collecting openai>=1.1.0 (from haystack-ai->nvidia-haystack)
  Downloadin

## Dataset

Install the [HotpotQA dataset](https://huggingface.co/datasets/hotpotqa/hotpot_qa) from Hugging Face:

In [None]:
from datasets import load_dataset

data = load_dataset('hotpotqa/hotpot_qa', 'distractor', trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

hotpot_qa.py:   0%|          | 0.00/6.42k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/566M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/46.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/90447 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/7405 [00:00<?, ? examples/s]

Let's check an entry to understand the data structure. In Hotpot QA, each entry includes a question, a ground-truth answer, context sentences and titles.

In [None]:
data["validation"][4]

{'id': '5a8e3ea95542995a26add48d',
 'question': 'The director of the romantic comedy "Big Stone Gap" is based in what New York city?',
 'answer': 'Greenwich Village, New York City',
 'type': 'bridge',
 'level': 'hard',
 'supporting_facts': {'title': ['Big Stone Gap (film)', 'Adriana Trigiani'],
  'sent_id': [0, 0]},
 'context': {'title': ['Just Another Romantic Wrestling Comedy',
   'Kingston Morning',
   'Nola (film)',
   'Adriana Trigiani',
   'Great Eastern Conventions',
   'New York Society of Model Engineers',
   'Clinton, Minnesota',
   "Hamish and Andy's Gap Year",
   'I Love NY (2015 film)',
   'Big Stone Gap (film)'],
  'sentences': [['Just Another Romantic Wrestling Comedy is a 2006 film starring April Hunter and Joanie Laurer.',
    ' This Romantic comedy film was premiered at New Jersey and New York City on December 1, 2006 and was released on DVD in the United States and the United Kingdom on April 17, 2007.',
    ' After the film\'s DVD release "Just Another Romantic Wres

We'll now convert the Hotpot QA dataset entries into Haystack Documents. We'll merge the sentences to into meaningful chunks and use the title as meta info in our Haystack Document object.

In [None]:
from haystack.dataclasses.document import Document

def convert_hotpot_dataset(data):
    doc_chunks = []

    for item in data:
        # Collect the relevant content
        context_dict = {item['context']["title"][i]: item['context']["sentences"][i] for i in range(len(item['context']["title"]))}

        # Convert to Haystack Documents
        for k, v in context_dict.items():
            content = ''.join(v).strip()
            doc_chunks.append(Document(content=content, meta={"title":k}))

    return doc_chunks

documents = convert_hotpot_dataset(data["validation"])

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
print(documents[0])

Document(id=52837c52309de1827a1ca76200774451a415376a784eba418d851d0fd59196cc, content: 'Ed Wood is a 1994 American biographical period comedy-drama film directed and produced by Tim Burton...', meta: {'title': 'Ed Wood (film)'})


## Indexing Documents

To create a pipeline to index our documents, we need an NVIDIA NIM api key. You can get 1k credits for free after [signing up](https://org.ngc.nvidia.com/setup/personal-keys) for NVIDIA's platform. Once you have you your api key, set it as `"NVIDIA_API_KEY"` environment variable.

In [None]:
import os

os.environ["NVIDIA_API_KEY"] = "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"

Next, create a pipeline and index your documents. For embeddings, we'll use the[`nvidia/nv-embedqa-e5-v5`](https://docs.api.nvidia.com/nim/reference/nvidia-nv-embedqa-e5-v5) model through [NvidiaDocumentEmbedder](https://docs.haystack.deepset.ai/docs/nvidiadocumentembedder).

In [None]:
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses.document import Document
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.embedders.nvidia import NvidiaDocumentEmbedder
from haystack import Pipeline
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.preprocessors import DocumentSplitter

document_store = InMemoryDocumentStore()
embedder = NvidiaDocumentEmbedder(model="nvidia/nv-embedqa-e5-v5",
                                  api_url="https://integrate.api.nvidia.com/v1")

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=DocumentSplitter(split_length=350, split_overlap=50), name="splitter")
indexing_pipeline.add_component(instance=embedder, name="embedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP), name="writer")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder.documents", "writer.documents")

indexing_pipeline.run({"splitter":{"documents": documents[:500]}}) # We dont need to index all documents

print(document_store.count_documents())

Calculating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:10<00:00,  1.53it/s]


505


## Enhanced RAG with Ranker
Let's now create a RAG pipeline with Ranker. For retrieval, we'll initialize the [NvidiaTextEmbedder](https://docs.haystack.deepset.ai/docs/nvidiatextembedder) and the [NvidiaRanker](https://docs.haystack.deepset.ai/docs/nvidiaranker) with the `nvidia/nv-rerankqa-mistral-4b-v3` model. We'll set the `top_k` value of retriever to 30 and of ranker to 5. Thus, we'll retrieve 30 docs but only pass the 5 most relevant documents as context to the LLM.

For generation, we'll initialize [NvidiaGenerator](https://docs.haystack.deepset.ai/docs/nvidiagenerator) with the `meta/llama3-70b-instruct model`.

In [None]:
from haystack import Pipeline
from haystack.utils.auth import Secret
from haystack.components.builders import PromptBuilder
from haystack_integrations.components.embedders.nvidia import NvidiaTextEmbedder
from haystack_integrations.components.generators.nvidia import NvidiaGenerator
from haystack_integrations.components.rankers.nvidia import NvidiaRanker
from haystack.components.retrievers import InMemoryEmbeddingRetriever

embedder = NvidiaTextEmbedder(model="nvidia/nv-embedqa-e5-v5",
                              api_url="https://integrate.api.nvidia.com/v1")

retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=30)
ranker = NvidiaRanker(
    model="nvidia/nv-rerankqa-mistral-4b-v3",
    top_k=5
)

prompt = """Answer the question given the context.
Question: {{ query }}
Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}
Answer:
"""
prompt_builder = PromptBuilder(template=prompt)

generator = NvidiaGenerator(
    model="meta/llama3-70b-instruct",
    model_arguments={
        "max_tokens": 1024
    }
)

enhanced_rag = Pipeline()
enhanced_rag.add_component("embedder", embedder)
enhanced_rag.add_component("retriever", retriever)
enhanced_rag.add_component("ranker", ranker)
enhanced_rag.add_component("prompt_builder", prompt_builder)
enhanced_rag.add_component("generator", generator)

enhanced_rag.connect("embedder.embedding", "retriever.query_embedding")
enhanced_rag.connect("retriever", "ranker")
enhanced_rag.connect("ranker.documents", "prompt_builder.documents")
enhanced_rag.connect("prompt_builder", "generator")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7c0d9504f4f0>
üöÖ Components
  - embedder: NvidiaTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - ranker: NvidiaRanker
  - prompt_builder: PromptBuilder
  - generator: NvidiaGenerator
üõ§Ô∏è Connections
  - embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> ranker.documents (List[Document])
  - ranker.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)

Let's run the pipeline with some questions and compare the answers:

In [None]:
question = "Are the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?" # answer is "no"
question = "The director of the romantic comedy 'Big Stone Gap' is based in what New York city?" # answer is "Greenwich Village, New York City"

enhanced_rag.run({
    "embedder": {"text": question},
    "ranker": {"query": question},
    "prompt_builder": {"query": question}
})

{'embedder': {'meta': {'usage': {'prompt_tokens': 23, 'total_tokens': 23}}},
 'generator': {'replies': ["The director of the romantic comedy 'Big Stone Gap', Adriana Trigiani, is based in Greenwich Village, New York City."],
  'meta': [{'role': 'assistant',
    'usage': {'prompt_tokens': 380,
     'total_tokens': 408,
     'completion_tokens': 28},
    'finish_reason': 'stop'}]}}

## Basic RAG Pipeline

For comparison, let's define a basic pipeline (without a ranker) and see the result for the same questions.

In [None]:
from haystack import Pipeline
from haystack.utils.auth import Secret
from haystack.components.builders import PromptBuilder
from haystack_integrations.components.embedders.nvidia import NvidiaTextEmbedder
from haystack_integrations.components.generators.nvidia import NvidiaGenerator
from haystack.components.retrievers import InMemoryEmbeddingRetriever

embedder = NvidiaTextEmbedder(model="nvidia/nv-embedqa-e5-v5",
                              api_url="https://integrate.api.nvidia.com/v1")

retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=5)

prompt = """Answer the question given the context.
Question: {{ query }}
Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}
Answer:
"""
prompt_builder = PromptBuilder(template=prompt)

generator = NvidiaGenerator(
    model="meta/llama3-70b-instruct",
    model_arguments={
        "max_tokens": 1024
    }
)

rag = Pipeline()
rag.add_component("embedder", embedder)
rag.add_component("retriever", retriever)
rag.add_component("prompt_builder", prompt_builder)
rag.add_component("generator", generator)

rag.connect("embedder.embedding", "retriever.query_embedding")
rag.connect("retriever", "prompt_builder.documents")
rag.connect("prompt_builder", "generator")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7c0d98944610>
üöÖ Components
  - embedder: NvidiaTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - generator: NvidiaGenerator
üõ§Ô∏è Connections
  - embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)

In [None]:
question = "Are the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?" # answer is "no"
question = "The director of the romantic comedy 'Big Stone Gap' is based in what New York city?" # answer is "Greenwich Village, New York City"

rag.run({
    "embedder": {"text": question},
    "prompt_builder": {"query": question}
})

{'embedder': {'meta': {'usage': {'prompt_tokens': 23, 'total_tokens': 23}}},
 'generator': {'replies': ['The answer is Brooklyn.'],
  'meta': [{'role': 'assistant',
    'usage': {'prompt_tokens': 473,
     'total_tokens': 479,
     'completion_tokens': 6},
    'finish_reason': 'stop'}]}}

## Conclusion

This recipe compares we compare two RAG pipelines: a basic RAG pipeline and an enhanced version that includes an `NvidiaRanker` with the `nvidia/nv-rerankqa-mistral-4b-v3`. While both use a context length of 5 documents, the enhanced RAG pipeline, thanks to the ranking model, provides more relevant documents for the LLM, leading to improved accuracy in the answer.

For a detailed evaluation, read the full [blog post](https://haystack.deepset.ai/blog/rag-with-nvidia-nim-ranker).