# RAG for Patent Question Answering with Reranker

Most retrieval-augmented generation (RAG) pipelines follow a common recipe: take a user’s question, retrieve relevant documents, and feed them to a language model to generate a response. This works reasonably well — until it doesn't.

When dealing with complex domains like patents, the limitations of naive retrieval become glaring:
- The language is dense and technical.
- Similarity-based retrievers often surface verbose but irrelevant sections.
- Critical information may be buried across long documents.

In this notebook, we’ll build a more **robust and domain-aware RAG system** specifically designed to answer technical and legal questions over patents. To improve retrieval quality, we’ll incorporate a **reranker** — a model that sits between retrieval and generation, reshuffling candidate passages to surface the most answer-relevant chunks.

This system will:
- Load and structure unstructured patent filings using the [Unstructured Platform](https://unstructured.io/).
- Ingest data into a [Pinecone](https://www.pinecone.io/product/) vector database for fast semantic retrieval.
- Re-rank retrieved candidates using **Cohere’s `rerank-english-v3.0`**.
- Answer user questions using **GPT-4o** grounded in the reranked context.

We’ll go step by step — starting with document ingestion and ending with an end-to-end QA pipeline that performs well even on nuanced queries.

Let’s dive in.


#Preparing the Data
To prepare our patent data for retrieval and reranking, we need to first break down the raw PDFs into structured chunks. This step is foundational for any RAG pipeline, and it’s where [Unstructured](https://unstructured.io) comes in.

The Unstructured API lets us:
- Extract clean, structured content from any document.
- Generates metadata, chunk text, and prep it for downstream applications.

## Setting Up the Unstructured Client

Before we can begin parsing raw patent documents, we need to set up access to the [Unstructured platform](https://unstructured.io). The Unstructured Platform API allows us to programmatically process documents, extract structured elements, and prepare them for chunking and embedding, all from within this notebook.

[Contact us](https://unstructured.io/enterprise) to get access or log in if you're already a user.


In [None]:
!pip install -U "unstructured-client"



If you haven’t already:
1. Login to [platform.unstructured.io](https://platform.unstructured.io)
2. In the sidebar, go to **API Keys**.
3. Click **New Key**, give it a name like `"patent-qna-notebook"`, and copy the key.


In [None]:
import os
import time
from google.colab import userdata
from unstructured_client import UnstructuredClient

Fetching the keys from Colab Secrets!

In [None]:
os.environ['UNSTRUCTURED_API_KEY'] = userdata.get("UNSTRUCTURED_API_KEY")
client = UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"))

In [None]:
# utility for inspecting responses in a readable way
def pretty_print_model(response_model):
    print(response_model.model_dump_json(indent=4))

## Setting up the S3 Source Connector


For this demo, we will be using AWS Key and Secret for Authentication.
Make sure to add fetch the corresponding values and for `S3_AWS_KEY` and `S3_AWS_SECRET` and add to the Secrets in Colab.


Similarly, fetch the the S3 URI to the bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket's root) or `s3://my-bucket/my-folder/` and add it to `S3_REMOTE_URL` in the Secrets.





For other authentication options and more details refer to [this](https://docs.unstructured.io/api-reference/workflow/sources/s3).

In [None]:
os.environ['AWS_ACCESS'] = userdata.get('AWS_ACCESS')
os.environ['AWS_SECRET'] = userdata.get('AWS_SECRET')
os.environ['S3_REMOTE_URL'] = userdata.get('S3_REMOTE_URL')

In [None]:
from unstructured_client.models.operations import CreateSourceRequest
from unstructured_client.models.shared import CreateSourceConnector

source_response = client.sources.create_source(
    request=CreateSourceRequest(
        create_source_connector=CreateSourceConnector(
            name=f"Reranker Tutorial Source Connector_",
            type="s3",
            config={
              "key": os.environ.get('AWS_ACCESS'),
              "secret": os.environ.get('AWS_SECRET'),
              "remote_url": os.environ.get('S3_REMOTE_URL'),
              "recursive": True
            }
        )
    )
)

In [None]:
from unstructured_client.models.operations import CreateSourceRequest
from unstructured_client.models.shared import (
    CreateSourceConnector,
    SourceConnectorType,
    S3SourceConnectorConfigInput
)

source_response = client.sources.create_source(
    request=CreateSourceRequest(
        create_source_connector=CreateSourceConnector(
            name=f"Reranker Tutorial Source Connector_",
            type=SourceConnectorType.S3,
            config=S3SourceConnectorConfigInput(
                key=os.environ.get('AWS_ACCESS'),
                secret=os.environ.get('AWS_SECRET'),
                remote_url=os.environ.get('S3_REMOTE_URL'),
                recursive=True
            )
        )
    )
)

In [None]:
pretty_print_model(source_response.source_connector_information)

{
    "config": {
        "anonymous": false,
        "recursive": true,
        "remote_url": "s3://ajay-uns-devrel-content/mm-agentic-rag/",
        "key": "**********",
        "secret": "**********"
    },
    "created_at": "2025-08-06T14:57:07.277627Z",
    "id": "e63b3e59-58e7-4e0b-90b3-85a7a6f5ad69",
    "name": "Reranker Tutorial Source Connector_",
    "type": "s3",
    "updated_at": "2025-08-06T14:57:07.416960Z"
}


## Setting up the Pinecone Destination Connector

Now that we’ve defined our document source (from S3), the next step is to configure where the processed chunks should go. For that, we’re using **Pinecone** — a fast, scalable vector database that's perfect for similarity search.

In our case, we’ll send embedded chunks of patent text to Pinecone, where they can later be searched via semantic queries.

---

### 🌲 Why Pinecone?

Pinecone is optimized for storing and querying high-dimensional vector embeddings. It provides:
- Scalable infrastructure for similarity search.
- Fast approximate nearest neighbor lookup.
- Simple API access for indexing and querying.

In this setup, Unstructured handles:
- Preprocessing the data (partitioning, chunking, embedding).
- Pushing the output vectors directly into our Pinecone index.

---


To connect Unstructured with Pinecone, you’ll need:

- **API Key**: Found under the API Keys tab in the Pinecone dashboard.
- **Index Name**: Create one manually from the dashboard, and ensure it’s in the "Serverless" environment.
- (Optional: Namespace) — used to logically group your documents inside the index.

If you haven’t already:
1. Go to [https://app.pinecone.io](https://app.pinecone.io) and sign in.
2. Create a **Serverless Index**.
3. Note the **index name** and **API key** from the dashboard.

Store both values securely in Colab secrets:

In [None]:
os.environ['PINECONE_INDEX'] = userdata.get('PINECONE_INDEX')
os.environ['PINECONE_API_KEY'] = userdata.get('PINECONE_API_KEY')


In [None]:
from unstructured_client.models.operations import CreateDestinationRequest
from unstructured_client.models.shared import CreateDestinationConnector

destination_response = client.destinations.create_destination(
    request=CreateDestinationRequest(
        create_destination_connector=CreateDestinationConnector(
            name=f"Reranker Tutorial Destination Connector_",
            type="pinecone",
            config={
                "index_name": os.environ.get("PINECONE_INDEX"),
                "api_key": os.environ.get("PINECONE_API_KEY"),
                "batch_size": 50,
                "namespace": "Default" # Default Option
            }
        )
    )
)

pretty_print_model(destination_response.destination_connector_information)

{
    "config": {
        "api_key": "**********",
        "batch_size": 50,
        "index_name": "uns-demo-2",
        "namespace": "Default"
    },
    "created_at": "2025-08-06T14:57:09.636042Z",
    "id": "3122da51-b23b-415c-a544-e329ba964c66",
    "name": "Reranker Tutorial Destination Connector_",
    "type": "pinecone",
    "updated_at": "2025-08-06T14:57:09.739495Z"
}


Next, we’ll wire everything together into a full document processing workflow.


## Creating a Document Processing Workflow

Now that we have access to our data, the next step is setting up how it should be processed.

We'll define a simple but powerful document pipeline using three key types of processing nodes:

- **Partitioner**  
  This step takes raw, unstructured files and extracts structured content from them.  
  We'll use a **Vision-Language Model (VLM) Partitioner**, which leverages a model capable of understanding both text and layout information from documents — pulling out elements from each page with higher fidelity.

- **Chunker**  
  After partitioning, the extracted elements are grouped into manageable "chunks."  
  Chunking ensures that during retrieval, we can focus only on the most relevant sections of a document — not the whole thing.

- **Embedder**  
  Finally, we'll generate vector embeddings for each chunk of text.  
  Embeddings are numeric representations that capture the meaning of the text, making it searchable and retrievable later on. We'll rely on an embedding provider to handle this step for us.

Each node plays a critical role in making our documents **retrieval-ready** for downstream RAG applications.

If you're curious about the different configuration options available for these processing steps, you can explore more details in the [Concepts documentation](https://docs.unstructured.io/ui/document-elements).


In [None]:
from unstructured_client.models.shared import (
    WorkflowNode,
    WorkflowType,
    Schedule
)

parition_node = WorkflowNode(
    name="Partitioner",
    subtype="vlm",
    type="partition",
    settings={
        "provider": "anthropic",
        "model": "claude-3-7-sonnet-20250219",
        }
    )

chunk_node = WorkflowNode(
    name="Chunker",
    subtype="chunk_by_title",
    type="chunk",
    settings={
        "new_after_n_chars": 1000,
        "max_characters": 4096,
        "overlap": 150
    }
)

embedder_node = WorkflowNode(
    name='Embedder',
    subtype='azure_openai',
    type="embed",
    settings={
        'model_name': 'text-embedding-3-large'
        }
    )


response = client.workflows.create_workflow(
    request={
        "create_workflow": {
            "name": f"Reranker Tutorial Workflow_{time.time()}",
            "source_id": source_response.source_connector_information.id,
            "destination_id": destination_response.destination_connector_information.id,
            "workflow_type": WorkflowType.CUSTOM,
            "workflow_nodes": [
                parition_node,
                chunk_node,
                embedder_node
            ]
        }
    }
)

pretty_print_model(response.workflow_information)
workflow_id = response.workflow_information.id

{
    "created_at": "2025-08-06T14:57:11.657721Z",
    "destinations": [
        "3122da51-b23b-415c-a544-e329ba964c66"
    ],
    "id": "974f7a59-df45-469e-94d2-09e0ec1f2500",
    "name": "Reranker Tutorial Workflow_1754492231.632472",
    "sources": [
        "e63b3e59-58e7-4e0b-90b3-85a7a6f5ad69"
    ],
    "status": "active",
    "workflow_nodes": [
        {
            "name": "Partitioner",
            "subtype": "vlm",
            "type": "partition",
            "id": "639c45c1-8009-4bfa-80a1-5e0e4b325467",
            "settings": {
                "provider": "anthropic",
                "provider_api_key": null,
                "model": "claude-3-7-sonnet-20250219",
                "output_format": "text/html",
                "prompt": null,
                "format_html": true,
                "unique_element_ids": true,
                "is_dynamic": false,
                "allow_fast": true
            }
        },
        {
            "name": "Chunker",
            "subt

## Running the workflow

Now that we've defined how we want to process our documentation, let's start the workflow and wait for it to complete:

In [None]:
res = client.workflows.run_workflow(
    request={
        "workflow_id": workflow_id,
    }
)

pretty_print_model(res.job_information)

{
    "created_at": "2025-08-06T14:57:13.160450Z",
    "id": "5dfaecab-e7f5-4ff2-84b6-6b460756bdf6",
    "status": "SCHEDULED",
    "workflow_id": "974f7a59-df45-469e-94d2-09e0ec1f2500",
    "workflow_name": "Reranker Tutorial Workflow_1754492231.632472",
    "job_type": "ephemeral"
}


In [None]:
response = client.jobs.list_jobs(
    request={
        "workflow_id": workflow_id
    }
)

last_job = response.response_list_jobs[0]
job_id = last_job.id
print(f"job_id: {job_id}")

job_id: 5dfaecab-e7f5-4ff2-84b6-6b460756bdf6


Now that we've created and started a job, we can poll Unstructured's `get_job` endpoint and check for its status every 30s till completion

In [None]:
import time

def poll_job_status(job_id, wait_time=30):
    while True:
        response = client.jobs.get_job(
            request={
                "job_id": job_id
            }
        )

        job = response.job_information

        if job.status == "SCHEDULED":
            print(f"Job is scheduled, polling again in {wait_time} seconds...")
            time.sleep(wait_time)
        elif job.status == "IN_PROGRESS":
            print(f"Job is in progress, polling again in {wait_time} seconds...")
            time.sleep(wait_time)
        else:
            print("Job is completed")
            break

    return job

job = poll_job_status(job_id)
pretty_print_model(job)

Job is scheduled, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is completed
{
    "created_at": "2025-08-06T14:57:13.160450",
    "id": "5dfaecab-e7f5-4ff2-84b6-6b460756bdf6",
    "status": "COMPLETED",
    "workflow_id": "974f7a59-df45-469e-94d2-09e0ec1f2500",
    "workflow_name": "Reranker Tutorial Workflow_1754492231.632472",
    "job_type": "ephemeral",
    "runtime": "PT0S"
}


At this point, we've completed all the foundational steps:

- Extracted structured elements from raw documents using a **Partitioner**.
- Organized the extracted content into manageable chunks with a **Chunker**.
- Generated vector embeddings for those chunks through an **Embedder**.

Our processed data is now stored and ready for retrieval.

Next, we'll connect the pieces together and build a RAG pipeline that can answer questions grounded in this freshly structured knowledge base.



# RAG 🧠

With our patent documents now chunked, embedded, and stored in Pinecone — we’re ready to move into the **retrieval-augmented generation (RAG)** phase.

In this section, we'll wire together:
- A **retriever**, backed by Pinecone, to pull relevant chunks.
- A **reranker**, using Cohere’s `rerank-english-v3.0`, to boost the most contextually relevant results.
- A **generator**, using OpenAI’s `gpt-4o`, to produce accurate, grounded answers based on that refined context.

We’ll also wrap these into a clean RAG pipeline using LangChain’s modular components.


For this portion, we will be using:

- **`pinecone-client`**: Native SDK to interact with Pinecone vector indices (for inserting, querying, and managing embeddings).
- **`cohere`**: Official client to access Cohere’s APIs — including rerankers and language models.
- **`langchain-*`**: A modular framework for chaining together LLMs, retrievers, tools, rerankers, and more — perfect for building custom RAG pipelines.

Once everything's installed, we'll connect to our vector store, load our reranker, and build a chain that retrieves → reranks → generates.


In [None]:
!pip install pinecone-client langchain-pinecone langchain-openai langchain-community cohere --upgrade --quiet


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.4/70.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m259.5/259.5 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.3/46.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.6/587.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import os
import requests
import urllib3
from google.colab import userdata
import cohere
from pinecone import Pinecone
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStore
from langchain.schema import Document
from langchain.callbacks import get_openai_callback


Now that we’ve installed our libraries, it’s time to wire up the APIs. We’ll be using three providers in this RAG pipeline:


- **Cohere**: for reranking retrieved chunks based on their actual relevance to a query.
- **OpenAI**: for generating answers with `gpt-4o`.
- **Pinecone**: to query the vector index we populated earlier.

We’ll securely fetch each API key from Colab secrets

In [None]:
# Set your API keys using Colab userdata
os.environ['COHERE_API_KEY'] = userdata.get("COHERE_API_KEY")
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
os.environ["PINECONE_API_KEY"] = userdata.get("PINECONE_API_KEY")
os.environ["PINECONE_INDEX"] = userdata.get("PINECONE_INDEX")

# Initialize Cohere client
cohere_client = cohere.Client(os.environ['COHERE_API_KEY'])

🛠 Fixing Pinecone in Colab

If you're running this in a Colab environment, Pinecone’s client can sometimes misbehave due to Colab’s proxy settings.

This little fix disables warnings and clears proxy-related environment variables:

In [None]:
# run this to ensure pinecone client works in your colab environment
urllib3.disable_warnings()

# Clear proxy environment variables that might cause connection issues
proxy_vars = ['HTTP_PROXY', 'HTTPS_PROXY', 'http_proxy', 'https_proxy']
for var in proxy_vars:
    if var in os.environ:
        del os.environ[var]

original_getproxies = requests.utils.getproxies
requests.utils.getproxies = lambda: {}

Before wiring things up, here’s a breakdown of the core functions used in this section.

- `connect_pinecone(index_name: str)`  
  Sets up a connection to the Pinecone index and wraps it as a LangChain-compatible vectorstore using OpenAI’s `text-embedding-3-large`. Returns the vectorstore so we can use it for retrieval.

- `retrieve_docs(vectorstore, query: str, k: int = 20)`  
  Performs basic similarity search against the vectorstore. Grabs the top-k chunks closest to the query based on embeddings.

- `rerank_docs(query: str, docs: list[Document], top_n: int = 5)`  
  Takes the initial retrieved results and reorders them using Cohere’s reranker model. This lets us prioritize documents that are actually useful for answering the question — not just semantically close.

- `generate_answer(query: str, docs: list[Document])`  
  Feeds the reranked context to GPT-4o to generate a final answer.

In [None]:
def connect_pinecone(index_name: str):
    """
    Connect to Pinecone vectorstore

    Args:
        index_name: Name of the Pinecone index

    Returns:
        Configured vectorstore
    """
    try:
        embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
        pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
        index = pc.Index(index_name)

        vectorstore = PineconeVectorStore(
            index=index,
            embedding=embeddings,
            text_key="text",
            namespace='Default'
        )

        print(f"Connected to Pinecone index: {index_name}")
        return vectorstore

    except Exception as e:
        print(f"Failed to connect to Pinecone: {e}")
        return None


def retrieve_docs(vectorstore, query: str, k: int = 20):
    """
    Retrieve documents from vectorstore

    Args:
        vectorstore: Pinecone vectorstore
        query: Search query
        k: Number of documents to retrieve

    Returns:
        List of relevant documents
    """
    try:
        docs = vectorstore.similarity_search(query, k=k)
        print(f"Retrieved {len(docs)} documents")
        return docs
    except Exception as e:
        print(f"Document retrieval failed: {e}")
        return []

def rerank_docs(query: str, docs: list[Document], top_n: int = 5):
    """
    Rerank documents using Cohere's reranking model

    Args:
        query: Original search query
        docs: List of retrieved documents
        top_n: Number of top documents to return

    Returns:
        List of reranked documents
    """
    try:
        response = cohere_client.rerank(
            query=query,
            documents=[doc.page_content for doc in docs],
            top_n=top_n,
            model="rerank-english-v3.0"
        )

        reranked_docs = [docs[r.index] for r in response.results]
        print(f"Reranked to top {len(reranked_docs)} documents")
        return reranked_docs

    except Exception as e:
        print(f"Reranking failed: {e}")
        return docs[:top_n]  # Fallback

def generate_answer(query: str, docs: list[Document]):
    """
    Generate answer using retrieved documents

    Args:
        query: User question
        docs: List of relevant documents

    Returns:
        Generated answer
    """
    try:
        llm = ChatOpenAI(model="gpt-4o", temperature=0)
        context = "\n\n".join([doc.page_content for doc in docs])

        prompt = f"""Answer the following question using the context below. Answer only based on the context provided, if there is not enough information, mention that there's not enough information:

        Context:
        {context}

        Question: {query}

        Answer:"""
        with get_openai_callback() as cb:
          response = llm.invoke(prompt)

          result = {
            "answer": response.content,
            "prompt_tokens": cb.prompt_tokens,
            "completion_tokens": cb.completion_tokens,
            "total_tokens": cb.total_tokens,
            "total_cost": cb.total_cost
          }
        return result

    except Exception as e:
        print(f"Answer generation failed: {e}")
        return None

# Connect to vectorstore
vectorstore = connect_pinecone(os.environ["PINECONE_INDEX"])



Connected to Pinecone index: uns-demo1


### Vanilla RAG

We’ll start with a simple retrieval-augmented generation setup: grab the top-k documents from Pinecone using embedding similarity, and pass them directly to GPT-4o.


In [None]:
class BasicRAGSystem:
    def __init__(self, vectorstore, k=10):
        self.vectorstore = vectorstore
        self.k = k

    def query(self, question):
        """Execute basic RAG pipeline"""

        # Retrieve documents
        docs = retrieve_docs(self.vectorstore, question, k=self.k)

        # Generate answer
        answer = generate_answer(question, docs)


        result = {
            "documents": docs,
            "num_docs": len(docs)
        }
        result.update(answer)

        return result

basic_rag = BasicRAGSystem(vectorstore,10)



In [None]:
test_query = "What is the primary function of the context analysis engine described in US11886826B1?"

print("Basic RAG Results:")
print("-" * 50)
basic_result = basic_rag.query(test_query)

print(f"Answer: {basic_result['answer']}")
print(f"Retrieved {basic_result['num_docs']} documents")
print(f"Total Tokens: {basic_result['total_tokens']}")

Basic RAG Results:
--------------------------------------------------
Retrieved 10 documents
Answer: The primary function of the context analysis engine described in US11886826B1 is to analyze input data and/or user instructions to output a set of context parameters associated with the input data. These context parameters may include information such as location ("where"), person ("who"), time period or time of day ("when"), event ("what"), or causal reasoning ("why") associated with the input data. The context analysis engine may also retain the output of the set of context parameters through multiple iterations of execution, allowing for retention of context information for changes without needing to reload large amounts of information.
Retrieved 10 documents
Total Tokens: 7741


The vanilla setup gets the right answer here. It finds the relevant chunk in the top 10 and generates a clean response.

Now let's try a more complex question

In [None]:
test_query = "Which of the two patents does not reference reward‑based optimization, and what training approach does it use instead?"

print("Basic RAG Results:")
print("-" * 50)
basic_result = basic_rag.query(test_query)

print(f"Answer: {basic_result['answer']}")
print(f"Retrieved {basic_result['num_docs']} documents")
print(f"Total Tokens: {basic_result['total_tokens']}")

Basic RAG Results:
--------------------------------------------------
Retrieved 10 documents
Answer: There's not enough information to determine which of the two patents does not reference reward-based optimization and what training approach it uses instead.
Retrieved 10 documents
Total Tokens: 7234


Even though we retrieved 10 chunks, none had what we needed. Let's try out a different approach to fetch the chunks **most relavant** to the query.

### RAG with Reranking


Plain vector search can only get us so far. It’s fast and useful, but it’s not perfect, sometimes the right chunk doesn’t make it into the top-10.

To fix this, we add a reranking step.

Here’s how it works:

- First, we fetch a **larger set of candidate chunks** — say 30 — from the vectorstore.
- Then we use a **reranker model** (in this case, Cohere’s `rerank-english-v3.0`) to score each chunk by how well it matches the question.
- We keep only the **top-N** (e.g. top 10) reranked chunks and send those to the LLM.

This extra scoring step helps surface the most relevant content, especially for nuanced or multi-part questions that vector search might miss.




In [None]:
class EnhancedRAGSystem:
    def __init__(self, vectorstore, k=40, top_n=20):
        self.vectorstore = vectorstore
        self.k = k
        self.top_n = top_n

    def query(self, question):
        """Execute enhanced RAG pipeline with reranking"""

        initial_docs = retrieve_docs(self.vectorstore, question, k=self.k)

        reranked_docs = rerank_docs(question, initial_docs, top_n=self.top_n)

        answer = generate_answer(question, reranked_docs)

        result = {
            "documents": reranked_docs,
            "initial_docs": initial_docs,
            "num_docs": len(reranked_docs)
        }
        result.update(answer)
        return result

# Initialize enhanced RAG system to fetch 30 candidate docs -> 10 reranked docs
enhanced_rag = EnhancedRAGSystem(vectorstore,30,10)


And now, a query that failed with Vanilla RAG

In [None]:
test_query = "Which of the two patents does not reference reward‑based optimization, and what training approach does it use instead?"

print("\nEnhanced RAG with Reranking:")
print("-" * 50)
enhanced_result = enhanced_rag.query(test_query)

print(f"Answer: {enhanced_result['answer']}")
print(f"Retrieved {enhanced_result['num_docs']} documents (from {len(enhanced_result['initial_docs'])} initial)")
print(f"Total Tokens: {enhanced_result['total_tokens']}")


Enhanced RAG with Reranking:
--------------------------------------------------
Retrieved 30 documents
Reranked to top 10 documents
Answer: The patent US 11,886,826 B1 does not reference reward-based optimization. Instead, it uses an iterative training approach based on one or more datasets, which may include user instruction data or user-labeled data.
Retrieved 10 documents (from 30 initial)
Total Tokens: 7662


So what changed?

Turns out the key chunk was buried deeper in the retrieval set, somewhere in the top 30, but not in the top 10 that vanilla RAG uses.

With reranking, we’re able to pull it up and pass it to the LLM, which now has enough signal to answer correctly.


### Why not send the entire context to the LLM?

Let's test it out.

In [None]:
basic_rag = BasicRAGSystem(vectorstore,30)
test_query = "Which of the two patents does not reference reward‑based optimization, and what training approach does it use instead?"

print("Basic RAG Results:")
print("-" * 50)
basic_result = basic_rag.query(test_query)

print(f"Answer: {basic_result['answer']}")
print(f"Retrieved {basic_result['num_docs']} documents")
print(f"Total Tokens: {basic_result['total_tokens']}")

Basic RAG Results:
--------------------------------------------------
Retrieved 30 documents
Answer: The patent that does not reference reward-based optimization is US 2024/0256582 A1. Instead, it uses a training approach that involves generating a set of search results for a search query and providing the set of search results as part of an input prompt to guide a generative AI model in generating a summary response of the set of search results.
Retrieved 30 documents
Total Tokens: 22521


In [None]:

print("\nEnhanced RAG with Reranking:")
print("-" * 50)
enhanced_result = enhanced_rag.query(test_query)

print(f"Answer: {enhanced_result['answer']}")
print(f"Retrieved {enhanced_result['num_docs']} documents (from {len(enhanced_result['initial_docs'])} initial)")
print(f"Total Tokens: {enhanced_result['total_tokens']}")


Enhanced RAG with Reranking:
--------------------------------------------------
Retrieved 30 documents
Reranked to top 10 documents
Answer: The patent US 11,886,826 B1 does not reference reward-based optimization. Instead, it uses an iterative training approach based on one or more datasets, which may include user instruction data or user-labeled data.
Retrieved 10 documents (from 30 initial)
Total Tokens: 7662


Here, Vanilla RAG also gave a confused answer from using all 30 chunks as context and the cost difference is also huge.

- **Vanilla RAG (k=30)** sends all 30 chunks straight to the LLM.
- **Reranked RAG** pulls 30 candidates, scores them, and keeps only the top 10.

That’s **3x fewer tokens** for the same output.

This isn’t just about cost. With longer inputs, LLM latency also goes up.  
Reranking helps us trim the fat and stay within context limits without sacrificing accuracy.

So if you're going to over-fetch from the vector store, it's almost always better to rerank before you send.

If you’re building anything question-answering or doc-heavy, try plugging in a reranker.  
It’s a simple addition that can boost accuracy, trim cost, and make your LLMs look smarter.

You can adapt the exact same setup to papers, reports, contracts — anything longform where chunk retrieval alone might not cut it.

Start from this notebook, swap in your own data, and see what changes.