---
title: "What is Retrieval Augmented Generation (RAG)?"
date: 2025-06-26
description-meta: "What is RAG and how does it work?"
categories:
  - llm
  - rag
  - python
---

## What is RAG?

It’s a technique to improve LLM answers by providing them with external information before they generate a response.

1. **Retrieve:** The system starts by searching a specific knowledge base for relevant information about the query.
2. **Augment:** This retrieved information is added to your original question/prompt.
3. **Generate:** The LLM uses both your question and the provided information to create a better answer.

It’s useful because it reduces "hallucinations", allows the use of current data, and builds trust with users by (potentially) providing citations.


## Vector databases 

A vector database (VectorDB) is designed to store and query data as vector embeddings (numerical representations).

Some popular vector databases are:

1. New generation: [Qdrant](https://qdrant.tech/), [Chroma](https://www.trychroma.com/), and [Pinecone](https://www.pinecone.io/).
2. Old generation: [Elasticsearch](https://www.elastic.co/) and [Postgres+PGVector](https://github.com/pgvector/pgvector).

New generation doesn't really mean better. It just means more recent. Many of the new providers had to "re-discover" the same concepts that were already available in the old generation such as BM25-based retrieval.

### Term-based retrieval

Term-based retrieval is a technique that uses the terms in the query to find the most relevant documents in the vector database.

TF-IDF:

- Counts how often a term appears in this document (TF).
- Measures how rare the word is across all documents (IDF).
- Highlights terms important and unique to this specific document.

Okapi BM25: Expands TF-IDF to introduce a weighting mechanism for term saturation and document length.

### Embedding-based retrieval

- Small dataset: Use k-NN.
  - Calculate similarity score between the query vector and every other vector stored in the VectorDB.
  - Sort all the vectors based on these similarity scores.
  - Return the 'k' most similar vectors (relative to the query)
- Big dataset: Use ANN such as LSH or HNSW.

### Why use vector databases?

If you have a small dataset, there's no real reason to use a vector database. But if you're dealing with thousands or millions of documents, you'll need to use a vector database to efficiently retrieve the most relevant documents.

They're useful because:

1. The more noise in the context provided to the LLM, the more likely it is to produce bad output.
2. It takes more time to process a longer context
3. It costs more to process a longer context


## RAG without vector database

In [1]:
import os

import chromadb
import tiktoken
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langchain_text_splitters import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

load_dotenv()

True

### Read the document

In [None]:
file_path = "assets/bbva.pdf"
loader = PyPDFLoader(file_path)
pages = []

for page in loader.lazy_load():
    pages.append(page)

### Generate response

In [2]:
model = ChatOpenAI(model="gpt-4.1-mini", temperature=0)

system_prompt = """
You are a helpful assistant that can answer questions about the provided context.

Please cite the page number used to answer the question. Write the page number in the format "Page X" at the end of your answer. 

If the answer is not found in the context, please say so.
"""
user_message = """
Please answer the following question based on the context provided:

Question: {question}

Documents:
{documents}
"""

messages = [SystemMessage(content=system_prompt), HumanMessage(content=user_message)]
context = ""
for i, page in enumerate(pages):
    context += f"--- PAGE {i + 1} ---\n{page.page_content}\n\n"


def get_response(context: dict):
    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_message.format(**context)),
    ]
    response = model.invoke(messages)
    return response.content


question = "What is the main idea of the document?"
response = get_response({"question": question, "documents": context})
print(response)

NameError: name 'pages' is not defined

In [None]:
question = "What are the daily transaction limits?"
response = get_response({"question": question, "documents": context})
print(response)

## RAG with vector search

In [None]:
openai_ef = OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
vector_db = chromadb.PersistentClient()

try:
    collection = vector_db.delete_collection("bbva")
except:
    pass

collection = vector_db.create_collection("bbva", embedding_function=openai_ef)

### Split and index documents

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(pages)

In [None]:
collection.add(
    documents=[split.page_content for split in all_splits],
    metadatas=[split.metadata for split in all_splits],
    ids=[str(i) for i in range(len(all_splits))],
)

### Query the database

In [4]:
collection.query(
    query_texts=["What are the daily transaction limits?", "Is there a monthly limit?"],
    n_results=3,
)


NameError: name 'collection' is not defined

### Generate a response

In [None]:
from langsmith import traceable

model = ChatOpenAI(model="gpt-4.1-mini", temperature=0)

system_prompt = """
You are a helpful assistant that can answer questions about the provided context.

Please cite the page number used to answer the question. Write the page number in the format "Page X" at the end of your answer. 

If the answer is not found in the context, please say so.
"""

user_message = """
Please answer the following question based on the context provided:

Question: {question}

Documents:
{documents}
"""


@traceable
def get_relevant_docs(question: str):
    relevant_docs = collection.query(query_texts=question, n_results=3)
    documents = relevant_docs["documents"][0]
    metadatas = relevant_docs["metadatas"][0]
    return [
        {"page_content": doc, "type": "Document", "metadata": metadata}
        for doc, metadata in zip(documents, metadatas)
    ]


def get_context(relevant_docs: list[dict]):
    context = ""
    for doc in relevant_docs:
        context += f"--- PAGE {doc['metadata']['page']} ---\n{doc['page_content']}\n\n"
    return context


@traceable
def get_messages(question: str, relevant_docs: dict):
    prompt_vars = {"question": question, "documents": get_context(relevant_docs)}
    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_message.format(**prompt_vars)),
    ]
    return messages


@traceable
def get_response(question: str):
    relevant_docs = get_relevant_docs(question)
    messages = get_messages(question, relevant_docs)
    response = model.invoke(messages)
    return response.content


question = "What are the customer service channels?"
response = get_response(question)
print(response)