# RAG with LangChain

- Step-1 : Extract the PDF text
- Step-2 : Chunk the extracted PDF text
- Step-3 : Create a vector store with the PDF chunks
- Step-4 : Create a retriever which returns the relevant chunks
- Step-5 : Build context from the relevant chunk texts
- Step-6 : Build the RAG chain using rag prompt, LLM and string output parser.
- Step-7 : Run the RAG chain to get the answer.

In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters  import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda
from dotenv import load_dotenv
import os

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Read .env file
load_dotenv()

api_key = os.environ.get("OPENAI_API_KEY")

if api_key:
    print("OpenAI API key loaded successfully!")
else:
    print("Error: OPENAI_API_KEY not found")

OpenAI API key loaded successfully!


### Extract PDF text

In [3]:
import requests

# 1. Get your project's root directory (e.g., /path/to/rag_learning)
current_dir = os.getcwd()

# 2. Define a path for the 'content' folder
content_dir = os.path.join(current_dir, "content")

# 3. Create the 'content' folder if it doesn't already exist
os.makedirs(content_dir, exist_ok=True)

# 4. Define the full, correct path for the NEW PDF
pdf_path = os.path.join(content_dir, "sparql_query_translation.pdf")

# 5. Download the PDF file
pdf_url = 'https://arxiv.org/pdf/2507.10045.pdf'
response = requests.get(pdf_url)

# 6. Save the file to your 'content' folder
with open(pdf_path, 'wb') as file:
    file.write(response.content)

print(f"Successfully downloaded")

Successfully downloaded


In [4]:
from typing import List
from langchain_core.documents import Document

def pdf_extract(pdf_path: str) -> List[Document]:
    """
    Extracts text from a PDF file using PyPDFLoader.

    Parameters:
    pdf_path (str): The file path of the PDF to be extracted.

    Returns:
    List[Document]: A list of Document objects containing the extracted text from the PDF.
    """

    print("PDF file text is extracted...")
    loader = PyPDFLoader(pdf_path)
    pdf_text = loader.load()

    return pdf_text

In [5]:
pdf_text = pdf_extract(pdf_path)

PDF file text is extracted...


In [6]:
print(f"Number of documents = {len(pdf_text)}")

Number of documents = 18


### Chunk PDF text  

In [7]:
def pdf_chunk(pdf_text: List[Document]) -> List[Document]:
    """
    Splits extracted PDF text into smaller chunks using RecursiveCharacterTextSplitter.

    Parameters:
    pdf_text (List[Document]): A list of Document objects containing extracted text from a PDF.

    Returns:
    List[Document]: A list of chunked Document objects.
    """

    print("PDF file text is chunked....")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    chunks = text_splitter.split_documents(pdf_text)

    return chunks

In [8]:
chunks = pdf_chunk(pdf_text)
print(f"Number of chunks = {len(chunks)}")

PDF file text is chunked....
Number of chunks = 67


In [9]:
print(chunks[0])

page_content='July 2025
Automating SPARQL Query Translations
between DBpedia and Wikidata
Malte Christian BARTELSa,1, Debayan BANERJEE a and Ricardo USBECK a
a Leuphana University of Lüneburg, Lüneburg, Germany
ORCiD ID: Malte Christian Bartels https://orcid.org/0009-0006-2113-3322, Debayan
Banerjee https://orcid.org/0000-0001-7626-8888, Ricardo Usbeck
https://orcid.org/0000-0002-0191-7211
Abstract. Purpose: This paper investigates whether state-of-the-art Large Lan-
guage Models (LLMs) can automatically translate SPARQL between popular
Knowledge Graph (KG) schemas. We focus on translations between the DBpedia
and Wikidata KG, and later on DBLP and OpenAlex KG. This study addresses a
notable gap in KG interoperability research by evaluating LLM performance on
SPARQL-to-SPARQL translation.
Methodology: Two benchmarks are assembled, where the first aligns 100 DBpe-
dia–Wikidata queries from QALD-9-Plus dataset; the second contains 100 DBLP' metadata={'producer': 'pikepdf 8.15.1', 'creato

### Create Vector Store

In [10]:
persistent_directory = os.path.join(current_dir, "db", "chroma_db_pdf_langchain")

def create_vector_store(chunks: List[Document], db_path: str) -> Chroma:
    """
    Creates a Chroma vector store from chunked documents.

    Parameters:
    chunks (List[Document]): A list of chunked Document objects.
    db_path (str): The directory path to persist the vector store.

    Returns:
    Chroma: A Chroma vector store containing the embedded documents.
    """

    print("Chroma vector store is created...\n")
    # Ensure your OPENAI_API_KEY is already loaded in the environment before this runs!
    embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

    # Create and persist the database specifically for these chunks
    db = Chroma.from_documents(
        documents=chunks,
        embedding=embedding_model,
        persist_directory=db_path
    )

    return db

# Run the function with your corrected path
db = create_vector_store(chunks, persistent_directory)

Chroma vector store is created...



### Retrieve relevant chunks

In [11]:
def retrieve_context(db: Chroma, query: str) -> List[Document]:
    """
    Retrieves relevant document chunks from the Chroma vector store based on a query.

    Parameters:
    db (Chroma): The Chroma vector store containing embedded documents.
    query (str): The query string to search for relevant document chunks.

    Returns:
    List[Document]: A list of retrieved relevant document chunks.
    """

    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 2})
    print("Relevant chunks are retrieved...\n")
    relevant_chunks = retriever.invoke(query)

    return relevant_chunks

In [12]:
query = "Explain the paper approach in one line"

relevant_chunks = retrieve_context(db, query)
print(f"Number of relevant chunks = {len(relevant_chunks)}")

Relevant chunks are retrieved...

Number of relevant chunks = 2


In [13]:
for i, chunk in enumerate(relevant_chunks):
  print(f"Chunk-{i}")
  print(chunk)
  print("\n")

Chunk-0
page_content='ran Associates, Inc.; 2022. p. 24824-37. Available from: https:
//proceedings.neurips.cc/paper_files/paper/2022/file/
9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.' metadata={'title': 'Automating SPARQL Query Translations between DBpedia and Wikidata', 'source': 'c:\\Users\\malte\\Desktop\\Coding\\rag_learning\\content\\sparql_query_translation.pdf', 'total_pages': 18, 'arxivid': 'https://arxiv.org/abs/2507.10045v1', 'page_label': '18', 'author': 'Malte Christian Bartels; Debayan Banerjee; Ricardo Usbeck', 'trapped': '/False', 'creator': 'arXiv GenPDF (tex2pdf:)', 'doi': 'https://doi.org/10.48550/arXiv.2507.10045', 'creationdate': '', 'license': 'http://creativecommons.org/licenses/by/4.0/', 'page': 17, 'producer': 'pikepdf 8.15.1', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5'}


Chunk-1
page_content='ran Associates, Inc.; 2022. p. 24824-37. Available from: https:
//proceedings.neurips.cc/pape

### Build context

In [14]:
def build_context(relevant_chunks: List[Document]) -> str:
    """
    Builds a context string from retrieved relevant document chunks.

    Parameters:
    relevant_chunks (List[Document]): A list of retrieved relevant document chunks.

    Returns:
    str: A concatenated string containing the content of the relevant chunks.
    """

    print("Context is built from relevant chunks")
    context = "\n\n".join([chunk.page_content for chunk in relevant_chunks])

    return context

In [15]:
context = build_context(relevant_chunks)
print(context)

Context is built from relevant chunks
ran Associates, Inc.; 2022. p. 24824-37. Available from: https:
//proceedings.neurips.cc/paper_files/paper/2022/file/
9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.

ran Associates, Inc.; 2022. p. 24824-37. Available from: https:
//proceedings.neurips.cc/paper_files/paper/2022/file/
9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.


### Combine all the steps into one function

In [16]:
from typing import Dict
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

def get_context(inputs: Dict[str, str]) -> Dict[str, str]:
    """
    Creates or loads a vector store for a given PDF file and extracts relevant chunks based on a query.

    Args:
        inputs (Dict[str, str]): A dictionary containing the following keys:
            - 'pdf_path' (str): Path to the PDF file.
            - 'query' (str): The user query.
            - 'db_path' (str): Path to the vector database.

    Returns:
        Dict[str, str]: A dictionary containing:
            - 'context' (str): Extracted relevant context.
            - 'query' (str): The user query.
    """
    pdf_path, query, db_path  = inputs['pdf_path'], inputs['query'], inputs['db_path']

    # Create new vector store if it does not exist
    if not os.path.exists(db_path):
        print("Creating a new vector store...\n")
        pdf_text = pdf_extract(pdf_path)
        chunks = pdf_chunk(pdf_text)
        db = create_vector_store(chunks, db_path)

    # Load the existing vector store
    else:
        print("Loading the existing vector store\n")
        embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
        db = Chroma(persist_directory=db_path, embedding_function=embedding_model)

    relevant_chunks = retrieve_context(db, query)
    context = build_context(relevant_chunks)

    return {'context': context, 'query': query}

### Build RAG Chain

In [17]:
template = """ You are an AI model trained for question answering. You should answer the
  given question based on the given context only.
  Question : {query}
  \n
  Context : {context}
  \n
  If the answer is not present in the given context, respond as: The answer to this question is not available
  in the provided content.
  """

rag_prompt = ChatPromptTemplate.from_template(template)

llm = ChatOpenAI(model='gpt-4o-mini')

str_parser = StrOutputParser()

rag_chain = (
    RunnableLambda(get_context)
    | rag_prompt
    | llm
    | str_parser
)

### Run RAG Chain

In [18]:
# Set the chroma DB path
current_dir = "/content/rag"
persistent_directory = os.path.join(current_dir, "db", "chroma_db_pdf_langchain")

In [19]:
# Download the PDF file
import requests

pdf_url = 'https://arxiv.org/pdf/2507.10045.pdf'
response = requests.get(pdf_url)

pdf_path = 'content/sparql_query_translation.pdf'
with open(pdf_path, 'wb') as file:
    file.write(response.content)

In [20]:
# Write the query
query = 'What was zero-shot learning used for in the paper?'

In [21]:
answer = rag_chain.invoke({'pdf_path':pdf_path, 'query':query, 'db_path':persistent_directory})
print(f"Query:{query}\n")
print(f"Generated answer:{answer}")

Loading the existing vector store

Relevant chunks are retrieved...

Context is built from relevant chunks
Query:What was zero-shot learning used for in the paper?

Generated answer:Zero-shot learning was used in the paper to mitigate baseline limitations, particularly by including an entity-relation mapping variable to enhance the zero-shot prompt. This approach was aimed at directly quantifying the impact of schema alignment information when applied to the models Llama 3.1-8B and Mistral-Large-Instruct-2407.
