# AI That Reads and Understands PDFs with LangGraph, OpenAI, and PyPDF

Easily build an AI that reads and understands your PDF documents using LangGraph, OpenAI, and PyPDF. Perfect for developers, researchers, and teams who need to extract insights and answer questions from documents with minimal effort.

## Overview

- **Input Parameters:** PDF document(s), user question  
- **Output Structure:** Extracted context, summarized insights, and direct answers to questions based on document content  
- **Technologies:** LangGraph, OpenAI API, PyPDF  
- **Use Cases:** Document search and QA, legal and research document analysis, internal knowledge base exploration, and intelligent PDF assistants

## Setup

### Jupyter Notebook

This and other tutorials are perhaps most conveniently run in a [Jupyter notebooks](https://jupyter.org/). Going through guides in an interactive environment is a great way to better understand them. See [here](https://jupyter.org/install) for instructions on how to install.

### Installation

This tutorial requires these langchain dependencies:

In [None]:
%pip install --quiet --upgrade langchain-community langgraph langchain-core "langchain[openai]" typing_extensions langchain-text-splitters pypdf

### LangSmith

Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls.
As these applications get more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent.
The best way to do this is with [LangSmith](https://smith.langchain.com).

After you sign up at the link above, make sure to set your environment variables to start logging traces:

In [None]:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

## Step 1: Load, Chunk, Embed, and Index the PDF

In this step, we will:

1. Load a PDF file from a [URL](https://arxiv.org/pdf/2312.10997) 
2. Split it into smaller chunks
3. Generate semantic embeddings using OpenAI's `text-embedding-3-large`
4. Store the embeddings in an in-memory vector store for fast querying

### 🔐 OpenAI API Key Setup

To generate embeddings using OpenAI’s `text-embedding-3-large` model, you’ll need an `OPENAI_API_KEY`.

Follow these steps to get your API key:

1. Go to [https://platform.openai.com/api-keys](https://platform.openai.com/api-keys)
2. Log into your OpenAI account (or sign up if you don’t have one)
3. Click **"Create new secret key"**
4. Copy the generated API key and store it securely

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass()

To start, we'll load a PDF document from a public URL and prepare it for processing.  
This involves three key steps:

1. Initializing the OpenAI embedding model (`text-embedding-3-large`)
2. Setting up an in-memory vector store to hold our document chunks
3. Loading the PDF content using LangChain's `PyPDFLoader`

This will give us the raw text we need before splitting and embedding it.


In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_community.document_loaders import PyPDFLoader

# PDF file URL
file_path = "https://arxiv.org/pdf/2312.10997"

# Initialize embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Initialize in-memory vector store
vector_store = InMemoryVectorStore(embeddings)

# Load PDF
loader = PyPDFLoader(file_path=file_path)
docs = loader.load()

print(docs[0].page_content[:500])

### 📚 What Are Chunks?

Chunks are small, coherent pieces of text extracted from the original document.  
Since large language models (LLMs) like GPT have a limited context window, we can't process the entire document at once.

By splitting the content into overlapping segments (chunks), we enable:

- Efficient semantic search
- Better relevance during question answering
- Reduced risk of token overflow errors

In this tutorial, we use a chunk size of **1000 characters** with **200-character overlap** to preserve context.


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Split the text into chunks of ~1000 characters with 200 characters overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
text_splits = text_splitter.split_documents(docs)

# Let's inspect the first chunk
text_splits[0].page_content[:500]

### 🧠 What Are Vectors?

Vectors are high-dimensional numerical representations of text.  
They are created by passing a text chunk through an **embedding model**, such as OpenAI's `text-embedding-3-large`.

These vectors capture the **semantic meaning** of the text. Texts with similar meaning will have embeddings (vectors) that are close to each other in vector space.

We use these vectors to:

- Search for similar content
- Match user questions to relevant document chunks
- Enable context-aware AI responses

Now that we have the document split into chunks, let's generate embeddings for each chunk and store them in our vector store.

In [None]:
# Add all chunks to the vector store (embedding happens under the hood)
vector_store.add_documents(documents=text_splits)

## Step 2: Retrieve & Generate

Now that we’ve embedded and indexed the document, we can use it to answer user questions.

This step involves two parts:

1. **Retrieve** the most relevant document chunks based on a user query using semantic similarity.
2. **Generate** a concise, context-aware response using a language model (LLM) and a custom prompt.

We will use OpenAI’s GPT model to generate answers grounded in the retrieved document context.

### 🧠 Define the Prompt Template for Retrieval-Augmented Generation (RAG)

The prompt template will instruct the LLM to use only the retrieved context when answering the user’s question.  
This prevents hallucinations and keeps answers grounded in real data.


In [None]:
from langchain_core.prompts import PromptTemplate

rag_prompt_template = PromptTemplate(
    template="""
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:
""",
    input_variables=["question", "context"],
)


### ❓ Enter a Question Related to the [PDF](https://arxiv.org/pdf/2312.10997)

Here are some example questions you can ask based on the content of the paper:

- *What problems does RAG solve in large language models?*
- *What are the three paradigms of RAG described in the paper?*
- *How does RAG mitigate hallucination in LLMs?*
- *What is Modular RAG and how is it different from Naive RAG?*
- *What challenges in RAG research are mentioned in the paper?*

Feel free to enter your own question related to the paper.


🔍 **What happens after you enter a question?**

When you type your question and run the next cell:

1. The system will **search the indexed document chunks** for the most semantically similar content.
2. It uses **vector similarity** (cosine distance) to find the most relevant passages in the PDF.
3. These passages are stored in memory and used as **context for the language model** in the next step.

This allows the model to give precise, grounded answers instead of guessing or hallucinating.

In [None]:
# Let the user enter a custom question about the RAG survey paper
user_question = input("Question: ")

# Perform semantic similarity search
retrieved_docs = vector_store.similarity_search(user_question)

# Store in simulated state
state = {
    "question": user_question,
    "docs": retrieved_docs
}

### 📄 Prepare the Context and Generate an Answer

Now that we’ve retrieved the most relevant document chunks, we’ll format them into a single text block.  
We then pass this context along with the user’s question into our prompt template.

Finally, we invoke the language model (GPT-4) to generate a concise answer based strictly on the provided context.


In [None]:
from langchain_openai import ChatOpenAI

# Format the context into a string
context = "\n\n".join(doc.page_content for doc in state["docs"])

# Fill the prompt template
question_prompt = rag_prompt_template.invoke({
    "question": state["question"],
    "context": context
})

# Initialize the language model
llm = ChatOpenAI(model="gpt-4")

# Get the response
response = llm.invoke(question_prompt)

# Show the answer
print(response.content)


## Step 3: Build a RAG Workflow Using LangGraph

In this step, we'll use [LangGraph](https://github.com/langchain-ai/langgraph) to orchestrate the entire **Retrieval-Augmented Generation (RAG)** pipeline as a stateful workflow.

LangGraph allows us to define the steps in a computation as nodes in a graph, making the pipeline modular and reusable.

Our workflow will:

1. Accept a user question
2. Retrieve the most relevant chunks from the vector store
3. Use a language model to generate a concise, grounded answer

### Define the Application State

We'll start by defining the data structure (state) that will be passed between nodes in the graph.


In [None]:
from typing import TypedDict, List
from langchain_core.documents import Document

class RagState(TypedDict):
    question: str
    docs: List[Document]
    answer: str

### 🔍 Define the Retrieval Function

This function receives the question, performs a similarity search using our in-memory vector store,  
and returns the top matching documents. These will be passed to the next step in the graph.


In [None]:
def retrieve(state: RagState):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"docs": retrieved_docs}

### 🧠 Define the Answer Generation Function

This function takes the retrieved documents and the original question,  
formats them using our prompt template, and calls the LLM to generate a final answer.


In [None]:
def generate(state: RagState):
    context = "\n\n".join(doc.page_content for doc in state["docs"])
    question_prompt = rag_prompt_template.invoke({
        "question": state["question"],
        "context": context
    })
    response = llm.invoke(question_prompt)
    return {"answer": response.content}

### 🔄 Create the Workflow Using LangGraph

Now we'll connect the nodes: first retrieval, then generation.  
We also specify the entry point and the terminal step using LangGraph's API.

In [None]:
from langgraph.graph import StateGraph, END

# Create the graph
workflow = StateGraph(RagState)

# Add nodes
workflow.add_node("retrieve", retrieve)
workflow.add_node("generate", generate)

# Define flow
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END)

# Compile the graph into an executable app
app = workflow.compile()


### ▶️ Run the Workflow with a Custom Question

You can now run the complete RAG workflow by providing a new question.
The system will retrieve context and generate an answer in a fully automated fashion.

In [None]:
question = input("Ask a question about the RAG paper: ")

result = app.invoke({"question": question})

print("🔍 Answer:\n", result["answer"])