# ü¶ô RAG Pipeline for PDF Analysis using LlamaIndex ü¶ô

In the data folder we have Amazon's quarterly financial statement. Here's how we're going to pull that into our knowledgebase

1. Extract data using `pdfplumber`
2. Convert to `LlamaIndex Documents`
3. Generate vector embeddings using `LlamaIndex` and `MiniLM`

In [None]:
%pip install llama-index openai pdfplumber pandas sentence-transformers llama-index-embeddings-huggingface

## üìÑ Step 1: Extract data from PDF üìÑ

`pdfplumber` and `pandas` make quick work of this. 

What‚Äôs impressive about `pdfplumber` is how it extracts tables and converts them into clean `Pandas DataFrames`.

In [2]:
import pdfplumber
import pandas as pd


# --- 1Ô∏è‚É£ Extract from PDF ---
pdf_path = "data/amazon-10q-2025-q2.pdf"

texts = []
with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        # Extract plain text
        text = page.extract_text() or ""
        if text.strip():
            texts.append(text.strip())

        # Extract tables
        tables = page.extract_tables()
        for table in tables:
            df = pd.DataFrame(table)
            # Turn table into a readable string
            table_str = df.to_string(index=False, header=False)
            texts.append("Extracted table:\n" + table_str)

print(f"Extracted {len(texts)} text blocks from PDF.")



Extracted 104 text blocks from PDF.


This `texts` list contains an element for each page and each table

*Note: if the table format isn't clean, you may see single table rows parsed into their own object*



In [18]:
# EXPLORE THIS DATA FOR YOURSELF

###################
#PAGE 7 OF THE PDF IS THE FIRST PAGE WITH ONLY TEXT
#-------------------
#print(texts[10])


###################
# THIS IS THE EXTRACTED TABLE FROM PDF PAGE 3
#-------------------
#print(texts[3])


###################
# OR SEE ALL OF THEM
#-------------------
# for index, t in enumerate(texts):
#   print(f"{index}---------")
#   print(t[:200])

## ü¶ô Step 2: Convert to LlamaIndex documents ü¶ô

A `LlamaIndex` Document is the core data unit that LlamaIndex uses for 
indexing and retrieval. It holds both the raw text content and optional 
metadata (like source, page number, or timestamps) to preserve context.

Converting the raw text into Document objects allows LlamaIndex to:
- Split and preprocess text intelligently (e.g., chunking, cleaning)
- Track source attribution for retrieved passages
- Embed, index, and query across data types uniformly

In this case, each page of text and each table extracted from the PDF 
becomes its own Document, forming a mini knowledge base for the RAG pipeline.

In [None]:
from llama_index.core import Document

documents = [Document(text=t) for t in texts if t.strip()]

## ü§ñ Step 3: Generate Vector Embeddings ü§ñ

Vector embeddings allow LLMs to understand our data.

You can find more on this at the end of the document if you want. For now, here is what you need to know...

`LlamaIndex.VectorStoreIndex` does two things
1. Creates searchable Vector index of each document using `all-MiniLM-L6-v2`
2. Organizes these embeddings into a structure optimized for semantic search


In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# downloads the model from huggingface and caches it
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
amazon_10q_index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

## ‚ùìStep 4: Query our new knowledgebase‚ùì

In [23]:
# --- 4Ô∏è‚É£ Query ---
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-5-mini")
query_engine = amazon_10q_index.as_query_engine(llm=llm)

In [None]:
from IPython.display import Markdown

query = "Breakdown the sales numbers for me."
response = query_engine.query(query)

Markdown(f"### üß† Sales Numbers\n\n{response}")

### üß† Sales Numbers

Here‚Äôs a concise breakdown of the sales numbers (amounts in millions):

Net sales by product/service group
- Three months ended June 30
  - Online stores: $55,392 (2024) ‚Üí $61,485 (2025)
  - Physical stores: $5,206 ‚Üí $5,595
  - Third‚Äëparty seller services: $36,201 ‚Üí $40,348
  - Advertising services: $12,771 ‚Üí $15,694
  - Subscription services: $10,866 ‚Üí $12,208
  - AWS: $26,281 ‚Üí $30,873
  - Other: $1,260 ‚Üí $1,499
  - Consolidated total: $147,977 ‚Üí $167,702

- Six months ended June 30
  - Online stores: $110,062 ‚Üí $118,892
  - Physical stores: $10,408 ‚Üí $11,128
  - Third‚Äëparty seller services: $70,797 ‚Üí $76,860
  - Advertising services: $24,595 ‚Üí $29,615
  - Subscription services: $21,588 ‚Üí $23,923
  - AWS: $51,318 ‚Üí $60,140
  - Other: $2,522 ‚Üí $2,811
  - Consolidated total: $291,290 ‚Üí $323,369

Net sales by operating segment
- Three months ended June 30
  - North America: $90,033 ‚Üí $100,068
  - International: $31,663 ‚Üí $36,761
  - AWS: $26,281 ‚Üí $30,873
  - Consolidated total: $147,977 ‚Üí $167,702

- Six months ended June 30
  - North America: $176,374 ‚Üí $192,955
  - International: $63,598 ‚Üí $70,274
  - AWS: $51,318 ‚Üí $60,140
  - Consolidated total: $291,290 ‚Üí $323,369

Growth and mix highlights
- Consolidated year‚Äëover‚Äëyear growth: +13% (Q2 2025) and +11% (six months 2025).
- Segment growth (year over year): North America +11% (Q2) and +9% (six months); International +16% (Q2) and +10% (six months); AWS +17% (Q2) and +17% (six months).
- Q2 2025 net sales mix: North America 60%, International 22%, AWS 18%.

Foreign exchange impact
- Foreign exchange increased consolidated net sales by $1.5 billion for Q2 2025; it did not have a significant impact on the six‚Äëmonth period.
- Foreign exchange reduced North America net sales by $169 million in Q2 2025 and by $582 million for the six months ended June 30, 2025.

If you want, I can reformat this into a table, show percentage changes by product group, or highlight the largest contributors to growth.

## üõë Try that again üîÑ

The numbers I spot checked were accurate, but 147,000 million is hard to understand. Let's have it convert it to billions.

In [31]:
from IPython.display import Markdown

query = "Breakdown the sales numbers changes for me. Use billions instead of millions."
response = query_engine.query(query)

Markdown(f"### üß† Sales Numbers\n\n{response}")

### üß† Sales Numbers

Three months ended June 30 (Q2)
- North America: $90.033 bn ‚Üí $100.068 bn; +$10.035 bn (+11.2%)
- International: $31.663 bn ‚Üí $36.761 bn; +$5.098 bn (+16.1%)
- AWS: $26.281 bn ‚Üí $30.873 bn; +$4.592 bn (+17.5%)
- Consolidated: $147.977 bn ‚Üí $167.702 bn; +$19.725 bn (+13.3%)

Net sales by type (Q2)
- Net product sales: $61.569 bn ‚Üí $68.246 bn; +$6.677 bn (+10.8%)
- Net service sales: $86.408 bn ‚Üí $99.456 bn; +$13.048 bn (+15.1%)

Six months ended June 30
- North America: $176.374 bn ‚Üí $192.955 bn; +$16.581 bn (+9.4%)
- International: $63.598 bn ‚Üí $70.274 bn; +$6.676 bn (+10.5%)
- AWS: $51.318 bn ‚Üí $60.140 bn; +$8.822 bn (+17.2%)
- Consolidated: $291.290 bn ‚Üí $323.369 bn; +$32.079 bn (+11.0%)

Net sales by type (six months)
- Net product sales: $122.484 bn ‚Üí $132.216 bn; +$9.732 bn (+7.9%)
- Net service sales: $168.806 bn ‚Üí $191.153 bn; +$22.347 bn (+13.2%)

# üèÅüèÅüèÅ We're done! üèÅüèÅüèÅ

This was impossible to do just a matter of months ago. If you downgrade to gpt-4o-mini, you get a virtually unusable output. All of the numbers I spot checked were accurate.

The paper path that lead us here:

| Date | Paper | Lab | Description |
| --- | --- | --- | --- |
| June 2017 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | Google | Transformers paper. This is built the foundation for LLMs |
| October 2018 | [Bidirectional encoder representations from transformers (BERT)](https://arxiv.org/abs/1810.04805) | Google | Uses transformer architecture to derive semantic meaning of text chunks |
| August 2019 | [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) | UKPLab | Reduces the computation time of BERT from 65 hours to 5 seconds. The `all-MiniLM-L6-v2` we used came from this |




## üì¶ Package notes üì¶

llama-index-embeddings-huggingface
- creates vector embeddings of text
- wraps embedding model (in this case... sentence-transformers/all-MiniLM-L6-v2)

pdfplumber 
- https://github.com/jsvine/pdfplumber 
- built by data journalist, jsvine. 
- Extends upon pdfminer parsing engine

pandas
- Data analysis tool
- Handles structured and labeled data


sentence-transformers
- UKPLab (Ubiquitous Knowledge Processing Lab) at TU Darmstadt, Germany
- built on top of huggingface transformers
- specifically to turn sentences (or paragraphs or pages) into embeddings
- utilized with SDET

all-MiniLM-L6-v
- created by UKPLab (Ubiquitous Knowledge Processing Lab) at TU Darmstadt, Germany
