# üö¢ RAG Pipeline for PDF Analysis using LlamaIndex ü¶ô

*June 18, 2023:*
The Titan Submarine began a journey to the Titanic wreckage, it would never resurface.

*October 2, 2025:*
The National Transit Safety Board published this report of their investigation.

Source: https://www.ntsb.gov/investigations/AccidentReports/Reports/MIR2536.pdf

We analyze this PDF report in 3 steps:
1. Extract data using `pdfplumber`
2. Convert to `LlamaIndex Documents`
3. Generate vector embeddings using `LlamaIndex` and `MiniLM`

In [None]:
%pip install llama-index openai pdfplumber pandas sentence-transformers llama-index-embeddings-huggingface

## üìÑ Step 1: Extract data from PDF üìÑ

`pdfplumber` and `pandas` make quick work of this. 

What‚Äôs impressive about `pdfplumber` is how it extracts tables and converts them into clean `Pandas DataFrames`.

In [None]:
import pdfplumber
import pandas as pd


# --- 1Ô∏è‚É£ Extract from PDF ---
pdf_path = "data/ntsb-titan-submarine-report.pdf"

texts = []
with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        # Extract plain text
        text = page.extract_text() or ""
        if text.strip():
            texts.append(text.strip())

        # Extract tables
        tables = page.extract_tables()
        for table in tables:
            df = pd.DataFrame(table)
            # Turn table into a readable string
            table_str = df.to_string(index=False, header=False)
            texts.append("Extracted table:\n" + table_str)

print(f"Extracted {len(texts)} text blocks from PDF.")



This `texts` list contains an element for each page and each table

*Note: if the table format isn't clean, you may see single table rows parsed into their own object*



In [None]:
# EXPLORE THIS DATA FOR YOURSELF

###################
#PAGE 1
#-------------------
print("------ TITLE PAGE -------")
print(texts[0][:300])


###################
# THIS IS THE ACRONYM AND ABBREVIATION TABLE
# (notice each table row is it's own text block)
#-------------------
print("------ PAGE 5 ACRONYM TABLE -------")
print('\n'.join(texts[6:12]))


## ü¶ô Step 2: Convert to LlamaIndex documents ü¶ô

A `LlamaIndex` Document is the core data unit that LlamaIndex uses for 
indexing and retrieval. It holds both the raw text content and optional 
metadata (like source, page number, or timestamps) to preserve context.

Converting the raw text into Document objects allows LlamaIndex to:
- Split and preprocess text intelligently (e.g., chunking, cleaning)
- Track source attribution for retrieved passages
- Embed, index, and query across data types uniformly

In this case, each page of text and each table extracted from the PDF 
becomes its own Document, forming a mini knowledge base for the RAG pipeline.

In [None]:
from llama_index.core import Document

documents = [Document(text=t) for t in texts if t.strip()]

## ü§ñ Step 3: Generate Vector Embeddings ü§ñ

Vector embeddings allow LLMs to understand our data.

You can find more on this at the end of the document if you want. For now, here is what you need to know...

`LlamaIndex.VectorStoreIndex` does two things
1. Creates searchable Vector index of each document using `all-MiniLM-L6-v2`
2. Organizes these embeddings into a structure optimized for semantic search


In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# downloads the model from huggingface and caches it
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
titan_report_index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

## ‚ùìStep 4: Query our new knowledgebase‚ùì

In [None]:
# --- 4Ô∏è‚É£ Query ---
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-5-mini")
query_engine = titan_report_index.as_query_engine(llm=llm)

In [None]:
from IPython.display import Markdown

query = "What could have prevented this tragety?"
response = query_engine.query(query)

Markdown(f"### üß† Response\n\n{response}")

### ‚ùìStep 4.1: Now try yours!‚ùì

In [None]:

query = ""
response = query_engine.query(query)

Markdown(f"### üß† Submarine answer: \n\n{response}")

# üèÅüèÅüèÅ We're done! üèÅüèÅüèÅ

The paper path that lead us here:

| Date | Paper | Lab | Description |
| --- | --- | --- | --- |
| June 2017 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | Google | Transformers paper. This is built the foundation for LLMs |
| October 2018 | [Bidirectional encoder representations from transformers (BERT)](https://arxiv.org/abs/1810.04805) | Google | Uses transformer architecture to derive semantic meaning of text chunks |
| August 2019 | [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) | UKPLab | Reduces the computation time of BERT from 65 hours to 5 seconds. The `all-MiniLM-L6-v2` we used came from this |




## üì¶ Package notes üì¶

`llama-index`
- Open Source project to help with the RAG Pipeline

`llama-index-embeddings-huggingface`
- creates vector embeddings of text
- wraps embedding model (in this case... `sentence-transformers/all-MiniLM-L6-v2`)

`pdfplumber `
- https://github.com/jsvine/pdfplumber 
- built by data journalist, jsvine. 
- Extends upon pdfminer parsing engine

`pandas`
- Data analysis tool
- Handles structured and labeled data

`sentence-transformers`
- UKPLab (Ubiquitous Knowledge Processing Lab) at TU Darmstadt, Germany
- Turns sentences (or paragraphs or pages) into embeddings
- the model `all-MiniLM-L6-v` does this incredibly fast, read the SentenceBERT Paper above to see how
