

Unstructured: “Document Understanding” & Chunking


### Explanation for Code Block 1
In this first cell, we pull down a series of WSU PDFs, merge them into one “mega-PDF,” and set up access to our vector database. We use **`requests`** to reliably download each file over HTTPS, and **`pypdf`**’s `PdfReader`/`PdfWriter` to concatenate all pages into a single document. The `glob` and `os` modules handle file paths and directories (`os.makedirs` ensures our `pdfs/` folder exists). After writing out `all_transfer_docs.pdf`, we load configuration values (API keys) from a separate `pinecone.init` file via **`configparser`**—this keeps secrets out of source control. Finally, we import the Pinecone client and instantiate it with our API key so we can start building a semantic index.  


In [7]:
import requests
from pypdf import PdfReader, PdfWriter
import glob
import os

pdf_urls = [
    "https://wpcdn.web.wsu.edu/wp-daesa/uploads/sites/3116/2024/04/Transfer-Center-Brochure.pdf",
    "https://registrar.wsu.edu/media/fzepa10r/admission.pdf",
    "https://em.wsu.edu/media/751323/washington-45-list-of-one-year-transfer-courses.pdf",
    "https://s3.wp.wsu.edu/uploads/sites/917/2017/11/Student-Guide.pdf",
    "https://s3.wp.wsu.edu/uploads/sites/16/2024/02/Transfer-Student-Course-Planner-2023-2024-updated-1.pdf",
    "https://s3.wp.wsu.edu/uploads/sites/917/2015/07/WSU-Articulation-Handbook.pdf",
    "https://s3.wp.wsu.edu/uploads/sites/2920/2024/10/Video-Text-Description_How-to-Send-Your-Transcripts-to-WSU.pdf",
]

os.makedirs("pdfs", exist_ok=True)
# download files
downloaded_paths = []
for url in pdf_urls:
    filename = url.split("/")[-1]
    dest_path = os.path.join("pdfs", filename)
    resp = requests.get(url)
    resp.raise_for_status()
    with open(dest_path, "wb") as f:
        f.write(resp.content)
    downloaded_paths.append(dest_path)
    print(f"Downloaded {filename}")

# 4) Merge into one mega‐PDF
writer = PdfWriter()
for path in downloaded_paths:
    reader = PdfReader(path)
    for page in reader.pages:
        writer.add_page(page)

all_pdf = "all_transfer_docs.pdf"
with open(all_pdf, "wb") as f:
    writer.write(f)

print(f"Merged {len(downloaded_paths)} files into {all_pdf}")

import os
import configparser
from langchain.text_splitter import CharacterTextSplitter
import sys
print(sys.executable)
from pinecone import Pinecone, ServerlessSpec
import pinecone
# Set up your keys here

# Load config This file is just a text file with stuff I want to keep seperate
# This reads it and parses it so i can get the values I want
config = configparser.ConfigParser()
config.read('pinecone.init')
PINECONE_API_KEY = config['DEFAULT']['PINECONE_API_KEY']
OPENAI_API_KEY = config['DEFAULT']['OPENAI_API_KEY']

print(pinecone.__version__) # Should be something like 4.1.0 

# Connect to Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)


Downloaded Transfer-Center-Brochure.pdf
Downloaded admission.pdf
Downloaded washington-45-list-of-one-year-transfer-courses.pdf
Downloaded Student-Guide.pdf
Downloaded Transfer-Student-Course-Planner-2023-2024-updated-1.pdf
Downloaded WSU-Articulation-Handbook.pdf
Downloaded Video-Text-Description_How-to-Send-Your-Transcripts-to-WSU.pdf
Merged 7 files into all_transfer_docs.pdf
c:\Users\jeffd\OneDrive\Desktop\ConnectAI\RAG_phase2\.pine_env\Scripts\python.exe
7.2.0


### Explanation for Code Block 2

Here we transform the raw PDF into clean, chunked text ready for embedding. We rely on **Unstructured** (`partition_pdf`) to break the PDF into logical “elements” (text, tables, images) with OCR fallback via Tesseract, which we select by setting `OCR_AGENT`. Then, **LangChain**’s `RecursiveCharacterTextSplitter` slices long passages into overlapping chunks (500 characters with 100-char overlap) to preserve context edges. A helper `remove_odd_chars` collapses whitespace and strips non-ASCII noise. We also define `filter_by_type` to drop metadata fields Pinecone won’t accept (e.g., complex objects). Finally, we loop through each element, split and merge small fragments into sensible chunks, and save the result to `clean_text_cache.json` for fast reloads later.  
# PDF Parsing, Chunking, and Vector Indexing with Pinecone

This notebook walks through parsing PDFs with `unstructured`, splitting the extracted text into context-preserving chunks using LangChain, and pushing those chunks into a Pinecone index that automatically embeds and stores vectors.

---

## 1. Parse PDF with `partition_pdf`

### Uses YOLOX for layout detection
### Enables table extraction
### Tesseract OCR language
### Extract tables as base64 image blocks

In [None]:
# Extract structured elements from PDF
from importlib.metadata import version, PackageNotFoundError
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.pdf import partition_pdf
import json
import re

def get_pkg_version(pkg_name: str) -> str:
    try:
        return version(pkg_name)
    except PackageNotFoundError:
        return f"{pkg_name} not installed"
print("pinecone:", get_pkg_version("pinecone"))
print("langchian:", get_pkg_version("langchain"))
print("unstructured:", get_pkg_version("unstructured"))
print("unstructured-inference", get_pkg_version("unstructured-inference"))   # should output 0.7.36
# 1: Parse PDF into elements unstructured handles this part, AKA Partition the PDF to get raw elements
#Setting the file here:


os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"


elements = partition_pdf(
    filename=all_pdf,
    strategy="hi_res",
    hi_res_model_name="yolox",  # Instead of detectron2_onnx
    skip_infer_table_types=[],            # Enable table extraction
    languages=["eng"],                    # Specify languages for OCR
    extract_image_block_types=["Table"]   # Extract table images as base64
)
def remove_odd_chars(text):
    text = re.sub(r"\s+", " ", text)  # collapse whitespace
    text = re.sub(r"[^\x20-\x7E]", "", text)  # remove non-ASCII characters
    return text.strip()

max_chunk_size = 500
overlap_size = 100

# This determines how to chunk our elements, overlap makes sure that context is not lost at edges
splitter = RecursiveCharacterTextSplitter(chunk_size=max_chunk_size, chunk_overlap=overlap_size)

# helper function removes types that Pinecone doesn't like
def filter_by_type(metadata):
    filtered_data = {}
    for k, v in metadata.items():
        if isinstance(v, (str, int, float, bool)):
            filtered_data[k] = v
        # List of str,int,float,bool are ok too
        elif isinstance(v, list) and all(isinstance(x, str) for x in v):
            filtered_data[k] = v
    return filtered_data

# We need to manage chunk sizes and store their meta data. As well as clean it so this is a big loop
clean_text = [] 
for elem in elements:
    # Skip empties
    if elem is None:
        continue
    # if we have an "text" attribute 
    elif hasattr(elem, "text"):

        merged_chunks = []
        buff = ""
        chunks = splitter.split_text( remove_odd_chars(elem.text) )

        # inner loop handles small chunks so we don't have noise/bloat 
        for ch in chunks: 
            if len(ch.strip()) < 40:
                buff += " " + ch.strip()
            else: 
                # we'll combine the small chunks into bigger ones
                full_chunk = (buff+" "+ ch).strip()
                merged_chunks.append(full_chunk)
                buff = ""
        if buff.strip():
            merged_chunks.append(buff.strip())
        
        # Now each merged chunk is assigned the same meta data 
        for i, chunk in enumerate(merged_chunks):

            # Return empty dict if elem has not metadata
            metadata = elem.metadata.to_dict() if hasattr(elem, "metadata") else {} 

            # Go thru and remove unsupported types
            filtered_metadata = filter_by_type(metadata)

            # create the text to stor in vector DB
            clean_text.append( { "chunk_text": chunk, "metadata" : filtered_metadata } )

            
             
# Step 4: Prepare chunks for Pinecone
print(f"Prepared {len(clean_text)} chunks.")

import json

with open("clean_text_cache.json", "w", encoding="utf-8") as f:
    json.dump(clean_text, f, ensure_ascii=False, indent=2)

pinecone: 7.2.0
langchian: 0.3.26
unstructured: 0.18.1
unstructured-inference 1.0.5
Prepared 519 chunks.


### Explanation for Code Block 2

Here we transform the raw PDF into clean, chunked text ready for embedding. We rely on **Unstructured** (`partition_pdf`) to break the PDF into logical “elements” (text, tables, images) with OCR fallback via Tesseract, which we select by setting `OCR_AGENT`. Then, **LangChain**’s `RecursiveCharacterTextSplitter` slices long passages into overlapping chunks (500 characters with 100-char overlap) to preserve context edges. A helper `remove_odd_chars` collapses whitespace and strips non-ASCII noise. We also define `filter_by_type` to drop metadata fields Pinecone won’t accept (e.g., complex objects). Finally, we loop through each element, split and merge small fragments into sensible chunks, and save the result to `clean_text_cache.json` for fast reloads later.  


In [13]:

# Connect to Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)

# Create serverless index using llama-text-embed-v2 (auto-embedding)
index_name = "rag2riches"

if not pc.has_index(index_name):
    pc.create_index_for_model(
        name=index_name,
        cloud="aws",
        region="us-east-1",
        embed={
            "model": "llama-text-embed-v2",
            "field_map": {"text": "chunk_text"}
        }
    )
    print("Integrated embedding index created.")
else:
    print("Index already exists, skipping creation.")

index = pc.Index(index_name)


Index already exists, skipping creation.


### Explanation for Code Block 3

This cell ensures our Pinecone index exists and is configured for automatic embeddings. We define a unique index name (`rag2riches`) and check with `pc.has_index()`. If absent, we call `pc.create_index_for_model`, telling Pinecone to use the **`llama-text-embed-v2`** model—this choice gives us state-of-the-art embeddings without manual preprocessing. The `field_map` argument instructs Pinecone to pull text from our `chunk_text` field. By centralizing index creation here, we avoid re-uploading or re-configuring on every run, speeding up development cycles. Finally, we bind `index = pc.Index(index_name)` so subsequent calls (upsert, query) know where to read and write.  

---

In [None]:
'''
Load pre-chunked documents, validate and filter metadata,
construct Pinecone upsert records, and push them in batches.
'''

# Standard library imports
import json  # For reading and writing JSON data
from datetime import datetime  # For parsing and formatting date strings

# Load the cached clean text chunks from disk
with open("clean_text_cache.json", "r", encoding="utf-8") as f:
    clean_text = json.load(f)

def validate_metadata(metadata):
    """
    Ensure all metadata fields are of an allowed type (str, int, float, bool, or list of str).
    Print out any invalid fields for debugging.
    """
    for key, value in metadata.items():
        if isinstance(value, (str, int, float, bool)):
            continue  # Basic scalar types are fine
        elif isinstance(value, list) and all(isinstance(item, str) for item in value):
            continue  # Lists of strings are allowed
        else:
            print(f"Invalid field: {key} — type: {type(value)} — value: {value}")

# Prepare a list to accumulate Pinecone records
records = []

# Define keywords to exclude or remap in metadata
excluded_keywords = ["parent_id"]

# Iterate through each document chunk
for idx, doc in enumerate(clean_text):
    filtered_metadata = {}
    metadata = doc.get("metadata", {})

    # Process each metadata key-value pair
    for key, value in metadata.items():
        if key == "parent_id":
            # Remap 'parent_id' to a new metadata field 'source_doc'
            filtered_metadata["source_doc"] = value
        elif any(term in key.lower() for term in excluded_keywords):
            # Skip any other keys containing excluded keywords
            continue
        elif isinstance(value, str):
            # Attempt to parse ISO date strings and reformat them
            try:
                dt = datetime.fromisoformat(value)
                filtered_metadata[key] = dt.strftime("%B_%Y")  # e.g., "June_2025"
                continue  # Move to the next metadata item
            except ValueError:
                pass  # Not a date string; leave it for later checks
        if isinstance(value, (str, int, float, bool)):
            # Acceptable scalar metadata
            filtered_metadata[key] = value
        elif isinstance(value, list) and all(isinstance(item, str) for item in value):
            # Acceptable list metadata
            filtered_metadata[key] = value
        else:
            # Report and skip unexpected types
            print(f"Filtered out key: {key}, value: {value} (type: {type(value)})")

    # Validate the filtered metadata to catch any remaining issues
    validate_metadata(filtered_metadata)

    # Build the record to upsert into Pinecone
    record = {
        "id": str(idx),  # Unique string ID per chunk
        "text": doc["chunk_text"],
        **filtered_metadata # * THIS UNPACKS THE DICTIONARY INTO TEXT! have to do this, can't give it dictionary or it breaks pinecone
    }

    # Print the first record for inspection
    if idx == 0:
        print(f"Printing record: {record}")

    # Add to the list of records to upsert
    records.append(record)

def batch(iterable, batch_size):
    """
    Yield successive batches of the given size from the iterable.
    Pinecone limits batch upsert to 96 records at a time.
    """
    for start in range(0, len(iterable), batch_size):
        yield iterable[start : start + batch_size]

# Upsert records into Pinecone in batches
for batch_records in batch(records, 96):
    # Print a sample record for logging (can remove in production)
    print(json.dumps(records[0], indent=2))
    index.upsert_records(namespace="default", records=batch_records)


Printing record: {'id': '0', 'text': 'About Us', 'detection_class_prob': 0.7652540802955627, 'last_modified': 'June_2025', 'languages': ['eng'], 'page_number': 1}
{
  "id": "0",
  "text": "About Us",
  "detection_class_prob": 0.7652540802955627,
  "last_modified": "June_2025",
  "languages": [
    "eng"
  ],
  "page_number": 1
}
{
  "id": "0",
  "text": "About Us",
  "detection_class_prob": 0.7652540802955627,
  "last_modified": "June_2025",
  "languages": [
    "eng"
  ],
  "page_number": 1
}
{
  "id": "0",
  "text": "About Us",
  "detection_class_prob": 0.7652540802955627,
  "last_modified": "June_2025",
  "languages": [
    "eng"
  ],
  "page_number": 1
}
{
  "id": "0",
  "text": "About Us",
  "detection_class_prob": 0.7652540802955627,
  "last_modified": "June_2025",
  "languages": [
    "eng"
  ],
  "page_number": 1
}
{
  "id": "0",
  "text": "About Us",
  "detection_class_prob": 0.7652540802955627,
  "last_modified": "June_2025",
  "languages": [
    "eng"
  ],
  "page_number": 1

### Explanation for Code Block 4

Metadata must not be nested—flattening is required for Pinecone's schema.
record = { ...}

Each record is structured as a dict with id, text, and metadata fields.

5. Recap: Workflow Stages
Extract structured content from PDFs using unstructured.

Slice content into manageable, overlapping chunks with LangChain.

Assign metadata and unique IDs to each chunk.

Use Pinecone’s auto-indexing to embed and store chunks for RAG or search.
Now that our index is ready, we load the pre-chunked data and turn it into Pinecone “records.” We read `clean_text_cache.json` and validate each chunk’s metadata via `validate_metadata`, which prints any unsupported types. We then map each document into a dictionary with a unique `id`, the actual `text`, and filtered metadata (e.g., page number, source URL). Because Pinecone limits batch sizes to 96 items, we define a simple `batch()` generator that yields sublists of up to 96 records. Inside the loop, we print out the first record for sanity-checking, then call `index.upsert_records(namespace="default", records=batch_records)` to bulk-load our text vectors into the cloud index.  


In [19]:
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)

def query_with_rag(question, top_k=5):
    # Semantic query
    result = index.query(
        top_k=top_k,
        include_metadata=True,
        vector=None,
        query=question
    )

    # Build context from top results
    context = "\n\n".join(
        match["metadata"].get("chunk_text", "") for match in result["matches"]
    )

    # Prompt OpenAI using the new client-based API
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant answering based only on the context provided."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ],
        temperature=0
    )

    return response.choices[0].message.content


### Explanation for Code Block 5

This cell defines our RAG (Retrieval-Augmented Generation) query function. We use the **OpenAI Python client** (`OpenAI(api_key=…)`) for a modern, unified API experience. Inside `query_with_rag`, we first call `index.query` to perform a semantic search over our embedded chunks (`top_k` controls how many hits). We then stitch together the top matches’ text into a single `context` string. Finally, we prompt GPT-4 with a system message locking it to “only the provided context,” followed by our user question. The temperature is set to 0 for deterministic, factual answers. The function returns the model’s reply, neatly wrapping search + generation in one call.  

---

In [16]:
answer = query_with_rag("How advanced does my math need to be to enter as a junior in software engineering into wsu?")
print("Answer:\n", answer)


Answer:
 The context provided does not include specific information about the math requirements for entering as a junior in software engineering at WSU (Washington State University or any other institution referred to as WSU). Please refer to WSU's specific course requirements or contact the university directly for accurate information.


### Explanation for Code Block 6

Here we demonstrate our end-to-end pipeline in action. By calling  