#UnstructuredPDFLoader
1. Uses advanced heuristics and layout analysis for better text extraction from PDFs.

2. Preserves more visual structure than PyPDFLoader (in many cases).

###Performance Metrics (Loader):
1. Total characters, alphanumeric characters, newline count, and token count: Indicate volume and cleanliness of extraction.

2. Content-to-noise ratio: Proportion of meaningful characters.

3. Structural preservation: Based on newlines per page.

4. Processing time / memory usage: Measures efficiency.

#RecursiveCharacterTextSplitter
1. Splits text based on fixed size, with overlap and separator priority.

2. Good for predictable chunking (not semantics).

###Performance Metrics (Splitter):
1. Chunk Size CV: Variation in chunk lengths.

2. Context redundancy: Estimates how well overlap preserves context.

3. Metadata accuracy: Whether metadata is retained in chunks.

4. Speed and memory use: Track processing cost.

In [1]:
# Install LangChain modules and support libraries
!pip install langchain langchain-community langchain-text-splitters unstructured[pdf] psutil pypdf pdfminer



In [9]:
!pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer_six-20250416-py3-none-any.whl.metadata (4.1 kB)
Downloading pdfminer_six-20250416-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pdfminer.six
Successfully installed pdfminer.six-20250416


In [2]:
import os
import time
import numpy as np
import psutil

from langchain_community.document_loaders import UnstructuredPDFLoader
from transformers import AutoTokenizer

In [3]:
# Initialize tokenizer to estimate token count (approximating GPT-like models)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
# Path to PDF
pdf_path = "/content/testPDF.pdf"  # Update with your file path

# Initialize loader
loader = UnstructuredPDFLoader(pdf_path, mode="elements")

In [5]:
# Start timing and memory tracking
start_time = time.time()
process = psutil.Process(os.getpid())
start_mem = process.memory_info().rss

In [6]:
# Load document
docs = loader.load()

# Combine all pages
full_text = " ".join([doc.page_content for doc in docs])



In [7]:
# ------ Loader Performance Metrics ------
total_chars = len(full_text)
alphanumeric_chars = len([c for c in full_text if c.isalnum()])
newline_chars = full_text.count('\n')
token_count = len(tokenizer.encode(full_text))
content_to_noise_ratio = alphanumeric_chars / total_chars
processing_time_loader = time.time() - start_time
mem_used_loader = (psutil.Process().memory_info().rss - start_mem) / (1024 ** 2)
structural_preservation_score = newline_chars / len(docs)

In [8]:
loader_metrics = {
    "Total Characters": total_chars,
    "Alphanumeric Characters": alphanumeric_chars,
    "Newline Characters": newline_chars,
    "Token Count": token_count,
    "Content-to-Noise Ratio": round(content_to_noise_ratio, 3),
    "Structural Preservation": round(structural_preservation_score, 3),
    "Processing Time (s)": round(processing_time_loader, 3),
    "Memory Usage (MB)": round(mem_used_loader, 3),
}

print("📊 UnstructuredPDFLoader Metrics:")
for k, v in loader_metrics.items():
    print(f"{k}: {v}")

📊 UnstructuredPDFLoader Metrics:
Total Characters: 1264
Alphanumeric Characters: 1031
Newline Characters: 0
Token Count: 248
Content-to-Noise Ratio: 0.816
Structural Preservation: 0.0
Processing Time (s): 14.813
Memory Usage (MB): 224.25


In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [10]:
# Start time and memory for splitter
start_time_splitter = time.time()
# Initialize RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""]
)
start_mem_split = psutil.Process().memory_info().rss

In [11]:
# Split documents
chunks = splitter.split_documents(docs)

splitter_time = time.time() - start_time_splitter
splitter_mem = (psutil.Process().memory_info().rss - start_mem_split) / (1024 ** 2)

In [12]:
# Get chunk contents
chunk_texts = [c.page_content for c in chunks]

# Compute Chunk Size CV
chunk_lengths = [len(c) for c in chunk_texts]
chunk_size_cv = np.std(chunk_lengths) / np.mean(chunk_lengths)

In [13]:
# Estimate context preservation via overlap redundancy
overlap_estimate = 1 - (len(set(chunk_texts)) / len(chunk_texts))  # Rough idea

In [14]:
# Metadata accuracy
metadata_accuracy = 1.0 if all(hasattr(chunk, "metadata") for chunk in chunks) else 0.0

In [15]:
splitter_metrics = {
    "Chunk Size CV": round(chunk_size_cv, 3),
    "Estimated Overlap/Context Redundancy": round(overlap_estimate, 3),
    "Metadata Accuracy": metadata_accuracy,
    "Splitter Processing Time (s)": round(splitter_time, 3),
    "Memory Usage (MB)": round(splitter_mem, 3),
}

print("\n📊 RecursiveCharacterTextSplitter Metrics:")
for k, v in splitter_metrics.items():
    print(f"{k}: {v}")


📊 RecursiveCharacterTextSplitter Metrics:
Chunk Size CV: 0.654
Estimated Overlap/Context Redundancy: 0.0
Metadata Accuracy: 1.0
Splitter Processing Time (s): 13.383
Memory Usage (MB): 0.0
