#PyPDFLoader:
Extracts text from PDF page-by-page.

###Performance Metrics (Loader):

1. Total characters, alphanumeric ratio, and token count indicate extraction volume and noise.
2. Structural preservation uses newline patterns.
3. Processing time and memory usage help evaluate efficiency.

#SemanticChunker:
Uses sentence-level splitting and semantic proximity to generate coherent chunks.

###Performance Metrics (Splitter):

1. Chunk size variability (CV), semantic continuity (via CrossEncoder), and metadata correctness.

2. Measured time and memory usage.

In [1]:
# Install LangChain components and transformers for tokenization and semantic evaluation
!pip install langchain langchain-community langchain-text-splitters
!pip install sentence-transformers transformers psutil

Collecting langchain-community
  Downloading langchain_community-0.3.23-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [8]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.4.0


In [2]:
import os
import time
import numpy as np
import psutil

from langchain_community.document_loaders import PyPDFLoader
from transformers import AutoTokenizer

In [3]:
# Load tokenizer to estimate token count (simulating GPT-4 encoding behavior)
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Replace with GPT-4 tokenizer if needed

# Path to your PDF file
pdf_path = "/content/SystemDesign.pdf"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [4]:
# Initialize the loader
loader = PyPDFLoader(pdf_path)

In [5]:
# Start timing and memory tracking
start_time = time.time()
process = psutil.Process(os.getpid())
start_mem = process.memory_info().rss

In [9]:
# Load documents using PyPDFLoader
docs = loader.load()

# Concatenate full text from all pages
full_text = " ".join([doc.page_content for doc in docs])

In [10]:
# ------ Loader Performance Metrics ------
total_chars = len(full_text)
alphanumeric_chars = len([c for c in full_text if c.isalnum()])
newline_chars = full_text.count('\n')
token_count = len(tokenizer.encode(full_text))
content_to_noise_ratio = alphanumeric_chars / total_chars
processing_time_loader = time.time() - start_time
mem_used_loader = (psutil.Process().memory_info().rss - start_mem) / (1024 ** 2)
structural_preservation_score = newline_chars / len(docs)

Token indices sequence length is longer than the specified maximum sequence length for this model (64843 > 1024). Running this sequence through the model will result in indexing errors


In [11]:
loader_metrics = {
    "Total Characters": total_chars,
    "Alphanumeric Characters": alphanumeric_chars,
    "Newline Characters": newline_chars,
    "Token Count": token_count,
    "Content-to-Noise Ratio": round(content_to_noise_ratio, 3),
    "Structural Preservation": round(structural_preservation_score, 3),
    "Processing Time (s)": round(processing_time_loader, 3),
    "Memory Usage (MB)": round(mem_used_loader, 3),
}

print("📊 Loader Metrics:")
for k, v in loader_metrics.items():
    print(f"{k}: {v}")

📊 Loader Metrics:
Total Characters: 283172
Alphanumeric Characters: 229288
Newline Characters: 3552
Token Count: 64843
Content-to-Noise Ratio: 0.81
Structural Preservation: 34.485
Processing Time (s): 38.127
Memory Usage (MB): 67.391


In [13]:
!pip install langchain-experimental

Collecting langchain-experimental
  Downloading langchain_experimental-0.3.4-py3-none-any.whl.metadata (1.7 kB)
Downloading langchain_experimental-0.3.4-py3-none-any.whl (209 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-experimental
Successfully installed langchain-experimental-0.3.4


In [14]:
from langchain_experimental.text_splitter import SemanticChunker
from sentence_transformers import CrossEncoder

In [15]:
# Initialize semantic coherence model
semantic_model = CrossEncoder("cross-encoder/stsb-roberta-base")

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

In [19]:
# Start splitter timer
start_time_splitter = time.time()

# Initialize embeddings (assuming you have sentence-transformers installed)
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2") # Or any other sentence-transformers model

# Semantic chunking
splitter = SemanticChunker(embeddings=embeddings) # Pass the embeddings object to SemanticChunker
chunks = splitter.split_documents(docs)

splitter_time = time.time() - start_time_splitter

  embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2") # Or any other sentence-transformers model


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [20]:
splitter_mem = (psutil.Process().memory_info().rss - start_mem) / (1024 ** 2)

# Get chunk texts
chunk_texts = [chunk.page_content for chunk in chunks]

In [21]:
# Semantic Coherence Testing
semantic_scores = []
for i in range(len(chunk_texts) - 1):
    score = semantic_model.predict([(chunk_texts[i], chunk_texts[i + 1])])
    semantic_scores.append(score)

semantic_flow = float(np.mean(semantic_scores))

In [22]:
# Chunk Size Coefficient of Variation (CV)
chunk_lengths = [len(c) for c in chunk_texts]
chunk_size_cv = np.std(chunk_lengths) / np.mean(chunk_lengths)

In [23]:
# Metadata accuracy estimation
metadata_accuracy = 1.0 if all(hasattr(chunk, "metadata") for chunk in chunks) else 0.0

In [24]:
splitter_metrics = {
    "Chunk Size CV": round(chunk_size_cv, 3),
    "Semantic Flow": round(semantic_flow, 3),
    "Metadata Accuracy": metadata_accuracy,
    "Splitter Processing Time (s)": round(splitter_time, 3),
    "Memory Usage (MB)": round(splitter_mem, 3),
}
print("\n📊 Splitter Metrics:")
for k, v in splitter_metrics.items():
    print(f"{k}: {v}")


📊 Splitter Metrics:
Chunk Size CV: 0.759
Semantic Flow: 0.522
Metadata Accuracy: 1.0
Splitter Processing Time (s): 1356.532
Memory Usage (MB): 2385.102
