###🌐 UnstructuredHTMLLoader to load HTML content (you can download and use HTML files locally).

###🧩 RecursiveCharacterTextSplitter to split the extracted content.

###📈 Performance metrics evaluated both for loader and splitter, as per your earlier performance matrix checklist.

| Component      | Loader                                      | Splitter                         |
| -------------- | ------------------------------------------- | -------------------------------- |
| Tool Used      | `UnstructuredHTMLLoader`                    | `RecursiveCharacterTextSplitter` |
| Strength       | Handles raw HTML with tag parsing           | Structure-agnostic splitting     |
| Weakness       | Doesn’t deeply parse HTML semantic sections | Ignores actual HTML semantics    |
| Metrics        | Character count, Token cost, C\:N ratio     | Chunk CV, Processing speed       |
| Coherence Test | Optional STS-B CrossEncoder                 | ✅ Score for continuity           |


In [3]:
# 📘 Notebook: UnstructuredHTMLLoader with RecursiveCharacterTextSplitter
# 🧠 Objective: Load an HTML file using UnstructuredHTMLLoader and analyze performance

# ✅ Step 1: Install necessary packages
!pip install -q langchain unstructured tiktoken psutil transformers langchain_community

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m85.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
# ✅ Step 2: Import required modules
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import requests, time, psutil, os, re
import numpy as np
import tiktoken

In [5]:
# ✅ Step 3: Download HTML content from the Wikipedia URL
url = "https://en.wikipedia.org/wiki/Roadside_Picnic"
html_content = requests.get(url).text

In [6]:
# Save it locally so BSHTMLLoader can process it
with open("roadside_picnic.html", "w", encoding="utf-8") as f:
    f.write(html_content)

In [7]:
# ✅ Step 3: Load a local HTML file
# Please upload your HTML file to the Colab environment (or set correct path)
html_path = "/content/roadside_picnic.html"  # 🔁 Replace with your HTML file
start_time = time.time()
process = psutil.Process(os.getpid())
initial_mem = process.memory_info().rss / 1024 / 1024
loader = UnstructuredHTMLLoader(file_path=html_path)
docs = loader.load()
end_time = time.time()

In [8]:
final_mem = process.memory_info().rss / 1024 / 1024

# ✅ Step 4: Evaluate Loader Performance Metrics
text = "\n".join([doc.page_content for doc in docs])

In [9]:
def count_tokens(text):
    enc = tiktoken.get_encoding("cl100k_base")
    return len(enc.encode(text))

def content_to_noise(text):
    alphanum = len(re.findall(r'\w', text))
    total_chars = len(text)
    return round(alphanum / total_chars, 4) if total_chars > 0 else 0

In [10]:
loader_metrics = {
    "Total Character Count": len(text),
    "Alphanumeric Character Count": len(re.findall(r'\w', text)),
    "Newline Character Count": text.count("\n"),
    "Token Count (GPT-4 Encoding)": count_tokens(text),
    "Content-to-Noise Ratio": content_to_noise(text),
    "Processing Time (sec)": round(end_time - start_time, 2),
    "Memory Usage (MB)": round(final_mem - initial_mem, 2),
    "Structural Preservation": "✅ Partial HTML structure (tags) preserved"
}

In [11]:
print("🔍 Loader Performance Metrics (UnstructuredHTMLLoader):")
for k, v in loader_metrics.items():
    print(f"{k}: {v}")

🔍 Loader Performance Metrics (UnstructuredHTMLLoader):
Total Character Count: 26470
Alphanumeric Character Count: 20980
Newline Character Count: 354
Token Count (GPT-4 Encoding): 6543
Content-to-Noise Ratio: 0.7926
Processing Time (sec): 5.78
Memory Usage (MB): 326.88
Structural Preservation: ✅ Partial HTML structure (tags) preserved


In [12]:
# ✅ Step 5: Use RecursiveCharacterTextSplitter to split the text
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)

In [13]:
split_start = time.time()
split_docs = splitter.split_documents(docs)
split_end = time.time()

In [14]:
chunks = [doc.page_content for doc in split_docs]
chunk_lengths = [len(chunk) for chunk in chunks]
chunk_tokens = [count_tokens(chunk) for chunk in chunks]

def chunk_size_cv(lengths):
    mean = np.mean(lengths)
    std = np.std(lengths)
    return round(std / mean, 4) if mean > 0 else 0

In [15]:
split_metrics = {
    "Total Chunks": len(chunks),
    "Avg Chunk Size (chars)": round(np.mean(chunk_lengths), 2),
    "Chunk Size CV": chunk_size_cv(chunk_lengths),
    "Token Range": f"{min(chunk_tokens)} - {max(chunk_tokens)}",
    "Processing Speed (MB/s)": round((len(text)/1024/1024) / (split_end - split_start), 4),
    "Memory Efficiency": "✔️ Efficient for plain text extraction",
    "Metadata Accuracy": "❌ No semantic metadata maintained"
}

In [16]:
print("\n📊 Splitter Performance Metrics (RecursiveCharacterTextSplitter):")
for k, v in split_metrics.items():
    print(f"{k}: {v}")


📊 Splitter Performance Metrics (RecursiveCharacterTextSplitter):
Total Chunks: 74
Avg Chunk Size (chars): 369.47
Chunk Size CV: 0.4444
Token Range: 5 - 197
Processing Speed (MB/s): 7.6864
Memory Efficiency: ✔️ Efficient for plain text extraction
Metadata Accuracy: ❌ No semantic metadata maintained
