# RAG-Based HTML Information Extraction

This notebook implements a Retrieval-Augmented Generation (RAG) system to extract structured information from HTML documents.

## System Architecture
1. **Chunking**: Split HTML into semantic chunks (tags, sections)
2. **Embedding**: Use `nomic-ai/nomic-embed-code-GGUF` for code-aware embeddings
3. **Retrieval**: FAISS vector store for efficient similarity search
4. **Generation**: LM Studio hosted model to generate structured JSON output

## Scenarios
- **Scenario 1**: E-commerce (books) - extract name and price
- **Scenario 2**: Job listings - extract title, location, salary, company
- **Scenario 3**: Club listings - extract names, logo links, websites
- **Scenario 4**: Hidden information - extract property details, coordinates


In [1]:
import os
import json
import re
import time
from pathlib import Path
from typing import List, Dict, Any, Optional
from dataclasses import dataclass

import httpx
import numpy as np
from bs4 import BeautifulSoup
import faiss
from llama_cpp import Llama

# Paths
DATA_DIR = Path("../data/html").resolve()
EMBEDDING_MODEL_PATH = Path.home() / ".cache/nomic-embed-code-v1.5.Q4_K_M.gguf"  # Adjust path

# LM Studio configuration
LMSTUDIO_BASE_URL = os.getenv("LMSTUDIO_BASE_URL", "http://localhost:1234/v1")
LMSTUDIO_MODEL = os.getenv("LMSTUDIO_MODEL", "qwen2.5-7b-instruct-1m")
LMSTUDIO_API_KEY = os.getenv("LMSTUDIO_API_KEY", "lm-studio")


## 1. HTML Chunking Strategy

We'll use a semantic chunking approach that preserves HTML structure while creating meaningful chunks.


In [2]:
@dataclass
class HTMLChunk:
    """Represents a chunk of HTML with metadata."""
    content: str
    tag_path: str  # e.g., "html > body > div.container > article"
    attributes: Dict[str, str]
    chunk_id: int
    

def extract_tag_path(element) -> str:
    """Build a CSS-like path for an element."""
    path_parts = []
    for parent in element.parents:
        if parent.name is None:
            continue
        name = parent.name
        if parent.get('class'):
            name += f".{parent.get('class')[0]}"
        elif parent.get('id'):
            name += f"#{parent.get('id')}"
        path_parts.append(name)
    return " > ".join(reversed(path_parts[-5:]))  # Last 5 levels


def chunk_html(html_content: str, max_chunk_size: int = 1000) -> List[HTMLChunk]:
    """Chunk HTML into semantic units.
    
    Strategy:
    - Extract meaningful containers (divs, articles, sections, li, tr, etc.)
    - Include attributes (class, id, data-*) as metadata
    - Keep chunks small enough for embedding but large enough for context
    """
    soup = BeautifulSoup(html_content, 'lxml')
    chunks = []
    chunk_id = 0

    # Expanded set of target tags for more thorough HTML extraction
    target_tags = [
        'article', 'div', 'section', 'li', 'tr', 'dl', 'aside', 'main', 'header',
        'footer', 'nav', 'table', 'thead', 'tbody', 'tfoot', 'ul', 'ol', 'dt',
        'dd', 'figure', 'figcaption', 'form', 'fieldset', 'legend', 'h1', 'h2',
        'h3', 'h4', 'h5', 'h6', 'pre', 'code', 'blockquote', 'address', 'summary',
        'details', 'p'
    ]

    for tag_name in target_tags:
        elements = soup.find_all(tag_name)
        for elem in elements:
            text = elem.get_text(separator=' ', strip=True)

            # Skip empty or very short chunks
            if len(text) < 20:
                continue

            # Get inner HTML (preserving structure)
            content = str(elem)[:max_chunk_size]

            # Extract attributes
            attrs = {k: ' '.join(v) if isinstance(v, list) else v
                     for k, v in elem.attrs.items()}

            # Build tag path
            tag_path = extract_tag_path(elem)

            chunks.append(HTMLChunk(
                content=content,
                tag_path=tag_path,
                attributes=attrs,
                chunk_id=chunk_id
            ))
            chunk_id += 1

    # Also extract script tags with JSON data
    for script in soup.find_all('script', type='application/json'):
        if script.string and len(script.string) > 50:
            chunks.append(HTMLChunk(
                content=script.string[:max_chunk_size],
                tag_path="script[type=application/json]",
                attributes=script.attrs,
                chunk_id=chunk_id
            ))
            chunk_id += 1

    # Extract inline script data (like __NEXT_DATA__)
    for script in soup.find_all('script', id=True):
        if script.string and ('{' in script.string or '[' in script.string):
            chunks.append(HTMLChunk(
                content=script.string[:max_chunk_size],
                tag_path=f"script#{script.get('id')}",
                attributes=script.attrs,
                chunk_id=chunk_id
            ))
            chunk_id += 1

    return chunks


## 2. Embedding Model Setup

We'll use nomic-embed-code for code-aware embeddings that understand HTML structure.


In [3]:
class NomicEmbedder:
    """Wrapper for nomic-embed-code using llama.cpp."""
    
    def __init__(self, model_path: Path, embedding_dim: int = 768):
        """Initialize the embedding model.
        
        Args:
            model_path: Path to the GGUF model file
            embedding_dim: Dimension of embeddings (768 for nomic-embed-code)
        """
        if not model_path.exists():
            raise FileNotFoundError(
                f"Embedding model not found at {model_path}. "
                f"Download from: https://huggingface.co/nomic-ai/nomic-embed-code-GGUF"
            )
        
        self.model = Llama(
            model_path=str(model_path),
            embedding=True,
            n_ctx=2048,
            n_batch=512,
            verbose=False
        )
        self.embedding_dim = embedding_dim
    
    def embed_text(self, text: str) -> np.ndarray:
        """Generate embedding for a single text."""
        # Nomic models expect specific prefix for code
        prefixed_text = f"search_document: {text}"
        embedding = self.model.embed(prefixed_text)
        return np.array(embedding, dtype=np.float32)
    
    def embed_batch(self, texts: List[str]) -> np.ndarray:
        """Generate embeddings for multiple texts."""
        embeddings = []
        for text in texts:
            embeddings.append(self.embed_text(text))
        return np.vstack(embeddings)
    
    def embed_query(self, query: str) -> np.ndarray:
        """Generate embedding for a search query."""
        prefixed_query = f"search_query: {query}"
        embedding = self.model.embed(prefixed_query)
        return np.array(embedding, dtype=np.float32)


## 3. Vector Store and Retrieval

FAISS for efficient similarity search over embedded chunks.


In [4]:
class HTMLVectorStore:
    """Vector store for HTML chunks using FAISS."""
    
    def __init__(self, embedder: NomicEmbedder):
        self.embedder = embedder
        self.index: Optional[faiss.Index] = None
        self.chunks: List[HTMLChunk] = []
        self.dimension = embedder.embedding_dim
    
    def build_index(self, html_content: str, chunk_size: int = 1000):
        """Build FAISS index from HTML content."""
        print(f"Chunking HTML (target size: {chunk_size})...")
        self.chunks = chunk_html(html_content, max_chunk_size=chunk_size)
        print(f"Created {len(self.chunks)} chunks")
        
        if not self.chunks:
            raise ValueError("No chunks created from HTML")
        
        # Extract text for embedding
        print("Generating embeddings...")
        texts_to_embed = []
        for chunk in self.chunks:
            # Combine content with metadata for richer embeddings
            text = f"{chunk.tag_path}\n"
            if chunk.attributes:
                attrs_str = " ".join([f"{k}={v}" for k, v in chunk.attributes.items()])
                text += f"Attributes: {attrs_str}\n"
            text += chunk.content
            texts_to_embed.append(text[:2000])  # Limit for embedding
        
        embeddings = self.embedder.embed_batch(texts_to_embed)
        
        # Build FAISS index
        print("Building FAISS index...")
        self.index = faiss.IndexFlatIP(self.dimension)  # Inner product (cosine similarity)
        
        # Normalize embeddings for cosine similarity
        faiss.normalize_L2(embeddings)
        self.index.add(embeddings)
        
        print(f"Index built with {self.index.ntotal} vectors")
    
    def retrieve(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
        """Retrieve top-k most relevant chunks for a query."""
        if self.index is None:
            raise ValueError("Index not built. Call build_index first.")
        
        # Embed query
        query_embedding = self.embedder.embed_query(query).reshape(1, -1)
        faiss.normalize_L2(query_embedding)
        
        # Search
        scores, indices = self.index.search(query_embedding, top_k)
        
        # Format results
        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx < len(self.chunks):
                chunk = self.chunks[idx]
                results.append({
                    "content": chunk.content,
                    "tag_path": chunk.tag_path,
                    "attributes": chunk.attributes,
                    "score": float(score),
                    "chunk_id": chunk.chunk_id
                })
        
        return results


## 4. LLM Generation

Use LM Studio to generate structured JSON from retrieved chunks.


In [5]:
def lmstudio_chat(
    messages: List[Dict[str, str]], 
    model: Optional[str] = None,
    temperature: float = 0.1, 
    max_tokens: int = 2048
) -> str:
    """Call LM Studio's OpenAI-compatible chat endpoint."""
    base = LMSTUDIO_BASE_URL.rstrip("/")
    if not base.endswith("/v1"):
        base = base + "/v1"
    url = f"{base}/chat/completions"
    
    headers = {"Authorization": f"Bearer {LMSTUDIO_API_KEY}"}
    payload = {
        "model": model or LMSTUDIO_MODEL,
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
        "stream": False,
    }
    
    with httpx.Client(timeout=180) as client:
        resp = client.post(url, headers=headers, json=payload)
        resp.raise_for_status()
        data = resp.json()
    
    return data["choices"][0]["message"]["content"]


def extract_json_from_response(response: str) -> Any:
    """Extract JSON from LLM response, handling markdown code blocks."""
    # Try to find JSON in code blocks
    match = re.search(r"```(?:json)?\s*([\s\S]+?)```", response)
    if match:
        json_str = match.group(1).strip()
    else:
        json_str = response.strip()
    
    try:
        return json.loads(json_str)
    except json.JSONDecodeError:
        # Try to find JSON object/array in the text
        for pattern in [r'\{[\s\S]+\}', r'\[[\s\S]+\]']:
            match = re.search(pattern, response)
            if match:
                try:
                    return json.loads(match.group(0))
                except:
                    continue
        raise ValueError(f"Could not extract JSON from response: {response[:200]}")


## 5. RAG Pipeline

Combine retrieval and generation for end-to-end extraction.


In [6]:
class RAGHTMLExtractor:
    """RAG-based HTML information extraction system."""
    
    def __init__(self, embedder: NomicEmbedder):
        self.embedder = embedder
        self.vector_store = HTMLVectorStore(embedder)
    
    def index_html(self, html_content: str, chunk_size: int = 1000):
        """Index HTML content for retrieval."""
        self.vector_store.build_index(html_content, chunk_size=chunk_size)
    
    def extract(
        self, 
        query: str, 
        top_k: int = 8,
        temperature: float = 0.1,
        max_retries: int = 2
    ) -> Dict[str, Any]:
        """Extract structured information using RAG.
        
        Args:
            query: Natural language query describing what to extract
            top_k: Number of chunks to retrieve
            temperature: LLM temperature
            max_retries: Number of retries if JSON parsing fails
            
        Returns:
            Dictionary with extracted information
        """
        # Retrieve relevant chunks
        print(f"\nRetrieving top-{top_k} chunks for query: {query}")
        retrieved_chunks = self.vector_store.retrieve(query, top_k=top_k)
        
        if not retrieved_chunks:
            return {"error": "No relevant chunks found", "data": []}
        
        # Show retrieval results
        print("\nTop retrieved chunks:")
        for i, chunk in enumerate(retrieved_chunks[:3], 1):
            print(f"  {i}. Score: {chunk['score']:.3f} | Path: {chunk['tag_path'][:60]}")
        
        # Build context from retrieved chunks
        context_parts = []
        for i, chunk in enumerate(retrieved_chunks, 1):
            context_parts.append(f"--- Chunk {i} (score: {chunk['score']:.3f}) ---")
            context_parts.append(f"Path: {chunk['tag_path']}")
            if chunk['attributes']:
                attrs = ', '.join([f"{k}={v}" for k, v in list(chunk['attributes'].items())[:3]])
                context_parts.append(f"Attributes: {attrs}")
            context_parts.append(f"Content:\n{chunk['content'][:800]}")
            context_parts.append("")
        
        context = "\n".join(context_parts)
        
        # Build prompt
        system_prompt = """You are an expert at extracting structured information from HTML.
Given HTML chunks retrieved for a specific query, extract the requested information and return it as valid JSON.

Rules:
- Return ONLY valid JSON (array or object)
- Extract ALL items found in the chunks
- Use null for missing fields
- Be precise with data types (numbers as numbers, not strings)
- For prices/salaries, extract numeric values when possible
- Do not truncate or limit results unless explicitly requested
"""
        
        user_prompt = f"""Query: {query}

Retrieved HTML chunks:
{context}

Extract the requested information and return as JSON. If the query asks for multiple items, return an array. Each object should have clear field names matching the query."""
        
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
        
        # Generate with retries
        for attempt in range(max_retries + 1):
            try:
                print(f"\nGenerating response (attempt {attempt + 1}/{max_retries + 1})...")
                response = lmstudio_chat(messages, temperature=temperature)
                
                # Extract and parse JSON
                result = extract_json_from_response(response)
                
                print(f"‚úì Successfully extracted {len(result) if isinstance(result, list) else 1} item(s)")
                return {
                    "success": True,
                    "data": result,
                    "query": query,
                    "chunks_used": len(retrieved_chunks)
                }
                
            except Exception as e:
                print(f"‚úó Attempt {attempt + 1} failed: {e}")
                if attempt < max_retries:
                    # Add feedback for retry
                    messages.append({"role": "assistant", "content": response})
                    messages.append({
                        "role": "user", 
                        "content": f"The response was not valid JSON. Error: {e}. Please return ONLY valid JSON."
                    })
                else:
                    return {
                        "success": False,
                        "error": str(e),
                        "raw_response": response,
                        "query": query
                    }
    
    def extract_from_file(
        self, 
        html_path: Path, 
        query: str,
        top_k: int = 8,
        chunk_size: int = 1000
    ) -> Dict[str, Any]:
        """Extract information from an HTML file."""
        print(f"\n{'='*80}")
        print(f"Processing: {html_path.name}")
        print(f"Query: {query}")
        print(f"{'='*80}")
        
        # Load HTML
        html_content = html_path.read_text(encoding='utf-8', errors='ignore')
        
        # Index
        self.index_html(html_content, chunk_size=chunk_size)
        
        # Extract
        result = self.extract(query, top_k=top_k)
        
        return result


## 6. Initialize System

Load the embedding model and create the RAG extractor.


In [7]:
# Initialize embedder
print("Loading embedding model...")
try:
    embedder = NomicEmbedder(EMBEDDING_MODEL_PATH)
    print(f"‚úì Loaded nomic-embed-code from {EMBEDDING_MODEL_PATH}")
except FileNotFoundError as e:
    print(f"‚úó {e}")
    print("\nTo download the model:")
    print("  huggingface-cli download nomic-ai/nomic-embed-code-GGUF nomic-embed-code-v1.5.Q4_K_M.gguf --local-dir ~/.cache/")
    raise

# Create RAG extractor
rag_extractor = RAGHTMLExtractor(embedder)
print("‚úì RAG system initialized")


Loading embedding model...
‚úó Embedding model not found at /Users/ardyh/.cache/nomic-embed-code-v1.5.Q4_K_M.gguf. Download from: https://huggingface.co/nomic-ai/nomic-embed-code-GGUF

To download the model:
  huggingface-cli download nomic-ai/nomic-embed-code-GGUF nomic-embed-code-v1.5.Q4_K_M.gguf --local-dir ~/.cache/


FileNotFoundError: Embedding model not found at /Users/ardyh/.cache/nomic-embed-code-v1.5.Q4_K_M.gguf. Download from: https://huggingface.co/nomic-ai/nomic-embed-code-GGUF

## 7. Test Scenarios

Run the extraction on all test scenarios.


### Scenario 1: E-commerce Book Store


In [None]:
scenario1_result = rag_extractor.extract_from_file(
    html_path=DATA_DIR / "scenario1_books.html",
    query="Extract all books with their name and price",
    top_k=10,
    chunk_size=800
)

if scenario1_result["success"]:
    print("\nüìö Sample extracted books:")
    print(json.dumps(scenario1_result["data"][:3], indent=2))
    print(f"\nTotal books extracted: {len(scenario1_result['data'])}")
else:
    print(f"\n‚úó Extraction failed: {scenario1_result['error']}")


### Scenario 2: Job Listings


In [None]:
scenario2_result = rag_extractor.extract_from_file(
    html_path=DATA_DIR / "scenario2_jobs.html",
    query="Extract job title, location, salary, and company name from all job listings",
    top_k=12,
    chunk_size=1200
)

if scenario2_result["success"]:
    print("\nüíº Sample extracted jobs:")
    print(json.dumps(scenario2_result["data"][:3], indent=2))
    print(f"\nTotal jobs extracted: {len(scenario2_result['data'])}")
else:
    print(f"\n‚úó Extraction failed: {scenario2_result['error']}")


### Scenario 3: Club Listings


In [None]:
scenario3_result = rag_extractor.extract_from_file(
    html_path=DATA_DIR / "scenario3_clubs.html",
    query="Get the club names, logo image links and their official websites",
    top_k=10,
    chunk_size=1000
)

if scenario3_result["success"]:
    print("\n‚öΩ Sample extracted clubs:")
    print(json.dumps(scenario3_result["data"][:3], indent=2))
    print(f"\nTotal clubs extracted: {len(scenario3_result['data'])}")
else:
    print(f"\n‚úó Extraction failed: {scenario3_result['error']}")


### Scenario 4: Property Details (Hidden Information)


In [None]:
scenario4_result = rag_extractor.extract_from_file(
    html_path=DATA_DIR / "scenario4_property.html",
    query="Return the property name, address, latitude and longitude",
    top_k=8,
    chunk_size=1500
)

if scenario4_result["success"]:
    print("\nüè† Extracted property details:")
    print(json.dumps(scenario4_result["data"], indent=2))
else:
    print(f"\n‚úó Extraction failed: {scenario4_result['error']}")


## 8. Results Summary


In [None]:
results_summary = {
    "Scenario 1 (Books)": {
        "success": scenario1_result["success"],
        "items_extracted": len(scenario1_result.get("data", [])) if isinstance(scenario1_result.get("data"), list) else 1,
        "chunks_used": scenario1_result.get("chunks_used", 0)
    },
    "Scenario 2 (Jobs)": {
        "success": scenario2_result["success"],
        "items_extracted": len(scenario2_result.get("data", [])) if isinstance(scenario2_result.get("data"), list) else 1,
        "chunks_used": scenario2_result.get("chunks_used", 0)
    },
    "Scenario 3 (Clubs)": {
        "success": scenario3_result["success"],
        "items_extracted": len(scenario3_result.get("data", [])) if isinstance(scenario3_result.get("data"), list) else 1,
        "chunks_used": scenario3_result.get("chunks_used", 0)
    },
    "Scenario 4 (Property)": {
        "success": scenario4_result["success"],
        "items_extracted": 1 if scenario4_result["success"] else 0,
        "chunks_used": scenario4_result.get("chunks_used", 0)
    }
}

print("\n" + "="*80)
print("RESULTS SUMMARY")
print("="*80)
print(json.dumps(results_summary, indent=2))

# Save results
output_dir = Path("../generated").resolve()
output_dir.mkdir(exist_ok=True)

output_file = output_dir / f"rag_extraction_results_{int(time.time())}.json"
with open(output_file, 'w') as f:
    json.dump({
        "scenario1": scenario1_result,
        "scenario2": scenario2_result,
        "scenario3": scenario3_result,
        "scenario4": scenario4_result,
        "summary": results_summary
    }, f, indent=2)

print(f"\n‚úì Results saved to: {output_file}")


## 9. Interactive Extraction

Test custom queries.


In [None]:
def interactive_extract(scenario_file: str, custom_query: str):
    """Run a custom extraction query."""
    result = rag_extractor.extract_from_file(
        html_path=DATA_DIR / scenario_file,
        query=custom_query,
        top_k=10
    )
    
    if result["success"]:
        print("\n‚úì Extraction successful!")
        print(json.dumps(result["data"], indent=2)[:1000])  # First 1000 chars
    else:
        print(f"\n‚úó Failed: {result['error']}")
    
    return result

# Example: Custom query (uncomment to use)
# custom_result = interactive_extract(
#     scenario_file="scenario1_books.html",
#     custom_query="Find all books with 5-star ratings and extract their titles and prices"
# )


## Notes

### Model Setup
1. **Embedding Model**: Download nomic-embed-code GGUF from HuggingFace:
   ```bash
   huggingface-cli download nomic-ai/nomic-embed-code-GGUF nomic-embed-code-v1.5.Q4_K_M.gguf --local-dir ~/.cache/
   ```

2. **LM Studio**: Ensure LM Studio is running with a model loaded (e.g., Qwen2.5-7B-Instruct)

### Performance Tuning
- **chunk_size**: Smaller chunks (500-800) for precise extraction, larger (1000-1500) for context
- **top_k**: More chunks (10-15) for comprehensive extraction, fewer (5-8) for speed
- **temperature**: Lower (0.0-0.2) for consistent structured output

### Advantages of RAG Approach
- ‚úì Handles large HTML files efficiently
- ‚úì Retrieves only relevant sections
- ‚úì Better context understanding with semantic search
- ‚úì Scalable to multiple documents
- ‚úì Works with hidden/embedded JSON data

### Dependencies
```bash
pip install beautifulsoup4 lxml faiss-cpu llama-cpp-python httpx numpy
```
