# Preserving Text Structure

When extracting text from documents, maintaining the original structure is crucial for downstream AI tasks. In this notebook, we'll learn how to preserve hierarchical structure from Markdown and HTML documents.

## Learning Objectives

By the end of this notebook, you will be able to:
- Parse structured documents (Markdown, HTML) while preserving hierarchy
- Extract metadata about document structure (headings, sections)
- Create searchable document maps with table of contents
- Handle structured data like tables within documents
- Build section-based search functionality

## Why Structure Matters

Preserving text structure provides several benefits:
- **Better chunking** - Respects semantic boundaries
- **More accurate search** - Maintains context and hierarchy
- **Improved Q&A** - Preserves relationships between sections
- **Structured data handling** - Properly extracts tables and lists

In [None]:
## Parsing Markdown with Structure Preservation

from llama_index.core.schema import Document
from llama_index.core.node_parser import MarkdownNodeParser
import textwrap

# Sample markdown document with clear hierarchical structure
markdown_text = """
# AI Engineering Fundamentals

## Introduction to Vector Databases

Vector databases are specialized database systems designed to store and query vector embeddings efficiently.

### Key Advantages
- Efficient similarity search
- Scalable to billions of vectors
- Support for metadata filtering

### Common Operations
1. **Vector Indexing**: Creating data structures for efficient search
2. **Approximate Nearest Neighbor Search**: Finding similar vectors quickly
3. **Hybrid Search**: Combining vector similarity with metadata filters

## Working with Embeddings

Embeddings are dense numerical representations of data that capture semantic meaning.

### Popular Embedding Models
- OpenAI text-embedding-ada-002
- Sentence Transformers
- CLIP for image embeddings
"""

# Create a document
document = Document(text=markdown_text)

# Create a parser that recognizes markdown structure
markdown_parser = MarkdownNodeParser()

# Parse the document into structured nodes
nodes = markdown_parser.get_nodes_from_documents([document])

# Display the resulting nodes
print(f"Markdown Parsing Results:\n")
print("=" * 80)
print(f"\nTotal nodes created: {len(nodes)}\n")

for i, node in enumerate(nodes, 1):
    print(f"Node {i}:")
    print(f"  Text: {textwrap.shorten(node.text, width=70)}...")
    print(f"  Header path: {node.metadata.get('header_path', 'N/A')}")
    print()

print("✓ Markdown structure preserved with hierarchical metadata")

In [None]:
## Parsing HTML with Structure Preservation

from llama_index.core.node_parser import HTMLNodeParser
from bs4 import BeautifulSoup
import textwrap

# Sample HTML document with headings, lists, and tables
html_text = """
<html>
<body>
  <h1>AI Engineering Fundamentals</h1>
  
  <h2>Introduction to Vector Databases</h2>
  <p>Vector databases are specialized database systems designed to store and query vector embeddings efficiently.</p>
  
  <h3>Key Advantages</h3>
  <ul>
    <li>Efficient similarity search</li>
    <li>Scalable to billions of vectors</li>
    <li>Support for metadata filtering</li>
  </ul>
  
  <h3>Common Operations</h3>
  <ol>
    <li><b>Vector Indexing</b>: Creating data structures for efficient search</li>
    <li><b>Approximate Nearest Neighbor Search</b>: Finding similar vectors quickly</li>
    <li><b>Hybrid Search</b>: Combining vector similarity with metadata filters</li>
  </ol>
  
  <h2>Working with Embeddings</h2>
  <p>Embeddings are dense numerical representations of data that capture semantic meaning.</p>
  
  <table border="1">
    <tr>
      <th>Model Name</th>
      <th>Dimensions</th>
      <th>Use Case</th>
    </tr>
    <tr>
      <td>text-embedding-ada-002</td>
      <td>1536</td>
      <td>General text embeddings</td>
    </tr>
    <tr>
      <td>all-MiniLM-L6-v2</td>
      <td>384</td>
      <td>Efficient semantic search</td>
    </tr>
  </table>
</body>
</html>
"""

# Create a document
html_document = Document(text=html_text)

# Create HTML parser
html_parser = HTMLNodeParser()

# Parse the document into nodes
html_nodes = html_parser.get_nodes_from_documents([html_document])

# Display parsing results
print("HTML Parsing Results:\n")
print("=" * 80)
print(f"\nTotal HTML nodes created: {len(html_nodes)}\n")

for i, node in enumerate(html_nodes[:10], 1):  # Show first 10 nodes
    tag = node.metadata.get('tag', 'N/A')
    print(f"Node {i} [{tag}]:")
    print(f"  {textwrap.shorten(node.text, width=70)}...")

print(f"\n... and {len(html_nodes) - 10} more nodes")

# Extract tables specifically
def extract_tables(html_content):
    """Extract table data from HTML"""
    soup = BeautifulSoup(html_content, 'html.parser')
    tables = soup.find_all('table')

    extracted_tables = []
    for table in tables:
        rows = table.find_all('tr')
        table_data = []

        for row in rows:
            cols = row.find_all(['td', 'th'])
            row_data = [col.text.strip() for col in cols]
            table_data.append(row_data)

        extracted_tables.append(table_data)

    return extracted_tables

# Extract and display tables
tables = extract_tables(html_text)
print("\n" + "=" * 80)
print("Extracted Table:\n")
for row in tables[0]:
    print(f"  {row}")

print("\n✓ HTML structure preserved with tag metadata and table extraction")

In [None]:
## Creating a Searchable Document Map

def create_document_map(nodes):
    """
    Create a searchable map of document sections.
    
    Args:
        nodes: List of parsed document nodes
        
    Returns:
        dict: Document map with section information
    """
    document_map = {}

    for i, node in enumerate(nodes):
        # Get the heading or create a default one
        heading = node.metadata.get("heading", f"Section {i+1}")
        level = node.metadata.get("heading_level", 0)

        # Add to document map with indent based on level
        indent = "  " * (level - 1) if level > 0 else ""
        document_map[heading] = {
            "index": i,
            "level": level,
            "text": node.text,
            "display": f"{indent}{heading}"
        }

    return document_map

def find_section(query, doc_map):
    """
    Find sections that match a query string.
    
    Args:
        query: Search term
        doc_map: Document map to search
        
    Returns:
        list: Matching sections with snippets
    """
    matches = []

    for heading, info in doc_map.items():
        # Check if query is in heading or content
        if query.lower() in heading.lower() or query.lower() in info['text'].lower():
            matches.append((heading, info))

    return matches

# Create document map from markdown nodes
doc_map = create_document_map(nodes)

# Display the document structure as a table of contents
print("Document Table of Contents:\n")
print("=" * 80)
for heading, info in doc_map.items():
    print(f"  {info['display']}")

# Perform section-based searches
print("\n" + "=" * 80)
print("Section-Based Search Examples:\n")

search_terms = ["advantages", "embedding models", "indexing"]

for term in search_terms:
    print(f"Searching for '{term}':")
    results = find_section(term, doc_map)

    if results:
        for heading, info in results:
            print(f"  ✓ Found in: {info['display']}")
            
            # Extract a relevant snippet
            text = info['text']
            term_pos = text.lower().find(term.lower())
            if term_pos >= 0:
                start = max(0, term_pos - 40)
                snippet = text[start:start+100].strip() + "..."
                print(f"    Snippet: {snippet}")
    else:
        print(f"  ✗ No results found")
    print()

print("✓ Document map enables efficient section-based navigation and search")

## Summary

We've explored how to preserve and leverage document structure during text extraction:

### Key Techniques

1. **Markdown Parsing** - MarkdownNodeParser preserves:
   - Heading hierarchy (H1, H2, H3)
   - Header paths showing document structure
   - Section relationships

2. **HTML Parsing** - HTMLNodeParser preserves:
   - Tag information (headings, paragraphs, lists)
   - Structured data (tables, lists)
   - Document hierarchy

3. **Document Maps** - Create searchable structures with:
   - Table of contents generation
   - Section-based navigation
   - Hierarchical search

4. **Table Extraction** - BeautifulSoup enables:
   - Structured table data extraction
   - Conversion to usable formats
   - Integration with other parsers

### Benefits of Structure Preservation

- **Better retrieval** - Search respects document hierarchy
- **Context preservation** - Maintains relationships between sections
- **Improved chunking** - Creates semantically meaningful chunks
- **Enhanced metadata** - Richer information for downstream tasks

### Best Practices

- Use structure-aware parsers (Markdown, HTML) when possible
- Preserve header paths and hierarchy metadata
- Extract tables and structured data separately when needed
- Build document maps for navigation and search
- Maintain section context in your chunks

Structure-aware parsing enables more sophisticated AI applications that understand document organization, not just raw text.