## Debugging Bullet Detection - Page 27

This cell debugs how bullets (•, ◦, etc.) are detected in the PDF on page 27. It extracts raw PDF data and prints:
- Block and line indices where bullets appear
- The text content after each bullet
- X positions of bullet markers

This helps verify that the PDF parsing is correctly identifying bullets and their locations.

In [1]:
# Debug bullet detection for page 27
import fitz

doc = fitz.open('Teradata Package for Python User Guide.pdf')
page = doc.load_page(26)  # page 27 (0-indexed)
data = page.get_text('dict')

print("Checking bullet detection for page 27:")
for i, block in enumerate(data['blocks']):
    if block['type'] == 0:
        for j, line in enumerate(block['lines']):
            spans = line['spans']
            if spans:
                line_text = ''
                has_bullet_chars = False
                first_content_span = None
                
                for span in spans:
                    span_text = span['text'].strip()
                    if span_text in ['•', '\u2022', '◦', '\u25E6', '*', '-']:
                        has_bullet_chars = True
                        print(f'Block {i}, Line {j}: Found bullet "{span_text}"')
                        continue
                    if first_content_span is None and span['text'].strip():
                        first_content_span = span
                    line_text += span['text'] + " "
                
                line_text = line_text.strip()
                if line_text and has_bullet_chars:
                    print(f'  line_text: "{line_text}"')
                    print(f'  has_bullet_chars: {has_bullet_chars}')
                    if first_content_span:
                        x_pos = first_content_span['origin'][0]
                        print(f'  first_content_x: {x_pos}')
                    print()

Checking bullet detection for page 27:
Block 7, Line 0: Found bullet "•"
Block 8, Line 0: Found bullet "◦"
Block 9, Line 0: Found bullet "◦"
Block 10, Line 0: Found bullet "◦"
Block 13, Line 0: Found bullet "•"
Block 14, Line 0: Found bullet "◦"
Block 15, Line 0: Found bullet "◦"
Block 16, Line 0: Found bullet "◦"
Block 17, Line 0: Found bullet "◦"
Block 18, Line 0: Found bullet "◦"
Block 19, Line 0: Found bullet "◦"
Block 20, Line 0: Found bullet "◦"
Block 21, Line 0: Found bullet "◦"
Block 22, Line 0: Found bullet "◦"
Block 23, Line 0: Found bullet "◦"


## Debugging Bullet Detection - Page 32

Similar to the previous cell, but debugs page 32 instead. This cell analyzes how bullets appear on a different chapter page to verify consistent bullet detection across the PDF.

Useful for comparing bullet patterns and X positions across different pages.

In [2]:
# Debug bullet detection for page 32
import fitz

doc = fitz.open('Teradata Package for Python User Guide.pdf')
page = doc.load_page(31)  # page 32 (0-indexed)
data = page.get_text('dict')

print("Checking bullet detection for page 32:")
for i, block in enumerate(data['blocks']):
    if block['type'] == 0:
        for j, line in enumerate(block['lines']):
            spans = line['spans']
            if spans:
                line_text = ''
                has_bullet_chars = False
                first_content_span = None
                
                for span in spans:
                    span_text = span['text'].strip()
                    if span_text in ['•', '\u2022', '◦', '\u25E6', '*', '-']:
                        has_bullet_chars = True
                        print(f'Block {i}, Line {j}: Found bullet "{span_text}"')
                        continue
                    if first_content_span is None and span['text'].strip():
                        first_content_span = span
                    line_text += span['text'] + " "
                
                line_text = line_text.strip()
                if line_text and has_bullet_chars:
                    print(f'  line_text: "{line_text}"')
                    print(f'  has_bullet_chars: {has_bullet_chars}')
                    if first_content_span:
                        x_pos = first_content_span['origin'][0]
                        print(f'  first_content_x: {x_pos}')
                    print()

Checking bullet detection for page 32:
Block 0, Line 0: Found bullet "•"
Block 1, Line 0: Found bullet "•"
Block 2, Line 0: Found bullet "•"
Block 4, Line 0: Found bullet "•"
Block 9, Line 0: Found bullet "•"
Block 13, Line 0: Found bullet "•"


## Import Dependencies

Imports the required libraries for PDF parsing and file operations:
- `pymupdf` (fitz): For reading and extracting text from PDF files
- `os`: For file and directory operations
- `re`: For regex pattern matching in text processing
- `json`: For working with JSON data (e.g., exporting raw metadata)

In [3]:
import pymupdf as fitz
import os
import re
import json

## Configuration: Chapter Map & PDF Metadata

Defines the structure of the PDF document:
- **CHAPTER_MAP**: A list of tuples containing (chapter_title, start_page, end_page) for each chapter/section in the Teradata Package for Python User Guide
- **PDF_FILE**: The path to the PDF file to be parsed

The chapter map is used to split the extracted content into separate markdown files per chapter, preserving document structure.

In [4]:
# Configuration: Chapter Map & PDF Metadata
CHAPTER_MAP = [
    ("Table of Contents", 3, 6),
    ("Introduction to Teradata Package for Python", 7, 26),
    ("Installing, Uninstalling, and Upgrading Teradata Package for Python", 27, 31),
    ("teradataml Components", 32, 42),
    ("DataFrames Setup and Basics (Sources, Non-Default DB, UAF)", 43, 60),
    ("DataFrame Manipulation (Core API)", 61, 164),
    ("DataFrame Metadata, Rotation, Saving, and Export", 165, 202),
    ("Executing Python Functions Inside Database Engine 20", 203, 242),
    ("teradataml DataFrame Column", 243, 279),
    ("teradataml Window Aggregates", 280, 288),
    ("Context to Teradata Vantage", 289, 301),
    ("teradataml Options", 302, 319),
    ("teradataml Utility and General Functions", 320, 399),
    ("teradataml Open-Source Machine Learning Functions", 400, 457),
    ("Script Methods (SCRIPT Table Operator)", 458, 477),
    ("Series (DataFrame Column Sequence)", 478, 481),
    ("BYOM (Bring Your Own Model) Management", 482, 519),
    ("Working with Geospatial Data", 520, 571),
    ("Exploratory Data Analysis (EDA UI)", 572, 576),
    ("Plotting in teradataml", 577, 611),
    ("Hyperparameter Tuning in teradataml", 612, 693),
    ("AutoML Overview and Methods", 694, 718),
    ("AutoML Examples", 719, 1108),
    ("AutoDataPrep", 1109, 1150),
    ("Feature Store in teradataml", 1151, 1190),
    ("Using Teradata Vantage Analytic Functions with Teradata Package for Python", 1191, 1235),
    ("Appendix A: Teradata Package for Python Limitations and Considerations", 1236, 1260),
    ("Appendix B: Using teradataml with Native Object Store", 1261, 1276),
    ("Appendix C: teradataml Extension with SQLAlchemy", 1277, 1295),
    ("Appendix D: Data Type Mapping", 1296, 1297),
    ("Appendix E: Additional Information", 1298, 1301)
]

PDF_FILE = "Teradata Package for Python User Guide.pdf"

## Constants: Font Sizes & Constants

Defines all constants for PDF parsing:
- **Font sizes**: H2 (17.95), H3 (15.96), H4 (11.55) for heading detection
- **BODY_TEXT_SIZE**: 10.5 points (standard text)
- **CODE_FONT**: 'Consolas' (identifies code blocks)
- **BOLD_FLAG**: 16 (bit flag for bold text)
- **Bullet character sets**: BLACK_BULLETS (•), WHITE_BULLETS (◦) for nested lists
- **Y_MERGE_TOLERANCE**: 5 points (merge lines within this vertical distance)
- **APPENDIX_START_INDEX**: 26 (index where appendices begin in chapter map)

In [5]:
# X-coordinate indentation mapping (generalized across PDF)
# Rule: Indentation strictly based on X thresholds (0/2/4/6 spaces)
# Rule: Nested levels must have at least +2 more spaces than parent
X_INDENT_RANGES = [
    (50, 75, 0, None, False),        # Level 0: X ~50-75, 0 spaces, no bullet (main content)
    (59, 73, 0, '*', True),          # Level 0 black bullet: X ~59-73, 0 spaces, with bullet •
    (72, 85, 2, '*', True),          # Level 1 bullet: X ~72-85, 2 spaces, with bullet • or ◦
    (77, 90, 2, None, False),        # Level 1 content: X ~77-90, 2 spaces, no bullet
    (85, 115, 4, '*', True),         # Level 2 bullet/content: X ~85-115, 4 spaces, with bullet or no bullet
    (111, 130, 4, None, False),      # Level 2 content: X ~111-130, 4 spaces, no bullet
    (125, 150, 6, None, False),      # Level 3 content: X ~125-150, 6 spaces, no bullet
]

## Define PDF Constants

Defines all font sizes, styles, and tolerance thresholds used throughout PDF parsing:
- **Font sizes**: H2 (17.95pt), H3 (15.96pt), H4 (11.55-13.96pt) for heading detection
- **Body text**: 10.5pt standard text size
- **Code font**: Consolas (identifies code blocks)
- **Bold flag**: 16 (bit flag for bold text)
- **Bullet characters**: BLACK_BULLETS (•, •, *) and WHITE_BULLETS (◦) for nested lists
- **Tolerances**: Y_MERGE (5pt), Font tolerance (0.01pt)
- **APPENDIX_START_INDEX**: Chapter 26 is where appendices begin

In [6]:
# Constants: PDF Font Sizes, Styles, and Tolerance Thresholds
TOLERANCE = 0.01
H2_SIZE_SPECIFIC = 17.954999923706055
H3_SIZE_GENERIC = 15.960000038146973
H4_SIZE_ARIAL_BLACK = 11.550000190734863
H4_SIZE_BOLD_SECONDARY = 13.96500015258789
BODY_TEXT_SIZE = 10.5
BOLD_FLAG = 16
CODE_FONT = 'Consolas'
H4_FONT = 'Arial-Black'
Y_MERGE_TOLERANCE = 5.0  # Lines within 5 points vertically are part of same row
APPENDIX_START_INDEX = 26  # Chapter index where appendices begin

# Bullet character detection (consolidated - used in multiple places)
BULLET_CHARS = {'•', '\u2022', '◦', '\u25E6', '*', '-'}
WHITE_BULLETS = {'◦', '\u25E6'}  # Nested bullets
BLACK_BULLETS = {'•', '\u2022', '*'}  # Main level bullets



## X-Coordinate Indentation Mapping

Maps X-coordinate positions from the PDF to markdown indentation levels:
- **Rule 1**: Indentation is strictly based on X thresholds (0/2/4/6 spaces)
- **Rule 2**: Nested levels must have at least +2 more spaces than parent level

Each range tuple: `(x_min, x_max, indent_spaces, bullet_marker, is_bullet)`
- x_min, x_max: X-coordinate boundaries (in points)
- indent_spaces: Number of spaces to apply (0, 2, 4, or 6)
- bullet_marker: Expected bullet character (•, ◦, or None)
- is_bullet: Whether this range is for bullets

This ensures consistent hierarchical indentation across all content.

In [7]:
def get_indent_from_x(x_pos, bullet_char=None):
    """
    Determine indentation level from X coordinate.
    
    Indentation is determined STRICTLY by X coordinate thresholds (0/2/4/6 spaces).
    Nested levels must have at least +2 more spaces than parent level.
    
    Args:
        x_pos: X coordinate from PDF (PRIMARY source of truth)
        bullet_char: bullet character - currently unused, kept for API compatibility
    
    Returns:
        indent_spaces: number of spaces for indentation (0, 2, 4, 6, etc.)
    """
    # Check X coordinate ranges (PRIMARY source of truth)
    for x_min, x_max, indent_spaces, bullet_marker, is_bullet in X_INDENT_RANGES:
        if x_min <= x_pos < x_max:
            return indent_spaces
    
    # Fallback: if x_pos > 125, estimate based on distance
    if x_pos >= 125:
        extra_levels = max(0, int((x_pos - 125) / 20))
        return 4 + extra_levels * 2
    
    return 0  # Default: main level

## Utility Function: Get Indentation from X Coordinate

**Purpose**: Convert PDF X-coordinate positions to markdown indentation spaces.

**Logic**:
1. Check X_INDENT_RANGES to find matching range for given x_pos
2. Return the indentation_spaces for that range (PRIMARY source of truth)
3. For x_pos > 125 (fallback): Estimate based on distance from 125
4. Default: Return 0 spaces

**Key Design**: Indentation is STRICTLY based on X coordinates, not on bullet characters. This ensures consistent formatting across the document.

In [8]:
# PDF Block Processing Functions

def merge_table_columns_in_block(block):
    """
    Merge table columns by grouping lines at similar Y coordinates.
    Special handling: Do NOT merge bullet characters with content text.
    """
    if not block.get('lines'):
        return block['lines']
    
    # Identify which lines contain ONLY a bullet character
    pure_bullet_lines = set()
    for line_idx, line in enumerate(block.get('lines', [])):
        spans = line.get('spans', [])
        if len(spans) == 1 and spans[0]['text'].strip() in BULLET_CHARS:
            pure_bullet_lines.add(line_idx)
    
    # Collect all lines with their Y coordinate, excluding pure bullets from merge
    y_groups = {}
    for line_idx, line in enumerate(block.get('lines', [])):
        if not line.get('spans'):
            continue
        
        # Don't merge pure bullet lines - keep them separate
        if line_idx in pure_bullet_lines:
            y_groups[len(y_groups)] = [line]  # Each bullet gets its own group
            continue
        
        y_pos = line['bbox'][1]  # Top of bbox
        
        # Find matching Y group (within tolerance), but only for non-bullet lines
        matched_y = None
        for existing_y in y_groups:
            if isinstance(existing_y, (int, float)) and existing_y < 1000:
                if abs(y_pos - existing_y) < Y_MERGE_TOLERANCE:
                    matched_y = existing_y
                    break
        
        if matched_y is None:
            matched_y = y_pos
            y_groups[matched_y] = []
        
        y_groups[matched_y].append(line)
    
    # Sort each group by X coordinate and merge spans
    merged_lines = []
    for key in sorted(y_groups.keys(), key=lambda k: k if isinstance(k, (int, float)) else float('inf')):
        lines_in_group = y_groups[key]
        
        if len(lines_in_group) == 1:
            merged_lines.append(lines_in_group[0])
        else:
            # Multiple lines at same Y - merge them
            lines_sorted = sorted(lines_in_group, key=lambda l: l['bbox'][0])
            
            # Combine spans from all lines, adding spaces between them
            merged_spans = []
            for line_idx, line in enumerate(lines_sorted):
                for span_idx, span in enumerate(line['spans']):
                    span_copy = dict(span)
                    if line_idx > 0 or span_idx > 0:
                        span_copy['text'] = ' ' + span['text']
                    merged_spans.append(span_copy)
            
            # Create merged line
            merged_line = {
                'spans': merged_spans,
                'wmode': lines_sorted[0].get('wmode', 0),
                'dir': lines_sorted[0].get('dir', [1.0, 0.0]),
                'bbox': [
                    lines_sorted[0]['bbox'][0],   # Left of first
                    lines_sorted[0]['bbox'][1],   # Top of first
                    lines_sorted[-1]['bbox'][2],  # Right of last
                    lines_sorted[-1]['bbox'][3],  # Bottom of last
                ]
            }
            merged_lines.append(merged_line)
    
    return merged_lines

def identify_bullet_lines(merged_block_lines):
    """Identify which lines contain ONLY a bullet character."""
    bullet_lines = set()
    for line_idx, line in enumerate(merged_block_lines):
        spans = line.get('spans', [])
        if len(spans) == 1 and spans[0]['text'].strip() in BULLET_CHARS:
            bullet_lines.add(line_idx)
    return bullet_lines

## Line Info Extraction & Helper Functions

**extract_line_info()**:
- Extracts metadata from each line: text, font size, flags, font name, X position, bullet info
- Skips noise: headers, footers, font sizes 24.0 or 8.0
- Detects if line had a bullet in previous line and captures bullet character
- Returns list of dicts with line info for processing

**should_close_code_block()**: Detects headings that should close open code blocks
**get_heading_level()**: Determines markdown heading level (2-4) from font characteristics  
**is_code_block_text()**: Checks if line is in code font (Consolas) at correct size (10.5pt)

## PDF Block Processing Functions

**merge_table_columns_in_block()**:
- Merges table columns by grouping lines at similar Y coordinates
- Preserves bullet characters on separate lines (not merged with content)
- Handles multi-column layouts in the PDF

**identify_bullet_lines()**:
- Identifies which lines contain ONLY a bullet character (•, ◦, etc.)
- Returns a set of line indices containing pure bullets
- Used to properly handle bullets separately from their content

In [9]:
# Line Info Extraction & Processing

def extract_line_info(merged_block_lines, bullet_lines, page_num_1idx):
    """
    Extract and process line metadata from merged block lines.
    
    Returns:
        List of dicts with: text, size, flags, font, is_bold, x_pos, has_bullet, bullet_char, bullet_x_pos
    """
    lines_info = []
    
    for line_idx, line in enumerate(merged_block_lines):
        spans = line.get('spans', [])
        if not spans:
            continue
        
        # SKIP pure bullet lines - they only serve to mark the next line
        if line_idx in bullet_lines:
            continue

        first_span = spans[0]
        span_size = first_span.get('size', 10)
        line_flags = first_span.get('flags', 0)
        line_font = first_span.get('font', 'Arial')
        x_pos = first_span.get('origin', [0, 0])[0]
        
        # Skip certain font sizes (headers, noise)
        if round(span_size, 1) in [24.0, 8.0]:
            continue

        line_text = "".join([s['text'] for s in spans])
        line_text = clean_page_noise(line_text, page_num_1idx).strip()
        line_text_clean = line_text.lstrip(' ')
        
        # Skip chapter/appendix headers and noise
        if re.match(r'^(\d{1,2}|[A-E]):\s+\S', line_text_clean):
            continue
        
        if not line_text_clean or re.fullmatch(r'[\.\-\(\)\s]*', line_text_clean):
            continue
        
        if line_text_clean.strip() in ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]:
            continue
        
        is_bold_heading = (line_flags & BOLD_FLAG)
        
        # Determine if this line had a bullet in previous line
        has_bullet_from_previous = False
        bullet_char = None
        bullet_x_pos = None
        if line_idx > 0 and (line_idx - 1) in bullet_lines:
            has_bullet_from_previous = True
            prev_line = merged_block_lines[line_idx - 1]
            if prev_line.get('spans'):
                bullet_text = prev_line['spans'][0]['text'].strip()
                bullet_char = bullet_text
                bullet_x_pos = prev_line['spans'][0].get('origin', [0, 0])[0]
        
        lines_info.append({
            'text': line_text_clean,
            'size': span_size,
            'flags': line_flags,
            'font': line_font,
            'is_bold': is_bold_heading,
            'x_pos': x_pos,
            'has_bullet': has_bullet_from_previous,
            'bullet_char': bullet_char,
            'bullet_x_pos': bullet_x_pos,
        })
    
    return lines_info

def should_close_code_block(line_info):
    """Determine if we should close any open code block for this line (heading detection)."""
    span_size = line_info['size']
    line_flags = line_info['flags']
    line_font = line_info['font']
    is_bold_heading = line_info['is_bold']
    
    # Check if this is a heading that should close code blocks
    if abs(span_size - H2_SIZE_SPECIFIC) < TOLERANCE and is_bold_heading:
        return True
    elif abs(span_size - H3_SIZE_GENERIC) < TOLERANCE and is_bold_heading:
        return True
    elif H4_FONT in line_font and abs(span_size - H4_SIZE_ARIAL_BLACK) < TOLERANCE:
        return True
    elif round(span_size, 3) == 13.965 and is_bold_heading:
        return True
    
    return False

def get_heading_level(line_info):
    """Get markdown heading level if line is a heading, else None."""
    span_size = line_info['size']
    line_flags = line_info['flags']
    line_font = line_info['font']
    is_bold_heading = line_info['is_bold']
    
    if abs(span_size - H2_SIZE_SPECIFIC) < TOLERANCE and is_bold_heading:
        return 2
    elif abs(span_size - H3_SIZE_GENERIC) < TOLERANCE and is_bold_heading:
        return 3
    elif H4_FONT in line_font and abs(span_size - H4_SIZE_ARIAL_BLACK) < TOLERANCE:
        return 4
    elif round(span_size, 3) == 13.965 and is_bold_heading:
        return 4
    
    return None

def is_code_block_text(line_info):
    """Check if line is code text (code font + correct size)."""
    return abs(line_info['size'] - BODY_TEXT_SIZE) < TOLERANCE and line_info['font'] == CODE_FONT

## Core Page Parsing Logic

**parse_page_with_metadata()**:
- Main function that parses a single PDF page
- Handles multi-page code block continuity via state tracking
- For each line, determines:
  1. Is it a heading? → Close code block, add markdown heading
  2. Is it code? → Add to code block with proper formatting
  3. Is it regular text? → Apply hierarchical indentation based on X coordinate

**Key features**:
- Tracks `in_code_block` state across pages
- Applies indentation strictly from X_INDENT_RANGES
- Handles numbered items and bullet markers
- Returns markdown text and updated state for next page

In [10]:
def parse_page_with_metadata(page, page_num_1idx, previous_page_state=None):
    """
    Parse a page with table detection, code vs output distinction, and hierarchical indentation.
    
    Args:
        page: PyMuPDF page object
        page_num_1idx: 1-indexed page number
        previous_page_state: dict tracking state from previous page
    
    Returns:
        dict with 'markdown': str and 'state': dict for next page
    """
    if previous_page_state is None:
        previous_page_state = {'in_code_block': False, 'pending_lines': []}
    
    data = page.get_text("dict")
    blocks = data['blocks']
    markdown_lines = []
    pending_numbered_item = None
    
    in_code_block = previous_page_state['in_code_block']
    
    if previous_page_state['pending_lines']:
        markdown_lines.extend(previous_page_state['pending_lines'])
        in_code_block = True

    # STEP 1: Prepare raw blocks for table detection
    all_blocks_with_raw_lines = []
    all_merged_blocks = {}
    
    for block_idx, block in enumerate(blocks):
        if block['type'] != 0:
            continue
        
        raw_lines = block.get('lines', [])
        all_blocks_with_raw_lines.append((block_idx, raw_lines))
        
        merged_block_lines = merge_table_columns_in_block(block)
        all_merged_blocks[block_idx] = merged_block_lines
    
    # STEP 2: Detect and convert tables
    page_table_markdown, table_line_refs = detect_and_convert_tables_in_page(all_blocks_with_raw_lines, page_num_1idx)
    
    # Build set of blocks that contain tables (from table_line_refs)
    table_blocks_set = set(b_idx for b_idx, _ in table_line_refs)

    # STEP 3: Process each block
    table_inserted = False
    for block_idx, block in enumerate(blocks):
        if block['type'] != 0:
            continue
        
        if block_idx not in all_merged_blocks:
            continue
        
        # If this block has table content, insert the table markdown before processing it
        if block_idx in table_blocks_set and not table_inserted:
            if in_code_block:
                markdown_lines.append("```\n")
                in_code_block = False
            markdown_lines.append(page_table_markdown)
            markdown_lines.append('\n\n')
            table_inserted = True
            # Skip all blocks that contain table content
            continue
        
        # Skip remaining table blocks (but don't re-insert table)
        if block_idx in table_blocks_set:
            continue
        
        merged_block_lines = all_merged_blocks[block_idx]
        bullet_lines = identify_bullet_lines(merged_block_lines)
        lines_info = extract_line_info(merged_block_lines, bullet_lines, page_num_1idx)
        
        # Process each line in the block
        for current in lines_info:
            line_text_clean = current['text']
            
            # Handle numbered items
            if re.match(r'^\d+\.$', line_text_clean.strip()):
                pending_numbered_item = line_text_clean.strip()
                continue
            
            if pending_numbered_item and line_text_clean:
                if not line_text_clean.startswith(('###', '####', '```', '*')):
                    line_text_clean = f"{pending_numbered_item} {line_text_clean}"
                    pending_numbered_item = None
            
            # CONSOLIDATED: Check if heading (closes code blocks automatically)
            if should_close_code_block(current):
                if in_code_block:
                    markdown_lines.append("```\n")
                    in_code_block = False
                
                heading_level = get_heading_level(current)
                heading_marker = '#' * heading_level
                markdown_lines.append(f"{heading_marker} {line_text_clean}\n")
                continue
            
            # Code/Output handling
            if is_code_block_text(current):
                is_code = is_code_line(line_text_clean, current['font'])
                
                if is_code:
                    # Start or continue python code block with REPL prompt
                    if not in_code_block:
                        markdown_lines.append("```python\n")
                        in_code_block = True
                    markdown_lines.append(line_text_clean + "\n")
                else:
                    # This is output (code font text after >>> lines)
                    if in_code_block:
                        markdown_lines.append(line_text_clean + "\n")
                    else:
                        # Orphaned code-font text - treat as regular code
                        if not in_code_block:
                            markdown_lines.append("```python\n")
                            in_code_block = True
                        markdown_lines.append(line_text_clean + "\n")
            
            else:
                # Regular body text (non-code-font) - apply hierarchical indentation
                if in_code_block:
                    markdown_lines.append("```\n")
                    in_code_block = False
                
                # Determine indentation from X coordinate (strictly based on X_INDENT_RANGES)
                indent_x_pos = current['bullet_x_pos'] if current['has_bullet'] and current['bullet_x_pos'] is not None else current['x_pos']
                indent_spaces = get_indent_from_x(indent_x_pos, current['bullet_char'])
                
                indent_str = ' ' * indent_spaces
                
                # Apply bullet marker if present
                if current['has_bullet']:
                    markdown_lines.append(f"{indent_str}* {line_text_clean}\n")
                else:
                    markdown_lines.append(f"{indent_str}{line_text_clean}\n")
    
    # Determine final state for next page
    final_state = {
        'in_code_block': in_code_block,
        'pending_lines': []
    }
    
    final_text = "".join(markdown_lines)
    final_text = re.sub(r'\n{3,}', '\n\n', final_text)
    
    return {
        'markdown': final_text.strip(),
        'state': final_state
    }

## Bullet Conversion & File Export Functions

**convert_bullets_in_text()**:
- Post-processing step to convert old bullet characters to markdown format
- Replaces • and ◦ with * (markdown bullet syntax)
- Preserves indentation from X coordinate parsing
- Ensures proper spacing for nested bullets

**export_raw_page_metadata()**:
- Debugging utility that exports raw PDF metadata for a specific page as JSON
- Helps diagnose PDF parsing issues
- Outputs: raw text dict structure from PyMuPDF with all font/position info

In [11]:
# Bullet Conversion & File Export Functions

def convert_bullets_in_text(text):
    """
    Convert old bullet characters (• and ◦) to markdown format (* with proper indentation).
    
    This is a post-processing step for any bullets that weren't converted during parsing.
    """
    lines = text.split('\n')
    converted_lines = []
    
    for line in lines:
        # Replace • with * (main bullet)
        if '•' in line or '\u2022' in line:
            match = re.match(r'^(\s*)([•\u2022])\s*(.*)', line)
            if match:
                spaces, bullet, content = match.groups()
                converted_lines.append(f"{spaces}* {content}")
            else:
                line = line.replace('•', '*').replace('\u2022', '*')
                converted_lines.append(line)
        
        # Replace ◦ with * (nested bullet) - ensure 4 spaces
        elif '◦' in line or '\u25E6' in line:
            match = re.match(r'^(\s*)([◦\u25E6])\s*(.*)', line)
            if match:
                spaces, bullet, content = match.groups()
                if len(spaces) < 4:
                    spaces = '    '  # Force 4 spaces for nested
                converted_lines.append(f"{spaces}* {content}")
            else:
                line = line.replace('◦', '*').replace('\u25E6', '*')
                converted_lines.append(line)
        else:
            converted_lines.append(line)
    
    return '\n'.join(converted_lines)

def export_raw_page_metadata(pdf_path, page_num_1idx, output_filename):
    """Export raw PDF metadata for a specific page (for debugging)."""
    try:
        doc = fitz.open(pdf_path)
        if 0 < page_num_1idx <= doc.page_count:
            page_num_0idx = page_num_1idx - 1
            page = doc.load_page(page_num_0idx)
            data = page.get_text("dict")
            with open(output_filename, "w", encoding="utf-8") as f:
                f.write(json.dumps(data, indent=4))
            print(f"✨ Successfully exported raw metadata for page {page_num_1idx} to '{output_filename}'.")
        else:
            print(f"Error: Page number {page_num_1idx} is out of bounds.")
    except FileNotFoundError:
        print(f"Error: PDF file not found at '{pdf_path}'.")
    except Exception as e:
        print(f"Error during metadata export: {e}")
    finally:
        if 'doc' in locals() and doc:
            doc.close()

## Extract Table Rows from PDF Blocks

**extract_table_rows_from_blocks()**:
- Extracts table rows from a range of PDF blocks
- Groups PDF spans by Y coordinate (same Y = same row, within 5px tolerance)
- For each row, groups cells by X coordinate (with 10px rounding)
- Identifies all unique column positions across the table
- Pads all rows to same column count with empty strings
- Returns: List of rows, each row is a list of cell strings

In [12]:
def extract_table_rows_from_blocks(all_blocks_with_raw_lines, table_start, table_end):
    """
    Extract table rows by:
    1. Collecting all lines from all blocks with their Y coordinates
    2. Grouping lines by Y coordinate (same Y = same row)
    3. For each row, extracting cells by X position
    4. Ensuring all rows have consistent column count (padding with empty strings)
    
    Args:
        all_blocks_with_raw_lines: List of (block_idx, raw_lines) tuples
        table_start, table_end: Block range
    
    Returns: List of rows, each row is a list of cells
    """
    block_dict = dict(all_blocks_with_raw_lines)
    
    # Step 1: Collect all lines with their Y and X coordinates
    all_lines_with_coords = []
    for block_idx in range(table_start, table_end + 1):
        if block_idx not in block_dict:
            continue
        
        for line in block_dict[block_idx]:
            if not line.get('spans'):
                continue
            
            # Each span in a line can be at different X
            for span in line.get('spans', []):
                y = span.get('origin', [0, 0])[1]
                x = span.get('origin', [0, 0])[0]
                text = span.get('text', '')
                all_lines_with_coords.append((y, x, text))
    
    # Step 2: Group by Y coordinate (within 5px tolerance)
    y_groups = {}
    for y, x, text in sorted(all_lines_with_coords, key=lambda item: item[0]):
        # Find matching Y group
        matched_y = None
        for existing_y in y_groups:
            if abs(y - existing_y) < 5:  # Y tolerance
                matched_y = existing_y
                break
        
        if matched_y is None:
            matched_y = y
            y_groups[matched_y] = []
        
        y_groups[matched_y].append((x, text))
    
    # Step 3: Identify all unique X column positions across entire table
    all_x_positions = set()
    for y_key in y_groups.keys():
        cells_list = y_groups[y_key]
        for x, text in cells_list:
            x_key = round(x / 10) * 10  # Round X to nearest 10px
            all_x_positions.add(x_key)
    
    sorted_x_positions = sorted(all_x_positions)
    
    # Step 4: Build rows by organizing cells by X position
    rows = []
    for y_key in sorted(y_groups.keys()):
        cells_list = y_groups[y_key]
        
        # Group by X position (cells in same X column)
        x_groups = {}
        for x, text in cells_list:
            x_key = round(x / 10) * 10  # Round X to nearest 10px
            if x_key not in x_groups:
                x_groups[x_key] = []
            x_groups[x_key].append(text)
        
        # Create row with cells for ALL column positions (fill missing with empty strings)
        row = []
        for x_key in sorted_x_positions:
            if x_key in x_groups:
                cell_text = "".join(x_groups[x_key]).strip()
            else:
                cell_text = ""
            row.append(cell_text)
        
        if any(row):  # Only add if row has at least one non-empty cell
            rows.append(row)
    
    return rows

In [13]:
def detect_table_blocks_in_page(all_blocks_with_raw_lines):
    """
    Detect table blocks by identifying aligned X-coordinates across multiple rows.
    
    A table is detected when:
    1. There are 2+ X positions (multi-column layout)
    2. Font size is ~10.028pt (table text size)
    3. X gap between columns >= 100px (distinct columns, not just formatting)
    4. Footer text is EXCLUDED via regex pattern, not by size
    
    Footer pattern: "Package for Python User Guide, Release xx.xx"
    This ensures robust footer filtering across future PDF updates.
    
    Args:
        all_blocks_with_raw_lines: List of (block_idx, raw_lines) tuples
    
    Returns:
        List of (table_start_block, table_end_block) tuples for detected tables
    """
    # Regex to detect footer/header text (version-agnostic)
    FOOTER_PATTERN = re.compile(r'Package for Python User Guide,?\s+Release\s+[\d.]+', re.IGNORECASE)
    TABLE_FONT_SIZE = 10.028  # Table text font size (with tolerance)
    FONT_SIZE_TOLERANCE = 0.05
    MIN_X_GAP = 100  # Minimum gap between columns
    
    # Step 1: Find blocks that contain table-like text
    table_candidate_blocks = []
    
    for block_idx, raw_lines in all_blocks_with_raw_lines:
        if not raw_lines:
            continue
        
        # Check if this block has table-like characteristics
        has_table_font = False
        x_positions = set()
        
        for line in raw_lines:
            for span in line.get('spans', []):
                # Filter out footer text via regex
                span_text = span.get('text', '')
                if FOOTER_PATTERN.search(span_text):
                    continue  # Skip footer lines
                
                # Check font size
                span_size = span.get('size', 0)
                if abs(span_size - TABLE_FONT_SIZE) < FONT_SIZE_TOLERANCE:
                    has_table_font = True
                    x = span.get('origin', [0, 0])[0]
                    x_key = round(x / 10) * 10  # Round X to nearest 10px
                    x_positions.add(x_key)
        
        # Determine if this is a table block
        if has_table_font and len(x_positions) >= 2:
            # Check if columns are distinct (gap >= MIN_X_GAP)
            sorted_x = sorted(x_positions)
            is_table = False
            for i in range(len(sorted_x) - 1):
                if sorted_x[i+1] - sorted_x[i] >= MIN_X_GAP:
                    is_table = True
                    break
            
            if is_table:
                table_candidate_blocks.append(block_idx)
    
    # Step 2: Group consecutive table blocks into ranges
    if not table_candidate_blocks:
        return []
    
    table_ranges = []
    range_start = table_candidate_blocks[0]
    range_end = table_candidate_blocks[0]
    
    for block_idx in table_candidate_blocks[1:]:
        if block_idx == range_end + 1:
            range_end = block_idx
        else:
            table_ranges.append((range_start, range_end))
            range_start = block_idx
            range_end = block_idx
    
    table_ranges.append((range_start, range_end))
    return table_ranges

## Format Markdown Table & Convert Tables

**format_markdown_table()**:
- Converts extracted table rows to markdown format (| separated columns)
- Deduplicates consecutive identical rows (header repetition in PDFs)
- Pads all rows to same column count with empty strings
- Returns properly formatted markdown table

**detect_and_convert_tables_in_page()**:
- Master function that orchestrates table detection and conversion
- Detects table block ranges, extracts rows, converts to markdown
- Combines multiple tables with newlines between them
- Returns: (markdown_table_str, set of block_references)

In [14]:
def format_markdown_table(rows):
    """
    Convert extracted table rows to markdown table format.
    
    Handles:
    1. Deduplication of consecutive identical rows (header repetition)
    2. Proper column alignment
    3. Markdown pipe separators
    
    Args:
        rows: List of lists, each inner list is a row of cells
    
    Returns:
        Formatted markdown table string
    """
    if not rows or all(len(row) == 0 for row in rows):
        return ""
    
    def normalize_row(row):
        """Normalize row for deduplication: lowercase, collapse whitespace."""
        normalized_cells = [re.sub(r'\s+', ' ', cell.strip().lower()) for cell in row]
        return normalized_cells
    
    # Deduplicate consecutive identical rows (header rows appearing multiple times)
    deduplicated = []
    prev_normalized = None
    
    for row in rows:
        curr_normalized = normalize_row(row)
        if curr_normalized != prev_normalized:
            deduplicated.append(row)
            prev_normalized = curr_normalized
    
    rows = deduplicated
    if not rows:
        return ""
    
    # Determine column count
    num_cols = max(len(row) for row in rows) if rows else 0
    if num_cols == 0:
        return ""
    
    # Ensure all rows have same column count
    for row in rows:
        while len(row) < num_cols:
            row.append("")
    
    # Build markdown table
    markdown_lines = []
    
    # Header row
    markdown_lines.append("| " + " | ".join(rows[0]) + " |")
    
    # Separator row
    separator = "| " + " | ".join(["-" * max(1, len(cell)) for cell in rows[0]]) + " |"
    markdown_lines.append(separator)
    
    # Data rows
    for row in rows[1:]:
        markdown_lines.append("| " + " | ".join(row) + " |")
    
    return "\n".join(markdown_lines)

In [15]:
def detect_and_convert_tables_in_page(all_blocks_with_raw_lines, page_num_1idx):
    """
    Master function: Detect tables on page and convert to markdown format.
    
    Args:
        all_blocks_with_raw_lines: List of (block_idx, raw_lines) tuples
        page_num_1idx: Page number (1-indexed, for debugging)
    
    Returns:
        Tuple of (markdown_table_str, set of (block_idx, line_idx) table references)
    """
    # Detect table block ranges
    table_ranges = detect_table_blocks_in_page(all_blocks_with_raw_lines)
    
    if not table_ranges:
        return "", set()
    
    # Extract and convert each table range
    all_table_markdown = []
    all_table_refs = set()
    
    for table_start, table_end in table_ranges:
        # Extract rows from this table range
        rows = extract_table_rows_from_blocks(all_blocks_with_raw_lines, table_start, table_end)
        
        # Convert to markdown
        table_md = format_markdown_table(rows)
        
        if table_md:
            all_table_markdown.append(table_md)
            # Track which blocks contain this table
            for block_idx in range(table_start, table_end + 1):
                all_table_refs.add((block_idx, 0))  # Store block reference
    
    # Combine all tables with newlines between them
    combined_markdown = "\n\n".join(all_table_markdown) if all_table_markdown else ""
    
    return combined_markdown, all_table_refs

## Table Detection and Markdown Conversion

**detect_table_blocks()**:
- Identifies table blocks by detecting aligned X-coordinates across multiple rows
- A table is detected when 2+ rows share similar X-coordinate boundaries
- Returns list of table blocks and their column boundaries

**extract_table_rows()**:
- Extracts rows from a detected table block
- Consolidates multi-line cells using proper line breaks
- Returns structured row data with column values

**is_table_header_row()**:
- Detects table headers by checking for bold font and specific font sizes
- Headers are typically 10pt Arial-Bold in this PDF

**format_markdown_table()**:
- Converts extracted table data to markdown table format
- Handles multi-line cells with `<br>` tags for proper markdown rendering
- Ensures clean, readable output for GitHub Copilot

**detect_and_convert_tables()**:
- Master function that processes all blocks in a page
- Detects tables, extracts rows, formats as markdown
- Integrates seamlessly with existing parsing pipeline


In [16]:
# Main Extraction & Chapter Processing

def extract_and_split_by_chapter(pdf_path, chapter_map, output_dir="teradataml_user_guide"):
    """
    Extract chapters from PDF and split into markdown files.
    Handles multi-page code block continuity within chapters.
    
    Args:
        pdf_path: Path to PDF file
        chapter_map: List of (title, start_page, end_page) tuples
        output_dir: Output directory for markdown files
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    try:
        doc = fitz.open(pdf_path)
        total_pages = doc.page_count

        for idx, (title, start_page_1idx, end_page_1idx) in enumerate(chapter_map):
            start_page_0idx = start_page_1idx - 1
            end_page_0idx = end_page_1idx - 1
            
            if start_page_0idx < 0 or end_page_0idx >= total_pages:
                continue

            chapter_text = [f"# {title}\n\n"] 
            page_state = {'in_code_block': False, 'pending_lines': []}
            
            # Process each page in chapter
            for page_num_0idx in range(start_page_0idx, end_page_0idx + 1):
                page = doc.load_page(page_num_0idx)
                result = parse_page_with_metadata(page, page_num_0idx + 1, page_state)
                
                markdown_text = result['markdown']
                page_state = result['state']
                
                chapter_text.append(markdown_text)
                chapter_text.append('\n')
            
            # Close any open code block at end of chapter
            if page_state['in_code_block']:
                chapter_text.append("```\n")
            
            # Post-process: convert any remaining old bullet characters
            final_text = "".join(chapter_text).strip()
            final_text = convert_bullets_in_text(final_text)
            
            # Determine output filename with prefix
            if idx == 0:
                prefix = "00"
            elif idx < APPENDIX_START_INDEX:
                prefix = f"{idx:02d}"
            else:
                appendix_index = idx - APPENDIX_START_INDEX
                prefix = chr(ord('A') + appendix_index)
            
            safe_title = sanitize_title(title)
            output_filename = os.path.join(output_dir, f"{prefix}_{safe_title}.md")
            
            # Write to file
            with open(output_filename, "w", encoding="utf-8") as f:
                f.write(final_text)
            print(f"✓ Generated {output_filename}")
            
    except Exception as e:
        print(f"Error during extraction: {e}")
    finally:
        if 'doc' in locals() and doc:
            doc.close()

## Helper Functions: Title Sanitization & Noise Removal

**sanitize_title()**:
- Cleans chapter titles for safe filenames
- Removes chapter/appendix prefixes
- Removes special characters, keeps word chars + parentheses/brackets
- Replaces spaces with underscores
- Example: "Chapter 1: DataFrames Setup" → "DataFrames_Setup"

**clean_page_noise()**:
- Strips header/footer noise from extracted text
- Removes file name header ("Teradata® Package for Python User Guide...")
- Removes page numbers and chapter title fragments
- Cleans up multiple newlines and non-breaking spaces
- Post-processing to ensure clean, readable markdown

**is_code_line()**:
- Determines if a line is actual code (vs output)
- Checks if line is in Consolas font and starts with >>> or ...
- Used to distinguish between code prompts and output in code blocks
- Returns True only if entire line is code (REPL prompt)


In [17]:
def sanitize_title(title):
    """
    Clean chapter titles for use in filenames.
    
    Steps:
    1. Remove leading chapter/appendix prefix (e.g., "Chapter 1:", "Appendix A:")
    2. Remove special characters (keep only word chars, spaces, parentheses, brackets, hyphens)
    3. Replace spaces/hyphens with underscores
    
    Args:
        title: Raw chapter title from PDF
    
    Returns:
        Sanitized title safe for filenames
    
    Example:
        "Chapter 1: DataFrames Setup and Basics" → "DataFrames_Setup_and_Basics"
    """
    title = re.sub(r'^(Chapter\s\d{1,2}|Appendix\s[A-E]):\s*', '', title, flags=re.IGNORECASE)
    title = re.sub(r'[^\w\s()\[\]-]', '', title)
    title = re.sub(r'[\s-]+', '_', title).strip('_')
    return title

In [18]:
def clean_page_noise(text, page_number):
    """
    Strip header/footer noise and non-breaking spaces from extracted text.
    
    Removes:
    1. File header ("Teradata® Package for Python User Guide, Release 20.00")
    2. Page numbers in headers/footers
    3. Chapter title fragments used in headers
    4. Multiple consecutive newlines
    5. Non-breaking spaces (U+00A0)
    
    Args:
        text: Raw extracted text from PDF page
        page_number: Current page number (1-indexed, used to find page number in headers)
    
    Returns:
        Clean text with noise removed
    """
    FILE_NAME_REGEX = re.escape("Teradata® Package for Python User Guide, Release 20.00")
    text = re.sub(FILE_NAME_REGEX, '', text)
    page_num_regex = re.escape(str(page_number))
    text = re.sub(rf'\s*{page_num_regex}\s*', '\n', text)
    chapter_title_fragments = [
        r'Context to Teradata Vantage', r'teradataml DataFrame Column', r'Executing Python Functions Inside Database',
        r'DataFrames for Tables and Views', r'teradataml Window Aggregates', r'teradataml Options',
        r'teradataml Utility and General Functions', r'Engine 20', r'Table and Views',
        r'Installing, Uninstalling, and Upgrading Teradata Package for Python'
    ]
    fragment_pattern = r'|'.join(re.escape(f) for f in chapter_title_fragments)
    noise_regex = re.compile(
        r'^\s*(\d{1,2}:\s*.*(?:' + fragment_pattern + r').*|.*(?:' + fragment_pattern + r')\s*\d{1,2}\s*)\s*$', 
        flags=re.IGNORECASE | re.MULTILINE
    )
    text = re.sub(noise_regex, '', text).strip()
    text = re.sub(r'\n\s*\n', '\n\n', text).strip()
    text = text.replace('\u00a0', ' ')
    return text

In [19]:
def is_code_line(line_text_clean, line_font):
    """
    Determine if a line is actual Python code (vs output from code execution).
    
    A line is considered code if:
    1. It's in the CODE_FONT (Consolas)
    2. It starts with Python REPL prompt (>>> or ...)
    
    Used to distinguish between:
    - Code: ">>> result = df.head()"
    - Output: "Name  Age  Score"
    
    Args:
        line_text_clean: The text content of the line (already trimmed)
        line_font: Font name extracted from PDF (e.g., 'Consolas', 'Arial')
    
    Returns:
        True if line is code (has REPL prompt), False if it's output
    """
    if line_font != CODE_FONT:
        return False
    return line_text_clean.startswith(('>>>', '...'))

## Main Extraction & Chapter Processing

**extract_and_split_by_chapter()**:
- Main orchestration function that processes the entire PDF
- For each chapter in CHAPTER_MAP:
  1. Initialize page state tracker
  2. Parse each page using parse_page_with_metadata()
  3. Track code block state across pages for continuity
  4. Close any open code blocks at chapter end
  5. Post-process to convert remaining bullet characters
  6. Generate output filename with prefix (00, 01-25 for chapters, A-E for appendices)
  7. Write markdown to file

**Outputs**: One markdown file per chapter in `teradataml_user_guide/` directory

In [20]:
# Execution: Extract and Process PDF

if os.path.exists(PDF_FILE):
    extract_and_split_by_chapter(PDF_FILE, CHAPTER_MAP)
    print("\n✅ Extraction complete with all bullet characters converted to markdown!")
else:
    print(f"Error: PDF file not found at '{PDF_FILE}'.")

✓ Generated teradataml_user_guide/00_Table_of_Contents.md
✓ Generated teradataml_user_guide/01_Introduction_to_Teradata_Package_for_Python.md
✓ Generated teradataml_user_guide/02_Installing_Uninstalling_and_Upgrading_Teradata_Package_for_Python.md
✓ Generated teradataml_user_guide/03_teradataml_Components.md
✓ Generated teradataml_user_guide/04_DataFrames_Setup_and_Basics_(Sources_Non_Default_DB_UAF).md
✓ Generated teradataml_user_guide/03_teradataml_Components.md
✓ Generated teradataml_user_guide/04_DataFrames_Setup_and_Basics_(Sources_Non_Default_DB_UAF).md
✓ Generated teradataml_user_guide/05_DataFrame_Manipulation_(Core_API).md
✓ Generated teradataml_user_guide/05_DataFrame_Manipulation_(Core_API).md
✓ Generated teradataml_user_guide/06_DataFrame_Metadata_Rotation_Saving_and_Export.md
✓ Generated teradataml_user_guide/06_DataFrame_Metadata_Rotation_Saving_and_Export.md
✓ Generated teradataml_user_guide/07_Executing_Python_Functions_Inside_Database_Engine_20.md
✓ Generated teradatam

## Execute PDF Extraction Workflow

Runs the main extraction orchestration:
1. Checks if PDF file exists at PDF_FILE path
2. Calls extract_and_split_by_chapter() to parse entire PDF
3. Processes all chapters per CHAPTER_MAP
4. Generates markdown files in teradataml_user_guide/ directory

**Output**: Confirmation message + list of generated chapter files

In [21]:
"""
Consolidate Markdown Tables: Merge split tables across pages

Algorithm:
- Find table blocks (header line followed by separator |---|...)
- For consecutive table blocks, if headers match and columns align, merge them
- For single contiguous blocks with repeated headers inside, remove the duplicate headers
- Only processes markdown files (post-extraction consolidation)
"""
import re
from pathlib import Path

TABLE_HEADER_RE = re.compile(r"^\s*\|.*\|\s*$")
SEPARATOR_RE = re.compile(r"^\s*\|(?:\s*-+\s*\|)+\s*$")

def normalize_row(row):
    # Normalize header row for comparison: lowercase, collapse whitespace
    return re.sub(r"\s+", " ", row.strip().lower())

def find_table_blocks(lines):
    blocks = []  # list of (start_idx, end_idx)
    i = 0
    n = len(lines)
    while i < n:
        if TABLE_HEADER_RE.match(lines[i]):
            # candidate header, check next line is separator
            if i + 1 < n and SEPARATOR_RE.match(lines[i+1]):
                start = i
                j = i + 2
                # extend until a non-table line
                while j < n and lines[j].strip().startswith('|'):
                    j += 1
                end = j - 1
                blocks.append((start, end))
                i = j
                continue
        i += 1
    return blocks

## Deduplicate & Merge Table Blocks

**dedupe_duplicate_headers_within_block()**:
- Removes repeated header+separator pairs within a single table block
- Happens when PDF has headers that repeat due to page breaks
- Returns: (deduplicated_lines, changed_flag)

**merge_blocks_in_lines()**:
- Merges consecutive table blocks with matching headers and column count
- Only merges if intermediate lines are blank (no content between tables)
- Iteratively processes until no more merges are possible
- Returns: True if any merges occurred

In [22]:
def dedupe_duplicate_headers_within_block(block_lines):
    """
    Given a list of lines that form a contiguous markdown table block,
    remove any repeated header+separator pairs that appear later in the block.
    """
    if len(block_lines) < 3:
        return block_lines, False

    header = normalize_row(block_lines[0])
    changed = False
    i = 2  # start after header and separator
    out_lines = block_lines[:2]
    while i < len(block_lines):
        line = block_lines[i]
        # if this line looks like the header and next is a separator -> skip both
        next_line = block_lines[i+1] if i+1 < len(block_lines) else ''
        if normalize_row(line) == header and SEPARATOR_RE.match(next_line):
            # skip header+separator
            changed = True
            i += 2
            # continue without adding these lines
            continue
        out_lines.append(line)
        i += 1

    return out_lines, changed

def merge_blocks_in_lines(lines):
    changed = False
    blocks = find_table_blocks(lines)
    # iterate and attempt merges
    i = 1
    while i < len(blocks):
        prev_start, prev_end = blocks[i-1]
        cur_start, cur_end = blocks[i]
        # check intermediate lines between prev_end and cur_start are blank
        inter = ''.join(lines[prev_end+1:cur_start]).strip()
        if inter != '':
            i += 1
            continue
        # get header rows
        header_prev = normalize_row(lines[prev_start])
        header_cur = normalize_row(lines[cur_start])
        if header_prev != header_cur:
            i += 1
            continue
        # count columns by splitting header on | and ignoring empty edges
        def col_count(row):
            parts = [p.strip() for p in row.strip().strip('|').split('|')]
            return len(parts)
        if col_count(lines[prev_start]) != col_count(lines[cur_start]):
            i += 1
            continue
        # perform merge: append data rows from cur (skip header and separator) to prev
        data_rows = lines[cur_start+2:cur_end+1]
        # remove trailing empty lines from prev_end if any
        insert_pos = prev_end + 1
        # insert data rows
        lines[insert_pos:insert_pos] = data_rows
        # now remove the original cur block (which shifted forward by len(data_rows))
        del lines[insert_pos+len(data_rows): insert_pos+len(data_rows) + (cur_end - cur_start +1)]
        changed = True
        # rebuild blocks and restart scanning
        blocks = find_table_blocks(lines)
        i = 1
    return changed

## Process Files & Execute Consolidation

**process_file()**:
- Main consolidation processor for a single markdown file
- Merges consecutive table blocks first
- Then dedupes headers within each block
- Writes changes back to file if any modifications occurred
- Returns: True if file was modified

**consolidate_md_tables()**:
- Entry point: processes all markdown files in a directory
- Calls process_file() on each markdown file
- Tracks and reports total files changed
- Returns: count of files that were modified

In [23]:
def process_file(path: Path):
    text = path.read_text(encoding='utf-8')
    lines = text.splitlines(keepends=True)
    changed = merge_blocks_in_lines(lines)
    # Also dedupe repeated header+separator pairs that may appear inside a single block
    blocks = find_table_blocks(lines)
    for start, end in reversed(blocks):
        block_lines = lines[start:end+1]
        new_block, block_changed = dedupe_duplicate_headers_within_block(block_lines)
        if block_changed:
            lines[start:end+1] = new_block
            changed = True
    if changed:
        path.write_text(''.join(lines), encoding='utf-8')
    return changed

def consolidate_md_tables(dir_path):
    """Process all markdown files in directory to consolidate split tables."""
    target = Path(dir_path)
    files = list(target.rglob('*.md')) if target.is_dir() else [target] if target.is_file() else []
    
    total_changed = 0
    for f in files:
        changed = process_file(f)
        if changed:
            print(f'✓ Merged tables in: {f.name}')
            total_changed += 1
    
    return total_changed

## Exporting raw PDF Metadata for debugging


In [25]:
export_raw_page_metadata(PDF_FILE, 1119, "page_1119_raw_metadata_debug.txt")

✨ Successfully exported raw metadata for page 1119 to 'page_1119_raw_metadata_debug.txt'.
