## Import Dependencies

Imports the required libraries for PDF parsing and file operations:
- `pymupdf` (fitz): For reading and extracting text from PDF files
- `os`: For file and directory operations
- `re`: For regex pattern matching in text processing
- `json`: For working with JSON data (e.g., exporting raw metadata)

In [131]:
import pymupdf as fitz
import os
import re
import json

## Configuration: Chapter Map & PDF Metadata

Defines the structure of the PDF document:
- **CHAPTER_MAP**: A list of tuples containing (chapter_title, start_page, end_page) for each chapter/section in the Teradata Package for Python User Guide
- **PDF_FILE**: The path to the PDF file to be parsed

The chapter map is used to split the extracted content into separate markdown files per chapter, preserving document structure.

In [132]:
# Configuration: Chapter Map & PDF Metadata
CHAPTER_MAP = [
    ("Table of Contents", 3, 6),
    ("Introduction to Teradata Package for Python", 7, 26),
    ("Installing, Uninstalling, and Upgrading Teradata Package for Python", 27, 31),
    ("teradataml Components", 32, 42),
    ("DataFrames Setup and Basics (Sources, Non-Default DB, UAF)", 43, 60),
    ("DataFrame Manipulation (Core API)", 61, 164),
    ("DataFrame Metadata, Rotation, Saving, and Export", 165, 202),
    ("Executing Python Functions Inside Database Engine 20", 203, 242),
    ("teradataml DataFrame Column", 243, 279),
    ("teradataml Window Aggregates", 280, 288),
    ("Context to Teradata Vantage", 289, 301),
    ("teradataml Options", 302, 319),
    ("teradataml Utility and General Functions", 320, 399),
    ("teradataml Open-Source Machine Learning Functions", 400, 457),
    ("Script Methods (SCRIPT Table Operator)", 458, 477),
    ("Series (DataFrame Column Sequence)", 478, 481),
    ("BYOM (Bring Your Own Model) Management", 482, 519),
    ("Working with Geospatial Data", 520, 571),
    ("Exploratory Data Analysis (EDA UI)", 572, 576),
    ("Plotting in teradataml", 577, 611),
    ("Hyperparameter Tuning in teradataml", 612, 693),
    ("AutoML Overview and Methods", 694, 718),
    ("AutoML Examples", 719, 1108),
    ("AutoDataPrep", 1109, 1150),
    ("Feature Store in teradataml", 1151, 1190),
    ("Using Teradata Vantage Analytic Functions with Teradata Package for Python", 1191, 1235),
    ("Appendix A: Teradata Package for Python Limitations and Considerations", 1236, 1260),
    ("Appendix B: Using teradataml with Native Object Store", 1261, 1276),
    ("Appendix C: teradataml Extension with SQLAlchemy", 1277, 1295),
    ("Appendix D: Data Type Mapping", 1296, 1297),
    ("Appendix E: Additional Information", 1298, 1301)
]

PDF_FILE = "Teradata Package for Python User Guide.pdf"

## Constants: Font Sizes & Constants

Defines all constants for PDF parsing:
- **Font sizes**: H2 (17.95), H3 (15.96), H4 (11.55) for heading detection
- **BODY_TEXT_SIZE**: 10.5 points (standard text)
- **CODE_FONT**: 'Consolas' (identifies code blocks)
- **BOLD_FLAG**: 16 (bit flag for bold text)
- **Bullet character sets**: BLACK_BULLETS (•), WHITE_BULLETS (◦) for nested lists
- **Y_MERGE_TOLERANCE**: 5 points (merge lines within this vertical distance)
- **APPENDIX_START_INDEX**: 26 (index where appendices begin in chapter map)

In [133]:
# X-coordinate indentation mapping (generalized across PDF)
# Rule: Indentation strictly based on X thresholds (0/2/4/6 spaces)
# Rule: Nested levels must have at least +2 more spaces than parent
X_INDENT_RANGES = [
    (50, 75, 0, None, False),        # Level 0: X ~50-75, 0 spaces, no bullet (main content)
    (59, 73, 0, '*', True),          # Level 0 black bullet: X ~59-73, 0 spaces, with bullet •
    (72, 85, 2, '*', True),          # Level 1 bullet: X ~72-85, 2 spaces, with bullet • or ◦
    (77, 90, 2, None, False),        # Level 1 content: X ~77-90, 2 spaces, no bullet
    (85, 115, 4, '*', True),         # Level 2 bullet/content: X ~85-115, 4 spaces, with bullet or no bullet
    (111, 130, 4, None, False),      # Level 2 content: X ~111-130, 4 spaces, no bullet
    (125, 150, 6, None, False),      # Level 3 content: X ~125-150, 6 spaces, no bullet
]

## Define PDF Constants

Defines all font sizes, styles, and tolerance thresholds used throughout PDF parsing:
- **Font sizes**: H2 (17.95pt), H3 (15.96pt), H4 (11.55-13.96pt) for heading detection
- **Body text**: 10.5pt standard text size
- **Code font**: Consolas (identifies code blocks)
- **Bold flag**: 16 (bit flag for bold text)
- **Bullet characters**: BLACK_BULLETS (•, •, *) and WHITE_BULLETS (◦) for nested lists
- **Tolerances**: Y_MERGE (5pt), Font tolerance (0.01pt)
- **APPENDIX_START_INDEX**: Chapter 26 is where appendices begin

In [134]:
# Constants: PDF Font Sizes, Styles, and Tolerance Thresholds
TOLERANCE = 0.01
H2_SIZE_SPECIFIC = 17.954999923706055
H3_SIZE_GENERIC = 15.960000038146973
H4_SIZE_ARIAL_BLACK = 11.550000190734863
H4_SIZE_BOLD_SECONDARY = 13.96500015258789
BODY_TEXT_SIZE = 10.5
BOLD_FLAG = 16
CODE_FONT = 'Consolas'
H4_FONT = 'Arial-Black'
Y_MERGE_TOLERANCE = 5.0  # Lines within 5 points vertically are part of same row
APPENDIX_START_INDEX = 26  # Chapter index where appendices begin

# Bullet character detection (consolidated - used in multiple places)
BULLET_CHARS = {'•', '\u2022', '◦', '\u25E6', '*', '-'}
WHITE_BULLETS = {'◦', '\u25E6'}  # Nested bullets
BLACK_BULLETS = {'•', '\u2022', '*'}  # Main level bullets



In [135]:
# Logging Setup
import logging

# Configure logging to write to a file
logging.basicConfig(
    filename='extraction_log.txt',
    level=logging.INFO,
    format='%(message)s',  # Only the message, no timestamp etc.
    filemode='w'  # Overwrite each run
)
logging.getLogger().setLevel(logging.WARNING)
logger = logging.getLogger(__name__)

## X-Coordinate Indentation Mapping

Maps X-coordinate positions from the PDF to markdown indentation levels:
- **Rule 1**: Indentation is strictly based on X thresholds (0/2/4/6 spaces)
- **Rule 2**: Nested levels must have at least +2 more spaces than parent level

Each range tuple: `(x_min, x_max, indent_spaces, bullet_marker, is_bullet)`
- x_min, x_max: X-coordinate boundaries (in points)
- indent_spaces: Number of spaces to apply (0, 2, 4, or 6)
- bullet_marker: Expected bullet character (•, ◦, or None)
- is_bullet: Whether this range is for bullets

This ensures consistent hierarchical indentation across all content.

In [136]:
def get_indent_from_x(x_pos, bullet_char=None):
    """
    Determine indentation level from X coordinate.
    
    Indentation is determined STRICTLY by X coordinate thresholds (0/2/4/6 spaces).
    Nested levels must have at least +2 more spaces than parent level.
    
    Args:
        x_pos: X coordinate from PDF (PRIMARY source of truth)
        bullet_char: bullet character - currently unused, kept for API compatibility
    
    Returns:
        indent_spaces: number of spaces for indentation (0, 2, 4, 6, etc.)
    """

    # bullet_char parameter is ignored (kept for API compatibility)
    # Check X coordinate ranges (PRIMARY source of truth)
    logger.info("log_040: Determining indentation from X coordinate")
    for x_min, x_max, indent_spaces, bullet_marker, is_bullet in X_INDENT_RANGES:
        if x_min <= x_pos < x_max:
            logger.info("log_041: x_min <= x_pos < x_max")
            return indent_spaces
    # Fallback: for x_pos >= 125, calculate additional indentation levels based on distance from 125
    if x_pos >= 125:
        logger.info("log_042: x_pos >= 125 fallback")
        extra_levels = max(0, int((x_pos - 125) / 20))
        result = 4 + extra_levels * 2
        return result
    return 0  # Default: main level

## PDF Block Processing: Merge Table Columns

**merge_table_columns_in_block()**:
- Merges table columns by grouping lines at similar Y coordinates within a PDF block
- Special handling: Does NOT merge bullet characters with content text (keeps them separate)
- Combines spans from multiple lines at the same Y position into single merged lines
- Returns the modified block with merged lines for table processing

In [137]:
# PDF Block Processing Functions

def merge_table_columns_in_block(block):
    """
    Merge table columns by grouping lines at similar Y coordinates.
    Special handling: Do NOT merge bullet characters with content text.
    """
    logger.info("log_042: Starting merge_table_columns_in_block")
    # Identify which lines contain ONLY a bullet character
    pure_bullet_lines = set()
    for line_idx, line in enumerate(block.get('lines', [])):
        spans = line.get('spans', [])
        if len(spans) == 1 and spans[0]['text'].strip() in BULLET_CHARS:
            logger.info(f"log_044: Found pure bullet line at index {line_idx}")
            pure_bullet_lines.add(line_idx)
    # Collect all lines with their Y coordinate, excluding pure bullets from merge
    y_groups = {}
    for line_idx, line in enumerate(block.get('lines', [])):
        # Don't merge pure bullet lines - keep them separate
        if line_idx in pure_bullet_lines:
            logger.info(f"log_046: Line {line_idx} is pure bullet, separate group")
            y_groups[len(y_groups)] = [line]  # Each bullet gets its own group
            continue
        y_pos = line['bbox'][1]  # Top of bbox
        # Find matching Y group (within tolerance), but only for non-bullet lines
        matched_y = None
        for existing_y in y_groups:
            if isinstance(existing_y, (int, float)) and existing_y < 1000:
                if abs(y_pos - existing_y) < Y_MERGE_TOLERANCE:
                    logger.info(f"log_047: Line {line_idx} matched Y group {existing_y}")
                    matched_y = existing_y
                    break
        if matched_y is None:
            logger.info(f"log_048: Line {line_idx} starts new Y group {y_pos}")
            matched_y = y_pos
            y_groups[matched_y] = []
        y_groups[matched_y].append(line)
    # Sort each group by X coordinate and merge spans
    merged_lines = []
    for key in sorted(y_groups.keys(), key=lambda k: k if isinstance(k, (int, float)) else float('inf')):
        lines_in_group = y_groups[key]
        if len(lines_in_group) == 1:
            logger.info(f"log_049: Y group {key} has 1 line, no merge")
            merged_lines.append(lines_in_group[0])
        else:
            logger.info(f"log_050: Y group {key} has {len(lines_in_group)} lines, merging")
            # Multiple lines at same Y - merge them
            lines_sorted = sorted(lines_in_group, key=lambda l: l['bbox'][0])
            # Combine spans from all lines, adding spaces between them
            merged_spans = []
            for line_idx, line in enumerate(lines_sorted):
                for span_idx, span in enumerate(line['spans']):
                    span_copy = dict(span)
                    if line_idx > 0 or span_idx > 0:
                        span_copy['text'] = ' ' + span['text']
                    merged_spans.append(span_copy)
            # Create merged line
            merged_line = {
                'spans': merged_spans,
                'wmode': lines_sorted[0].get('wmode', 0),
                'dir': lines_sorted[0].get('dir', [1.0, 0.0]),
                'bbox': [
                    lines_sorted[0]['bbox'][0],   # Left of first
                    lines_sorted[0]['bbox'][1],   # Top of first
                    lines_sorted[-1]['bbox'][2],  # Right of last
                    lines_sorted[-1]['bbox'][3],  # Bottom of last
                ]
            }
            merged_lines.append(merged_line)
    return merged_lines

## PDF Block Processing: Identify Bullet Lines

**identify_bullet_lines()**:
- Identifies which lines in a merged block contain ONLY a bullet character (•, ◦, etc.)
- Returns a set of line indices containing pure bullets
- Used to properly handle bullets separately from their content during processing

In [138]:
def identify_bullet_lines(merged_block_lines):
    """Identify which lines contain ONLY a bullet character."""
    logger.info("log_051: Starting identify_bullet_lines")
    bullet_lines = set()
    for line_idx, line in enumerate(merged_block_lines):
        spans = line.get('spans', [])
        if len(spans) == 1 and spans[0]['text'].strip() in BULLET_CHARS:
            logger.info(f"log_052: Found pure bullet line at index {line_idx}")
            bullet_lines.add(line_idx)
    return bullet_lines

## Line Info Extraction & Helper Functions

**extract_line_info()**:
- Extracts metadata from each line: text, font size, flags, font name, X position, bullet info
- Skips noise: headers, footers, font sizes 24.0 or 8.0
- Detects if line had a bullet in previous line and captures bullet character
- Returns list of dicts with line info for processing

**should_close_code_block()**: Detects headings that should close open code blocks
**get_heading_level()**: Determines markdown heading level (2-4) from font characteristics  
**is_code_block_text()**: Checks if line is in code font (Consolas) at correct size (10.5pt)

In [139]:
def extract_line_info(merged_block_lines, bullet_lines, page_num_1idx):
    """
    Extract and process line metadata from merged block lines.
    
    Returns:
        List of dicts with: text, size, flags, font, is_bold, x_pos, has_bullet, bullet_char, bullet_x_pos
    """
    logger.info("log_060: Starting extract_line_info")
    lines_info = []
    for line_idx, line in enumerate(merged_block_lines):
        spans = line.get('spans', [])

        # SKIP pure bullet lines - they only serve to mark the next line
        if line_idx in bullet_lines:
            logger.info(f"log_054: Line {line_idx} is pure bullet, skipping")
            continue
        first_span = spans[0]
        span_size = first_span.get('size', 10)
        line_flags = first_span.get('flags', 0)
        line_font = first_span.get('font', 'Arial')
        x_pos = first_span.get('origin', [0, 0])[0]
        # Skip certain font sizes (headers, noise)
        if round(span_size, 1) in [24.0, 8.0]:
            logger.info(f"log_055: Line {line_idx} has header/noise font size, skipping")
            continue
        line_text = "".join([s['text'] for s in spans])
        line_text = clean_page_noise(line_text, page_num_1idx).strip()
        line_text_clean = line_text.lstrip(' ')
        # Skip chapter/appendix headers and noise
        if re.match(r'^(\d{1,2}|[A-E]):\s+\S', line_text_clean):
            logger.info(f"log_056: Line {line_idx} matches chapter/appendix header, skipping")
            continue
        if not line_text_clean or re.fullmatch(r'[\.\-\(\)\s]*', line_text_clean):
            logger.info(f"log_057: Line {line_idx} is empty/noise, skipping")
            continue
        if line_text_clean.strip() in ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]:
            logger.info(f"log_058: Line {line_idx} is page number, skipping")
            continue
        is_bold_heading = (line_flags & BOLD_FLAG)
        # Determine if this line had a bullet in previous line
        has_bullet_from_previous = False
        bullet_char = None
        bullet_x_pos = None
        if line_idx > 0 and (line_idx - 1) in bullet_lines:
            logger.info(f"log_059: Line {line_idx} has bullet from previous line")
            has_bullet_from_previous = True
            prev_line = merged_block_lines[line_idx - 1]
            if prev_line.get('spans'):
                bullet_text = prev_line['spans'][0]['text'].strip()
                bullet_char = bullet_text
                bullet_x_pos = prev_line['spans'][0].get('origin', [0, 0])[0]
        lines_info.append({
            'text': line_text_clean,
            'size': span_size,
            'flags': line_flags,
            'font': line_font,
            'is_bold': is_bold_heading,
            'x_pos': x_pos,
            'has_bullet': has_bullet_from_previous,
            'bullet_char': bullet_char,
            'bullet_x_pos': bullet_x_pos,
        })
    return lines_info


## PDF Block Processing Functions

**merge_table_columns_in_block()**:
- Merges table columns by grouping lines at similar Y coordinates
- Preserves bullet characters on separate lines (not merged with content)
- Handles multi-column layouts in the PDF

**identify_bullet_lines()**:
- Identifies which lines contain ONLY a bullet character (•, ◦, etc.)
- Returns a set of line indices containing pure bullets
- Used to properly handle bullets separately from their content

In [140]:
def should_close_code_block(line_info):
    """Determine if we should close any open code block for this line (heading detection)."""
    logger.info("log_060: Starting should_close_code_block")
    span_size = line_info['size']
    line_flags = line_info['flags']
    line_font = line_info['font']
    is_bold_heading = line_info['is_bold']
    # Check if this is a heading that should close code blocks
    if abs(span_size - H2_SIZE_SPECIFIC) < TOLERANCE and is_bold_heading:
        logger.info("log_061: Detected H2 heading")
        return True
    elif abs(span_size - H3_SIZE_GENERIC) < TOLERANCE and is_bold_heading:
        logger.info("log_062: Detected H3 heading")
        return True
    elif H4_FONT in line_font and abs(span_size - H4_SIZE_ARIAL_BLACK) < TOLERANCE:
        logger.info("log_063: Detected H4 heading (Arial Black)")
        return True
    elif round(span_size, 3) == 13.965 and is_bold_heading:
        logger.info("log_064: Detected H4 heading (size 13.965)")
        return True
    return False

def get_heading_level(line_info):
    """Get markdown heading level if line is a heading, else None."""
    logger.info("log_065: Starting get_heading_level")
    span_size = line_info['size']
    line_flags = line_info['flags']
    line_font = line_info['font']
    is_bold_heading = line_info['is_bold']
    if abs(span_size - H2_SIZE_SPECIFIC) < TOLERANCE and is_bold_heading:
        logger.info("log_066: Heading level 2 detected")
        return 2
    elif abs(span_size - H3_SIZE_GENERIC) < TOLERANCE and is_bold_heading:
        logger.info("log_067: Heading level 3 detected")
        return 3
    elif H4_FONT in line_font and abs(span_size - H4_SIZE_ARIAL_BLACK) < TOLERANCE:
        logger.info("log_068: Heading level 4 detected (Arial Black)")
        return 4
    elif round(span_size, 3) == 13.965 and is_bold_heading:
        logger.info("log_069: Heading level 4 detected (size 13.965)")
        return 4
    return None

def is_code_block_text(line_info):
    """Check if line is code text (code font + correct size)."""
    logger.info("log_070: Starting is_code_block_text")
    return abs(line_info['size'] - BODY_TEXT_SIZE) < TOLERANCE and line_info['font'] == CODE_FONT


## Core Page Parsing Logic

**parse_page_with_metadata()**:
- Main function that parses a single PDF page
- Handles multi-page code block continuity via state tracking
- For each line, determines:
  1. Is it a heading? → Close code block, add markdown heading
  2. Is it code? → Add to code block with proper formatting
  3. Is it regular text? → Apply hierarchical indentation based on X coordinate

**Key features**:
- Tracks `in_code_block` state across pages
- Applies indentation strictly from X_INDENT_RANGES
- Handles numbered items and bullet markers
- Returns markdown text and updated state for next page

In [141]:
def parse_page_with_metadata(page, page_num_1idx, previous_page_state=None):
    """
    Parse a page with table detection, code vs output distinction, and hierarchical indentation.
    
    FIX: Changed to insert tables at their correct block positions instead of all at once.
    
    Args:
        page: PyMuPDF page object
        page_num_1idx: 1-indexed page number
        previous_page_state: dict tracking state from previous page
    
    Returns:
        dict with 'markdown': str and 'state': dict for next page
    """
    logger.info(f"log_018: {page_num_1idx:04d} Starting parse_page_with_metadata")
    
    data = page.get_text("dict")
    logger.info(f"log_020: {page_num_1idx:04d} Got text dict")
    blocks = data['blocks']
    logger.info(f"log_021: {page_num_1idx:04d} Extracted {len(blocks)} blocks")
    markdown_lines = []
    pending_numbered_item = None
    
    in_code_block = previous_page_state['in_code_block']
    
    # STEP 1: Prepare raw blocks for table detection
    all_blocks_with_raw_lines = []
    all_merged_blocks = {}
    
    for block_idx, block in enumerate(blocks):
        if block['type'] != 0:
            continue
        
        raw_lines = block.get('lines', [])
        all_blocks_with_raw_lines.append((block_idx, raw_lines))
        
        merged_block_lines = merge_table_columns_in_block(block)
        all_merged_blocks[block_idx] = merged_block_lines
    
    logger.info(f"log_023: {page_num_1idx:04d} Prepared {len(all_blocks_with_raw_lines)} blocks for table detection")

    # STEP 2: Detect and convert tables
    table_list, table_line_refs = detect_and_convert_tables_in_page(all_blocks_with_raw_lines, page_num_1idx)
    logger.info(f"log_024: {page_num_1idx:04d} Detected and converted tables")

    # Build dict mapping block_idx to its table markdown
    # table_dict[block_idx] = markdown string to insert at this block
    table_dict = {}  # block_idx -> markdown_str
    for start_block, end_block, table_markdown in table_list:
        # Insert table at the START block of this range
        table_dict[start_block] = table_markdown
    
    # Build set of blocks that contain tables (from table_line_refs)
    table_blocks_set = set(b_idx for b_idx, _ in table_line_refs)
    logger.info(f"log_025: {page_num_1idx:04d} Table blocks set: {len(table_blocks_set)}")

    # STEP 3: Process each block in order, inserting tables at their correct positions
    for block_idx, block in enumerate(blocks):
        if block['type'] != 0:
            continue
        
        if block_idx not in all_merged_blocks:
            continue
        
        # FIX: Insert table at this specific block position (if it's the start of a table range)
        if block_idx in table_dict:
            if in_code_block:
                logger.info(f"log_026: {page_num_1idx:04d} Closing code block for table")
                markdown_lines.append("```\n")
                in_code_block = False
            logger.info(f"log_027: {page_num_1idx:04d} Inserting table at block {block_idx}")
            markdown_lines.append(table_dict[block_idx])
            markdown_lines.append('\n\n')
            # Skip all blocks that are part of this table
            continue
        
        # Skip remaining table blocks (they're part of a table already inserted)
        if block_idx in table_blocks_set:
            continue
        
        merged_block_lines = all_merged_blocks[block_idx]
        bullet_lines = identify_bullet_lines(merged_block_lines)
        lines_info = extract_line_info(merged_block_lines, bullet_lines, page_num_1idx)
        
        # Process each line in the (possibly combined) block
        for current in lines_info:
            line_text_clean = current['text']
            
            # CONSOLIDATED: Check if heading (closes code blocks automatically)
            if should_close_code_block(current):
                if in_code_block:
                    logger.info(f"log_030: {page_num_1idx:04d} Closing code block for heading")
                    markdown_lines.append("```\n")
                    in_code_block = False
                
                heading_level = get_heading_level(current)
                heading_marker = '#' * heading_level
                logger.info(f"log_031: {page_num_1idx:04d} Adding heading")
                markdown_lines.append(f"{heading_marker} {line_text_clean}\n")
                continue
            
            # Code/Output handling
            if is_code_block_text(current):
                is_code = is_code_line(line_text_clean, current['font'])
                
                if is_code:
                    # Start or continue python code block with REPL prompt
                    if not in_code_block:
                        logger.info(f"log_032: {page_num_1idx:04d} Starting code block")
                        markdown_lines.append("```python\n")
                        in_code_block = True
                    logger.info(f"log_033: {page_num_1idx:04d} Adding code line")
                    markdown_lines.append(line_text_clean + "\n")
                else:
                    # This is output (code font text after >>> lines)
                    if in_code_block:
                        logger.info(f"log_034: {page_num_1idx:04d} Adding output in code block")
                        markdown_lines.append(line_text_clean + "\n")
                    else:
                        # Orphaned code-font text - treat as regular code
                        if not in_code_block:
                            logger.info(f"log_035: {page_num_1idx:04d} Starting code block for output")
                            markdown_lines.append("```python\n")
                            in_code_block = True
                        markdown_lines.append(line_text_clean + "\n")
            
            else:
                # Regular body text (non-code-font) - apply hierarchical indentation
                if in_code_block:
                    logger.info(f"log_036: {page_num_1idx:04d} Closing code block for body")
                    markdown_lines.append("```\n")
                    in_code_block = False
                
                # Determine indentation from X coordinate (strictly based on X_INDENT_RANGES)
                indent_x_pos = current['bullet_x_pos'] if current['has_bullet'] and current['bullet_x_pos'] is not None else current['x_pos']
                indent_spaces = get_indent_from_x(indent_x_pos, current['bullet_char'])
                
                indent_str = ' ' * indent_spaces
                
                # Apply bullet marker if present
                if current['has_bullet']:
                    logger.info(f"log_037: {page_num_1idx:04d} Adding bullet line")
                    markdown_lines.append(f"{indent_str}* {line_text_clean}\n")
                else:
                    logger.info(f"log_038: {page_num_1idx:04d} Adding body line")
                    markdown_lines.append(f"{indent_str}{line_text_clean}\n")
    
    # Determine final state for next page
    final_state = {
        'in_code_block': in_code_block,
        'pending_lines': []
    }
    
    final_text = "".join(markdown_lines)
    final_text = re.sub(r'\n{3,}', '\n\n', final_text)
    logger.info(f"log_039: {page_num_1idx:04d} Finalized markdown")
    
    return {
        'markdown': final_text.strip(),
        'state': final_state
    }

## Bullet Conversion & File Export Functions

**convert_bullets_in_text()**:
- Post-processing step to convert old bullet characters to markdown format
- Replaces • and ◦ with * (markdown bullet syntax)
- Preserves indentation from X coordinate parsing
- Ensures proper spacing for nested bullets


In [142]:
# Bullet Conversion & File Export Functions

def convert_bullets_in_text(text):
    """
    Convert old bullet characters (• and ◦) to markdown format (* with proper indentation).
    
    This is a post-processing step for any bullets that weren't converted during parsing.
    """
    logger.info("log_071: Starting convert_bullets_in_text")
    lines = text.split('\n')
    converted_lines = []
    for line in lines:
        # Replace • with * (main bullet)
        if '•' in line or '\u2022' in line:
            logger.info("log_072: Found main bullet line")
            match = re.match(r'^(\s*)([•\u2022])\s*(.*)', line)
            if match:
                logger.info("log_073: Main bullet line matches regex")
                spaces, bullet, content = match.groups()
                converted_lines.append(f"{spaces}* {content}")
            else:
                logger.info("log_074: Main bullet line does not match regex, replacing directly")
                line = line.replace('•', '*').replace('\u2022', '*')
                converted_lines.append(line)
        # Replace ◦ with * (nested bullet) - ensure 4 spaces
        elif '◦' in line or '\u25E6' in line:
            logger.info("log_075: Found nested bullet line")
            match = re.match(r'^(\s*)([◦\u25E6])\s*(.*)', line)
            if match:
                logger.info("log_076: Nested bullet line matches regex")
                spaces, bullet, content = match.groups()
                converted_lines.append(f"{spaces}* {content}")
            else:
                logger.info("log_078: Nested bullet line does not match regex, replacing directly")
                line = line.replace('◦', '*').replace('\u25E6', '*')
                converted_lines.append(line)
        else:
            converted_lines.append(line)
    return '\n'.join(converted_lines)


## Detect Table Blocks in Page

**detect_table_blocks_in_page()**:
- Detects table structures by identifying header blocks and their associated data blocks
- Uses font characteristics and X-coordinate alignment to identify table components
- Returns ranges of blocks that contain tables

In [143]:
def detect_table_blocks_in_page(all_blocks_with_raw_lines):
    """
    Detect table blocks by identifying header blocks and grouping all data blocks
    that fall within the header's column threshold range.
    
    Key insight: A table consists of:
    1. A header block with 2+ distinct columns (50px+ gaps)
    2. All consecutive data blocks whose X coordinates fall within the header's range
    3. Stops at footer or header blocks
    4. ONLY the FIRST occurrence of a multi-column block in a given X range is treated as a header
       This avoids treating data rows (Block 1 with 2 columns) as separate headers
    
    For multi-block-per-row tables (like page 10):
    - Header (Block 0): X=[70, 130, 180] defines column thresholds
    - Block 1: X=[70, 130] → within header range ✓
    - Block 2: X=[180] → within header range ✓
    - Block 3: X=[190+] → outside range (separate table or plain text)
    
    Args:
        all_blocks_with_raw_lines: List of (block_idx, raw_lines) tuples
    
    Returns:
        List of (table_start_block, table_end_block) tuples for detected tables
    """
    logger.info("log_079: Starting detect_table_blocks_in_page")
    # Regex to detect footer/header text (version-agnostic)
    FOOTER_PATTERN = re.compile(r'Package for Python User Guide,?\s+Release\s+[\d.]+', re.IGNORECASE)
    TABLE_FONT_SIZE = 10.028  # Table text font size (with tolerance)
    FONT_SIZE_TOLERANCE = 0.05
    MIN_X_GAP = 30  # Minimum gap between columns (distinct columns)
    X_THRESHOLD_TOLERANCE = 100  # Allow 100px tolerance when checking if block X is within header range
    # Step 1: Find all blocks with table-like font characteristics
    block_characteristics = {}  # Store characteristics for each block
    header_candidates = []  # Blocks with 2+ columns and 50px+ gaps
    for block_idx, raw_lines in all_blocks_with_raw_lines:
        # Check if this block has table-like characteristics
        has_table_font = False
        x_positions = set()
        x_min, x_max = float('inf'), 0
        for line in raw_lines:
            for span in line.get('spans', []):
                # Filter out footer text via regex
                span_text = span.get('text', '')
                if FOOTER_PATTERN.search(span_text):
                    logger.info(f"log_081: Block {block_idx} span matches footer pattern, skipping")
                    continue  # Skip footer lines
                # Check font size
                span_size = span.get('size', 0)
                if abs(span_size - TABLE_FONT_SIZE) < FONT_SIZE_TOLERANCE:
                    logger.info(f"log_082: Block {block_idx} span matches table font size")
                    has_table_font = True
                    x = span.get('origin', [0, 0])[0]
                    x_key = x  # Use exact X value, do not round
                    x_positions.add(x_key)
                    x_min = min(x_min, x_key)
                    x_max = max(x_max, x_key)
        # Store characteristics for this block
        block_characteristics[block_idx] = {
            'has_table_font': has_table_font,
            'x_positions': x_positions,
            'x_min': x_min if x_min != float('inf') else 0,
            'x_max': x_max,
            'num_columns': len(x_positions)
        }
        # Identify header candidate blocks: have 2+ X positions with distinct columns (gap >= MIN_X_GAP)
        # BUT REJECT blocks with >5 unique X positions (likely wrapped text, not a table)
        if has_table_font and len(x_positions) >= 2 and len(x_positions) <= 5:
            sorted_x = sorted(x_positions)
            for i in range(len(sorted_x) - 1):
                if sorted_x[i+1] - sorted_x[i] >= MIN_X_GAP:
                    logger.info(f"log_083: Block {block_idx} is header candidate")
                    header_candidates.append(block_idx)
                    break
    # Pre-compute block colors and Y positions to avoid repeated iterations
    block_color_map = {}
    block_min_y_map = {}
    for block_idx, raw_lines in all_blocks_with_raw_lines:
        # Extract color
        color = None
        for line in raw_lines:
            for span in line.get('spans', []):
                color = span.get('color')
                if color is not None:
                    break
            if color is not None:
                break
        block_color_map[block_idx] = color
        # Extract min Y
        min_y = float('inf')
        for line in raw_lines:
            for span in line.get('spans', []):
                y = span.get('origin', [0, 0])[1]
                min_y = min(min_y, y)
        block_min_y_map[block_idx] = min_y if min_y != float('inf') else None
    # Step 2: Filter to ONLY keep headers (not data rows that look like headers)
    # If two header candidates have overlapping X ranges, only the FIRST is a header
    if not header_candidates:
        logger.info("log_084: No header candidates found")
        return []
    real_headers = []
    used_x_ranges = []
    for header_idx in sorted(header_candidates):
        # Check for non-table blocks between the last real header and this candidate
        # If a non-table block exists, it's a separator, so we reset the X-range memory
        last_header = real_headers[-1] if real_headers else -1
        has_separator = False
        for i in range(last_header + 1, header_idx):
            if i in block_characteristics and not block_characteristics[i]['has_table_font']:
                logger.info(f"log_085: Separator found between headers at {i}")
                has_separator = True
                break
        if has_separator:
            # A non-table block was found, so this is a new section. Reset used ranges.
            logger.info(f"log_086: Resetting used_x_ranges due to separator before header {header_idx}")
            used_x_ranges = []
        header_chars = block_characteristics[header_idx]
        header_x_positions = header_chars['x_positions']
        header_x_min = min(header_x_positions) if header_x_positions else 0
        header_x_max = max(header_x_positions) if header_x_positions else 0
        # Check if this X range overlaps with any existing header's X range
        is_overlapping = False
        for existing_min, existing_max in used_x_ranges:
            # Check if X ranges overlap significantly
            if (header_x_min <= existing_max + X_THRESHOLD_TOLERANCE and 
                header_x_max >= existing_min - X_THRESHOLD_TOLERANCE):
                logger.info(f"log_087: Header {header_idx} X range overlaps with previous header")
                is_overlapping = True
                break
        if not is_overlapping:
            # This is a truly new header (different table)
            logger.info(f"log_088: Header {header_idx} is a real header")
            real_headers.append(header_idx)
            used_x_ranges.append((header_x_min, header_x_max))
    # Step 2B: Extend header ranges to include ALL previous blocks with same color
    # This handles cases where table headers span multiple blocks with varying Y positions
    # (e.g., page 203 blocks 14-15-16, where 14-15 have same color but different Y than 16)
    # Track: (original_header_idx) -> (extended_start_idx, extended_end_idx)
    header_extension_map = {}  # Maps original header to (start, end) of extended range
    blocks_to_skip = set()  # Blocks that were absorbed into extended headers
    for header_idx in real_headers:
        header_start_idx = header_idx
        header_end_idx = header_idx
        header_color = block_color_map[header_idx]
        # Look backward to include ALL consecutive blocks with same color
        # This ensures multi-block headers are fully captured even if blocks have different Y positions
        check_idx = header_idx - 1
        while check_idx >= 0:
            # Check if previous block has table font
            if check_idx not in block_characteristics or not block_characteristics[check_idx]['has_table_font']:
                break
            # Check if previous block has same color
            prev_color = block_color_map.get(check_idx)
            if prev_color == header_color:
                logger.info(f"log_089: Extending header {header_idx} backward to {check_idx}")
                header_start_idx = check_idx
                blocks_to_skip.add(check_idx)
                check_idx -= 1
            else:
                # Color changed, stop looking backward
                break
        # Store the extended range
        header_extension_map[header_idx] = (header_start_idx, header_end_idx)
    # Step 3: Group blocks into tables starting from each real header block
    table_ranges = []
    processed_blocks = set()
    for header_idx in sorted(real_headers):

        # Get extended range for this header
        header_start_idx, header_end_idx = header_extension_map[header_idx]
        # Recompute header X positions from ALL blocks in extended range
        header_x_positions = set()
        for h_idx in range(header_start_idx, header_end_idx + 1):
            if h_idx in block_characteristics:
                header_x_positions.update(block_characteristics[h_idx]['x_positions'])

        header_x_min = min(header_x_positions)
        header_x_max = max(header_x_positions)
        # Mark all blocks in extended header range as processed
        for h_idx in range(header_start_idx, header_end_idx + 1):
            processed_blocks.add(h_idx)
        # Store the extended header as a single table range
        logger.info(f"log_092: Storing extended header range {header_start_idx}-{header_end_idx}")
        table_ranges.append((header_start_idx, header_end_idx))
        # Now look for consecutive DATA blocks that belong to this table
        # These can have continuation blocks merged (blocks with no column 1 content)
        data_range_start = None
        for candidate_idx in range(header_end_idx + 1, len(block_characteristics)):

            candidate_chars = block_characteristics[candidate_idx]
            # Skip blocks without table font (likely footer/header)
            if not candidate_chars['has_table_font']:
                logger.info(f"log_094: Candidate block {candidate_idx} has no table font, stopping data range")
                # Save any pending data range before stopping
                if data_range_start is not None:
                    logger.info(f"log_095: Saving data range {data_range_start}-{candidate_idx-1}")
                    table_ranges.append((data_range_start, candidate_idx - 1))
                    data_range_start = None
                break
            # Check if candidate is another real header (different table)
            if candidate_idx in real_headers:
                logger.info(f"log_096: Candidate block {candidate_idx} is another real header, stopping data range")
                # Save any pending data range and stop
                if data_range_start is not None:
                    logger.info(f"log_097: Saving data range {data_range_start}-{candidate_idx-1}")
                    table_ranges.append((data_range_start, candidate_idx - 1))
                    data_range_start = None
                break
            # Check if candidate block's X coordinates mean it belongs to table
            candidate_x_positions = candidate_chars['x_positions']
            if candidate_x_positions:  # Has at least one X position
                candidate_x_min = min(candidate_x_positions)
                # STOP if text extends significantly to the left of column 1
                if candidate_x_min < header_x_min - X_THRESHOLD_TOLERANCE:
                    logger.info(f"log_098: Candidate block {candidate_idx} is outside table boundary, stopping data range")
                    # This block is outside the table boundary. Save any pending data range and stop.
                    if data_range_start is not None:
                        logger.info(f"log_099: Saving data range {data_range_start}-{candidate_idx-1}")
                        table_ranges.append((data_range_start, candidate_idx - 1))
                        data_range_start = None
                    break
                # CONTINUE: This block belongs to the table
                if data_range_start is None:
                    logger.info(f"log_100: Starting new data range at {candidate_idx}")
                    data_range_start = candidate_idx
                processed_blocks.add(candidate_idx)
    return table_ranges


## Extract Table Rows from Blocks

**extract_table_rows_from_blocks()**:
- Extracts structured table rows from PDF blocks using column thresholds
- Handles multi-block rows by merging blocks without column 1 content
- Returns a list of rows with aligned cell data

In [144]:
def extract_table_rows_from_blocks(all_blocks_with_raw_lines, table_start, table_end, header_thresholds=None):
    """
    Extract table rows by aligning cells to column thresholds from the TITLE ROW or provided thresholds.
    Merges blocks that have no content in column 1 with the previous block.
    
    Key Logic:
    1. If header_thresholds provided, use those. Otherwise, title blocks (table_start to table_end) define thresholds
    2. Column 1 = [min_threshold, threshold[1]), Column 2 = [threshold[1], threshold[2]), etc.
    3. For each subsequent block:
       - If it has NO content in column 1 range, merge it with previous block
       - If it HAS content in column 1, start a new row
    4. Initialize empty columns as "" when merging blocks
    
    Args:
        all_blocks_with_raw_lines: List of (block_idx, raw_lines) tuples
        table_start, table_end: Block range
        header_thresholds: Optional pre-computed column thresholds (use when extracting data-only ranges)
    
    Returns: List of rows, each row is a list of cells (text strings)
    """
    logger.info("log_104: Starting extract_table_rows_from_blocks")
    block_dict = dict(all_blocks_with_raw_lines)
    # Step 1: Collect all spans with block_idx, Y, X, text
    all_spans_with_coords = []
    for block_idx in range(table_start, table_end + 1):
        for line in block_dict[block_idx]:
            for span in line.get('spans', []):
                y = span.get('origin', [0, 0])[1]
                x = span.get('origin', [0, 0])[0]
                text = span.get('text', '')
                all_spans_with_coords.append((block_idx, y, x, text))

    # Step 2: Extract column thresholds from provided thresholds OR title blocks
    if header_thresholds is not None:
        logger.info("log_108: Using provided header_thresholds")
        column_thresholds = sorted(header_thresholds)
    else:
        # Extract thresholds from title block range (ALL blocks from table_start to table_end)
        # This handles cases where headers span multiple blocks (e.g., page 86 blocks 11-12)
        title_block_spans = [(y, x, text) for b, y, x, t in all_spans_with_coords if table_start <= b <= table_end]
        if title_block_spans:
            logger.info("log_109: Extracting thresholds from title block spans")
            x_raw = [x for y, x, text in title_block_spans]
            # Cluster X positions within 30px as same column
            # Use actual X values (not rounded) to preserve precision for range checks
            clustered_x = []
            for x_val in sorted(set(x_raw)):
                if not clustered_x or abs(x_val - clustered_x[-1]) > 30:
                    clustered_x.append(x_val)
            column_thresholds = sorted(clustered_x)


    # Step 3: Helper function to assign X coordinate to column index
    def assign_to_column_idx(x_coord):
        """
        Assign X coordinate to column index using nearest-column logic.
        For large column gaps (>100px), use distance to nearest column center.
        For small gaps, use range-based assignment.
        """

        # Check if this is a wide-gap table (columns >100px apart)
        gaps = [column_thresholds[i+1] - column_thresholds[i] for i in range(len(column_thresholds)-1)]
        has_wide_gap = any(gap > 100 for gap in gaps)
        if has_wide_gap:
            logger.info("log_114: Wide-gap table detected, using nearest-column assignment")
            nearest_idx = min(range(len(column_thresholds)), 
                             key=lambda i: abs(x_coord - column_thresholds[i]))
            return nearest_idx
        else:
            # Use range-based assignment for narrow columns
            for i in range(len(column_thresholds)):
                if i == len(column_thresholds) - 1:
                    # Last column: threshold[i] to infinity
                    if x_coord >= column_thresholds[i]:
                        logger.info(f"log_115: Assigning x_coord {x_coord} to last column {i}")
                        return i
                else:
                    # Middle column: threshold[i] to threshold[i+1]
                    if column_thresholds[i] <= x_coord < column_thresholds[i + 1]:
                        logger.info(f"log_116: Assigning x_coord {x_coord} to column {i}")
                        return i
            # Fallback: assign to nearest threshold
            logger.info(f"log_117: Fallback assigning x_coord {x_coord} to nearest column")
            nearest_idx = min(range(len(column_thresholds)), 
                             key=lambda i: abs(x_coord - column_thresholds[i]))
            return nearest_idx
    # Step 4: Group spans by block first, then assign columns within each block
    block_spans_dict = {}  # {block_idx: {col_idx: [text_list]}}
    for block_idx, y, x, text in all_spans_with_coords:
        col_idx = assign_to_column_idx(x)
        if block_idx not in block_spans_dict:
            block_spans_dict[block_idx] = {}
        if col_idx not in block_spans_dict[block_idx]:
            block_spans_dict[block_idx][col_idx] = []
        block_spans_dict[block_idx][col_idx].append(text)
    # Step 5: Combine text for each cell and identify blocks with no column 0 content
    rows = []
    current_row = None  # Will accumulate columns from merged blocks
    current_row_cols = None  # Tracks which columns have been filled
    for block_idx in sorted(block_spans_dict.keys()):
        block_cols = block_spans_dict[block_idx]
        # Check if this block has content in column 0 (first column)
        has_col_0 = 0 in block_cols
        if has_col_0:
            logger.info(f"log_118: Block {block_idx} has column 0, starting new row")
            # Start a new row
            if current_row is not None:
                logger.info(f"log_119: Saving previous row for block {block_idx}")
                rows.append(current_row)
            # Initialize new row with all columns
            current_row = []
            current_row_cols = set()
            for col_idx in range(len(column_thresholds)):
                if col_idx in block_cols:
                    combined_text = ' '.join(block_cols[col_idx]).strip()
                    current_row.append(combined_text)
                    current_row_cols.add(col_idx)
                else:
                    current_row.append("")
        else:
            logger.info(f"log_120: Block {block_idx} has no column 0, merging with previous row")
            # No column 0 content: merge with previous row
            if current_row is None:
                logger.info(f"log_121: No current row, initializing empty row for block {block_idx}")
                current_row = [""] * len(column_thresholds)
                current_row_cols = set()
            # Merge block content into current row
            for col_idx in range(len(column_thresholds)):
                if col_idx in block_cols:
                    combined_text = ' '.join(block_cols[col_idx]).strip()
                    # Append to existing cell if already has content
                    if current_row[col_idx]:
                        logger.info(f"log_122: Appending to existing cell in column {col_idx} for block {block_idx}")
                        current_row[col_idx] += " " + combined_text
                    else:
                        logger.info(f"log_123: Setting cell in column {col_idx} for block {block_idx}")
                        current_row[col_idx] = combined_text
                    current_row_cols.add(col_idx)
    # Step 6: Add the last row
    if current_row is not None and any(current_row):
        logger.info("log_124: Adding final row")
        rows.append(current_row)
    return rows


## Format Markdown Table & Convert Tables

**format_markdown_table()**:
- Converts extracted table rows to markdown format (| separated columns)
- Deduplicates consecutive identical rows (header repetition in PDFs)
- Pads all rows to same column count with empty strings
- Returns properly formatted markdown table


In [145]:
def format_markdown_table(rows):
    """
    Convert extracted table rows to markdown table format.
    
    Handles:
    1. Deduplication of consecutive identical rows (header repetition)
    2. Proper column alignment
    3. Markdown pipe separators
    
    Args:
        rows: List of lists, each inner list is a row of cells
    
    Returns:
        Formatted markdown table string
    """
    logger.info("log_125: Starting format_markdown_table")

    def normalize_row(row):
        """Normalize row for deduplication: lowercase, collapse whitespace."""
        normalized_cells = [re.sub(r'\s+', ' ', cell.strip().lower()) for cell in row]
        return normalized_cells
    # Deduplicate consecutive identical rows (header rows appearing multiple times)
    deduplicated = []
    prev_normalized = None
    for row in rows:
        curr_normalized = normalize_row(row)
        if curr_normalized != prev_normalized:
            deduplicated.append(row)
            prev_normalized = curr_normalized
    rows = deduplicated

    # Determine column count
    num_cols = max(len(row) for row in rows) if rows else 0
    logger.info(f"log_128: Number of columns detected: {num_cols}")

    # Ensure all rows have same column count
    for row in rows:
        while len(row) < num_cols:
            row.append("")
    # Build markdown table
    markdown_lines = []
    # Header row
    markdown_lines.append("| " + " | ".join(rows[0]) + " |")
    # Separator row
    separator = "| " + " | ".join(["-" * max(1, len(cell)) for cell in rows[0]]) + " |"
    markdown_lines.append(separator)
    # Data rows
    for i, row in enumerate(rows[1:], start=1):
        logger.info(f"log_130: Adding data row {i}")
        markdown_lines.append("| " + " | ".join(row) + " |")
    return "\n".join(markdown_lines)


## Detect and Convert Tables in Page

**detect_and_convert_tables_in_page()**:
- Orchestrates table detection and conversion to markdown
- Combines header and data rows into properly formatted tables
- Returns markdown table strings and references to processed blocks

In [146]:
def detect_and_convert_tables_in_page(all_blocks_with_raw_lines, page_num_1idx):
    """
    Master function: Detect tables on page and convert to markdown format.
    
    FIX: Returns list of (start_block, end_block, markdown) tuples instead of combined markdown.
    This allows parse_page_with_metadata() to insert tables at their correct block positions.
    
    Args:
        all_blocks_with_raw_lines: List of (block_idx, raw_lines) tuples
        page_num_1idx: Page number (1-indexed, for debugging)
    
    Returns:
        Tuple of (table_list, set of (block_idx, line_idx) table references)
        where table_list = [(start_block, end_block, markdown_str), ...]
    """
    logger.info(f"log_139: Starting detect_and_convert_tables_in_page for page {page_num_1idx}")
    # Detect table block ranges
    table_ranges = detect_table_blocks_in_page(all_blocks_with_raw_lines)
    logger.info(f"log_140: Detected table ranges: {table_ranges}")
    if not table_ranges:
        logger.info("log_141: No table ranges found, returning empty list and set")
        return [], set()
    # Build a map of header ranges to their thresholds
    header_thresholds_map = {}  # Maps header_end_idx -> thresholds
    for idx, (table_start, table_end) in enumerate(table_ranges):
        if idx + 1 < len(table_ranges):
            next_start, next_end = table_ranges[idx + 1]
            if next_start == table_end + 1:
                rows = extract_table_rows_from_blocks(all_blocks_with_raw_lines, table_start, table_end)
                if rows:
                    block_dict = dict(all_blocks_with_raw_lines)
                    x_positions = []
                    for block_idx in range(table_start, table_end + 1):
                        header_block_lines = block_dict.get(block_idx, [])
                        for line in header_block_lines:
                            for span in line.get('spans', []):
                                x = span.get('origin', [0, 0])[0]
                                x_positions.append(x)
                    clustered_x = []
                    for x_val in sorted(set(x_positions)):
                        if not clustered_x or abs(x_val - clustered_x[-1]) > 30:
                            clustered_x.append(x_val)
                    header_thresholds_map[table_end] = sorted(clustered_x)
    # Extract and convert each table range
    all_table_markdown = []
    all_table_refs = set()
    table_ranges_with_idx = []  # Track (start, end, index_in_all_table_markdown)
    i = 0
    while i < len(table_ranges):
        table_start, table_end = table_ranges[i]
        logger.info(f"log_142: Processing table range {table_start}-{table_end}")
        if table_end in header_thresholds_map and i + 1 < len(table_ranges):
            next_start, next_end = table_ranges[i + 1]
            if next_start == table_end + 1:
                logger.info(f"log_143: Header range {table_start}-{table_end} followed by data range {next_start}-{next_end}")
                header_thresholds = header_thresholds_map.get(table_end, None)
                header_rows = extract_table_rows_from_blocks(all_blocks_with_raw_lines, table_start, table_end, header_thresholds)
                data_rows = extract_table_rows_from_blocks(all_blocks_with_raw_lines, next_start, next_end, header_thresholds)
                combined_rows = header_rows + data_rows
                table_md = format_markdown_table(combined_rows)
                if table_md:
                    logger.info(f"log_144: Table markdown generated for blocks {table_start}-{next_end}")
                    all_table_markdown.append(table_md)
                    table_ranges_with_idx.append((table_start, next_end, len(all_table_markdown) - 1))
                    for block_idx in range(table_start, next_end + 1):
                        all_table_refs.add((block_idx, 0))
                i += 2
                continue
        header_thresholds = None
        rows = extract_table_rows_from_blocks(all_blocks_with_raw_lines, table_start, table_end, header_thresholds)
        table_md = format_markdown_table(rows)
        if table_md:
            logger.info(f"log_145: Table markdown generated for blocks {table_start}-{table_end}")
            all_table_markdown.append(table_md)
            table_ranges_with_idx.append((table_start, table_end, len(all_table_markdown) - 1))
            for block_idx in range(table_start, table_end + 1):
                all_table_refs.add((block_idx, 0))
        i += 1
    # Build table list: (start_block, end_block, markdown)
    table_list = [(start, end, all_table_markdown[idx]) for start, end, idx in table_ranges_with_idx]
    logger.info(f"log_146: Returning {len(table_list)} table ranges for page {page_num_1idx}")
    return table_list, all_table_refs

## extract_all_tables_from_blocks


In [147]:
def extract_all_tables_from_blocks(all_blocks_with_raw_lines, page_num_1idx):
    """
    Key Logic:
    - Header blocks (extended or single-block ranges) are extracted to get column thresholds
    - Data blocks following headers are extracted using the HEADER's thresholds
    - Header and data rows are COMBINED into ONE table before formatting
    - This ensures only ONE header and separator per logical table
    - Handles extended headers (e.g., blocks 11-12 merged into single header range)
    
    Args:
        all_blocks_with_raw_lines: List of (block_idx, raw_lines) tuples
        page_num_1idx: Page number (1-indexed, for debugging)
    
    Returns:
        Tuple of (markdown_table_str, set of (block_idx, line_idx) table references)
    """
    logger.info(f"log_131: Starting extract_all_tables_from_blocks for page {page_num_1idx}")
    # Detect table block ranges
    table_ranges = detect_table_blocks_in_page(all_blocks_with_raw_lines)
    if not table_ranges:
        logger.info("log_132: No table ranges detected, returning empty string and set")
        return "", set()
    # Build a map of header ranges to their thresholds
    header_thresholds_map = {}  # Maps header_end_idx -> thresholds
    for idx, (table_start, table_end) in enumerate(table_ranges):
        # Check if there's a next range that immediately follows this range
        if idx + 1 < len(table_ranges):
            next_start, next_end = table_ranges[idx + 1]
            if next_start == table_end + 1:
                # This range is followed by data blocks
                # Extract thresholds from this header range
                rows = extract_table_rows_from_blocks(all_blocks_with_raw_lines, table_start, table_end)
                if rows:
                    block_dict = dict(all_blocks_with_raw_lines)
                    x_positions = []
                    for block_idx in range(table_start, table_end + 1):
                        header_block_lines = block_dict.get(block_idx, [])
                        for line in header_block_lines:
                            for span in line.get('spans', []):
                                x = span.get('origin', [0, 0])[0]
                                x_positions.append(x)
                    clustered_x = []
                    for x_val in sorted(set(x_positions)):
                        if not clustered_x or abs(x_val - clustered_x[-1]) > 30:
                            clustered_x.append(x_val)
                    header_thresholds_map[table_end] = sorted(clustered_x)
    # Extract and convert each table range
    all_table_markdown = []
    all_table_refs = set()
    i = 0
    while i < len(table_ranges):
        table_start, table_end = table_ranges[i]
        # Check if this range is a header (immediately followed by data blocks)
        if table_end in header_thresholds_map and i + 1 < len(table_ranges):
            next_start, next_end = table_ranges[i + 1]
            if next_start == table_end + 1:
                logger.info(f"log_133: Header range {table_start}-{table_end} followed by data range {next_start}-{next_end}")
                header_thresholds = header_thresholds_map.get(table_end, None)
                header_rows = extract_table_rows_from_blocks(all_blocks_with_raw_lines, table_start, table_end, header_thresholds)
                data_rows = extract_table_rows_from_blocks(all_blocks_with_raw_lines, next_start, next_end, header_thresholds)
                combined_rows = header_rows + data_rows
                table_md = format_markdown_table(combined_rows)
                if table_md:
                    logger.info(f"log_134: Table markdown generated for blocks {table_start}-{next_end}")
                    all_table_markdown.append(table_md)
                    for block_idx in range(table_start, next_end + 1):
                        all_table_refs.add((block_idx, 0))
                i += 2  # Skip both header and data ranges
                continue
        header_thresholds = None
        rows = extract_table_rows_from_blocks(all_blocks_with_raw_lines, table_start, table_end, header_thresholds)
        table_md = format_markdown_table(rows)
        if table_md:
            logger.info(f"log_135: Table markdown generated for blocks {table_start}-{table_end}")
            all_table_markdown.append(table_md)
            for block_idx in range(table_start, table_end + 1):
                all_table_refs.add((block_idx, 0))
        i += 1
    # Combine all table markdowns with newlines between them
    combined_markdown = "\n\n".join(all_table_markdown) if all_table_markdown else ""
    logger.info("log_136: Returning combined markdown and table refs")
    return combined_markdown, all_table_refs


## Main Extraction & Chapter Processing

**extract_and_split_by_chapter()**:
- Main orchestration function that processes the entire PDF
- For each chapter in CHAPTER_MAP:
  1. Initialize page state tracker
  2. Parse each page using parse_page_with_metadata()
  3. Track code block state across pages for continuity
  4. Close any open code blocks at chapter end
  5. Post-process to convert remaining bullet characters
  6. Generate output filename with prefix (00, 01-25 for chapters, A-E for appendices)
  7. Write markdown to file

**Outputs**: One markdown file per chapter in `teradataml_user_guide/` directory

In [148]:
# Main Extraction & Chapter Processing

def extract_and_split_by_chapter(pdf_path, chapter_map, output_dir="teradataml_user_guide"):
    """
    Extract chapters from PDF and split into markdown files.
    Handles multi-page code block continuity within chapters.
    
    Args:
        pdf_path: Path to PDF file
        chapter_map: List of (title, start_page, end_page) tuples
        output_dir: Output directory for markdown files
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    try:
        doc = fitz.open(pdf_path)
        total_pages = doc.page_count
        logger.info(f"log_001: Opened PDF {pdf_path} with {total_pages} pages")

        for idx, (title, start_page_1idx, end_page_1idx) in enumerate(chapter_map):
            logger.info(f"log_002: Processing chapter {idx}: '{title}' pages {start_page_1idx}-{end_page_1idx}")
            start_page_0idx = start_page_1idx - 1
            end_page_0idx = end_page_1idx - 1

            logger.info(f"log_004: Starting processing for chapter {idx}: '{title}'")
            chapter_text = [f"# {title}\n\n"] 
            page_state = {'in_code_block': False, 'pending_lines': []}
            
            # Process each page in chapter
            for page_num_0idx in range(start_page_0idx, end_page_0idx + 1):
                logger.info(f"log_005: Processing page {page_num_0idx + 1:04d} for chapter {idx}: '{title}'")
                page = doc.load_page(page_num_0idx)
                result = parse_page_with_metadata(page, page_num_0idx + 1, page_state)
                logger.info(f"log_006: Parsed page {page_num_0idx + 1:04d} for chapter {idx}: '{title}'")
                
                markdown_text = result['markdown']
                page_state = result['state']
                
                chapter_text.append(markdown_text)
                chapter_text.append('\n')
            
            # Close any open code block at end of chapter
            if page_state['in_code_block']:
                logger.info(f"log_007: Closing open code block for chapter {idx}: '{title}'")
                chapter_text.append("```\n")
            
            # Post-process: convert any remaining old bullet characters
            final_text = "".join(chapter_text).strip()
            final_text = convert_bullets_in_text(final_text)
            logger.info(f"log_008: Converted bullets for chapter {idx}: '{title}'")
            
            # Determine output filename with prefix
            if idx == 0:
                prefix = "00"
                logger.info(f"log_009: Set prefix '00' for chapter {idx}: '{title}'")
            elif idx < APPENDIX_START_INDEX:
                prefix = f"{idx:02d}"
                logger.info(f"log_010: Set prefix '{prefix}' for chapter {idx}: '{title}'")
            
            safe_title = sanitize_title(title)
            logger.info(f"log_012: Sanitized title to '{safe_title}' for chapter {idx}: '{title}'")
            output_filename = os.path.join(output_dir, f"{prefix}_{safe_title}.md")
            
            # Write to file
            with open(output_filename, "w", encoding="utf-8") as f:
                f.write(final_text)
            logger.info(f"log_013: Wrote file '{output_filename}' for chapter {idx}: '{title}'")
            print(f"✓ Generated {output_filename}")
            
    except Exception as e:
        logger.info(f"log_014: Error during extraction: {e}")
        print(f"Error during extraction: {e}")
    finally:
        if 'doc' in locals() and doc:
            doc.close()

## Helper Functions: Title Sanitization & Noise Removal

**sanitize_title()**:
- Cleans chapter titles for safe filenames
- Removes chapter/appendix prefixes
- Removes special characters, keeps word chars + parentheses/brackets
- Replaces spaces with underscores
- Example: "Chapter 1: DataFrames Setup" → "DataFrames_Setup"


In [149]:
def sanitize_title(title):
    """
    Clean chapter titles for use in filenames.
    
    Steps:
    1. Remove leading chapter/appendix prefix (e.g., "Chapter 1:", "Appendix A:")
    2. Remove special characters (keep only word chars, spaces, parentheses, brackets, hyphens)
    3. Replace spaces/hyphens with underscores
    
    Args:
        title: Raw chapter title from PDF
    
    Returns:
        Sanitized title safe for filenames
    
    Example:
        "Chapter 1: DataFrames Setup and Basics" → "DataFrames_Setup_and_Basics"
    """
    logger.info(f"log_015: Sanitizing title '{title}'")
    title = re.sub(r'^(Chapter\s\d{1,2}|Appendix\s[A-E]):\s*', '', title, flags=re.IGNORECASE)
    title = re.sub(r'[^\w\s()\[\]-]', '', title)
    title = re.sub(r'[\s-]+', '_', title).strip('_')
    return title

## Clean Page Noise

**clean_page_noise()**:
- Removes header, footer, and page number artifacts from extracted text
- Cleans up formatting issues like non-breaking spaces
- Returns clean, readable text for further processing

In [150]:
def clean_page_noise(text, page_number):
    """
    Strip header/footer noise and non-breaking spaces from extracted text.
    
    Removes:
    1. File header ("Teradata® Package for Python User Guide, Release 20.00")
    2. Page numbers in headers/footers
    3. Chapter title fragments used in headers
    4. Multiple consecutive newlines
    5. Non-breaking spaces (U+00A0)
    
    Args:
        text: Raw extracted text from PDF page
        page_number: Current page number (1-indexed, used to find page number in headers)
    
    Returns:
        Clean text with noise removed
    """
    logger.info(f"log_016: {page_number:04d} Cleaning page noise")
    FILE_NAME_REGEX = re.escape("Teradata® Package for Python User Guide, Release 20.00")
    text = re.sub(FILE_NAME_REGEX, '', text)
    page_num_regex = re.escape(str(page_number))
    text = re.sub(rf'\s*{page_num_regex}\s*', '\n', text)
    chapter_title_fragments = [
        r'Context to Teradata Vantage', r'teradataml DataFrame Column', r'Executing Python Functions Inside Database',
        r'DataFrames for Tables and Views', r'teradataml Window Aggregates', r'teradataml Options',
        r'teradataml Utility and General Functions', r'Engine 20', r'Table and Views',
        r'Installing, Uninstalling, and Upgrading Teradata Package for Python'
    ]
    fragment_pattern = r'|'.join(re.escape(f) for f in chapter_title_fragments)
    noise_regex = re.compile(
        r'^\s*(\d{1,2}:\s*.*(?:' + fragment_pattern + r').*|.*(?:' + fragment_pattern + r')\s*\d{1,2}\s*)\s*$', 
        flags=re.IGNORECASE | re.MULTILINE
    )
    text = re.sub(noise_regex, '', text).strip()
    text = re.sub(r'\n\s*\n', '\n\n', text).strip()
    text = text.replace('\u00a0', ' ')
    return text

## Determine Code Lines

**is_code_line()**:
- Distinguishes between Python code and execution output
- Checks for REPL prompts and code font characteristics
- Returns boolean indicating if a line contains code

In [151]:
def is_code_line(line_text_clean, line_font):
    """
    Determine if a line is actual Python code (vs output from code execution).
    
    A line is considered code if:
    1. It's in the CODE_FONT (Consolas)
    2. It starts with Python REPL prompt (>>> or ...)
    
    Used to distinguish between:
    - Code: ">>> result = df.head()"
    - Output: "Name  Age  Score"
    
    Args:
        line_text_clean: The text content of the line (already trimmed)
        line_font: Font name extracted from PDF (e.g., 'Consolas', 'Arial')
    
    Returns:
        True if line is code (has REPL prompt), False if it's output
    """
    logger.info(f"log_137: Checking if code line with font '{line_font}' and text '{line_text_clean}'")
    if line_font != CODE_FONT:
        logger.info("log_138: Font does not match CODE_FONT, returning False")
        return False
    result = line_text_clean.startswith(('>>>', '...'))
    return result


## Export Raw page

In [152]:
def export_raw_page_metadata(pdf_path, page_num_1idx, output_filename):
    """Export raw PDF metadata for a specific page (for debugging)."""
    try:
        doc = fitz.open(pdf_path)
        if 0 < page_num_1idx <= doc.page_count:
            page_num_0idx = page_num_1idx - 1
            page = doc.load_page(page_num_0idx)
            data = page.get_text("dict")
            with open(output_filename, "w", encoding="utf-8") as f:
                f.write(json.dumps(data, indent=4))
            print(f"✨ Successfully exported raw metadata for page {page_num_1idx} to '{output_filename}'.")
        else:
            print(f"Error: Page number {page_num_1idx} is out of bounds.")
    except FileNotFoundError:
        print(f"Error: PDF file not found at '{pdf_path}'.")
    except Exception as e:
        print(f"Error during metadata export: {e}")
    finally:
        if 'doc' in locals() and doc:
            doc.close()

In [153]:
export_raw_page_metadata(PDF_FILE, 204, "page_204_raw_metadata_debug.txt")

✨ Successfully exported raw metadata for page 204 to 'page_204_raw_metadata_debug.txt'.


## Main Extraction & Chapter Processing

**extract_and_split_by_chapter()**:
- Main orchestration function that processes the entire PDF
- For each chapter in CHAPTER_MAP:
  1. Initialize page state tracker
  2. Parse each page using parse_page_with_metadata()
  3. Track code block state across pages for continuity
  4. Close any open code blocks at chapter end
  5. Post-process to convert remaining bullet characters
  6. Generate output filename with prefix (00, 01-25 for chapters, A-E for appendices)
  7. Write markdown to file

**Outputs**: One markdown file per chapter in `teradataml_user_guide/` directory

## Export Raw Page Metadata

This cell exports raw PDF metadata for page 204 for debugging purposes.

In [None]:
# Execution: Extract and Process PDF

if os.path.exists(PDF_FILE):
    extract_and_split_by_chapter(PDF_FILE, CHAPTER_MAP)
    print("\n✅ Extraction complete with all bullet characters converted to markdown!")
else:
    print(f"Error: PDF file not found at '{PDF_FILE}'.")

✓ Generated teradataml_user_guide/00_Table_of_Contents.md
✓ Generated teradataml_user_guide/01_Introduction_to_Teradata_Package_for_Python.md
✓ Generated teradataml_user_guide/02_Installing_Uninstalling_and_Upgrading_Teradata_Package_for_Python.md
✓ Generated teradataml_user_guide/03_teradataml_Components.md
✓ Generated teradataml_user_guide/04_DataFrames_Setup_and_Basics_(Sources_Non_Default_DB_UAF).md
✓ Generated teradataml_user_guide/05_DataFrame_Manipulation_(Core_API).md
✓ Generated teradataml_user_guide/06_DataFrame_Metadata_Rotation_Saving_and_Export.md
✓ Generated teradataml_user_guide/07_Executing_Python_Functions_Inside_Database_Engine_20.md
✓ Generated teradataml_user_guide/08_teradataml_DataFrame_Column.md
✓ Generated teradataml_user_guide/09_teradataml_Window_Aggregates.md
✓ Generated teradataml_user_guide/10_Context_to_Teradata_Vantage.md
✓ Generated teradataml_user_guide/11_teradataml_Options.md
✓ Generated teradataml_user_guide/12_teradataml_Utility_and_General_Function

In [None]:
# Export raw metadata for page 12 to debug table issue
export_raw_page_metadata(PDF_FILE, 286, "page_286_raw_metadata_debug.txt")

✨ Successfully exported raw metadata for page 286 to 'page_286_raw_metadata_debug.txt'.


In [None]:
# tests