# Page Fetching for Pagination Strategies

This notebook loads pagination plans from Stage 1 and executes the actual data fetching for each strategy.

## Strategies:
1. **Full Table** - Single query for entire table
2. **Row by Row** - Fetch each row individually using key values
3. **Attribute-based** - Fetch pages by partition values
4. **Classic Pagination** - Offset-based iterative fetching
5. **Range-based Alphabetic** - Fetch by alphabetic ranges (A-F, G-L, etc.)
6. **Range-based Semantic** - Fetch by semantic categories (continents, etc.)
7. **Range-based Unrestricted** - Fetch by unrestricted ranges (benchmark)

## Output:
- CSV files: Aggregated table data (for metrics calculation)
- JSON files: Detailed execution logs with per-page metrics

## Setup and Imports

In [1]:
from __future__ import annotations

import json
import os
import re
import time
import pandas as pd
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any, Optional, Tuple
from concurrent.futures import ThreadPoolExecutor, as_completed
import openai

# ===== CONFIGURATION =====
# MAX_TABLES: Limit number of tables to process (None for all)
MAX_TABLES = 20

# MAX_PAGES: Limit pages per table per strategy (None for all, useful for testing)
MAX_PAGES = None

# PARALLEL_STRATEGIES: Run strategies in parallel per table
PARALLEL_STRATEGIES = True

# MAX_WORKERS: Number of parallel strategy executions per table
MAX_WORKERS = 5

# MAX_RETRIES: Number of retries for failed LLM calls
MAX_RETRIES = 3

# MAX_PAGINATION_PAGES: Failsafe for classic_pagination
MAX_PAGINATION_PAGES = 10

# Paths
ROOT = Path('.')
PROCESSING_ROOT = ROOT / 'processing'
STRATEGY_DIR = PROCESSING_ROOT / '1_strategy'
OUTPUT_ROOT = PROCESSING_ROOT / '2_fetching'

# Find the most recent strategy directory
strategy_subdirs = sorted([d for d in STRATEGY_DIR.iterdir() if d.is_dir()], reverse=True)
if not strategy_subdirs:
    raise FileNotFoundError(f"No strategy data found in {STRATEGY_DIR}")

LATEST_STRATEGY_DIR = strategy_subdirs[0]

# Find ground truth data directory
DATA_DIR = PROCESSING_ROOT / '0_data'
data_subdirs = sorted([d for d in DATA_DIR.iterdir() if d.is_dir()], reverse=True)
if not data_subdirs:
    raise FileNotFoundError(f"No ground truth data found in {DATA_DIR}")

LATEST_DATA_DIR = data_subdirs[0]

print('Strategy directory:', LATEST_STRATEGY_DIR.resolve())
print('Ground truth directory:', LATEST_DATA_DIR.resolve())
print('Output base:', OUTPUT_ROOT.resolve())

Strategy directory: /Users/bef/Desktop/TablePagination/processing/1_strategy/20251015_230052
Ground truth directory: /Users/bef/Desktop/TablePagination/processing/0_data/20251004_213355
Output base: /Users/bef/Desktop/TablePagination/processing/2_fetching


In [None]:
# Configure OpenRouter
# Read API key from file
api_key_file = ROOT / 'api_key.txt'
if not api_key_file.exists():
    raise ValueError('No API key found. Please create api_key.txt or set OPENROUTER_API_KEY environment variable')
with open(api_key_file, 'r') as f:
    OPENROUTER_API_KEY = f.read().strip()
print('✓ API key loaded from api_key.txt')

client = openai.OpenAI(
    api_key=OPENROUTER_API_KEY,
    # base_url="https://openrouter.ai/api/v1"
)

MODEL = 'gpt-4o-mini-search-preview-2025-03-11'
print(f'Using model: {MODEL}')

✓ API key loaded from api_key.txt
Using model: gpt-4o-mini-search-preview-2025-03-11


## Helper Functions

In [3]:
def normalize_field(s: str) -> str:
    """Normalize field names (matching original experiment logic)."""
    s = s.lower().replace(" ","_").replace("-","_").replace(".", "").replace(",","_")\
            .replace("(", "").replace(")", "").replace(":", "").replace('"','').replace("'","")\
            .replace("/", "")
    return re.sub('_+', '_', s)


def call_llm_with_metrics_split(system_msg: str, user_msg: str, retry_count: int = 0) -> Dict[str, Any]:
    """
    Call LLM with separate system and user messages (matching original experiment).
    """
    start_time = time.time()
    prompt_length = len(system_msg) + len(user_msg)
    
    try:
        response = client.chat.completions.create(
            model=MODEL,
            messages=[{"role": "system", "content": system_msg},
                     {"role": "user", "content": user_msg}],
            # temperature=0
        )
        
        latency = time.time() - start_time
        content = response.choices[0].message.content
        
        # Check if content is None
        if content is None:
            return {
                'success': False,
                'error': 'Response content is None',
                'metrics': {
                    'latency': round(latency, 3),
                    'prompt_length': prompt_length,
                    'retry_count': retry_count,
                    'timestamp': datetime.now().isoformat()
                }
            }
        
        content = content.strip()
        response_length = len(content)
        
        # Extract token usage
        usage = response.usage
        prompt_tokens = usage.prompt_tokens if usage else 0
        completion_tokens = usage.completion_tokens if usage else 0
        total_tokens = usage.total_tokens if usage else 0
        
        return {
            'success': True,
            'response': content,
            'metrics': {
                'latency': round(latency, 3),
                'prompt_tokens': prompt_tokens,
                'completion_tokens': completion_tokens,
                'total_tokens': total_tokens,
                'prompt_length': prompt_length,
                'response_length': response_length,
                'retry_count': retry_count,
                'timestamp': datetime.now().isoformat()
            }
        }
    except Exception as e:
        latency = time.time() - start_time
        return {
            'success': False,
            'error': str(e),
            'metrics': {
                'latency': round(latency, 3),
                'prompt_length': prompt_length,
                'retry_count': retry_count,
                'timestamp': datetime.now().isoformat()
            }
        }


def call_llm_with_retries(prompt: str = None, system_msg: str = None, user_msg: str = None, max_retries: int = MAX_RETRIES) -> Dict[str, Any]:
    """Call LLM with retry logic. Supports either single prompt or system+user messages."""
    for attempt in range(max_retries):
        if system_msg and user_msg:
            result = call_llm_with_metrics_split(system_msg, user_msg, retry_count=attempt)
        else:
            result = call_llm_with_metrics_split("You are a retriever of facts.", prompt, retry_count=attempt)
        if result['success']:
            return result
        print(f"  Retry {attempt + 1}/{max_retries} after error: {result['error']}")
        time.sleep(1)  # Brief delay between retries
    
    return result  # Return last failed attempt


def parse_json_response(response: str, expect_list: bool = True) -> Tuple[Optional[Any], Dict[str, Any]]:
    """
    Extract and parse JSON from LLM response (matching original parsing logic).
    
    Args:
        response: The raw LLM response text
        expect_list: True for list responses, False for dict responses
    
    Returns:
        - parsed_data: List/Dict or None
        - parse_metrics: dict with parse_success, json_start_pos, json_end_pos
    """
    metrics = {
        'parse_success': False,
        'json_start_pos': -1,
        'json_end_pos': -1
    }
    
    try:
        # Matching original logic: find JSON boundaries
        if expect_list:
            start_char, end_char = '[', ']'
        else:
            start_char, end_char = '{', '}'
        
        # Extract JSON portion
        start_pos = response.find(start_char)
        end_pos = response.rfind(end_char)
        
        if start_pos != -1 and end_pos != -1:
            json_str = response[start_pos:end_pos + 1]
            metrics['json_start_pos'] = start_pos
            metrics['json_end_pos'] = end_pos + 1
            
            parsed = json.loads(json_str)
            
            # Handle wrapped responses
            if isinstance(parsed, dict) and len(parsed.keys()) == 1:
                parsed = list(parsed.values())[0]
            
            metrics['parse_success'] = True
            return parsed, metrics
        
        # Fallback: manual parsing (matching original fallback logic)
        if not expect_list and start_char not in response and end_char not in response:
            return None, metrics
            
        split_response = response.split("{")
        response_json = []
        for s in split_response[1:]:
            split_s = s.split("}")
            if len(split_s) > 1:
                content = split_s[0]
                attributes = content.split(",")
                elements = {}
                for attr in attributes:
                    knv = attr.split(":")
                    if len(knv) > 1:
                        parsed_k = "%s" % knv[0].replace('"','').strip()
                        parsed_v = "%s" % knv[1].replace('"','').strip()
                        elements[parsed_k] = parsed_v
                
                if elements:
                    response_json.append(elements)
        
        if response_json:
            metrics['parse_success'] = True
            if expect_list:
                return response_json, metrics
            else:
                return response_json[0] if response_json else {}, metrics
        
        return None, metrics
        
    except json.JSONDecodeError as e:
        metrics['parse_error'] = str(e)
        return None, metrics
    except Exception as e:
        metrics['parse_error'] = str(e)
        return None, metrics

## Strategy 1: Full Table

In [4]:
def fetch_full_table(plan: Dict[str, Any]) -> Dict[str, Any]:
    """
    Fetch entire table in one query (matching original B004 approach).
    """
    meta = plan['metadata']
    table_title = meta['table_title']
    columns = meta['columns']
    table_id = plan['table_id']
    
    # Normalize field names and build response format (matching B004)
    num_fields = len(columns)
    norm_fields = [normalize_field(f) for f in columns]
    fields_json = []
    for field in norm_fields:
        fields_json.append(f'"{field}": "{field}"')
    response_format = ', '.join(fields_json)
    
    # Build prompt matching original B004 template exactly
    system_msg = "You are a retriever of facts."
    user_msg = f"""List {table_title} - as many as possible to fit into response.
The response will be formatted as JSON shown below.
Each element of the response will contain {num_fields} fields: {', '.join(columns)}.
Do not output any additional text that is not in JSON format.

RESPONSE FORMAT:
[{{
    {response_format}
}}]"""
    
    print(f"  Fetching full table...")
    
    # Call LLM with separate system and user messages
    result = call_llm_with_retries(system_msg=system_msg, user_msg=user_msg)
    
    page_data = {
        'page_number': 1,
        'system_msg': system_msg,
        'user_msg': user_msg,
        'prompt': user_msg,  # For backwards compatibility
        'raw_response': result.get('response', ''),
        'error': result.get('error'),
        **result['metrics']
    }
    
    if result['success']:
        parsed_data, parse_metrics = parse_json_response(result['response'], expect_list=True)
        page_data.update(parse_metrics)
        page_data['rows_returned'] = len(parsed_data) if parsed_data else 0
        page_data['parsed_data'] = parsed_data if parsed_data else []
    else:
        page_data['parse_success'] = False
        page_data['rows_returned'] = 0
        page_data['parsed_data'] = []
    
    return {
        'pages': [page_data],
        'all_rows': page_data['parsed_data']
    }

## Strategy 2: Row by Row

In [5]:
def fetch_row_by_row(plan: Dict[str, Any]) -> Dict[str, Any]:
    """
    Fetch each row individually using key values from plan (matching original B008 approach).
    """
    meta = plan['metadata']
    config = plan['pagination_config']
    
    table_title = meta['table_title']
    columns = meta['columns']
    key_columns = config['key_columns']
    key_values = config['key_values']
    table_id = plan['table_id']
    
    # Normalize field names
    norm_fields = [normalize_field(f) for f in columns]
    norm_keys = [normalize_field(k) for k in key_columns]
    
    pages = []
    all_rows = []
    
    # Apply MAX_PAGES limit if set
    keys_to_fetch = key_values[:MAX_PAGES] if MAX_PAGES else key_values
    total_keys = len(key_values)
    
    if MAX_PAGES and len(key_values) > MAX_PAGES:
        print(f"  Limited to first {MAX_PAGES} rows out of {total_keys}")
    
    for page_num, key_combo in enumerate(keys_to_fetch, 1):
        # Build WHERE clause for this row
        key_conditions = []
        for key in key_columns:
            norm_key = normalize_field(key)
            key_value = key_combo.get(norm_key, 'UNKNOWN')
            key_conditions.append(f"{key} = {key_value}")
        row_key = '(' + ', '.join(key_conditions) + ')'
        
        # Build response format with key values filled in (matching B008)
        fields_json = []
        for field in norm_fields:
            # Use key values where available, field names otherwise
            if field in key_combo:
                fields_json.append(f'"{field}": "{key_combo[field]}"')
            else:
                fields_json.append(f'"{field}": "{field}"')
        response_format = ', '.join(fields_json)
        
        # Build prompt matching original B008 template exactly
        system_msg = "You are a retriever of facts."
        key_column_desc = f"The key column{'s' if len(key_columns) > 1 else ''} in the table {'are' if len(key_columns) > 1 else 'is'} {', '.join(key_columns)}"
        user_msg = f"""We want to create a table with the detailed information about {table_title}.
Columns in the table are {', '.join(columns)}.
{key_column_desc}.
Retrieve a single row whose key is {row_key}.
The response will be formatted as JSON dictionary shown below.
Pay special attention to wrap all property names and values in double quotes!

RESPONSE FORMAT:
{{
    {response_format}
}}"""
        
        print(f"  Fetching row {page_num}/{len(keys_to_fetch)}...")
        
        result = call_llm_with_retries(system_msg=system_msg, user_msg=user_msg)
        
        page_data = {
            'page_number': page_num,
            'key_values': key_combo,
            'system_msg': system_msg,
            'user_msg': user_msg,
            'prompt': user_msg,
            'raw_response': result.get('response', ''),
            'error': result.get('error'),
            **result['metrics']
        }
        
        if result['success']:
            # Parse as dict (single row), then wrap in list
            parsed_dict, parse_metrics = parse_json_response(result['response'], expect_list=False)
            page_data.update(parse_metrics)
            
            if parsed_dict:
                parsed_data = [parsed_dict]  # Wrap dict in list
                page_data['rows_returned'] = 1
                page_data['parsed_data'] = parsed_data
                all_rows.extend(parsed_data)
            else:
                page_data['rows_returned'] = 0
                page_data['parsed_data'] = []
        else:
            page_data['parse_success'] = False
            page_data['rows_returned'] = 0
            page_data['parsed_data'] = []
        
        pages.append(page_data)
    
    return {
        'pages': pages,
        'all_rows': all_rows
    }

## Strategy 3: Attribute-based

In [6]:
def fetch_attribute_based(plan: Dict[str, Any]) -> Dict[str, Any]:
    """
    Fetch pages by partition values (matching original approach).
    """
    meta = plan['metadata']
    config = plan['pagination_config']
    
    table_title = meta['table_title']
    columns = meta['columns']
    partition_column = config['partition_column']
    partition_values = config['partition_values']
    table_id = plan['table_id']
    
    # Normalize field names
    num_fields = len(columns)
    norm_fields = [normalize_field(f) for f in columns]
    
    pages = []
    all_rows = []
    
    # Apply MAX_PAGES limit
    values_to_fetch = partition_values[:MAX_PAGES] if MAX_PAGES else partition_values
    total_values = len(partition_values)
    
    if MAX_PAGES and len(partition_values) > MAX_PAGES:
        print(f"  Limited to first {MAX_PAGES} partitions out of {total_values}")
    
    for page_num, value in enumerate(values_to_fetch, 1):
        # Build response format
        fields_json = [f'"{field}": "{field}"' for field in norm_fields]
        response_format = ', '.join(fields_json)
        
        # Build prompt matching original approach
        system_msg = "You are a retriever of facts."
        user_msg = f"""List rows from {table_title} where {partition_column} = {value}.
The response will be formatted as JSON shown below.
Each element of the response will contain {num_fields} fields: {', '.join(columns)}.
Do not output any additional text that is not in JSON format.

RESPONSE FORMAT:
[{{
    {response_format}
}}]"""
        
        print(f"  Fetching partition {page_num}/{len(values_to_fetch)}: {partition_column}={value}...")
        
        result = call_llm_with_retries(system_msg=system_msg, user_msg=user_msg)
        
        page_data = {
            'page_number': page_num,
            'partition_value': value,
            'system_msg': system_msg,
            'user_msg': user_msg,
            'prompt': user_msg,
            'raw_response': result.get('response', ''),
            'error': result.get('error'),
            **result['metrics']
        }
        
        if result['success']:
            parsed_data, parse_metrics = parse_json_response(result['response'], expect_list=True)
            page_data.update(parse_metrics)
            page_data['rows_returned'] = len(parsed_data) if parsed_data else 0
            page_data['parsed_data'] = parsed_data if parsed_data else []
            
            if parsed_data:
                all_rows.extend(parsed_data)
        else:
            page_data['parse_success'] = False
            page_data['rows_returned'] = 0
            page_data['parsed_data'] = []
        
        pages.append(page_data)
    
    return {
        'pages': pages,
        'all_rows': all_rows
    }

## Strategy 4: Classic Pagination

In [7]:
def fetch_classic_pagination(plan: Dict[str, Any]) -> Dict[str, Any]:
    """
    Iterative offset-based pagination (matching original approach).
    Stops when LLM returns empty result or hits MAX_PAGINATION_PAGES failsafe.
    """
    meta = plan['metadata']
    config = plan['pagination_config']
    
    table_title = meta['table_title']
    columns = meta['columns']
    page_size = config['page_size']
    sort_order = config['sort_order']
    table_id = plan['table_id']
    
    # Normalize field names
    num_fields = len(columns)
    norm_fields = [normalize_field(f) for f in columns]
    
    pages = []
    all_rows = []
    offset = 0
    page_num = 0
    
    max_pages = min(MAX_PAGES, MAX_PAGINATION_PAGES) if MAX_PAGES else MAX_PAGINATION_PAGES
    
    while page_num < max_pages:
        page_num += 1
        
        # Build response format
        fields_json = [f'"{field}": "{field}"' for field in norm_fields]
        response_format = ', '.join(fields_json)
        
        # Build prompt matching original approach
        system_msg = "You are a retriever of facts."
        sort_desc = ', '.join(sort_order)
        user_msg = f"""List rows {offset + 1} to {offset + page_size} from {table_title}, sorted by {sort_desc}.
The response will be formatted as JSON shown below.
Each element of the response will contain {num_fields} fields: {', '.join(columns)}.
Do not output any additional text that is not in JSON format.

If there are no more rows at this offset, respond with an empty list: []

RESPONSE FORMAT:
[{{
    {response_format}
}}]"""
        
        print(f"  Fetching page {page_num} (offset {offset}, size {page_size})...")
        
        result = call_llm_with_retries(system_msg=system_msg, user_msg=user_msg)
        
        page_data = {
            'page_number': page_num,
            'offset': offset,
            'page_size': page_size,
            'system_msg': system_msg,
            'user_msg': user_msg,
            'prompt': user_msg,
            'raw_response': result.get('response', ''),
            'error': result.get('error'),
            **result['metrics']
        }
        
        if result['success']:
            parsed_data, parse_metrics = parse_json_response(result['response'], expect_list=True)
            page_data.update(parse_metrics)
            page_data['rows_returned'] = len(parsed_data) if parsed_data else 0
            page_data['parsed_data'] = parsed_data if parsed_data else []
            
            if parsed_data and len(parsed_data) > 0:
                all_rows.extend(parsed_data)
                offset += page_size
            else:
                # Empty result, stop pagination
                print(f"  Stopping: Empty result at offset {offset}")
                page_data['stop_reason'] = 'empty_result'
                pages.append(page_data)
                break
        else:
            page_data['parse_success'] = False
            page_data['rows_returned'] = 0
            page_data['parsed_data'] = []
        
        pages.append(page_data)
    
    if page_num >= max_pages:
        print(f"  Stopping: Hit max pages limit ({max_pages})")
    
    return {
        'pages': pages,
        'all_rows': all_rows
    }

## Strategy 5: Range-based

In [8]:
def fetch_range_based(plan: Dict[str, Any]) -> Dict[str, Any]:
    """
    Fetch pages by defined ranges (matching original approach).
    """
    meta = plan['metadata']
    config = plan['pagination_config']
    
    table_title = meta['table_title']
    columns = meta['columns']
    partition_column = config['partition_column']
    ranges = config['ranges']
    table_id = plan['table_id']
    
    # Normalize field names
    num_fields = len(columns)
    norm_fields = [normalize_field(f) for f in columns]
    
    pages = []
    all_rows = []
    
    # Apply MAX_PAGES limit
    ranges_to_fetch = ranges[:MAX_PAGES] if MAX_PAGES else ranges
    total_ranges = len(ranges)
    
    if MAX_PAGES and len(ranges) > MAX_PAGES:
        print(f"  Limited to first {MAX_PAGES} ranges out of {total_ranges}")
    
    for page_num, range_spec in enumerate(ranges_to_fetch, 1):
        # Handle different range types
        if 'category' in range_spec:
            # Semantic/category-based range
            category = range_spec['category']
            filter_condition = f"{partition_column} represents {category}"
            range_display = f"{category}"
        else:
            # Numeric/alphabetical range with gte/lt
            gte = range_spec.get('gte', '')
            lt = range_spec.get('lt', '')
            filter_condition = f"{partition_column} >= {gte} and {partition_column} < {lt}"
            range_display = f"[{gte}, {lt})"
        
        # Build response format
        fields_json = [f'"{field}": "{field}"' for field in norm_fields]
        response_format = ', '.join(fields_json)
        
        # Build prompt matching original approach
        system_msg = "You are a retriever of facts."
        user_msg = f"""List rows from {table_title} where {filter_condition}.
The response will be formatted as JSON shown below.
Each element of the response will contain {num_fields} fields: {', '.join(columns)}.
Do not output any additional text that is not in JSON format.

RESPONSE FORMAT:
[{{
    {response_format}
}}]"""
        
        print(f"  Fetching range {page_num}/{len(ranges_to_fetch)}: {partition_column} {range_display}...")
        
        result = call_llm_with_retries(system_msg=system_msg, user_msg=user_msg)
        
        page_data = {
            'page_number': page_num,
            'range': range_spec,
            'system_msg': system_msg,
            'user_msg': user_msg,
            'prompt': user_msg,
            'raw_response': result.get('response', ''),
            'error': result.get('error'),
            **result['metrics']
        }
        
        if result['success']:
            parsed_data, parse_metrics = parse_json_response(result['response'], expect_list=True)
            page_data.update(parse_metrics)
            page_data['rows_returned'] = len(parsed_data) if parsed_data else 0
            page_data['parsed_data'] = parsed_data if parsed_data else []
            
            if parsed_data:
                all_rows.extend(parsed_data)
        else:
            page_data['parse_success'] = False
            page_data['rows_returned'] = 0
            page_data['parsed_data'] = []
        
        pages.append(page_data)
    
    return {
        'pages': pages,
        'all_rows': all_rows
    }

## Main Execution: Process All Strategies

In [9]:
# Strategy function mapping
STRATEGY_FUNCTIONS = {
    'full_table': fetch_full_table,
    'row_by_row': fetch_row_by_row,
    'attribute_based': fetch_attribute_based,
    'classic_pagination': fetch_classic_pagination,
    'range_based_alphabetic': fetch_range_based,
    'range_based_semantic': fetch_range_based,
    'range_based_unrestricted': fetch_range_based
}

# Load all strategies
strategies_to_process = []
range_based_variants = ['alphabetic', 'semantic', 'unrestricted']

for variant in range_based_variants:
    strategy_name = f'range_based_{variant}'
    strategy_dir = LATEST_STRATEGY_DIR / 'range_based'
    if not strategy_dir.exists():
        print(f"Skipping {strategy_name}: directory not found")
        continue
    
    json_files = sorted(strategy_dir.glob('*.json'))
    # Filter out error files
    json_files = [f for f in json_files if not f.name.startswith('_')]
    
    for json_file in json_files:
        with open(json_file, 'r', encoding='utf-8') as f:
            plan = json.load(f)
        # Create a copy of the plan for this variant
        variant_plan = plan.copy()
        variant_plan['strategy'] = strategy_name
        variant_plan['pagination_config'] = plan[f'pagination_config_{variant}']
        strategies_to_process.append((strategy_name, variant_plan))

# Now load other strategies
for strategy_name in ['full_table', 'row_by_row', 'attribute_based', 'classic_pagination']:
    strategy_dir = LATEST_STRATEGY_DIR / strategy_name
    if not strategy_dir.exists():
        print(f"Skipping {strategy_name}: directory not found")
        continue
    
    json_files = sorted(strategy_dir.glob('*.json'))
    # Filter out error files
    json_files = [f for f in json_files if not f.name.startswith('_')]
    
    for json_file in json_files:
        with open(json_file, 'r', encoding='utf-8') as f:
            plan = json.load(f)
        strategies_to_process.append((strategy_name, plan))

print(f"Found {len(strategies_to_process)} strategy-table combinations")

# Filter to only tables that have plans for ALL range-based strategies (apples-to-apples comparison)
from collections import Counter
all_strategies = ['range_based_alphabetic', 'range_based_semantic', 'range_based_unrestricted']
table_strategy_counts = Counter(plan['table_id'] for _, plan in strategies_to_process)
tables_with_all_strategies = [table_id for table_id, count in table_strategy_counts.items() 
                               if count == len(all_strategies)]

print(f"\nFiltering to tables with plans for all {len(all_strategies)} strategies:")
print(f"  Total unique tables: {len(table_strategy_counts)}")
print(f"  Tables with all strategies: {len(tables_with_all_strategies)}")

strategies_to_process = [(s, p) for s, p in strategies_to_process 
                         if p['table_id'] in tables_with_all_strategies]
print(f"  Filtered to {len(strategies_to_process)} strategy-table combinations")

# Apply MAX_TABLES limit across all strategies
if MAX_TABLES:
    # Group by table_id to ensure we process complete sets
    table_ids = list(set(p['table_id'] for _, p in strategies_to_process))[:MAX_TABLES]
    strategies_to_process = [(s, p) for s, p in strategies_to_process if p['table_id'] in table_ids]
    print(f"Limited to {len(strategies_to_process)} combinations for {len(table_ids)} tables")

print(f"Processing {len(strategies_to_process)} strategy-table combinations...")

Skipping full_table: directory not found
Skipping row_by_row: directory not found
Skipping attribute_based: directory not found
Skipping classic_pagination: directory not found
Found 72 strategy-table combinations

Filtering to tables with plans for all 3 strategies:
  Total unique tables: 24
  Tables with all strategies: 24
  Filtered to 72 strategy-table combinations
Limited to 60 combinations for 20 tables
Processing 60 strategy-table combinations...


In [10]:
# Create output directory
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = OUTPUT_ROOT / timestamp
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Output directory: {output_dir}")

# Create subdirectories for each strategy
for strategy_name in STRATEGY_FUNCTIONS.keys():
    strategy_dir = output_dir / strategy_name
    strategy_dir.mkdir(exist_ok=True)

# Group strategies by table_id for parallel execution
from collections import defaultdict
tables_with_strategies = defaultdict(list)
for strategy_name, plan in strategies_to_process:
    table_id = plan['table_id']
    tables_with_strategies[table_id].append((strategy_name, plan))

print(f"\nGrouped into {len(tables_with_strategies)} tables with strategies")

# Process each table with parallel strategy execution
results_summary = []


def process_strategy(strategy_name: str, plan: Dict[str, Any]) -> Dict[str, Any]:
    """Process a single strategy-table combination."""
    table_id = plan['table_id']
    table_name = plan['table_name']
    
    try:
        # Execute fetching strategy
        fetch_func = STRATEGY_FUNCTIONS[strategy_name]
        fetch_result = fetch_func(plan)
        
        pages = fetch_result['pages']
        all_rows = fetch_result['all_rows']
        
        # Calculate aggregated metrics
        total_pages = len(pages)
        successful_pages = sum(1 for p in pages if p.get('parse_success', False))
        failed_pages = total_pages - successful_pages
        
        total_latency = sum(p.get('latency', 0) for p in pages)
        avg_latency = total_latency / total_pages if total_pages > 0 else 0
        
        total_tokens = sum(p.get('total_tokens', 0) for p in pages)
        total_llm_calls = sum(1 + p.get('retry_count', 0) for p in pages)
        
        total_rows_fetched = len(all_rows)
        
        # Check for duplicate rows
        unique_rows = set()
        duplicate_count = 0
        for row in all_rows:
            row_key = tuple(sorted(row.items()))
            if row_key in unique_rows:
                duplicate_count += 1
            unique_rows.add(row_key)
        
        # Check column consistency
        if all_rows:
            column_sets = [set(row.keys()) for row in all_rows]
            columns_consistent = all(cs == column_sets[0] for cs in column_sets)
        else:
            columns_consistent = True
        
        error_rate = failed_pages / total_pages if total_pages > 0 else 0
        
        # Build execution summary
        execution_summary = {
            'table_id': table_id,
            'table_name': table_name,
            'strategy': strategy_name,
            'metadata': plan['metadata'],
            'pagination_config': plan['pagination_config'],
            'execution_metadata': {
                'timestamp': timestamp,
                'total_pages': total_pages,
                'successful_pages': successful_pages,
                'failed_pages': failed_pages,
                'total_llm_calls': total_llm_calls,
                'total_latency': round(total_latency, 3),
                'avg_latency': round(avg_latency, 3),
                'total_tokens': total_tokens,
                'total_rows_fetched': total_rows_fetched,
                'unique_rows': len(unique_rows),
                'duplicate_rows': duplicate_count,
                'columns_consistent': columns_consistent,
                'error_rate': round(error_rate, 4)
            },
            'pages': pages
        }
        
        # Save JSON log
        json_path = output_dir / strategy_name / f"{table_id}.json"
        with open(json_path, 'w', encoding='utf-8') as f:
            json.dump(execution_summary, f, indent=2, ensure_ascii=False)
        
        # Save CSV if we have data (matching B004 approach: normalize columns to match reference)
        if all_rows:
            try:
                # Load reference table to get proper column names
                ref_csv_path = LATEST_DATA_DIR / f"{table_id}.csv"
                df_ref = pd.read_csv(ref_csv_path)
                
                # Create dataframe from fetched rows
                df = pd.DataFrame(all_rows)
                
                # Normalize fetched column names
                df.columns = [normalize_field(col) for col in df.columns]
                
                # Normalize reference column names to create mapping
                norm_ref_cols = [normalize_field(col) for col in df_ref.columns]
                
                # Ensure fetched df has same columns as reference (reorder and add missing)
                missing_cols = [col for col in norm_ref_cols if col not in df.columns]
                for col in missing_cols:
                    df[col] = None  # Add missing columns with None
                
                # Reorder to match reference
                df = df[norm_ref_cols]
                
                # Restore original column names from reference
                df.columns = df_ref.columns
                
                # Drop duplicates based on key columns
                key_columns = plan['metadata']['key_columns']
                if key_columns:
                    df = df.drop_duplicates(subset=key_columns)
                
                csv_path = output_dir / strategy_name / f"{table_id}.csv"
                df.to_csv(csv_path, index=False, encoding='utf-8')
                print(f"  ✓ {strategy_name}/{table_id}: {len(df)} rows, {total_tokens} tokens, {total_latency:.2f}s")
            except Exception as e:
                print(f"  ✗ {strategy_name}/{table_id}: CSV save failed: {e}")
                import traceback
                traceback.print_exc()
        else:
            print(f"  ⚠ {strategy_name}/{table_id}: No rows to save")
        
        return {
            'strategy': strategy_name,
            'table_id': table_id,
            'success': True,
            **execution_summary['execution_metadata']
        }
        
    except Exception as e:
        print(f"  ✗ {strategy_name}/{table_id}: Error: {e}")
        import traceback
        traceback.print_exc()
        return {
            'strategy': strategy_name,
            'table_id': table_id,
            'success': False,
            'error': str(e)
        }


# Process each table with parallel strategies
for table_num, (table_id, strategies) in enumerate(tables_with_strategies.items(), 1):
    print(f"\n{'='*70}")
    print(f"[{table_num}/{len(tables_with_strategies)}] Processing table: {table_id}")
    print(f"{'='*70}")
    print(f"Strategies: {', '.join(s for s, _ in strategies)}")
    
    if PARALLEL_STRATEGIES:
        # Execute strategies in parallel
        with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
            futures = {
                executor.submit(process_strategy, strategy_name, plan): (strategy_name, plan['table_id'])
                for strategy_name, plan in strategies
            }
            
            for future in as_completed(futures):
                result = future.result()
                results_summary.append(result)
    else:
        # Sequential execution (for debugging)
        for strategy_name, plan in strategies:
            print(f"\n  {strategy_name.upper()}")
            result = process_strategy(strategy_name, plan)
            results_summary.append(result)

print(f"\n{'='*70}")
print("SUMMARY")
print(f"{'='*70}")
print(f"Processed: {len(results_summary)} strategy-table combinations")
print(f"Successful: {sum(1 for r in results_summary if r['success'])}")
print(f"Failed: {sum(1 for r in results_summary if not r['success'])}")
print(f"{'='*70}")

Output directory: processing/2_fetching/20251016_001443

Grouped into 20 tables with strategies

[1/20] Processing table: 10_men_butterfly_100m_2009
Strategies: range_based_alphabetic, range_based_semantic, range_based_unrestricted
  Fetching range 1/4: Name [A, G)...
  Fetching range 1/6: Nationality Africa...
  Fetching range 1/22: Time [49.00, 49.50)...
  Fetching range 2/6: Nationality Asia...
  Fetching range 2/22: Time [49.50, 50.00)...
  Fetching range 3/22: Time [50.00, 50.50)...
  Fetching range 4/22: Time [50.50, 51.00)...
  Fetching range 5/22: Time [51.00, 51.50)...
  Fetching range 2/4: Name [G, M)...
  Fetching range 3/6: Nationality Europe...
  Fetching range 3/4: Name [M, S)...
  Fetching range 6/22: Time [51.50, 52.00)...
  Fetching range 4/4: Name [S, [)...
  Fetching range 4/6: Nationality North America...
  Fetching range 7/22: Time [52.00, 52.50)...
  ✓ range_based_alphabetic/10_men_butterfly_100m_2009: 105 rows, 5809 tokens, 34.20s
  Fetching range 8/22: Time [52.

Traceback (most recent call last):
  File "/var/folders/s8/dzjx_6bx24960zc7sd87xsq400xh_z/T/ipykernel_33115/1136395299.py", line 125, in process_strategy
    df.columns = df_ref.columns
    ^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/core/generic.py", line 6313, in __setattr__
    return object.__setattr__(self, name, value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/core/generic.py", line 814, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 238, in set_axis
    self._validate_set_axis(axis, new_labels)
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/core/internals/base.py", line 98, in _validate_set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 10 elements, new values have 9 elements


  Fetching range 7/12: Year [1960, 1970)...
  ✓ range_based_alphabetic/2_belgium_demographics_1900_2011: 16 rows, 8668 tokens, 46.78s
  Fetching range 8/12: Year [1970, 1980)...
  Fetching range 9/12: Year [1980, 1990)...
  Fetching range 10/12: Year [1990, 2000)...
  Fetching range 11/12: Year [2000, 2010)...
  Fetching range 12/12: Year [2010, 2012)...
  ✓ range_based_unrestricted/2_belgium_demographics_1900_2011: 111 rows, 16408 tokens, 90.14s

[12/20] Processing table: 32_fa_cup_qualifying_rounds_1999_2000
Strategies: range_based_alphabetic, range_based_semantic, range_based_unrestricted
  Fetching range 1/4: Home Team [A, G)...
  Fetching range 1/5: Home Team Premier League (top-tier professional clubs)...
  Fetching range 1/43: Tie [Preliminary:1, Preliminary:26)...
  Fetching range 2/5: Home Team Football League (professional clubs)...
  Fetching range 2/43: Tie [Preliminary:26, Preliminary:51)...
  Fetching range 2/4: Home Team [G, N)...
  Fetching range 3/5: Home Team National

Traceback (most recent call last):
  File "/var/folders/s8/dzjx_6bx24960zc7sd87xsq400xh_z/T/ipykernel_33115/1136395299.py", line 125, in process_strategy
    df.columns = df_ref.columns
    ^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/core/generic.py", line 6313, in __setattr__
    return object.__setattr__(self, name, value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/core/generic.py", line 814, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 238, in set_axis
    self._validate_set_axis(axis, new_labels)
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/core/internals/base.py", line 98, in _validate_set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 12 elements, new values have 9 elements


  Fetching range 2/10: Year [1930, 1940)...
  Fetching range 2/4: Opponent [G, N)...
  Fetching range 3/10: Year [1940, 1950)...
  Fetching range 4/10: Year [1950, 1960)...
  Fetching range 2/5: Opponent Asia...
  Fetching range 5/10: Year [1960, 1970)...
  Fetching range 3/4: Opponent [N, T)...
  Fetching range 3/5: Opponent Europe...
  Fetching range 6/10: Year [1970, 1980)...
  Fetching range 4/5: Opponent Americas...
  Fetching range 4/4: Opponent [T, [)...
  Fetching range 7/10: Year [1980, 1990)...
  Fetching range 5/5: Opponent Africa...
  ✓ range_based_alphabetic/41_new_zealand_football_results_1922_2012: 24 rows, 3279 tokens, 24.27s
  ✓ range_based_semantic/41_new_zealand_football_results_1922_2012: 43 rows, 3441 tokens, 24.41s
  Fetching range 8/10: Year [1990, 2000)...
  Fetching range 9/10: Year [2000, 2010)...
  Fetching range 10/10: Year [2010, 2020)...
  ✓ range_based_unrestricted/41_new_zealand_football_results_1922_2012: 62 rows, 4400 tokens, 42.30s

[19/20] Processing

## Save Final Summary

In [11]:
# Calculate aggregate statistics across all strategies
execution_end_time = datetime.now()
# Parse timestamp: format is YYYYMMDD_HHMMSS
execution_start_time = datetime.strptime(timestamp, '%Y%m%d_%H%M%S')

total_execution_time = (execution_end_time - execution_start_time).total_seconds()

# Aggregate metrics
total_tables_processed = len(set(r['table_id'] for r in results_summary if r['success']))
total_strategies_run = len([r for r in results_summary if r['success']])
total_failures = len([r for r in results_summary if not r['success']])

successful_results = [r for r in results_summary if r.get('success', False)]

total_pages_fetched = sum(r.get('total_pages', 0) for r in successful_results)
total_llm_calls_made = sum(r.get('total_llm_calls', 0) for r in successful_results)
total_tokens_used = sum(r.get('total_tokens', 0) for r in successful_results)
total_latency_seconds = sum(r.get('total_latency', 0) for r in successful_results)
total_rows_fetched = sum(r.get('total_rows_fetched', 0) for r in successful_results)

# Calculate averages
avg_latency_per_page = total_latency_seconds / total_pages_fetched if total_pages_fetched > 0 else 0
avg_tokens_per_call = total_tokens_used / total_llm_calls_made if total_llm_calls_made > 0 else 0
avg_pages_per_strategy = total_pages_fetched / total_strategies_run if total_strategies_run > 0 else 0

# Collect all errors
errors_list = [
    {
        'strategy': r['strategy'],
        'table_id': r['table_id'],
        'error': r.get('error', 'Unknown error')
    }
    for r in results_summary if not r.get('success', False)
]

# Strategy breakdown
strategy_breakdown = {}
for strategy in STRATEGY_FUNCTIONS.keys():
    strategy_results = [r for r in successful_results if r.get('strategy') == strategy]
    if strategy_results:
        strategy_breakdown[strategy] = {
            'tables_processed': len(strategy_results),
            'total_pages': sum(r.get('total_pages', 0) for r in strategy_results),
            'total_tokens': sum(r.get('total_tokens', 0) for r in strategy_results),
            'total_latency': round(sum(r.get('total_latency', 0) for r in strategy_results), 3),
            'avg_latency': round(sum(r.get('avg_latency', 0) for r in strategy_results) / len(strategy_results), 3),
            'total_rows': sum(r.get('total_rows_fetched', 0) for r in strategy_results),
            'error_rate': round(sum(r.get('error_rate', 0) for r in strategy_results) / len(strategy_results), 4)
        }

# Build comprehensive summary
summary = {
    'timestamp': timestamp,
    'execution_time': {
        'start': timestamp,
        'end': execution_end_time.strftime('%Y%m%d_%H%M%S'),
        'total_seconds': round(total_execution_time, 2),
        'total_minutes': round(total_execution_time / 60, 2)
    },
    'configuration': {
        'strategy_directory': str(LATEST_STRATEGY_DIR),
        'max_tables': MAX_TABLES,
        'max_pages': MAX_PAGES,
        'max_retries': MAX_RETRIES,
        'max_pagination_pages': MAX_PAGINATION_PAGES,
        'model': MODEL
    },
    'aggregate_metrics': {
        'total_tables_processed': total_tables_processed,
        'total_strategy_table_combinations': total_strategies_run,
        'total_failures': total_failures,
        'success_rate': round(total_strategies_run / (total_strategies_run + total_failures), 4) if (total_strategies_run + total_failures) > 0 else 0,
        'total_pages_fetched': total_pages_fetched,
        'total_llm_calls': total_llm_calls_made,
        'total_tokens_used': total_tokens_used,
        'total_latency_seconds': round(total_latency_seconds, 2),
        'total_rows_fetched': total_rows_fetched,
        'avg_latency_per_page': round(avg_latency_per_page, 3),
        'avg_tokens_per_call': round(avg_tokens_per_call, 1),
        'avg_pages_per_strategy': round(avg_pages_per_strategy, 1)
    },
    'strategy_breakdown': strategy_breakdown,
    'errors': errors_list,
    'detailed_results': results_summary
}

summary_file = output_dir / '_summary.json'
with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(summary, f, indent=2, ensure_ascii=False)

print(f"\n{'='*70}")
print("EXECUTION SUMMARY")
print(f"{'='*70}")
print(f"Total execution time: {summary['execution_time']['total_minutes']:.2f} minutes")
print(f"Tables processed: {total_tables_processed}")
print(f"Strategy combinations: {total_strategies_run} successful, {total_failures} failed")
print(f"Total pages fetched: {total_pages_fetched}")
print(f"Total LLM calls: {total_llm_calls_made}")
print(f"Total tokens used: {total_tokens_used:,}")
print(f"Total rows fetched: {total_rows_fetched}")
print(f"Avg latency per page: {avg_latency_per_page:.2f}s")
print(f"Avg tokens per call: {avg_tokens_per_call:.1f}")
if errors_list:
    print(f"\n⚠ {len(errors_list)} errors occurred (see _summary.json for details)")
print(f"\nFinal summary saved to {summary_file}")
print(f"All results saved to {output_dir}")
print(f"{'='*70}")


EXECUTION SUMMARY
Total execution time: 30.93 minutes
Tables processed: 20
Strategy combinations: 60 successful, 0 failed
Total pages fetched: 443
Total LLM calls: 443
Total tokens used: 444,318
Total rows fetched: 7132
Avg latency per page: 7.31s
Avg tokens per call: 1003.0

Final summary saved to processing/2_fetching/20251016_001443/_summary.json
All results saved to processing/2_fetching/20251016_001443


## Next Steps

The fetched data has been saved to `processing/2_fetching/<timestamp>/`.

Each strategy subdirectory contains:
- **JSON files**: Detailed execution logs with per-page metrics, prompts, responses, and aggregated statistics
- **CSV files**: Aggregated table data ready for metrics calculation

### Metrics Available:

**Per-page metrics:**
- latency, tokens (prompt/completion/total), retry_count
- parse_success, rows_returned, JSON extraction position
- timestamp, raw_response, parsed_data

**Per-table-strategy metrics:**
- total_pages, successful/failed_pages, total_llm_calls
- total/avg latency, total_tokens
- total_rows_fetched, unique_rows, duplicate_rows
- columns_consistent, error_rate

### To calculate accuracy metrics:
Use the CSV files with the evaluation logic from `X101_Calculate_Metrics.ipynb` to compute:
- Keys F1, Precision, Recall
- Non-keys F1, Precision, Recall
- Overall F1, Precision, Recall