# Strategy Planning for Pagination

This notebook determines the pagination strategy for each table and outputs a plan for the next step (fetching pages).

## Strategies Implemented:
1. **Full Table** - Fetch entire table in one query
2. **Row by Row** - Fetch all keys, then fetch each row individually
3. **Attribute-based** - Ask LLM which column to partition by, then fetch pages by distinct values
4. **Classic Pagination** - Offset-based pagination with configurable page size
5. **Range-based** - Generate alphabetic (by starting letter), semantic (by categories), and unrestricted (original LLM choice) pagination plans

## Setup and Imports

In [1]:
from __future__ import annotations

import json
import os
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
import openai

# ===== CONFIGURATION =====
# MAX_TABLES: Set to a number to process only that many tables (for testing/sampling)
# Set to None to process all tables
MAX_TABLES = None  # e.g., 1 for single table test, 3 for sampling first 3 tables, None for all

# PARALLEL_STRATEGIES: Set to True to run strategies in parallel (faster)
PARALLEL_STRATEGIES = True
MAX_WORKERS = 5  # Number of parallel strategy workers (set to number of strategies)

# Paths
ROOT = Path('.')
PROCESSING_ROOT = ROOT / 'processing'
DATA_DIR = PROCESSING_ROOT / '0_data'
OUTPUT_ROOT = PROCESSING_ROOT / '1_strategy'

# Find the most recent data directory
data_subdirs = sorted([d for d in DATA_DIR.iterdir() if d.is_dir()], reverse=True)
if not data_subdirs:
    raise FileNotFoundError(f"No data found in {DATA_DIR}")

LATEST_DATA_DIR = data_subdirs[0]

print('Output base:', OUTPUT_ROOT.resolve())
print('Data directory:', LATEST_DATA_DIR.resolve())

Output base: /Users/bef/Desktop/TablePagination/processing/1_strategy
Data directory: /Users/bef/Desktop/TablePagination/processing/0_data/20251004_213355


In [2]:
# Configure OpenRouter
# Read API key from file
api_key_file = ROOT / 'api_key.txt'
if not api_key_file.exists():
    raise ValueError('No API key found. Please create api_key.txt or set OPENROUTER_API_KEY environment variable')
with open(api_key_file, 'r') as f:
    OPENROUTER_API_KEY = f.read().strip()
print('✓ API key loaded from api_key.txt')

# Create OpenAI client configured for OpenRouter
client = openai.OpenAI(
    api_key=OPENROUTER_API_KEY,
    base_url="https://openrouter.ai/api/v1"
)

# Model to use for LLM queries
MODEL = 'openai/gpt-4.1-mini'  # OpenRouter format: provider/model
print(f'Using model: {MODEL}')

✓ API key loaded from api_key.txt
Using model: openai/gpt-4.1-mini


In [3]:
# Load all table data from step 0
def load_all_tables(data_dir: Path) -> List[Dict[str, Any]]:
    """Load all JSON files from the data directory."""
    tables = []
    for json_file in sorted(data_dir.glob('*.json')):
        with open(json_file, 'r', encoding='utf-8') as f:
            data = json.load(f)
            # Add file info
            data['file_path'] = str(json_file)
            data['table_id'] = json_file.stem  # e.g., "0_republican_straw_polls_2012"
            tables.append(data)
    return tables

tables = load_all_tables(LATEST_DATA_DIR)
print(f'Loaded {len(tables)} tables')
print(f'Sample table ID: {tables[0]["table_id"]}')

Loaded 48 tables
Sample table ID: 10_men_butterfly_100m_2009


## Helper Functions

In [4]:
def call_llm(prompt: str, response_format: str = "text") -> str:
    """Make a simple LLM call and return the response."""
    try:
        response = client.chat.completions.create(
            model=MODEL,
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"LLM call failed: {e}")
        return None

def normalize_field(s: str) -> str:
    """Normalize field names (from ChatGPT35_RowByRow_FirstExample)."""
    import re
    s = s.lower().replace(" ", "_").replace("-", "_").replace(".", "").replace(",", "_")\
            .replace("(", "").replace(")", "").replace(":", "").replace('"', '').replace("'", "")\
            .replace("/", "")
    return re.sub('_+', '_', s)

## Strategy 1: Full Table

In [5]:
def plan_full_table(table_data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Strategy: Fetch the entire table in one query.
    No LLM calls needed.
    """
    meta = table_data['meta']
    
    return {
        "table_id": table_data['table_id'],
        "table_name": meta.get('name', ''),
        "strategy": "full_table",
        "metadata": {
            "table_title": meta.get('table_title', ''),
            "columns": meta.get('columns', []),
            "key_columns": meta.get('keys', [])
        },
        "pagination_config": {}
    }

## Strategy 2: Row by Row

In [6]:
def plan_row_by_row(table_data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Strategy: Fetch all key values first, then fetch each row individually.
    Makes 1 LLM call to get all key combinations.
    """
    meta = table_data['meta']
    table_title = meta.get('table_title', '')
    keys = meta.get('keys', [])
    
    if not keys:
        print(f"Warning: No keys defined for {table_data['table_id']}")
        return None
    
    # Normalize key names
    norm_keys = [normalize_field(k) for k in keys]
    
    # Build prompt to fetch all keys (inspired by ChatGPT35_RowByRow_FirstExample)
    key_columns_desc = f"The key column{'s' if len(keys) > 1 else ''} in the table {'are' if len(keys) > 1 else 'is'} {', '.join(keys)}"
    
    keys_json_format = ', '.join([f'"{nk}": "{nk}"' for nk in norm_keys])
    
    keys_prompt = f"""You are a retriever of facts.
We want to create a table with the detailed information about {table_title}.
{key_columns_desc}.
List all {', '.join(keys)} entities for the table.
The response will be formatted as JSON list shown below.

RESPONSE FORMAT:
[{{
    {keys_json_format}
}}]"""
    
    print(f"Fetching keys for {table_data['table_id']}...")
    response = call_llm(keys_prompt)
    
    if not response:
        print(f"Failed to fetch keys for {table_data['table_id']}")
        return None
    
    # Parse the response to extract key values
    try:
        # Clean up response to extract JSON
        if not response.startswith("[") and "[" in response:
            response = response[response.find("["):]
        if not response.endswith("]") and "]" in response:
            response = response[:response.rfind("]") + 1]
        
        key_values = json.loads(response)
        
        if not isinstance(key_values, list):
            print(f"Invalid response format for {table_data['table_id']}")
            return None
            
    except json.JSONDecodeError as e:
        print(f"Failed to parse keys response for {table_data['table_id']}: {e}")
        return None
    
    return {
        "table_id": table_data['table_id'],
        "table_name": meta.get('name', ''),
        "strategy": "row_by_row",
        "metadata": {
            "table_title": table_title,
            "columns": meta.get('columns', []),
            "key_columns": keys
        },
        "pagination_config": {
            "key_columns": keys,
            "key_values": key_values,
            "total_rows": len(key_values)
        }
    }

## Strategy 3: Attribute-based

In [7]:
def plan_attribute_based(table_data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Strategy: Ask LLM which column to partition by, then get distinct values.
    Makes 2 LLM calls:
    1. Ask which column to use for partitioning
    2. Get all distinct values for that column
    """
    meta = table_data['meta']
    table_title = meta.get('table_title', '')
    columns = meta.get('columns', [])
    
    if not columns:
        print(f"Warning: No columns defined for {table_data['table_id']}")
        return None
    
    # Call 1: Ask which column to partition by
    partition_prompt = f"""You are helping to paginate a table about {table_title}.
The table has the following columns: {', '.join(columns)}.

Which single column would be best to use for partitioning/grouping this table's data?
Choose a column that would create meaningful, balanced groups.

Respond with ONLY the column name, nothing else."""
    
    print(f"Asking LLM for partition column for {table_data['table_id']}...")
    partition_column = call_llm(partition_prompt)
    
    if not partition_column or partition_column not in columns:
        print(f"Invalid partition column '{partition_column}' for {table_data['table_id']}")
        return None
    
    # Call 2: Get all distinct values for that column
    values_prompt = f"""You are a retriever of facts.
We want to paginate a table about {table_title}.
List all distinct values of the column "{partition_column}" in this table.

RESPONSE FORMAT:
["{partition_column}_value1", "{partition_column}_value2", ...]"""
    
    print(f"Fetching distinct values for column '{partition_column}'...")
    response = call_llm(values_prompt)
    
    if not response:
        print(f"Failed to fetch values for {table_data['table_id']}")
        return None
    
    # Parse the response
    try:
        if not response.startswith("[") and "[" in response:
            response = response[response.find("["):]
        if not response.endswith("]") and "]" in response:
            response = response[:response.rfind("]") + 1]
        
        partition_values = json.loads(response)
        
        if not isinstance(partition_values, list):
            print(f"Invalid response format for {table_data['table_id']}")
            return None
            
    except json.JSONDecodeError as e:
        print(f"Failed to parse values response for {table_data['table_id']}: {e}")
        return None
    
    return {
        "table_id": table_data['table_id'],
        "table_name": meta.get('name', ''),
        "strategy": "attribute_based",
        "metadata": {
            "table_title": table_title,
            "columns": columns,
            "key_columns": meta.get('keys', [])
        },
        "pagination_config": {
            "partition_column": partition_column,
            "partition_values": partition_values,
            "total_partitions": len(partition_values)
        }
    }

## Strategy 4: Classic Pagination (Offset)

In [8]:
def plan_classic_pagination(table_data: Dict[str, Any], page_size: int = 10) -> Dict[str, Any]:
    """
    Strategy: Classic offset-based pagination.
    No LLM calls - just configuration.
    The fetch notebook will iteratively fetch pages.
    """
    meta = table_data['meta']
    keys = meta.get('keys', [])
    
    if not keys:
        print(f"Warning: No keys defined for {table_data['table_id']}")
        return None
    
    # Default sort order: ascending by key columns
    sort_order = [f"{key} ASC" for key in keys]
    
    return {
        "table_id": table_data['table_id'],
        "table_name": meta.get('name', ''),
        "strategy": "classic_pagination",
        "metadata": {
            "table_title": meta.get('table_title', ''),
            "columns": meta.get('columns', []),
            "key_columns": keys
        },
        "pagination_config": {
            "page_size": page_size,
            "primary_keys": keys,
            "sort_order": sort_order
        }
    }

## Strategy 5: Range-based Pagination

In [9]:
def plan_range_based(table_data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Strategy: Generate alphabetic, semantic, and unrestricted range-based pagination plans.
    Makes up to 6 LLM calls: 2 for alphabetic, 2 for semantic, 2 for unrestricted.
    Returns all three configs or None if any fails.
    """
    meta = table_data['meta']
    table_title = meta.get('table_title', '')
    columns = meta.get('columns', [])
    
    if not columns:
        print(f"Warning: No columns defined for {table_data['table_id']}")
        return None
    
    configs = {}
    
    # 1. Alphabetic strategy
    alphabetic_prompt = f"""You are helping to paginate a table about {table_title}.
The table has the following columns: {', '.join(columns)}.

Suggest a column that contains text data (like names, titles, or categories) that can be alphabetized.
Then suggest bucketing by starting letter (e.g., A-F, G-M, etc.).

Respond in the format: "<column_name>, <bucketing_description>"
Example: "Title, by starting letter ranges A-F, G-M, N-S, T-Z"
"""
    
    print(f"Asking LLM for alphabetic strategy for {table_data['table_id']}...")
    alphabetic_response = call_llm(alphabetic_prompt)
    
    if alphabetic_response:
        parts = alphabetic_response.split(',', 1)
        if len(parts) == 2:
            partition_column = parts[0].strip().strip('"').strip("'")
            bucketing_criteria = parts[1].strip().strip('"').strip("'")
            
            # Get ranges for alphabetic
            ranges_prompt = f"""You are a retriever of facts.
For a table about {table_title}, we want to paginate by {partition_column} using {bucketing_criteria}.

List all the alphabetic ranges needed. For each range, provide the lower bound (inclusive) and upper bound (exclusive).

RESPONSE FORMAT (JSON array of objects):
[
    {{"gte": "A", "lt": "G"}},
    {{"gte": "G", "lt": "M"}}
]

Example for starting letters:
[
    {{"gte": "A", "lt": "F"}},
    {{"gte": "F", "lt": "K"}}
]
"""
            
            print(f"Fetching alphabetic ranges for {partition_column}...")
            ranges_response = call_llm(ranges_prompt)
            
            if ranges_response:
                try:
                    if not ranges_response.startswith("[") and "[" in ranges_response:
                        ranges_response = ranges_response[ranges_response.find("["):]
                    if not ranges_response.endswith("]") and "]" in ranges_response:
                        ranges_response = ranges_response[:ranges_response.rfind("]") + 1]
                    
                    ranges = json.loads(ranges_response)
                    
                    if isinstance(ranges, list):
                        configs['alphabetic'] = {
                            "partition_column": partition_column,
                            "bucketing_criteria": bucketing_criteria,
                            "ranges": ranges,
                            "total_ranges": len(ranges)
                        }
                except json.JSONDecodeError as e:
                    print(f"Failed to parse alphabetic ranges for {table_data['table_id']}: {e}")
    
    # 2. Semantic strategy
    semantic_prompt = f"""You are helping to paginate a table about {table_title}.
The table has the following columns: {', '.join(columns)}.

Choose a column that contains truly categorical data that can be divided into meaningful semantic groups.
ABSOLUTELY DO NOT choose any date, time, year, or temporal columns.
ABSOLUTELY DO NOT choose numeric columns like ranks, scores, counts, or measurements.
Choose columns like: countries, regions, types, categories, names of things, or qualitative attributes.

Then suggest bucketing by meaningful semantic categories (like continents for countries, or product types for items).

Respond in the format: "<column_name>, <bucketing_description>"
Example: "Country, by continent (Asia, Europe, Americas, Africa, Oceania)"
Example: "Type, by category (food, clothing, electronics, books)"
Example: "Genre, by type (fiction, non-fiction, biography, poetry)"
"""
    
    print(f"Asking LLM for semantic strategy for {table_data['table_id']}...")
    semantic_response = call_llm(semantic_prompt)
    
    if semantic_response:
        parts = semantic_response.split(',', 1)
        if len(parts) == 2:
            partition_column = parts[0].strip().strip('"').strip("'")
            bucketing_criteria = parts[1].strip().strip('"').strip("'")
            
            # Stronger validation - reject any temporal or numeric columns, and unique identifier columns
            forbidden_keywords = ['year', 'date', 'time', 'rank', 'score', 'points', 'count', 'number', 'id', 'age', 'population', 'gdp', 'earnings', 'price', 'cost', 'salary', 'wage']
            identifier_keywords = ['name', 'title', 'id', 'code', 'key', 'identifier']
            if any(keyword in partition_column.lower() for keyword in forbidden_keywords):
                print(f"Rejected forbidden column '{partition_column}' for semantic")
            elif any(keyword in partition_column.lower() for keyword in identifier_keywords):
                print(f"Rejected identifier column '{partition_column}' for semantic")
            elif 'decade' in bucketing_criteria.lower() or 'year' in bucketing_criteria.lower() or 'date' in bucketing_criteria.lower():
                print(f"Rejected temporal bucketing '{bucketing_criteria}' for semantic")
            else:
                # Get ranges for semantic (actually categories)
                ranges_prompt = f"""You are a retriever of facts.
For a table about {table_title}, we want to paginate by {partition_column} using {bucketing_criteria}.

List all the semantic categories needed. These should be meaningful qualitative groups, NOT dates, numbers, or column names.

RESPONSE FORMAT (JSON array of strings):
["category1", "category2", "category3"]

Example for "Country, by continent":
["Asia", "Europe", "North America", "South America", "Africa", "Oceania"]

Example for "Type, by category":
["Food", "Clothing", "Electronics", "Books", "Home Goods"]
"""
                
                print(f"Fetching semantic categories for {partition_column}...")
                ranges_response = call_llm(ranges_prompt)
                
                if ranges_response:
                    try:
                        if not ranges_response.startswith("[") and "[" in ranges_response:
                            ranges_response = ranges_response[ranges_response.find("["):]
                        if not ranges_response.endswith("]") and "]" in ranges_response:
                            ranges_response = ranges_response[:ranges_response.rfind("]") + 1]
                        
                        categories = json.loads(ranges_response)
                        
                        if isinstance(categories, list) and len(categories) > 0:
                            # Validate that categories are not column names or temporal
                            column_names = [col.lower() for col in columns]
                            temporal_indicators = ['decade', 'year', 'century', 'period', 'era', 'age']
                            
                            valid_categories = True
                            for cat in categories:
                                cat_lower = cat.lower()
                                if cat_lower in column_names:
                                    valid_categories = False
                                    break
                                if any(indicator in cat_lower for indicator in temporal_indicators):
                                    valid_categories = False
                                    break
                            
                            if valid_categories:
                                # Convert categories to range-like format for consistency
                                ranges = [{"category": cat} for cat in categories]
                                configs['semantic'] = {
                                    "partition_column": partition_column,
                                    "bucketing_criteria": bucketing_criteria,
                                    "ranges": ranges,
                                    "total_ranges": len(ranges)
                                }
                            else:
                                print(f"Categories contain invalid content: {categories}")
                    except json.JSONDecodeError as e:
                        print(f"Failed to parse semantic categories for {table_data['table_id']}: {e}")
    
    # 3. Unrestricted strategy (original logic from 1_Strategy.ipynb)
    unrestricted_prompt = f"""You are helping to paginate a table about {table_title}.
The table has the following columns: {', '.join(columns)}.

Suggest the best column to use for range-based pagination and describe how to bucket the data.
For example: "year, by decade" or "price, by $100 ranges" or "date, by month".

Respond in the format: "<column_name>, <bucketing_description>"
Example: "year, by decade"
"""
    
    print(f"Asking LLM for unrestricted range strategy for {table_data['table_id']}...")
    unrestricted_response = call_llm(unrestricted_prompt)
    
    if unrestricted_response:
        parts = unrestricted_response.split(',', 1)
        if len(parts) == 2:
            partition_column = parts[0].strip().strip('"').strip("'")
            bucketing_criteria = parts[1].strip().strip('"').strip("'")
            
            # Get ranges for unrestricted (same as original)
            ranges_prompt = f"""You are a retriever of facts.
For a table about {table_title}, we want to paginate by {partition_column} using {bucketing_criteria}.

List all the ranges needed. For each range, provide the lower bound (inclusive) and upper bound (exclusive).

RESPONSE FORMAT (JSON array of objects):
[
    {{"gte": "lower_value", "lt": "upper_value"}},
    {{"gte": "lower_value", "lt": "upper_value"}}
]

Example for "year by decade":
[
    {{"gte": "1980", "lt": "1990"}},
    {{"gte": "1990", "lt": "2000"}}
]
"""
            
            print(f"Fetching unrestricted ranges for {partition_column} {bucketing_criteria}...")
            ranges_response = call_llm(ranges_prompt)
            
            if ranges_response:
                try:
                    if not ranges_response.startswith("[") and "[" in ranges_response:
                        ranges_response = ranges_response[ranges_response.find("["):]
                    if not ranges_response.endswith("]") and "]" in ranges_response:
                        ranges_response = ranges_response[:ranges_response.rfind("]") + 1]
                    
                    ranges = json.loads(ranges_response)
                    
                    if isinstance(ranges, list):
                        configs['unrestricted'] = {
                            "partition_column": partition_column,
                            "bucketing_criteria": bucketing_criteria,
                            "ranges": ranges,
                            "total_ranges": len(ranges)
                        }
                except json.JSONDecodeError as e:
                    print(f"Failed to parse unrestricted ranges for {table_data['table_id']}: {e}")
    
    # Only return if we have all three configs
    if len(configs) == 3:
        return {
            "table_id": table_data['table_id'],
            "table_name": meta.get('name', ''),
            "strategy": "range_based",
            "metadata": {
                "table_title": table_title,
                "columns": columns,
                "key_columns": meta.get('keys', [])
            },
            "pagination_config_alphabetic": configs['alphabetic'],
            "pagination_config_semantic": configs['semantic'],
            "pagination_config_unrestricted": configs['unrestricted']
        }
    else:
        print(f"Could not generate all three configs for {table_data['table_id']} (got {len(configs)})")
        return None

## Main Execution: Generate Plans for All Tables

In [10]:
# Configuration: which strategies to run
STRATEGIES_TO_RUN = {
    'full_table': plan_full_table,
    'row_by_row': plan_row_by_row,
    'attribute_based': plan_attribute_based,
    'classic_pagination': plan_classic_pagination,
    'range_based': plan_range_based
}

# Choose which strategies to execute (comment out ones you don't want)
ACTIVE_STRATEGIES = [
    # 'full_table',
    # 'row_by_row',
    # 'attribute_based',
    # 'classic_pagination',
    'range_based',
]

# Apply MAX_TABLES limit if set
if MAX_TABLES is not None:
    print(f"📊 LIMITED RUN: Processing first {MAX_TABLES} table(s)")
    tables_to_process = tables[:MAX_TABLES]
else:
    tables_to_process = tables

print(f"Active strategies: {ACTIVE_STRATEGIES}")
print(f"Processing {len(tables_to_process)} table(s) out of {len(tables)} total...")

Active strategies: ['range_based']
Processing 48 table(s) out of 48 total...


In [11]:
# Create output directory with timestamp
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = OUTPUT_ROOT / timestamp
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Output directory: {output_dir}")

# Create subdirectories for each strategy upfront
for strategy_name in ACTIVE_STRATEGIES:
    strategy_dir = output_dir / strategy_name
    strategy_dir.mkdir(exist_ok=True)

# Process all tables for each strategy
results = {strategy: [] for strategy in ACTIVE_STRATEGIES}
errors = {strategy: [] for strategy in ACTIVE_STRATEGIES}

def process_strategy(strategy_name: str):
    """Process all tables for a single strategy with retry logic."""
    strategy_results = []
    strategy_errors = []
    MAX_RETRIES = 3
    
    print(f"\n{'='*60}")
    print(f"Running strategy: {strategy_name.upper()}")
    print(f"{'='*60}\n")
    
    strategy_func = STRATEGIES_TO_RUN[strategy_name]
    strategy_dir = output_dir / strategy_name
    
    for i, table in enumerate(tables_to_process):
        print(f"[{strategy_name}] [{i+1}/{len(tables_to_process)}] Processing {table['table_id']}...")
        
        plan = None
        last_error = None
        
        # Retry loop
        for attempt in range(MAX_RETRIES):
            try:
                plan = strategy_func(table)
                
                if plan:
                    # Success!
                    break
                else:
                    last_error = 'Function returned None'
                    if attempt < MAX_RETRIES - 1:
                        print(f"[{strategy_name}]   ⚠ Attempt {attempt + 1}/{MAX_RETRIES} failed (returned None), retrying...")
                    
            except Exception as e:
                last_error = str(e)
                if attempt < MAX_RETRIES - 1:
                    print(f"[{strategy_name}]   ⚠ Attempt {attempt + 1}/{MAX_RETRIES} failed: {e}, retrying...")
        
        # After all retries, check if we succeeded
        if plan:
            strategy_results.append(plan)
            
            # Save immediately after successful processing
            table_id = plan['table_id']
            output_file = strategy_dir / f"{table_id}.json"
            with open(output_file, 'w', encoding='utf-8') as f:
                json.dump(plan, f, indent=2, ensure_ascii=False)
            
            retry_msg = f" (after {attempt + 1} attempt{'s' if attempt > 0 else ''})" if attempt > 0 else ""
            print(f"[{strategy_name}]   ✓ Success{retry_msg} - Saved to {output_file.name}")
        else:
            # All retries failed
            strategy_errors.append({
                'table_id': table['table_id'],
                'error': last_error,
                'attempts': MAX_RETRIES
            })
            print(f"[{strategy_name}]   ✗ Failed after {MAX_RETRIES} attempts: {last_error}")
        
        # Save errors incrementally too
        if strategy_errors:
            errors_file = strategy_dir / '_errors.json'
            with open(errors_file, 'w', encoding='utf-8') as f:
                json.dump(strategy_errors, f, indent=2, ensure_ascii=False)
    
    print(f"\n[{strategy_name}] Completed: {len(strategy_results)} successes, {len(strategy_errors)} errors")
    return strategy_name, strategy_results, strategy_errors

# Run strategies in parallel or sequentially
if PARALLEL_STRATEGIES:
    print(f"\n⚡ Running {len(ACTIVE_STRATEGIES)} strategies in PARALLEL with {MAX_WORKERS} workers")
    
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        # Submit all strategy tasks
        future_to_strategy = {
            executor.submit(process_strategy, strategy_name): strategy_name 
            for strategy_name in ACTIVE_STRATEGIES
        }
        
        # Collect results as they complete
        for future in as_completed(future_to_strategy):
            strategy_name, strategy_results, strategy_errors = future.result()
            results[strategy_name] = strategy_results
            errors[strategy_name] = strategy_errors
else:
    print(f"\n🔄 Running {len(ACTIVE_STRATEGIES)} strategies SEQUENTIALLY")
    
    for strategy_name in ACTIVE_STRATEGIES:
        strategy_name, strategy_results, strategy_errors = process_strategy(strategy_name)
        results[strategy_name] = strategy_results
        errors[strategy_name] = strategy_errors

print(f"\n{'='*60}")
print("SUMMARY")
print(f"{'='*60}")
for strategy_name in ACTIVE_STRATEGIES:
    print(f"{strategy_name}: {len(results[strategy_name])} tables processed")
print(f"{'='*60}")

Output directory: processing/1_strategy/20251015_022607

⚡ Running 1 strategies in PARALLEL with 5 workers

Running strategy: RANGE_BASED

[range_based] [1/48] Processing 10_men_butterfly_100m_2009...
Asking LLM for alphabetic strategy for 10_men_butterfly_100m_2009...
Fetching alphabetic ranges for Name...
Fetching alphabetic ranges for Name...
Asking LLM for semantic strategy for 10_men_butterfly_100m_2009...
Asking LLM for semantic strategy for 10_men_butterfly_100m_2009...
Fetching semantic categories for Nationality...
Fetching semantic categories for Nationality...
Asking LLM for unrestricted range strategy for 10_men_butterfly_100m_2009...
Asking LLM for unrestricted range strategy for 10_men_butterfly_100m_2009...
Fetching unrestricted ranges for Time by 0.5-second intervals...
Fetching unrestricted ranges for Time by 0.5-second intervals...
[range_based]   ✓ Success - Saved to 10_men_butterfly_100m_2009.json
[range_based] [2/48] Processing 11_playstation_3_cooperative_games...

## Save Final Summary

In [12]:
# All individual files have been saved incrementally during processing
# Now just save the final summary

summary = {
    'timestamp': timestamp,
    'max_tables_limit': MAX_TABLES,
    'total_tables': len(tables),
    'processed_tables': len(tables_to_process),
    'strategies': {
        strategy_name: {
            'success_count': len(results[strategy_name]),
            'error_count': len(errors[strategy_name])
        }
        for strategy_name in ACTIVE_STRATEGIES
    }
}

summary_file = output_dir / '_summary.json'
with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(summary, f, indent=2, ensure_ascii=False)

print(f"\nFinal summary saved to {summary_file}")
print(f"\nAll done! Results saved to {output_dir}")


Final summary saved to processing/1_strategy/20251015_022607/_summary.json

All done! Results saved to processing/1_strategy/20251015_022607


## Sample Output Inspection

In [13]:
# Let's inspect a sample output from each strategy
for strategy_name in ACTIVE_STRATEGIES:
    if results[strategy_name]:
        print(f"\n{'='*60}")
        print(f"Sample output for {strategy_name.upper()}")
        print(f"{'='*60}")
        sample = results[strategy_name][0]
        print(json.dumps(sample, indent=2))
        break  # Show just one example


Sample output for RANGE_BASED
{
  "table_id": "10_men_butterfly_100m_2009",
  "table_name": "men_butterfly_100m_2009",
  "strategy": "range_based",
  "metadata": {
    "table_title": "men's 100 metre butterfly results in heats at the 2009 World Aquatics Championships",
    "columns": [
      "Name",
      "Nationality",
      "Time",
      "Heat",
      "Lane"
    ],
    "key_columns": [
      "Name"
    ]
  },
  "pagination_config_alphabetic": {
    "partition_column": "Name",
    "bucketing_criteria": "by starting letter ranges A-F, G-M, N-S, T-Z",
    "ranges": [
      {
        "gte": "A",
        "lt": "G"
      },
      {
        "gte": "G",
        "lt": "N"
      },
      {
        "gte": "N",
        "lt": "T"
      },
      {
        "gte": "T",
        "lt": "U"
      }
    ],
    "total_ranges": 4
  },
  "pagination_config_semantic": {
    "partition_column": "Nationality",
    "bucketing_criteria": "by continent (Africa, Asia, Europe, Americas, Oceania)",
    "ranges": [


## Next Steps

The pagination plans have been generated and saved to `processing/1_strategy/<timestamp>/`.

Each strategy creates a subdirectory with JSON files for each table containing:
- **table_id**: Unique identifier
- **strategy**: The pagination approach used
- **metadata**: Table information (title, columns, keys)
- **pagination_config**: Strategy-specific configuration for the fetch notebook

### For the next notebook (2_FetchPages.ipynb):
1. Load these JSON files
2. For each pagination plan, execute the appropriate fetching logic for all three range-based configs:
   - **range_based**: Use pagination_config_alphabetic, pagination_config_semantic, and pagination_config_unrestricted to fetch data in buckets
   - For alphabetic: Use ranges with gte/lt for letter ranges
   - For semantic: Use ranges with category for semantic groups  
   - For unrestricted: Use ranges with gte/lt for any type of ranges (benchmark reference)

### To enable more strategies:
Uncomment the strategies you want in the `ACTIVE_STRATEGIES` list above and re-run the notebook.