# Strategy Planning for Pagination

This notebook determines the pagination strategy for each table and outputs a plan for the next step (fetching pages).

## Strategies Implemented:
1. **Full Table** - Fetch entire table in one query
2. **Row by Row** - Fetch all keys, then fetch each row individually
3. **Attribute-based** - Ask LLM which column to partition by, then fetch pages by distinct values
4. **Classic Pagination** - Offset-based pagination with configurable page size
5. **Range-based** - Ask LLM for column + bucketing criteria, fetch by ranges

## Setup and Imports

In [None]:
from __future__ import annotations

import json
import os
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
import openai

# ===== CONFIGURATION =====
# MAX_TABLES: Set to a number to process only that many tables (for testing/sampling)
# Set to None to process all tables
MAX_TABLES = 1  # e.g., 1 for single table test, 3 for sampling first 3 tables, None for all

# PARALLEL_STRATEGIES: Set to True to run strategies in parallel (faster)
PARALLEL_STRATEGIES = True
MAX_WORKERS = 5  # Number of parallel strategy workers (set to number of strategies)

# Paths
ROOT = Path('.')
PROCESSING_ROOT = ROOT / 'processing'
DATA_DIR = PROCESSING_ROOT / '0_data'
OUTPUT_ROOT = PROCESSING_ROOT / '1_strategy'

# Find the most recent data directory
data_subdirs = sorted([d for d in DATA_DIR.iterdir() if d.is_dir()], reverse=True)
if not data_subdirs:
    raise FileNotFoundError(f"No data found in {DATA_DIR}")

LATEST_DATA_DIR = data_subdirs[0]

print('Output base:', OUTPUT_ROOT.resolve())
print('Data directory:', LATEST_DATA_DIR.resolve())

Output base: /Users/bef/Desktop/TablePagination/processing/1_strategy
Data directory: /Users/bef/Desktop/TablePagination/processing/0_data/20251004_213355


In [41]:
# Configure OpenRouter
# Make sure to set your API key
OPENROUTER_API_KEY = os.getenv('OPENROUTER_API_KEY', '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')

# Create OpenAI client configured for OpenRouter
client = openai.OpenAI(
    api_key=OPENROUTER_API_KEY,
    base_url="https://openrouter.ai/api/v1"# Model to use for LLM queries
)

MODEL = 'openai/gpt-4o-mini'  # OpenRouter format: provider/model

In [42]:
# Load all table data from step 0
def load_all_tables(data_dir: Path) -> List[Dict[str, Any]]:
    """Load all JSON files from the data directory."""
    tables = []
    for json_file in sorted(data_dir.glob('*.json')):
        with open(json_file, 'r', encoding='utf-8') as f:
            data = json.load(f)
            # Add file info
            data['file_path'] = str(json_file)
            data['table_id'] = json_file.stem  # e.g., "0_republican_straw_polls_2012"
            tables.append(data)
    return tables

tables = load_all_tables(LATEST_DATA_DIR)
print(f'Loaded {len(tables)} tables')
print(f'Sample table ID: {tables[0]["table_id"]}')

Loaded 48 tables
Sample table ID: 10_men_butterfly_100m_2009


## Helper Functions

In [43]:
def call_llm(prompt: str, response_format: str = "text") -> str:
    """Make a simple LLM call and return the response."""
    try:
        response = client.chat.completions.create(
            model=MODEL,
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"LLM call failed: {e}")
        return None

def normalize_field(s: str) -> str:
    """Normalize field names (from ChatGPT35_RowByRow_FirstExample)."""
    import re
    s = s.lower().replace(" ", "_").replace("-", "_").replace(".", "").replace(",", "_")\
            .replace("(", "").replace(")", "").replace(":", "").replace('"', '').replace("'", "")\
            .replace("/", "")
    return re.sub('_+', '_', s)

## Strategy 1: Full Table

In [44]:
def plan_full_table(table_data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Strategy: Fetch the entire table in one query.
    No LLM calls needed.
    """
    meta = table_data['meta']
    
    return {
        "table_id": table_data['table_id'],
        "table_name": meta.get('name', ''),
        "strategy": "full_table",
        "metadata": {
            "table_title": meta.get('table_title', ''),
            "columns": meta.get('columns', []),
            "key_columns": meta.get('keys', [])
        },
        "pagination_config": {}
    }

## Strategy 2: Row by Row

In [45]:
def plan_row_by_row(table_data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Strategy: Fetch all key values first, then fetch each row individually.
    Makes 1 LLM call to get all key combinations.
    """
    meta = table_data['meta']
    table_title = meta.get('table_title', '')
    keys = meta.get('keys', [])
    
    if not keys:
        print(f"Warning: No keys defined for {table_data['table_id']}")
        return None
    
    # Normalize key names
    norm_keys = [normalize_field(k) for k in keys]
    
    # Build prompt to fetch all keys (inspired by ChatGPT35_RowByRow_FirstExample)
    key_columns_desc = f"The key column{'s' if len(keys) > 1 else ''} in the table {'are' if len(keys) > 1 else 'is'} {', '.join(keys)}"
    
    keys_json_format = ', '.join([f'"{nk}": "{nk}"' for nk in norm_keys])
    
    keys_prompt = f"""You are a retriever of facts.
We want to create a table with the detailed information about {table_title}.
{key_columns_desc}.
List all {', '.join(keys)} entities for the table.
The response will be formatted as JSON list shown below.

RESPONSE FORMAT:
[{{
    {keys_json_format}
}}]"""
    
    print(f"Fetching keys for {table_data['table_id']}...")
    response = call_llm(keys_prompt)
    
    if not response:
        print(f"Failed to fetch keys for {table_data['table_id']}")
        return None
    
    # Parse the response to extract key values
    try:
        # Clean up response to extract JSON
        if not response.startswith("[") and "[" in response:
            response = response[response.find("["):]
        if not response.endswith("]") and "]" in response:
            response = response[:response.rfind("]") + 1]
        
        key_values = json.loads(response)
        
        if not isinstance(key_values, list):
            print(f"Invalid response format for {table_data['table_id']}")
            return None
            
    except json.JSONDecodeError as e:
        print(f"Failed to parse keys response for {table_data['table_id']}: {e}")
        return None
    
    return {
        "table_id": table_data['table_id'],
        "table_name": meta.get('name', ''),
        "strategy": "row_by_row",
        "metadata": {
            "table_title": table_title,
            "columns": meta.get('columns', []),
            "key_columns": keys
        },
        "pagination_config": {
            "key_columns": keys,
            "key_values": key_values,
            "total_rows": len(key_values)
        }
    }

## Strategy 3: Attribute-based

In [46]:
def plan_attribute_based(table_data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Strategy: Ask LLM which column to partition by, then get distinct values.
    Makes 2 LLM calls:
    1. Ask which column to use for partitioning
    2. Get all distinct values for that column
    """
    meta = table_data['meta']
    table_title = meta.get('table_title', '')
    columns = meta.get('columns', [])
    
    if not columns:
        print(f"Warning: No columns defined for {table_data['table_id']}")
        return None
    
    # Call 1: Ask which column to partition by
    partition_prompt = f"""You are helping to paginate a table about {table_title}.
The table has the following columns: {', '.join(columns)}.

Which single column would be best to use for partitioning/grouping this table's data?
Choose a column that would create meaningful, balanced groups.

Respond with ONLY the column name, nothing else."""
    
    print(f"Asking LLM for partition column for {table_data['table_id']}...")
    partition_column = call_llm(partition_prompt)
    
    if not partition_column or partition_column not in columns:
        print(f"Invalid partition column '{partition_column}' for {table_data['table_id']}")
        return None
    
    # Call 2: Get all distinct values for that column
    values_prompt = f"""You are a retriever of facts.
We want to paginate a table about {table_title}.
List all distinct values of the column "{partition_column}" in this table.

RESPONSE FORMAT:
["{partition_column}_value1", "{partition_column}_value2", ...]"""
    
    print(f"Fetching distinct values for column '{partition_column}'...")
    response = call_llm(values_prompt)
    
    if not response:
        print(f"Failed to fetch values for {table_data['table_id']}")
        return None
    
    # Parse the response
    try:
        if not response.startswith("[") and "[" in response:
            response = response[response.find("["):]
        if not response.endswith("]") and "]" in response:
            response = response[:response.rfind("]") + 1]
        
        partition_values = json.loads(response)
        
        if not isinstance(partition_values, list):
            print(f"Invalid response format for {table_data['table_id']}")
            return None
            
    except json.JSONDecodeError as e:
        print(f"Failed to parse values response for {table_data['table_id']}: {e}")
        return None
    
    return {
        "table_id": table_data['table_id'],
        "table_name": meta.get('name', ''),
        "strategy": "attribute_based",
        "metadata": {
            "table_title": table_title,
            "columns": columns,
            "key_columns": meta.get('keys', [])
        },
        "pagination_config": {
            "partition_column": partition_column,
            "partition_values": partition_values,
            "total_partitions": len(partition_values)
        }
    }

## Strategy 4: Classic Pagination (Offset)

In [47]:
def plan_classic_pagination(table_data: Dict[str, Any], page_size: int = 10) -> Dict[str, Any]:
    """
    Strategy: Classic offset-based pagination.
    No LLM calls - just configuration.
    The fetch notebook will iteratively fetch pages.
    """
    meta = table_data['meta']
    keys = meta.get('keys', [])
    
    if not keys:
        print(f"Warning: No keys defined for {table_data['table_id']}")
        return None
    
    # Default sort order: ascending by key columns
    sort_order = [f"{key} ASC" for key in keys]
    
    return {
        "table_id": table_data['table_id'],
        "table_name": meta.get('name', ''),
        "strategy": "classic_pagination",
        "metadata": {
            "table_title": meta.get('table_title', ''),
            "columns": meta.get('columns', []),
            "key_columns": keys
        },
        "pagination_config": {
            "page_size": page_size,
            "primary_keys": keys,
            "sort_order": sort_order
        }
    }

## Strategy 5: Range-based Pagination

In [None]:
def plan_range_based(table_data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Strategy: Ask LLM for column and bucketing criteria, then define ranges.
    Makes 1 LLM call to determine column and bucketing strategy.
    """
    meta = table_data['meta']
    table_title = meta.get('table_title', '')
    columns = meta.get('columns', [])
    
    if not columns:
        print(f"Warning: No columns defined for {table_data['table_id']}")
        return None
    
    # Ask LLM for column and bucketing strategy
    range_prompt = f"""You are helping to paginate a table about {table_title}.
The table has the following columns: {', '.join(columns)}.

Suggest the best column to use for range-based pagination and describe how to bucket the data.
For example: "year, by decade" or "price, by $100 ranges" or "date, by month".

Respond in the format: "<column_name>, <bucketing_description>"
Example: "year, by decade"
"""
    
    print(f"Asking LLM for range strategy for {table_data['table_id']}...")
    response = call_llm(range_prompt)
    
    if not response:
        print(f"Failed to get range strategy for {table_data['table_id']}")
        return None
    
    # Parse response (expected format: "column, bucketing")
    parts = response.split(',', 1)
    if len(parts) != 2:
        print(f"Invalid range response format for {table_data['table_id']}: {response}")
        return None
    
    partition_column = parts[0].strip().strip('"').strip("'")
    bucketing_criteria = parts[1].strip().strip('"').strip("'")
    
    # Now ask for the actual ranges
    ranges_prompt = f"""You are a retriever of facts.
For a table about {table_title}, we want to paginate by {partition_column} using {bucketing_criteria}.

List all the ranges needed. For each range, provide the lower bound (inclusive) and upper bound (exclusive).

RESPONSE FORMAT (JSON array of objects):
[
    {{"gte": "lower_value", "lt": "upper_value"}},
    {{"gte": "lower_value", "lt": "upper_value"}}
]

Example for "year by decade":
[
    {{"gte": "1980", "lt": "1990"}},
    {{"gte": "1990", "lt": "2000"}}
]
"""
    
    print(f"Fetching ranges for {partition_column} {bucketing_criteria}...")
    ranges_response = call_llm(ranges_prompt)
    
    if not ranges_response:
        print(f"Failed to get ranges for {table_data['table_id']}")
        return None
    
    # Parse ranges
    try:
        if not ranges_response.startswith("[") and "[" in ranges_response:
            ranges_response = ranges_response[ranges_response.find("["):]
        if not ranges_response.endswith("]") and "]" in ranges_response:
            ranges_response = ranges_response[:ranges_response.rfind("]") + 1]
        
        ranges = json.loads(ranges_response)
        
        if not isinstance(ranges, list):
            print(f"Invalid ranges format for {table_data['table_id']}")
            return None
            
    except json.JSONDecodeError as e:
        print(f"Failed to parse ranges for {table_data['table_id']}: {e}")
        return None
    
    return {
        "table_id": table_data['table_id'],
        "table_name": meta.get('name', ''),
        "strategy": "range_based",
        "metadata": {
            "table_title": table_title,
            "columns": columns,
            "key_columns": meta.get('keys', [])
        },
        "pagination_config": {
            "partition_column": partition_column,
            "bucketing_criteria": bucketing_criteria,
            "ranges": ranges,
            "total_ranges": len(ranges)
        }
    }

## Main Execution: Generate Plans for All Tables

In [None]:
# Configuration: which strategies to run
STRATEGIES_TO_RUN = {
    'full_table': plan_full_table,
    'row_by_row': plan_row_by_row,
    'attribute_based': plan_attribute_based,
    'classic_pagination': plan_classic_pagination,
    'range_based': plan_range_based
}

# Choose which strategies to execute (comment out ones you don't want)
ACTIVE_STRATEGIES = [
    'full_table',
    'row_by_row',
    'attribute_based',
    'classic_pagination',
    'range_based',
]

# Apply MAX_TABLES limit if set
if MAX_TABLES is not None:
    print(f"📊 LIMITED RUN: Processing first {MAX_TABLES} table(s)")
    tables_to_process = tables[:MAX_TABLES]
else:
    tables_to_process = tables

print(f"Active strategies: {ACTIVE_STRATEGIES}")
print(f"Processing {len(tables_to_process)} table(s) out of {len(tables)} total...")

🧪 TEST RUN MODE: Processing only 1 table
Active strategies: ['full_table', 'row_by_row', 'attribute_based', 'classic_pagination', 'range_based']
Processing 1 table(s)...


In [None]:
# Create output directory with timestamp
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = OUTPUT_ROOT / timestamp
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Output directory: {output_dir}")

# Create subdirectories for each strategy upfront
for strategy_name in ACTIVE_STRATEGIES:
    strategy_dir = output_dir / strategy_name
    strategy_dir.mkdir(exist_ok=True)

# Process all tables for each strategy
results = {strategy: [] for strategy in ACTIVE_STRATEGIES}
errors = {strategy: [] for strategy in ACTIVE_STRATEGIES}

def process_strategy(strategy_name: str):
    """Process all tables for a single strategy."""
    strategy_results = []
    strategy_errors = []
    
    print(f"\n{'='*60}")
    print(f"Running strategy: {strategy_name.upper()}")
    print(f"{'='*60}\n")
    
    strategy_func = STRATEGIES_TO_RUN[strategy_name]
    strategy_dir = output_dir / strategy_name
    
    for i, table in enumerate(tables_to_process):
        print(f"[{strategy_name}] [{i+1}/{len(tables_to_process)}] Processing {table['table_id']}...")
        
        try:
            plan = strategy_func(table)
            
            if plan:
                strategy_results.append(plan)
                
                # Save immediately after successful processing
                table_id = plan['table_id']
                output_file = strategy_dir / f"{table_id}.json"
                with open(output_file, 'w', encoding='utf-8') as f:
                    json.dump(plan, f, indent=2, ensure_ascii=False)
                
                print(f"[{strategy_name}]   ✓ Success - Saved to {output_file.name}")
            else:
                strategy_errors.append({
                    'table_id': table['table_id'],
                    'error': 'Function returned None'
                })
                print(f"[{strategy_name}]   ✗ Failed")
                
        except Exception as e:
            strategy_errors.append({
                'table_id': table['table_id'],
                'error': str(e)
            })
            print(f"[{strategy_name}]   ✗ Error: {e}")
        
        # Save errors incrementally too
        if strategy_errors:
            errors_file = strategy_dir / '_errors.json'
            with open(errors_file, 'w', encoding='utf-8') as f:
                json.dump(strategy_errors, f, indent=2, ensure_ascii=False)
    
    print(f"\n[{strategy_name}] Completed: {len(strategy_results)} successes, {len(strategy_errors)} errors")
    return strategy_name, strategy_results, strategy_errors

# Run strategies in parallel or sequentially
if PARALLEL_STRATEGIES:
    print(f"\n⚡ Running {len(ACTIVE_STRATEGIES)} strategies in PARALLEL with {MAX_WORKERS} workers")
    
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        # Submit all strategy tasks
        future_to_strategy = {
            executor.submit(process_strategy, strategy_name): strategy_name 
            for strategy_name in ACTIVE_STRATEGIES
        }
        
        # Collect results as they complete
        for future in as_completed(future_to_strategy):
            strategy_name, strategy_results, strategy_errors = future.result()
            results[strategy_name] = strategy_results
            errors[strategy_name] = strategy_errors
else:
    print(f"\n🔄 Running {len(ACTIVE_STRATEGIES)} strategies SEQUENTIALLY")
    
    for strategy_name in ACTIVE_STRATEGIES:
        strategy_name, strategy_results, strategy_errors = process_strategy(strategy_name)
        results[strategy_name] = strategy_results
        errors[strategy_name] = strategy_errors

print(f"\n{'='*60}")
print("SUMMARY")
print(f"{'='*60}")
for strategy_name in ACTIVE_STRATEGIES:
    print(f"{strategy_name}: {len(results[strategy_name])} tables processed")
print(f"{'='*60}")

Output directory: processing/1_strategy/20251004_213906

Running strategy: FULL_TABLE

[1/1] Processing 10_men_butterfly_100m_2009...
  ✓ Success - Saved to 10_men_butterfly_100m_2009.json

Completed full_table: 1 successes, 0 errors

Running strategy: ROW_BY_ROW

[1/1] Processing 10_men_butterfly_100m_2009...
Fetching keys for 10_men_butterfly_100m_2009...
  ✓ Success - Saved to 10_men_butterfly_100m_2009.json

Completed row_by_row: 1 successes, 0 errors

Running strategy: ATTRIBUTE_BASED

[1/1] Processing 10_men_butterfly_100m_2009...
Asking LLM for partition column for 10_men_butterfly_100m_2009...
  ✓ Success - Saved to 10_men_butterfly_100m_2009.json

Completed row_by_row: 1 successes, 0 errors

Running strategy: ATTRIBUTE_BASED

[1/1] Processing 10_men_butterfly_100m_2009...
Asking LLM for partition column for 10_men_butterfly_100m_2009...
Fetching distinct values for column 'Heat'...
Fetching distinct values for column 'Heat'...
  ✓ Success - Saved to 10_men_butterfly_100m_2009.

## Save Final Summary

In [None]:
# All individual files have been saved incrementally during processing
# Now just save the final summary

summary = {
    'timestamp': timestamp,
    'max_tables_limit': MAX_TABLES,
    'total_tables': len(tables),
    'processed_tables': len(tables_to_process),
    'strategies': {
        strategy_name: {
            'success_count': len(results[strategy_name]),
            'error_count': len(errors[strategy_name])
        }
        for strategy_name in ACTIVE_STRATEGIES
    }
}

summary_file = output_dir / '_summary.json'
with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(summary, f, indent=2, ensure_ascii=False)

print(f"\nFinal summary saved to {summary_file}")
print(f"\nAll done! Results saved to {output_dir}")


Final summary saved to processing/1_strategy/20251004_213906/_summary.json

All done! Results saved to processing/1_strategy/20251004_213906


## Sample Output Inspection

In [52]:
# Let's inspect a sample output from each strategy
for strategy_name in ACTIVE_STRATEGIES:
    if results[strategy_name]:
        print(f"\n{'='*60}")
        print(f"Sample output for {strategy_name.upper()}")
        print(f"{'='*60}")
        sample = results[strategy_name][0]
        print(json.dumps(sample, indent=2))
        break  # Show just one example


Sample output for FULL_TABLE
{
  "table_id": "10_men_butterfly_100m_2009",
  "table_name": "men_butterfly_100m_2009",
  "strategy": "full_table",
  "metadata": {
    "table_title": "men's 100 metre butterfly results in heats at the 2009 World Aquatics Championships",
    "columns": [
      "Name",
      "Nationality",
      "Time",
      "Heat",
      "Lane"
    ],
    "key_columns": [
      "Name"
    ]
  },
  "pagination_config": {}
}


## Next Steps

The pagination plans have been generated and saved to `processing/1_strategy/<timestamp>/`.

Each strategy creates a subdirectory with JSON files for each table containing:
- **table_id**: Unique identifier
- **strategy**: The pagination approach used
- **metadata**: Table information (title, columns, keys)
- **pagination_config**: Strategy-specific configuration for the fetch notebook

### For the next notebook (2_FetchPages.ipynb):
1. Load these JSON files
2. For each pagination plan, execute the appropriate fetching logic:
   - **full_table**: One query for entire table
   - **row_by_row**: Use key_values to fetch each row
   - **attribute_based**: Use partition_column and partition_values to fetch filtered pages
   - **classic_pagination**: Iteratively fetch pages using page_size and primary_keys
   - **range_based**: Use ranges to fetch data in buckets

### To enable more strategies:
Uncomment the strategies you want in the `ACTIVE_STRATEGIES` list above and re-run the notebook.