# Pagination and Large Result Sets

This notebook demonstrates how to efficiently work with large result sets using:
- The `paginate()` method for automatic pagination
- Manual pagination with `offset` and `limit`
- Best practices for handling large datasets

## When to Use Pagination

Use pagination when:
- You expect more than 50-100 results
- You want to process results in batches
- You need to implement infinite scroll or "load more"
- Memory constraints require processing data in chunks

## Setup

In [None]:
# Add parent directory to path for local development
import sys
import os
sys.path.insert(0, os.path.abspath('..'))

In [None]:
from nanohubremote import Session
from nanohubresults import Results

# Initialize session
auth_data = {
    "grant_type": "personal_token",
    "token": "YOUR_TOKEN_HERE"
}
session = Session(auth_data, url="https://nanohub.org/api")
results = Results(session)

print("âœ“ Connected to nanoHUB API")

## Method 1: Automatic Pagination with `paginate()`

The `paginate()` method automatically fetches results in pages and yields them one at a time. This is the easiest way to iterate over large result sets.

### How it works:
1. Fetches results in pages (default: 50 per page)
2. Automatically handles offset increments
3. Stops when no more results are available
4. Yields individual results, not pages

In [None]:
print("Setting up pagination query...")

query = results.query("2dfets", simtool=False) \
    .filter("input.Ef", ">", 0) \
    .select("input.Ef", "input.Lg", "output.f11")

print("Query ready. Will fetch results in pages of 10.")

### Basic Pagination Example

In [None]:
print("Iterating over results...\n")
print(f"{'#':<6} {'SQUID':<50} {'Ef (V)':<10} {'Lg (nm)':<10}")
print("-" * 80)

count = 0
for result in query.paginate(per_page=10):
    count += 1
    
    # Show details for first 5 results
    if count <= 5:
        squid = result.get('squid', '')[:47] + '...' if len(result.get('squid', '')) > 50 else result.get('squid', '')
        ef = result.get('input.Ef', 'N/A')
        lg = result.get('input.Lg', 'N/A')
        print(f"{count:<6} {squid:<50} {ef:<10} {lg:<10}")
    elif count == 6:
        print("... (processing remaining results)")
    
    # Safety limit for demo
    if count >= 50:
        print(f"\nReached demo limit of 50 results")
        break

print(f"\nâœ“ Processed {count} results total")

## Method 2: Manual Pagination

For more control, you can implement pagination manually using `offset` and `limit`.

In [None]:
print("Manual pagination example\n")

page_size = 10
page_num = 0
total_results = 0

while page_num < 3:  # Fetch first 3 pages
    offset = page_num * page_size
    
    print(f"Fetching page {page_num + 1} (offset={offset}, limit={page_size})...")
    
    page_query = results.query("2dfets", simtool=False) \
        .filter("input.Ef", ">", 0.2) \
        .select("input.Ef", "input.Lg") \
        .limit(page_size) \
        .offset(offset)
    
    response = page_query.execute()
    page_results = response.get('results', [])
    
    if not page_results:
        print("  No more results")
        break
    
    print(f"  Retrieved {len(page_results)} results")
    total_results += len(page_results)
    page_num += 1

print(f"\nâœ“ Total results fetched: {total_results}")

## Processing Data in Batches

For large datasets, you might want to process and save data in batches to avoid memory issues.

In [None]:
import json

print("Processing results in batches...\n")

batch_query = results.query("2dfets", simtool=False) \
    .filter("input.Ef", ">", 0) \
    .select("input.Ef", "input.Lg", "input.temperature")

batch_size = 10
batch_num = 0
batch_data = []

for result in batch_query.paginate(per_page=10):
    batch_data.append(result)
    
    # When batch is full, save it
    if len(batch_data) >= batch_size:
        batch_file = f"batch_{batch_num:03d}.json"
        with open(batch_file, 'w') as f:
            json.dump(batch_data, f, indent=2)
        
        print(f"Saved batch {batch_num} to {batch_file} ({len(batch_data)} results)")
        
        batch_num += 1
        batch_data = []
    
    # Demo limit
    if batch_num >= 3:
        break

# Save any remaining data
if batch_data:
    batch_file = f"batch_{batch_num:03d}.json"
    with open(batch_file, 'w') as f:
        json.dump(batch_data, f, indent=2)
    print(f"Saved final batch {batch_num} to {batch_file} ({len(batch_data)} results)")

print(f"\nâœ“ Saved {batch_num + 1} batch files")

## Collecting Statistics Across Pages

You can efficiently collect statistics without loading all data into memory.

In [None]:
print("Collecting statistics across all pages...\n")

stats_query = results.query("2dfets", simtool=False) \
    .filter("input.Ef", ">", 0) \
    .select("input.Ef", "input.Lg", "input.temperature")

# Initialize statistics
stats = {
    'count': 0,
    'ef_sum': 0,
    'ef_min': float('inf'),
    'ef_max': float('-inf'),
    'lg_sum': 0,
    'temp_counts': {}
}

# Process results
for result in stats_query.paginate(per_page=20):
    stats['count'] += 1
    
    ef = result.get('input.Ef', 0)
    lg = result.get('input.Lg', 0)
    temp = result.get('input.temperature')
    
    stats['ef_sum'] += ef
    stats['ef_min'] = min(stats['ef_min'], ef)
    stats['ef_max'] = max(stats['ef_max'], ef)
    stats['lg_sum'] += lg
    
    if temp:
        stats['temp_counts'][temp] = stats['temp_counts'].get(temp, 0) + 1
    
    # Demo limit
    if stats['count'] >= 100:
        break

# Calculate averages
if stats['count'] > 0:
    print("Statistics Summary:")
    print(f"  Total results: {stats['count']}")
    print(f"\n  Fermi Energy:")
    print(f"    Min: {stats['ef_min']:.3f} V")
    print(f"    Max: {stats['ef_max']:.3f} V")
    print(f"    Avg: {stats['ef_sum'] / stats['count']:.3f} V")
    print(f"\n  Gate Length:")
    print(f"    Avg: {stats['lg_sum'] / stats['count']:.1f} nm")
    print(f"\n  Temperature distribution:")
    for temp, count in sorted(stats['temp_counts'].items()):
        print(f"    {temp} K: {count} results ({count/stats['count']*100:.1f}%)")
else:
    print("No results to analyze")

## Performance Comparison

Let's compare the performance of different page sizes.

In [None]:
import time

print("Performance comparison of different page sizes:\n")
print(f"{'Page Size':<12} {'Time (s)':<12} {'Results':<10} {'Pages':<10}")
print("-" * 50)

for page_size in [5, 10, 20, 50]:
    perf_query = results.query("2dfets", simtool=False) \
        .filter("input.Ef", ">", 0.25) \
        .filter("input.Ef", "<", 0.35) \
        .select("input.Ef")
    
    start_time = time.time()
    count = 0
    pages = 0
    
    for result in perf_query.paginate(per_page=page_size):
        count += 1
        if count % page_size == 1:
            pages += 1
        
        # Limit to 50 results for fair comparison
        if count >= 50:
            break
    
    elapsed = time.time() - start_time
    print(f"{page_size:<12} {elapsed:<12.3f} {count:<10} {pages:<10}")

print("\nðŸ’¡ Tip: Larger page sizes are generally faster but use more memory per page.")

## Summary

In this notebook, you learned:
1. âœ“ How to use `paginate()` for automatic pagination
2. âœ“ How to implement manual pagination with offset/limit
3. âœ“ How to process large datasets in batches
4. âœ“ How to collect statistics efficiently
5. âœ“ Performance considerations for different page sizes

## Best Practices

### Choose the Right Page Size
- **Small (10-20)**: Good for interactive applications, lower memory
- **Medium (50)**: Default, balanced performance
- **Large (100+)**: Faster for bulk processing, higher memory

### When to Use Each Method
- **`paginate()`**: When you want to process all results sequentially
- **Manual pagination**: When you need specific pages (e.g., page 5 only)
- **Batch processing**: When dealing with very large datasets

### Memory Management
- Process data as you iterate, don't collect all results first
- Save to disk in batches for very large datasets
- Use generators and iterators to minimize memory usage

### Performance Tips
- Select only the fields you need
- Use filters to reduce result set size
- Consider caching frequently accessed data
- Balance page size between API calls and memory usage

## Advanced Topics

For production applications, consider:
- Implementing retry logic for failed requests
- Adding progress bars (e.g., with `tqdm`)
- Parallel processing of pages
- Caching strategies with `requests-cache`
- Database storage for large datasets