# On-Disk Learning Part 2: Advanced Techniques

**Optimize your research workflow with intelligent caching and storage options**

---

## What You'll Learn

- ‚úÖ DAG-based incremental caching (2-3√ó faster iteration)
- ‚úÖ Storage backend options (files vs mmap)
- ‚úÖ Parallel processing (3-4√ó speedup)
- ‚úÖ Best practices for iterative experimentation

**Time:** 20-25 minutes  
**Prerequisites:** Complete Part 1 first

---

## 1. The Iteration Problem

Research involves testing multiple transform combinations:

### Without Caching
```python
# Baseline
config1 = {"clique_lifting": {...}}  # 40 minutes

# Add feature A  
config2 = {"clique_lifting": {...}, "feature_A": {...}}  # 54 minutes (reprocesses clique!)

# Try feature B
config3 = {"clique_lifting": {...}, "feature_B": {...}}  # 54 minutes (reprocesses clique!)

# Total: 148 minutes, wasted 108 minutes reprocessing!
```

### With DAG Caching
```python
# Baseline
config1 = {"clique_lifting": {...}}  # 40 minutes (cached!)

# Add feature A
config2 = {"clique_lifting": {...}, "feature_A": {...}}  # 14 minutes (reuses clique!)

# Try feature B  
config3 = {"clique_lifting": {...}, "feature_B": {...}}  # 14 minutes (reuses clique!)

# Total: 68 minutes - saved 80 minutes (2.2√ó faster!)
```

In [None]:
import warnings
warnings.filterwarnings('ignore')

from omegaconf import OmegaConf
from topobench.data.datasets import SyntheticGraphDataset
from topobench.data.preprocessor import OnDiskInductivePreprocessor
import time

# Create dataset
dataset = SyntheticGraphDataset(num_samples=2000, num_nodes=20, num_features=16, seed=42)
print(f"Dataset ready: {len(dataset)} graphs")

## 2. DAG Caching in Action

### Experiment 1: Baseline Transform

In [None]:
# Baseline: SimplicialCliqueLifting (expensive transform)
config1 = OmegaConf.create({
    "clique_lifting": {
        "transform_type": "lifting",
        "transform_name": "SimplicialCliqueLifting",
        "complex_dim": 2
    }
})

print("[Experiment 1] Baseline transform")
start = time.time()

preprocessor1 = OnDiskInductivePreprocessor(
    dataset=dataset,
    data_dir="./data/dag_demo",
    transforms_config=config1,
    storage_backend="files",  # Fast for development
    num_workers=4
)

time1 = time.time() - start
print(f"Time: {time1:.1f}s")
print("Transform cached at: ./data/dag_demo/transform_chain/")

### Experiment 2: Add Feature Transform (Reuses Clique!)

In [None]:
# Add ProjectionSum (clique_lifting will be reused!)
config2 = OmegaConf.create({
    "clique_lifting": {  # Same config = cached!
        "transform_type": "lifting",
        "transform_name": "SimplicialCliqueLifting",
        "complex_dim": 2
    },
    "projection": {  # NEW transform
        "transform_type": "feature",
        "transform_name": "ProjectionSum"
    }
})

print("\n[Experiment 2] Add ProjectionSum")
start = time.time()

preprocessor2 = OnDiskInductivePreprocessor(
    dataset=dataset,
    data_dir="./data/dag_demo",  # Same directory!
    transforms_config=config2,
    storage_backend="files",
    num_workers=4
)

time2 = time.time() - start
speedup = time1 / time2 if time2 > 0 else 0

print(f"Time: {time2:.1f}s")
print(f"Speedup: {speedup:.1f}√ó (reused clique_lifting!)")
print(f"Look for message: 'Reusing 1 cached transform(s)!' above")

### How It Works

TopoBench creates unique cache directories:

```
data_dir/transform_chain/
  DataTransform_0_abc123/  ‚Üê SimplicialCliqueLifting  
  DataTransform_1_def456/  ‚Üê ProjectionSum
```

Cache key = **transform_id** (position) + **hash** (parameters)

- Same config ‚Üí reuse cache ‚úÖ
- Changed params ‚Üí new cache ‚úÖ  
- Different position ‚Üí new cache ‚úÖ

---

## 3. Storage Backends: Files vs Mmap

TopoBench offers two storage backends:

### Files Backend (Development)
- ‚ö° **Fast iteration** (3-4√ó with parallel)
- üìä **Clear DAG benefits** visible
- üíæ **Larger disk** usage (~4-5√ó more)
- üéØ **Use when:** Experimenting, iterating

### Mmap Backend (Production)  
- üíæ **Compressed** (4-5√ó smaller)
- üöÄ **Fast I/O** during training
- ‚ö†Ô∏è **Slower preprocessing** (compression overhead)
- üéØ **Use when:** Final deployment, limited disk

### Side-by-Side Comparison

In [None]:
# Create small dataset for comparison
small_dataset = SyntheticGraphDataset(num_samples=500, num_nodes=20)
config = OmegaConf.create({"clique": {"transform_type": "lifting", "transform_name": "SimplicialCliqueLifting", "complex_dim": 2}})

# Files backend
print("Files backend (development):")
start = time.time()
preprocessor_files = OnDiskInductivePreprocessor(
    dataset=small_dataset,
    data_dir="./data/compare_files",
    transforms_config=config,
    storage_backend="files",
    num_workers=4
)
time_files = time.time() - start
print(f"  Time: {time_files:.1f}s")

In [None]:
# Mmap backend  
print("\nMmap backend (production):")
start = time.time()
preprocessor_mmap = OnDiskInductivePreprocessor(
    dataset=small_dataset,
    data_dir="./data/compare_mmap",
    transforms_config=config,
    storage_backend="mmap",
    compression="lz4",
    num_workers=1  # Use 1 worker with mmap
)
time_mmap = time.time() - start
print(f"  Time: {time_mmap:.1f}s")

In [None]:
# Check disk usage
import subprocess
try:
    files_size = subprocess.check_output(['du', '-sh', './data/compare_files']).decode().split()[0]
    mmap_size = subprocess.check_output(['du', '-sh', './data/compare_mmap']).decode().split()[0]
    print(f"\nDisk usage:")
    print(f"  Files: {files_size}")
    print(f"  Mmap:  {mmap_size} (compressed)")
except:
    print("\n(Disk usage check requires Unix du command)")

### Decision Guide

**Choose FILES when:**
- ‚úÖ Iterating on transform combinations
- ‚úÖ Prototyping pipelines
- ‚úÖ Disk space is abundant
- ‚úÖ Want fastest development cycle

**Choose MMAP when:**
- ‚úÖ Finalizing pipeline for production
- ‚úÖ Disk space is limited  
- ‚úÖ Training repeatedly on same data
- ‚úÖ Want optimal storage + I/O

---

## 4. Parallel Processing

Enable multi-core processing for faster preprocessing:

### Performance with Different Worker Counts

In [None]:
# Test different worker counts
test_dataset = SyntheticGraphDataset(num_samples=1000, num_nodes=20)
config = OmegaConf.create({"clique": {"transform_type": "lifting", "transform_name": "SimplicialCliqueLifting", "complex_dim": 2}})

for workers in [1, 2, 4]:
    print(f"\nTesting {workers} worker(s):")
    start = time.time()
    
    preprocessor = OnDiskInductivePreprocessor(
        dataset=test_dataset,
        data_dir=f"./data/parallel_{workers}",
        transforms_config=config,
        storage_backend="files",
        num_workers=workers,
        force_reload=True  # Reprocess to measure time
    )
    
    elapsed = time.time() - start
    print(f"  Time: {elapsed:.1f}s")

### Best Practices

```python
# Development: Use parallel processing with files
OnDiskInductivePreprocessor(
    ...,
    storage_backend="files",
    num_workers=7  # Use N-1 cores
)

# Production: Use mmap with 1 worker
OnDiskInductivePreprocessor(
    ...,
    storage_backend="mmap",
    compression="lz4",
    num_workers=1  # Compression is sequential
)
```

**Why 1 worker with mmap?** Compression creates a bottleneck that limits parallel speedup to ~2√ó instead of 3-4√ó.

---

## 5. Complete Workflow Example

### Realistic Research Scenario

Goal: Find best transform combination for your task

In [None]:
# Phase 1: Development (files backend + parallel)
dataset = SyntheticGraphDataset(num_samples=1000, num_nodes=30)

# Baseline
config_baseline = OmegaConf.create({
    "lifting": {"transform_type": "lifting", "transform_name": "SimplicialCliqueLifting", "complex_dim": 2}
})

preprocessor_baseline = OnDiskInductivePreprocessor(
    dataset=dataset,
    data_dir="./data/workflow",
    transforms_config=config_baseline,
    storage_backend="files",
    num_workers=4
)
print("Baseline created")

In [None]:
# Try variant A (reuses lifting!)
config_A = OmegaConf.create({
    "lifting": {"transform_type": "lifting", "transform_name": "SimplicialCliqueLifting", "complex_dim": 2},
    "features": {"transform_type": "feature", "transform_name": "ProjectionSum"}
})

preprocessor_A = OnDiskInductivePreprocessor(
    dataset=dataset,
    data_dir="./data/workflow",  # Reuse cache!
    transforms_config=config_A,
    storage_backend="files",
    num_workers=4
)
print("Variant A created (reused lifting)")

In [None]:
# Phase 2: Production (convert best to mmap)
# Assuming variant A won, convert to compressed format
preprocessor_production = OnDiskInductivePreprocessor(
    dataset=dataset,
    data_dir="./data/production",
    transforms_config=config_A,  # Best config
    storage_backend="mmap",
    compression="lz4",
    num_workers=1
)
print("Production version created (compressed)")

---

## 6. Summary

### What You Learned

1. ‚úÖ **DAG caching:** Automatically reuses transforms (2-3√ó faster iteration)
2. ‚úÖ **Storage backends:** Files (fast dev) vs Mmap (small storage)
3. ‚úÖ **Parallel processing:** 3-4√ó speedup with multiple workers
4. ‚úÖ **Workflows:** Development ‚Üí experimentation ‚Üí production

### Key Takeaways

| Feature | Benefit | When to Use |
|---------|---------|-------------|
| **DAG Caching** | 2-3√ó faster | Always (automatic) |
| **Files Backend** | Fast iteration | Development |
| **Mmap Backend** | 4-5√ó smaller | Production |
| **Parallel (files)** | 3-4√ó speedup | Large datasets |

### Recommended Workflow

```python
# 1. Development: Fast iteration
dev = OnDiskInductivePreprocessor(
    storage_backend="files",
    num_workers=7
)

# 2. Experiment: Try variations (DAG cache speeds this up!)
# ... iterate on transforms ...

# 3. Production: Convert final pipeline
prod = OnDiskInductivePreprocessor(
    storage_backend="mmap",
    compression="lz4",
    num_workers=1
)
```

### Resources

- **`README_DAG_CACHING.md`**: Complete technical reference
- **`SPEED_VS_COMPRESSION_TRADEOFF.md`**: Detailed backend comparison
- **GitHub Issues**: Ask questions and report issues

---

**Congratulations!** You've mastered TopoBench's on-disk preprocessing.

You can now:
- Train on datasets beyond RAM ‚úÖ
- Iterate rapidly with DAG caching ‚úÖ  
- Choose the right backend for each phase ‚úÖ
- Optimize preprocessing with parallel workers ‚úÖ

**Happy researching!** üöÄ