# Distributed Computing

This tutorial covers running computations across multiple workers. You'll learn:

- **Jobs 2.0** — DataJoint's job coordination system
- **Multi-process** — Parallel workers on one machine
- **Multi-machine** — Cluster-scale computation
- **Error handling** — Recovery and monitoring

In [None]:
import datajoint as dj
import numpy as np
import time

schema = dj.Schema('tutorial_distributed')

# Clean up from previous runs
schema.drop(prompt=False)
schema = dj.Schema('tutorial_distributed')

## Setup

In [None]:
@schema
class Experiment(dj.Manual):
    definition = """
    exp_id : int
    ---
    n_samples : int
    """

@schema
class Analysis(dj.Computed):
    definition = """
    -> Experiment
    ---
    result : float64
    compute_time : float32
    """

    def make(self, key):
        start = time.time()
        n = (Experiment & key).fetch1('n_samples')
        result = float(np.mean(np.random.randn(n) ** 2))
        time.sleep(0.1)
        self.insert1({**key, 'result': result, 'compute_time': time.time() - start})

In [None]:
Experiment.insert([{'exp_id': i, 'n_samples': 10000} for i in range(20)])
print(f"To compute: {len(Analysis.key_source - Analysis)}")

## Direct vs Distributed Mode

**Direct mode** (default): No coordination, suitable for single worker.

**Distributed mode** (`reserve_jobs=True`): Workers coordinate via jobs table.

In [None]:
# Distributed mode
Analysis.populate(reserve_jobs=True, max_calls=5, display_progress=True)

## The Jobs Table

In [None]:
# Refresh job queue
result = Analysis.jobs.refresh()
print(f"Added: {result['added']}")

# Check status
for status, count in Analysis.jobs.progress().items():
    print(f"{status}: {count}")

## Multi-Process and Multi-Machine

The `processes=N` parameter spawns multiple worker processes on one machine. However, this requires table classes to be defined in importable Python modules (not notebooks), because multiprocessing needs to pickle and transfer the class definitions to worker processes.

For production use, define your tables in a module and run workers as scripts:

```python
# pipeline.py - Define your tables
import datajoint as dj
schema = dj.Schema('my_pipeline')

@schema
class Analysis(dj.Computed):
    definition = """..."""
    def make(self, key): ...
```

```python
# worker.py - Run workers
from pipeline import Analysis

# Single machine, 4 processes
Analysis.populate(reserve_jobs=True, processes=4)

# Or run this script on multiple machines
while True:
    result = Analysis.populate(reserve_jobs=True, max_calls=100, suppress_errors=True)
    if result['success_count'] == 0:
        break
```

In this notebook, we'll demonstrate distributed coordination with a single process:

In [None]:
# Complete remaining jobs with distributed coordination
Analysis.populate(reserve_jobs=True, display_progress=True)
print(f"Computed: {len(Analysis())}")

## Error Handling

In [None]:
# View errors
print(f"Errors: {len(Analysis.jobs.errors)}")

# Retry failed jobs
Analysis.jobs.errors.delete()
Analysis.populate(reserve_jobs=True, suppress_errors=True)

## Quick Reference

| Option | Description |
|--------|-------------|
| `reserve_jobs=True` | Enable coordination |
| `processes=N` | N worker processes |
| `max_calls=N` | Limit jobs per run |
| `suppress_errors=True` | Continue on errors |

In [None]:
schema.drop(prompt=False)