Issue: CUDA workers do not release resources after each evaluate_program when using spawn start method

## Summary

When using PyTorch with multiprocessing in OpenEvolve (e.g. [KernelBench's `eval_kernel_against_ref`](https://github.com/ScalingIntelligence/KernelBench/blob/main/src/eval.py#L390)), CUDA requires the spawn start method to avoid the well-known “Cannot re-initialize CUDA in forked subprocess” error (see [PyTorch docs](https://docs.pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing)) .  

However, after setting in `cli.py`:

```python
# cli.py
def main() -> int:
    """
    Main entry point

    Returns:
        Exit code
    """

    # BUGFIX: allow CUDA in mp
    mp.set_start_method("spawn", force=True)

    return asyncio.run(main_async())
```

I observed that **worker processes do not release CUDA resources after each** `evaluate_program` call. This causes GPU memory to continuously grow and never be freed until the entire evolution run stops.

This is consistent with [PyTorch’s documentation](https://pytorch.org/docs/stable/notes/multiprocessing.html#avoiding-and-fighting-deadlocks) and also with general [Python multiprocessing behavior](https://stackoverflow.com/questions/54974817/python-multiprocessing-pool-maxtasksperchild). 

CUDA contexts live for the lifetime of the worker process, so they are not released unless the worker exits.

## What I tried

To force worker processes to restart after each task (so that CUDA contexts are destroyed), I modified the executor initialization in `process_parallel.py`:

```python
# process_parallel.py
def start(self) -> None:
    self.executor = ProcessPoolExecutor(
        max_workers=self.num_workers,
        initializer=_worker_init,
        initargs=(config_dict, self.evaluation_file, current_env),
        # BUGFIX: allow CUDA in mp
        max_tasks_per_child=1,
     )
```

This ensures that each evaluation runs in a fresh process, and GPU memory is properly reclaimed.

## Question

Before submitting a PR, I would like to confirm:

**Does adding `max_tasks_per_child=1` introduce any risks for OpenEvolve’s generality or execution model?**

Specifically:

- Are there parts of OpenEvolve that depend on long-lived worker processes?
- Is there any state that workers are expected to preserve across tasks?
- Would forcing a full process respawn per task degrade performance in some scenarios (e.g., expensive model initialization)?
- Would a user-settable configuration be preferable?

### Why this matters

Without killing workers after each task, CUDA contexts accumulate indefinitely, especially with inline kernels or any operation that initializes PyTorch CUDA state. This leads to (potential) GPU memory leaks and stale CUDA contexts that never reset.  

****

If this aligns with the project’s architecture, I am happy to open a PR to add this behavior (or make it configurable).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue: CUDA workers do not release resources after each evaluate_program when using spawn start method #330

Summary

What I tried

Question

Why this matters

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue: CUDA workers do not release resources after each evaluate_program when using spawn start method #330

Description

Summary

What I tried

Question

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions