# Designing complex, reusable and scalable scientific workflows with Pydra

> Ghislain VAILLANT, Inria

## Motivation

> Scientific workflows often require sophisticated analyses that encompass a large **collection of algorithms**. These algorithms are not necessarily designed to **work together** and are written by **different authors**.
>
> Some may be written in Python, while others might require calling **external programs**. It is common practice to create semi-manual workflows that require the scientists to **handle the files** and **interact with partial results** from algorithms and external tools.
>
> This approach is conceptually simple and easy to implement, but the resulting workflow is often **time consuming**, **error-prone** and **difficult to share** with others.
> 
> -- <cite>[Pydra's Documentation](https://nipype.github.io/pydra/)</cite>

## Roadmap

- Prerequisites
- Core components
- Advanced features
- Case study
- Support channels

## Prerequisites

- Python 3.8+
- Type annotations
- Data classes

### Installation

To install the core package:

```shell
$ pip install pydra==0.22
```

To install Pydra task packages, for instance ANTs:

```shell
$ pip install pydra-ants
```

### Type annotations

- Proposed in [PEP 484](https://peps.python.org/pep-0484/)
- Available since Python 3.5
- Implemented in syntax and [typing](https://docs.python.org/3/library/typing.html) module
- Enhanced by subsequent Python releases

Standard function definition.

In [None]:
def scale(factor, vector):
    return [factor * x for x in vector]

Definition with type annotations.

In [None]:
from typing import List

# Type alias for convenience.
Vector = List[float]

def scale(factor: float, vector: Vector) -> Vector:
    return [factor * x for x in vector]

### Data classes

- Proposed in [PEP 557](https://peps.python.org/pep-0557/)
- Available since Python 3.7
- Implemented in [dataclasses](https://docs.python.org/3/library/dataclasses.html) module
- Enhanced by third-party libraries such as [attrs](https://www.attrs.org/)

Simple record definition.

In [None]:
import attrs

@attrs.define
class GeoPoint:
    lat: float
    lon: float

In [None]:
swansea = GeoPoint(51.62, -3.94)

print(swansea)

Record with custom fields.

In [None]:
from attrs import define, field, validators

def validate_lat(instance, attribute, value):
    if abs(value) > 90:
        raise ValueError(
            f"Latitude must be in range (-90, 90), got {value}.")

def validate_lon(instance, attribute, value):
    if abs(value) > 180:
        raise ValueError(
            f"Longitude must be in range (-180, 180), got {value}.")

@define(kw_only=True)   # Forbid init with posargs.
class CustomGeoPoint:
    lat: float = field(
        validator=[validators.instance_of(float), validate_lat])

    lon: float = field(
        validator=[validators.instance_of(float), validate_lon])

    alt: float = field(
        default=0.0, metadata={"recorded_by": "$DEVICE"})

In [None]:
swansea = CustomGeoPoint(lat=51.62, lon=-3.94)  # Okay!

print(swansea)

In [None]:
%xmode Minimal

In [None]:
swansea = CustomGeoPoint(151.62, -3.94)             # Oops!

In [None]:
swansea = CustomGeoPoint(lat=151.62, lon=-3.94)     # Oops!

## Core components

Tasks, workflows and shell specifications.

### Python tasks

Defining a function task.

In [None]:
from pathlib import Path
from pydra.mark import task

# Define a Python task.
@task
def cwd() -> Path:
    return Path.cwd()

Running a task.

In [None]:
# Instantiate a task.
task = cwd()

# Run and get the results.
result = task()

print(result.output.out)

### Shell tasks

Defining a shell command task.

In [None]:
from pydra.engine.task import ShellCommandTask

# Define a shell task.
class Pwd(ShellCommandTask):
    executable = "pwd"

# Instantiate a task.
task = Pwd()

# Run and get the results.
result = task()

print(result.output.stdout)

Defining input specifications.

In [None]:
from attrs import define, field
from pydra.engine.specs import ShellSpec, SpecInfo

# Define an input specifications.
@define(kw_only=True)
class InputSpec(ShellSpec):
    level: int = field(
        metadata={"help_string": "max level", "argstr": "-L"}
    )
    path: str = field(
        metadata={
            "help_string": "input path",
            "mandatory": True,
            "argstr": "",
        }
    )

# Define the shell task.
class Tree(ShellCommandTask):
    executable = "tree"

    # Associate the specifications with the task definition.
    input_spec = SpecInfo(name="Inputs", bases=(InputSpec,))


Testing input specifications.

In [None]:
from pathlib import Path

# Instantiate a task.
task = Tree(path=Path.cwd(), level=1)

# Check the shell command.
print(task.cmdline)

Output specifications.

### Workflows

Composing tasks in a workflow.

Submitting a workflow for execution.

### Shell specifications

Mutually exclusive parameters with `xor`.

Dependent parameters with `requires`.

Custom formatting with `formatter`.

## Complex workflows

Container tasks, map-reduce semantics and nested workflows.

## Advanced features

Workflow submission and customization options.

Submission strategies.

Global caching.

## Case study

A realistic example from the neuroimaging community.

## Support channels

- Documentation: `https://nipype.github.io/pydra`
- Issues: `https://github.com/nipype/pydra/issues`
- Discussions: `https://github.com/nipype/pydra/discussions`
- Live chat: `https://mattermost.brainhack.org/brainhack/channels/nipype`
- Cohacking: `https://meet.jit.si/pydra`

### Tutorial

![The Pydra tutorial homepage](./assets/pydra-tutorial.png)

> https://nipype.github.io/pydra-tutorial

### Q&A

![The NeuroStars homepage](./assets/neurostars-homepage.png)

> https://neurostars.org

### Pydra task packages

![Pydra task packages on PyPI](./assets/pypi-packages.png)

> https://pypi.org/search/?q=pydra

### Summary