In [None]:
%matplotlib widget

# Introduction to Sciline

<h4><i>Data processing workflow management tool.</i></h4>

<h3><a href="https://scipp.github.io/sciline/">scipp.github.io/sciline/</a></h3>

<br>

Sciline is an open-source library developed by ESS for managing and visualizing data processing workflows (sometimes called "pipelines").

It defines workflows as directed acyclic graphs (DAGs) where the nodes are inputs, intermediate results, or final results, and edges are dependencies between the nodes.

This has some benefits:

- Any (named) intermediate result in the pipeline can be computed.
- The dependencies between intermediate results can be visualized.
- Implementations of intermediate result can be replaced.
- Results that are expensive to compute can be cached.
- Certainty that the computed result has not been corrupted by running jupyter cells out of order.

<br><br>

## Terminology

A Sciline **workflow** (or "pipeline") is defined by a set of transformations or **providers** that specify what inputs are needed to compute one specific output quantity.

The input and output quantities of the providers are called **domain types**.

A provider is a python function with type annotations:

```python
# Example provider
def load_run(
    run_number: RunNumber,
    proposal_number: ProposalNumber,
    data_dir_path: DataPath,
) -> LoadedNexusData:

    filename = f'{proposal_number}_{run_number:06d}.hdf'
    path = os.path.join(data_dir_path, filename)
    
    with snx.File(path) as f:
        data = f[()]
        
    return data
```

In the above example, to compute the `LoadedNexusData` quantity the `RunNumber`, `ProposalNumber` and the `DataPath` quantities are needed.

`LoadedNexusData`, `RunNumber`, `ProposalNumber` and `DataPath` are "domain types".

The domain types are the nodes in the workflow graph, and they represent the inputs, intermediate results, or final results of the workflow.

<br>

### Inputs, intermediate results, and final results

A typical data reduction workflow has some "**Inputs**", some "**Intermediate results**" and some "**Final results**".

Sciline does not distinguish between those, but it is useful to make a loose distinction:

**Inputs** are typically:

- The name of one or more NeXus files.
- Parameters defining a region of interest (ROI),
  - for example the wavelength range.
- The number of histogram bins.

**Intermediate results** are typically:

- List of events with associated coordinates, masks, and weights.
  - "Coordinates" such as wavelength, scattering angle, etc.
- Monitor wavelength histogram.
- Various calibration factors that are computed or loaded from file.

**Final results** are typically:

- Curve describing the scattering cross section $S(Q)$ as a function of momentum transfer (SANS).
- List of peaks and associated intensities (diffraction).
- Etc.

![Sciline graph example](sciline-graph-example.svg "Illustration of a Sciline workflow graph")

### How does Sciline know how to build the graph?

Each domain type is **unique**, and given a list of all functions to be used in the workflow,
Sciline can figure out how to connect the nodes in the graph.

Think of it like puzzle pieces that fit into each other:

<img src="sciline-puzzle-1.svg" width="600">

<br><br><br>

<img src="sciline-puzzle-2.svg" width="600">

## Example

### Creating a pipeline

In [None]:
import os
from typing import NewType

import sciline as sl
import scipp as sc

from scippneutron.conversion.graph.beamline import beamline
from scippneutron.conversion.graph.tof import elastic

import sans_utils as utils


# Start by defining the domain types
# - quantities representing input parameters, intermediate results and the final results of the pipeline.

Foldername = NewType("Foldername", str)
"""Folder name for measurements."""

RawData = NewType("RawData", sc.DataArray)
"""Raw loaded data."""

CoordTransformGraph = NewType("CoordTransformGraph", dict)
"""Graph describing coordinate transformations."""

WavelengthData = NewType("WavelengthData", sc.DataArray)
"""Data with wavelength coordinate."""

QData = NewType("Qdata", sc.DataArray)
"""Data with Q coordinate."""

QBins = NewType("QBins", sc.Variable)

QHistogram = NewType("QHistogram", sc.DataArray)
"""Data histogrammed in Q bins."""


def load(folder: Foldername) -> RawData:
    """Load raw data from file"""
    return utils.load_sans(folder)


def to_wavelength(
    data: RawData, graph: CoordTransformGraph
) -> WavelengthData:
    """Compute wavelength for events"""
    return data.transform_coords("wavelength", graph=graph)


def to_Q(data: WavelengthData, graph: CoordTransformGraph) -> QData:
    """Compute Q for events"""
    return data.transform_coords("Q", graph=graph)


def to_histogram(events: QData, qbins: QBins) -> QHistogram:
    """Histogram data in Q bins"""
    return events.hist(Q=qbins)


graph = {**beamline(scatter=True), **elastic("tof")}

workflow = sl.Pipeline(
    # List the providers that make up the workflow.
    (load, to_wavelength, to_Q, to_histogram),
    # Optionally, assign values to domain types.
    params={CoordTransformGraph: graph}
)

workflow.visualize(graph_attr={'rankdir': 'LR'})

Some domain types are visualized in red color and with dashed border, those are the domain types that **lack a definition**.

In [None]:
workflow

### Setting parameters

In [None]:
workflow[QBins] = sc.linspace("Q", 5.0e-3, 0.19, 201, unit="1/angstrom")
workflow[Foldername] = utils.fetch_data("3-mcstas/SANS_with_sample_many_neutrons")

In [None]:
workflow

In [None]:
workflow.visualize(graph_attr={'rankdir': 'LR'})

### Computing quantities

In [None]:
q_hist = workflow.compute(QHistogram)
q_hist

In [None]:
wavelength_data = workflow.compute(WavelengthData)
wavelength_data

In [None]:
two_results = workflow.compute((WavelengthData, QHistogram))
two_results

In [None]:
two_results[QHistogram]

### Replace intermediate result

In [None]:
wavelength_data = wavelength_data.assign_masks(
    wavelength_too_high = wavelength_data.coords['wavelength'] >= sc.scalar(6.5, unit='angstrom')
)
workflow[WavelengthData] = wavelength_data

workflow.visualize(graph_attr={'rankdir': 'LR'})

## Common errors

In [None]:
workflow = sl.Pipeline(
    # Missing to_Q provider!
    (load, to_wavelength, to_histogram),
)
# Graph is disconnected
workflow.visualize()

In [None]:
# Uncomment the next line to see the exception
#workflow.compute(QHistogram)

In [None]:
def bad_to_histogram(events: QData, qbins: QBins) -> QHistogram:
    """Histogram data in Q bins"""
    # Wrong coordinate name!
    return events.hist(q=qbins)

workflow = sl.Pipeline(
    (load, to_wavelength, to_Q, bad_to_histogram,),
)
workflow[Foldername] = utils.fetch_data("3-mcstas/SANS_with_sample_many_neutrons")
workflow[CoordTransformGraph] = graph
workflow[QBins] = 200

# Uncomment the next line to see the exception
#workflow.compute(QHistogram)

## Generic domain types

Sometimes we want to replicate parts of a workflow and apply it to a different input.

A typical case is when we have a sample measurement and want to correct it by a background measurement.

In that case many of the processing steps are identical, but ultimately we want to subtract the background measurement from the sample measurement.

Generic domain types lets us define domain types that represent "Y of the X" such as:
- `Filename[Background]`: Filename of the background run.
- `QHistogram[Sample]`: The Q-histogram of the sample run.
- `QHistogram[Background]`: The Q-histogram of the background run.
- etc


### Example: Generic domain types

In [None]:
from typing import NewType, TypeVar
import sciline

_fake_filesystem = {
    'file102.txt': [1, 2, float('nan'), 3],
    'file103.txt': [1, 2, 3, 4],
    'file104.txt': [1, 2, 3, 4, 5],
    'file105.txt': [1, 2, 3],
    'background.txt': [0.1, 0.1],
}

# Define concrete RunType values we will use.
Sample = NewType('Sample', int)
Background = NewType('Background', int)

# Define generic domain types
RunType = TypeVar('RunType', Sample, Background)

# sciline.Scope makes Filename a "generic" domain type that depends on RunType.
class Filename(sciline.Scope[RunType, str], str): ...


class RawData(sciline.Scope[RunType, dict], dict): ...


class CleanedData(sciline.Scope[RunType, list], list): ...


# Define normal domain types
ScaleFactor = NewType('ScaleFactor', float)
BackgroundSubtractedData = NewType('BackgroundSubtractedData', list)
Result = NewType('Result', float)


def load(filename: Filename[RunType]) -> RawData[RunType]:
    """Load the data from the filename."""
    data = _fake_filesystem[filename]
    return {'data': data, 'meta': {'filename': filename}}


def clean(raw_data: RawData[RunType]) -> CleanedData[RunType]:
    """Clean the data, removing NaNs."""
    import math
    return [x for x in raw_data['data'] if not math.isnan(x)]


def subtract_background(
    data: CleanedData[Sample], background: CleanedData[Background]
) -> BackgroundSubtractedData:
    return [x - sum(background) for x in data]


def process(data: BackgroundSubtractedData, param: ScaleFactor) -> Result:
    """Process the data, multiplying the sum by the scale factor."""
    return sum(data) * param


providers = [load, clean, process, subtract_background]
workflow = sciline.Pipeline(providers)
workflow.visualize(Result)

The `load` and `clean` providers are re-used for both the sample run and the background run.

This is very common in practice.

In [None]:
workflow[ScaleFactor] = 2.0
workflow[Filename[Sample]] = 'file102.txt'
workflow[Filename[Background]] = 'background.txt'
workflow

In [None]:
workflow.compute(CleanedData[Sample])

In [None]:
workflow.compute(CleanedData[Background])

In [None]:
workflow.compute(Result)

## How will Sciline be used at ESS?

Most instruments will have **one or more** associated Sciline workflows.

The workflows will be the basic interface to the data reduction software.

On top of that interface we can build simpler but less flexible interfaces.

- But that will take time.
- In the early days after HC the interface to the data reduction will be mainly in the form of Sciline workflows.

### What do I need to know?

1. How to figure out **what quantity to compute** with the workflow.
   - Look at the workflow graph and read on the technique package documentation page.
3. How to figure out **what parameters are needed** to compute the target quantity.
   - Error messages tell you what is missing, or you can look at the workflow graph.
5. **How to set parameters** on the workflow.
7. **How to compute** the desired quantity.
8. How to read and **understand common error messages**.


## There will be Sciline exercises in the training session later!