# Backend API Demo

This notebook walks through the core features of StreamWeave against the local dev stack:

- Prefect pipeline orchestration
- rclone data transfers from simulated CIFS instrument shares
- Pre- and post-transfer hook system (file filtering, metadata enrichment)
- Fine-grained file access control (users, groups, projects)

**Prerequisites:** the dev stack must be running at `https://streamweave.local`.
See [Local Development](development.md) for setup instructions.

This page is a static rendering of a Jupyter Notebook, which you can <a href="./backend-demo.ipynb" download>&#x2913; download </a> to run locally.

## Prerequisites

- Docker and Docker Compose installed
- `uv` installed for Python package management
- The repo cloned and the dev stack already running (see [Local Development](development.md))

## Initial setup

Before running the notebook, start Jupyter from the repo root:

```bash
cd backend
uv sync
uv run jupyter lab ../docs/backend-demo.ipynb
```

You will also need to bring up the development docker stack and run at least the first few steps of the [local dev deployment](development.md) setup:

> First, redirect the `streamweave.local` DNS name to your local machine by adding the following to `/etc/hosts` (macOS/Linux) or `C:\System32\drivers\etc\hosts` (windows):
> 
> ```
> 127.0.0.1 streamweave.local
> ```


> Then, from the repository root, run the following to bring up the development stack:
> 
> ```bash
> docker compose -f docker-compose.yml -f docker-compose.dev.yml up
> ```



The following cell contains helper commands that will be used throughout the notebook:

In [41]:
import httpx
import json
import os
import subprocess
import threading
import time
import warnings
from pathlib import Path

# Find the repo root regardless of where Jupyter was launched from
def _find_repo_root():
    p = Path.cwd()
    while p != p.parent:
        if (p / "docker-compose.yml").exists():
            return p
        p = p.parent
    raise RuntimeError("Could not find repo root (no docker-compose.yml found)")

REPO_ROOT = _find_repo_root()
DEV_COMPOSE = f"-f {REPO_ROOT}/docker-compose.yml -f {REPO_ROOT}/docker-compose.dev.yml"

# Dev stack credentials (set in docker-compose.dev.yml)
ADMIN_EMAIL = "admin@example.com"
ADMIN_PASSWORD = "adminpassword"

BASE_URL = "https://streamweave.local"
PREFECT_API_URL = "https://streamweave.local/prefect/api"

_limits = httpx.Limits(max_connections=10, max_keepalive_connections=5, keepalive_expiry=300)
client = httpx.Client(base_url=BASE_URL, timeout=30, verify=str(REPO_ROOT / "caddy/certs/ca.crt"), limits=_limits)
prefect = httpx.Client(base_url=PREFECT_API_URL, timeout=30, verify=str(REPO_ROOT / "caddy/certs/ca.crt"), limits=_limits)


def pp(resp, n: int | None = None):
    """Pretty-print a JSON response. Prints first `n` items if given."""
    try:
        data = resp.json()
        if n is not None and isinstance(data, list) and len(data) > n:
            data = [*data[:n], "..."]
        print(json.dumps(data, indent=2))
    except Exception:
        print(f"HTTP {resp.status_code}: {resp.text}")

def pp_dict(data, n: int | None = None):
    """Pretty-print a dictionary or list. Prints first `n` items if given."""
    if n is not None and isinstance(data, list) and len(data) > n:
        data = [*data[:n], "..."]
    print(json.dumps(data, indent=2))


def run(cmd, **kwargs):
    """Run a shell command, streaming stdout normally and stderr in yellow."""
    YELLOW = "\033[33m"
    RESET = "\033[0m"
    with subprocess.Popen(
        cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, **kwargs
    ) as proc:
        def _stream_stderr():
            for line in proc.stderr:
                print(f"{YELLOW}{line}{RESET}", end="", flush=True)
        t = threading.Thread(target=_stream_stderr)
        t.start()
        for line in proc.stdout:
            print(line, end="", flush=True)
        t.join()
    if proc.returncode != 0:
        warnings.warn(f"Command exited with code {proc.returncode}: {cmd}")
    return proc


def wait_for_flow_run(flow_run_id: str, timeout: int = 120) -> str:
    """Wait for a Prefect flow run to complete. Returns the final state type."""
    terminal_states = ("COMPLETED", "FAILED", "CANCELLED", "CRASHED")
    print(f"Waiting for flow run {flow_run_id} to complete...")
    for attempt in range(timeout):
        flow_run = prefect.get(f"/flow_runs/{flow_run_id}").json()
        state = flow_run.get("state", {}).get("type", "UNKNOWN")
        if state in terminal_states:
            print(f"Flow run finished with state: {state}")
            return state
        print(f"  State: {state} (attempt {attempt + 1}/{timeout})")
        time.sleep(1)
    print(f"Warning: Flow run did not complete within {timeout} seconds")
    return "TIMEOUT"

For reference, the dev stack starts the following docker services:

| Service | URL | Description |
|---|---|---|
| `postgres` | — | Application database |
| `redis` | — | Prefect cache |
| `prefect-postgres` | — | Prefect's internal database |
| `prefect-server` | `https://streamweave.local/prefect/` | Prefect UI + API (admin-only, need to login through main StreamWeave URL first) |
| `api` | `https://streamweave.local/api/` | StreamWeave FastAPI backend |
| `worker` | — | Prefect worker with rclone |
| `frontend` | `https://streamweave.local` | StreamWeave vite frontend dev server (hot reload) |
| `caddy` | `https://streamweave.local` | HTTPS reverse proxy |
| `mailpit` | `https://streamweave.local/mail/` | SMTP catch-all for outgoing emails |
| `s3-dev` | `https://streamweave.local/s3/` | S3-compatible dev storage |
| `dev-seed` | — | Seeds sample data on startup, then exits |
| `instruments-init` | — | One-shot container that copies `sample_data/` into named volumes, then exits |
| `samba-instruments` | — | Single Samba server exposing all 4 instrument shares (`nmr`, `hplc`, `ms`, `tem`) on port 4461 |

### Wait for services to be ready

The `dev-seed` container runs once on startup and populates the database with sample
instruments, storage locations, schedules, and hooks. Re-running is safe — existing
records are skipped.

In [42]:
# Check the API for health to make sure services are ready
for attempt in range(60):
    try:
        resp = client.get("/health")
        if resp.status_code == 200:
            print("API is ready.")
            break
    except httpx.RequestError:
        pass
    print(f"Waiting for API... (attempt {attempt + 1}/60)")
    time.sleep(2)
else:
    raise RuntimeError("API did not become available")

API is ready.


In [43]:
_ = run("docker compose ps")

NAME                              IMAGE                        COMMAND                  SERVICE             CREATED          STATUS                             PORTS
streamweave-api-1                 streamweave-api              "sh -c 'alembic upgr…"   api                 53 seconds ago   Up 40 seconds (healthy)            0.0.0.0:8000->8000/tcp, [::]:8000->8000/tcp
streamweave-caddy-1               caddy:alpine                 "caddy run --config …"   caddy               52 seconds ago   Up 40 seconds                      0.0.0.0:80->80/tcp, [::]:80->80/tcp, 0.0.0.0:443->443/tcp, [::]:443->443/tcp
streamweave-frontend-1            streamweave-frontend-dev     "docker-entrypoint.s…"   frontend            52 seconds ago   Up 40 seconds                      
streamweave-mailpit-1             axllent/mailpit:latest       "/mailpit --webroot …"   mailpit             53 seconds ago   Up 51 seconds (unhealthy)          0.0.0.0:1025->1025/tcp, [::]:1025->1025/tcp
streamweave-postgres-1      

### Check api status

In [44]:
# Wait for the API to be ready (retries up to 30 seconds)
for attempt in range(30):
    try:
        resp = client.get("/health")
        pp(resp)
        break
    except httpx.RequestError:
        print(f"Waiting for API... (attempt {attempt + 1}/30)")
        time.sleep(1)
else:
    raise RuntimeError("API did not become available within 30 seconds")

{
  "status": "ok"
}


Expected: `{"status": "ok"}`

The Prefect UI is accessible at **https://streamweave.local/prefect/** (you must login at https://streamweave.local/ with an admin account first).

## 2. Get an Auth Token

The dev stack automatically creates an admin account on startup via `ensure_admin.py`.
The default credentials are `admin@example.com` / `adminpassword` and can be overridden
with the `ADMIN_EMAIL` and `ADMIN_PASSWORD` environment variables. This cell will also add the token authentication
to both the StreamWeave and Prefect API clients so all later calls are authorized.

In [45]:
resp = client.post("/auth/jwt/login", data={"username": ADMIN_EMAIL, "password": ADMIN_PASSWORD})
TOKEN = resp.json()["access_token"]
AUTH = {"Authorization": f"Bearer {TOKEN}"}
client.headers["Authorization"] = f"Bearer {TOKEN}"
prefect.headers["Authorization"] = f"Bearer {TOKEN}"
print(f"Token acquired (first 20 chars): {TOKEN[:20]}...")

Token acquired (first 20 chars): eyJhbGciOiJIUzI1NiIs...


## 3. Verify Seeded Data

### Counts of data

This cell just verifies that all of the expected data was initialized by the seed data container. The later cells in this section show the API responses for each data type individually.

In [46]:
resources = [
    ("Service accounts",  client.get("/api/service-accounts").json(), 3),
    ("Storage locations", client.get("/api/storage-locations").json(), 3),
    ("Instruments",       client.get("/api/instruments").json(), 4),
    ("Schedules",         client.get("/api/schedules").json(), 4),
    ("Hooks",             client.get("/api/hooks").json(), 3),
    ("Users",             client.get("/api/admin/users").json(), 4),
]

w = max(len(label) for label, *_ in resources)
print(f"{'Resource':<{w}}  {'Count':>5}  {'Expected':>8}  {'OK':>4}")
print("-" * (w + 23))
for label, data, expected in resources:
    count = len(data)
    ok = "✓" if count == expected else "✗"
    print(f"{label:<{w}}  {count:>5}  {expected:>8}  {ok:>4}")

Resource           Count  Expected    OK
----------------------------------------
Service accounts       3         3     ✓
Storage locations      3         3     ✓
Instruments            4         4     ✓
Schedules              4         4     ✓
Hooks                  3         3     ✓
Users                  5         4     ✗


### 3a. Instruments

Expected: 4 instruments — Bruker AVANCE III 600 MHz NMR, Waters Acquity UPLC-MS,
Thermo Orbitrap Exploris 480, and FEI Titan Themis 300 TEM (offline for maintenance, so `enabled` is `false`).

In [47]:
resp = client.get("/api/instruments")
pp(resp)

[
  {
    "id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
    "name": "Bruker AVANCE III 600 MHz NMR",
    "description": "600 MHz solution NMR for small-molecule and protein characterization",
    "location": "Chemistry Building, Room 102",
    "pid": null,
    "cifs_host": "samba-instruments",
    "cifs_share": "nmr",
    "cifs_base_path": "/",
    "service_account_id": "2de1b861-abb4-4d0b-bd40-79929d1f77d3",
    "transfer_adapter": "rclone",
    "transfer_config": null,
    "enabled": true,
    "created_at": "2026-03-01T04:53:10.168617Z",
    "updated_at": "2026-03-01T04:53:10.168617Z",
    "deleted_at": null
  },
  {
    "id": "dc7ad357-3207-431c-b065-04d1d2723f5f",
    "name": "Waters Acquity UPLC-MS",
    "description": "Ultra-performance liquid chromatography with mass spectrometry detection",
    "location": "Analytical Core, Room 210",
    "pid": null,
    "cifs_host": "samba-instruments",
    "cifs_share": "hplc",
    "cifs_base_path": "/",
    "service_account_id": "33658607-9

### 3b. Storage locations

Expected: 3 storage locations — **Local POSIX archive** (`/storage/posix-archive`),
**S3 dev bucket** (rclone → `s3-dev:9000`), and **Samba archive share** (CIFS).

In [48]:
resp = client.get("/api/storage-locations")
pp(resp)

[
  {
    "id": "1da88c98-dbec-4634-b61e-20b29a932282",
    "name": "Local POSIX archive",
    "type": "posix",
    "connection_config": null,
    "base_path": "/storage/posix-archive",
    "enabled": true,
    "created_at": "2026-03-01T04:53:10.096887Z",
    "updated_at": "2026-03-01T04:53:10.096887Z",
    "deleted_at": null
  },
  {
    "id": "52fc3927-2977-4d2d-b017-922f5603927d",
    "name": "S3 dev bucket",
    "type": "s3",
    "connection_config": {
      "bucket": "instruments",
      "region": "us-east-1",
      "endpoint_url": "http://s3-dev:9000",
      "access_key_id": "devkey",
      "secret_access_key": "****"
    },
    "base_path": "instruments",
    "enabled": true,
    "created_at": "2026-03-01T04:53:10.101919Z",
    "updated_at": "2026-03-01T04:53:10.101919Z",
    "deleted_at": null
  },
  {
    "id": "177c1f7b-7c94-4da0-944e-be4a9af2c1a5",
    "name": "Samba archive share",
    "type": "cifs",
    "connection_config": {
      "host": "samba-archive",
      "share": 

### 3c. Schedules

Expected: 4 schedules with non-null `prefect_deployment_id` — the dev seed creates
schedules via the API, which triggers Prefect deployment creation automatically.

In [49]:
resp = client.get("/api/schedules")
schedules = resp.json()
pp(resp)

[
  {
    "id": "4f8738db-7351-4ed5-b903-9157a34060c8",
    "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
    "default_storage_location_id": "1da88c98-dbec-4634-b61e-20b29a932282",
    "cron_expression": "0 1 * * *",
    "prefect_deployment_id": "e2ca1ca9-f4ab-44ae-8bfe-24d9138920bf",
    "enabled": true,
    "created_at": "2026-03-01T04:53:10.190914Z",
    "updated_at": "2026-03-01T04:53:10.193129Z",
    "deleted_at": null
  },
  {
    "id": "abf4b318-b19b-47bc-b833-1f3bc496b9d1",
    "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
    "default_storage_location_id": "52fc3927-2977-4d2d-b017-922f5603927d",
    "cron_expression": "0 2 * * *",
    "prefect_deployment_id": "e2ca1ca9-f4ab-44ae-8bfe-24d9138920bf",
    "enabled": true,
    "created_at": "2026-03-01T04:53:11.608074Z",
    "updated_at": "2026-03-01T04:53:11.610840Z",
    "deleted_at": null
  },
  {
    "id": "4420a247-a785-48ab-8caf-3f8bb2b8d649",
    "instrument_id": "dc7ad357-3207-431c-b065-04d1d2723f5f",

### 3d. Hooks

Expected: 3 hooks:

- **Auto-assign file access on transfer** (`post_transfer`, `access_assignment`)
- **NMR metadata enrichment** (`post_transfer`, `metadata_enrichment`, scoped to NMR instrument)
- **File size filter — skip temp files** (`pre_transfer`, `file_filter`, excludes `*.tmp`, `*.lock`, `~*`)

In [50]:
resp = client.get("/api/hooks")
pp(resp)

[
  {
    "id": "b781584a-4023-498d-8784-fe27e5de3761",
    "name": "Auto-assign file access on transfer",
    "description": "Grants the instrument owner read access to every transferred file",
    "trigger": "post_transfer",
    "implementation": "builtin",
    "builtin_name": "access_assignment",
    "script_path": null,
    "webhook_url": null,
    "config": null,
    "instrument_id": null,
    "priority": 0,
    "enabled": true,
    "deleted_at": null
  },
  {
    "id": "884d9b30-3d3c-46c2-b729-9fc0e6a0783a",
    "name": "NMR metadata enrichment",
    "description": "Extracts pulse programme, solvent, and nucleus from Bruker acqus files",
    "trigger": "post_transfer",
    "implementation": "builtin",
    "builtin_name": "metadata_enrichment",
    "script_path": null,
    "webhook_url": null,
    "config": null,
    "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
    "priority": 10,
    "enabled": true,
    "deleted_at": null
  },
  {
    "id": "952fbafa-4c3a-40f7-a857-1

In [51]:
# Save the NMR schedule ID (first schedule) for use in later steps
instruments = client.get("/api/instruments").json()
nmr = next(i for i in instruments if "NMR" in i["name"])
NMR_SCHEDULE = next(
    s for s in schedules if s["instrument_id"] == nmr["id"]
)
SCHEDULE_ID = NMR_SCHEDULE["id"]
print(f"NMR instrument: {nmr['name']}")
print(f"Schedule ID: {SCHEDULE_ID}")
print(f"Prefect deployment ID: {NMR_SCHEDULE.get('prefect_deployment_id')}")

NMR instrument: Bruker AVANCE III 600 MHz NMR
Schedule ID: 4f8738db-7351-4ed5-b903-9157a34060c8
Prefect deployment ID: e2ca1ca9-f4ab-44ae-8bfe-24d9138920bf


## 4. Test Prefect Integration

### 4a. Check Prefect UI

Open **https://streamweave.local/** in a browser, login in the default admin (this saves a cookie so you can access the Prefect interface), and then view the Prefect dashboard by clicking `Admin -> Prefect Dashboard`.
You should see:

- **Deployments** tab: 3 deployments named `harvest-{instrument_name}`
- **Work Pools** tab: a pool named **streamweave-worker-pool** with an active worker

### 4b. Trigger a manual harvest

In [52]:
resp = client.post(f"/api/schedules/{SCHEDULE_ID}/trigger", headers=AUTH)
pp(resp)
FLOW_RUN_ID = resp.json().get("flow_run_id")
print(f"\nView the run at https://streamweave.local/prefect/runs/flow-run/{FLOW_RUN_ID}")
wait_for_flow_run(FLOW_RUN_ID)

{
  "flow_run_id": "1b823504-9ecb-410d-84df-a3302827e0c1",
  "schedule_id": "4f8738db-7351-4ed5-b903-9157a34060c8"
}

View the run at https://streamweave.local/prefect/runs/flow-run/1b823504-9ecb-410d-84df-a3302827e0c1
Waiting for flow run 1b823504-9ecb-410d-84df-a3302827e0c1 to complete...
  State: PENDING (attempt 1/120)
  State: PENDING (attempt 2/120)
  State: PENDING (attempt 3/120)
  State: PENDING (attempt 4/120)
  State: PENDING (attempt 5/120)
  State: PENDING (attempt 6/120)
  State: PENDING (attempt 7/120)
  State: PENDING (attempt 8/120)
  State: PENDING (attempt 9/120)
  State: PENDING (attempt 10/120)
  State: PENDING (attempt 11/120)
  State: PENDING (attempt 12/120)
  State: RUNNING (attempt 13/120)
Flow run finished with state: COMPLETED


'COMPLETED'

Expected response:
```json
{
  "flow_run_id": "<uuid>",
  "schedule_id": "<uuid>"
}
```

### 4c. Monitor in Prefect UI

Go to **https://streamweave.local/prefect/flow-runs** (or the link above) and watch the triggered flow run. It will:

1. Run `discover_files_task` — discovers files on the NMR's Samba share (10, if this is the first time it has run)
2. Run `transfer_single_file_task` for each new file — transfers via rclone

The run details in the Prefect interface show the ten original files being found, and transferred:

<img src="_static/prefect_flowrun_logs.png" width="80%" style="margin-left: 3em">

## 5. Verify Harvest Results

### 5a. File discovery

Should have found 10 example files from the NMR instrument

In [53]:
resp = client.get("/api/files", headers=AUTH)
print(f"\nFound {len(resp.json())} files")
pp(resp, n=3)


Found 10 files
[
  {
    "id": "040d940d-7438-44e5-9a18-1dd06f3b7c26",
    "persistent_id": "ark:/99999/fk4th4upeoohrdbbbt724njnc2ocq",
    "persistent_id_type": "ark",
    "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
    "source_path": "20260201_alanine_13C/pdata/1/1r",
    "filename": "1r",
    "size_bytes": 8192,
    "source_mtime": "2026-03-01T04:52:53.030000Z",
    "xxhash": "f90b1bb50d3a727b",
    "sha256": null,
    "first_discovered_at": "2026-03-01T04:54:02.887230Z",
    "metadata_": {},
    "owner_id": null
  },
  {
    "id": "f2b8a998-fed7-47f1-9434-2520ef7e0875",
    "persistent_id": "ark:/99999/fk4fdf6rvnfgjegze5scb4leec3xi",
    "persistent_id_type": "ark",
    "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
    "source_path": "20260201_alanine_13C/fid",
    "filename": "fid",
    "size_bytes": 16384,
    "source_mtime": "2026-03-01T04:52:53.036000Z",
    "xxhash": "b9ae9fcc0155a1c7",
    "sha256": null,
    "first_discovered_at": "2026-03-01T04:54:0

For each file you should see:
- `persistent_id` starting with `ark:/99999/fk4...` (unique ARK identifier)
- `instrument_id` matching the harvested instrument
- `source_path` matching the file's path on the instrument
- `filename` — the file name
- `xxhash` — checksum computed after transfer

### 5b. File transfers

Likewise, there should be 10 transfer actions for each file:

In [54]:
resp = client.get("/api/transfers", headers=AUTH)
print(f"Found {len(resp.json())} transfers")
pp(resp, n=5)

Found 10 transfers
[
  {
    "id": "e3269a79-107f-4d22-9b55-a9f3ad981e57",
    "file_id": "c9b4368b-a4a5-4653-8a83-d32adfd3536f",
    "storage_location_id": "1da88c98-dbec-4634-b61e-20b29a932282",
    "destination_path": "/storage/posix-archive/Bruker AVANCE III 600 MHz NMR/20260210_ethanol_COSY/acqus",
    "transfer_adapter": "rclone",
    "status": "completed",
    "bytes_transferred": 485,
    "source_checksum": null,
    "dest_checksum": "13a9a75aebeedaa8",
    "checksum_verified": false,
    "started_at": "2026-03-01T04:54:02.420789Z",
    "completed_at": "2026-03-01T04:54:02.467524Z",
    "error_message": null,
    "prefect_flow_run_id": null
  },
  {
    "id": "0938d447-03c9-4b1e-8bf2-54c43b4866db",
    "file_id": "5c006485-c547-4fdf-a5b5-04f3d1748798",
    "storage_location_id": "1da88c98-dbec-4634-b61e-20b29a932282",
    "destination_path": "/storage/posix-archive/Bruker AVANCE III 600 MHz NMR/20260210_ethanol_COSY/fid",
    "transfer_adapter": "rclone",
    "status": "complet

Each transfer should have:
- `status`: `"completed"` or `"skipped"`
- `dest_checksum` — xxhash of the transferred file
- `destination_path` — where the file was written under `/storage/`
- `bytes_transferred` — file size
- `started_at` and `completed_at` timestamps


### 5c. Check files on disk (in the source directory)

In [55]:
_ = run(f"docker compose {DEV_COMPOSE} exec samba-instruments find /data/nmr -type f | sort")

/data/nmr/20260115_glucose_1H/acqus
/data/nmr/20260115_glucose_1H/fid
/data/nmr/20260115_glucose_1H/pdata/1/1r
/data/nmr/20260115_glucose_1H/pdata/1/procs
/data/nmr/20260201_alanine_13C/acqus
/data/nmr/20260201_alanine_13C/fid
/data/nmr/20260201_alanine_13C/pdata/1/1r
/data/nmr/20260210_ethanol_COSY/acqus
/data/nmr/20260210_ethanol_COSY/fid
/data/nmr/20260210_ethanol_COSY/pdata/1/1r


### 5d. Check files on disk (in the storage directory)

In [56]:
_ = run(f"docker compose {DEV_COMPOSE} exec api tree /storage/posix-archive/")

/storage/posix-archive/
└── Bruker AVANCE III 600 MHz NMR
    ├── 20260115_glucose_1H
    │   ├── acqus
    │   ├── fid
    │   └── pdata
    │       └── 1
    │           ├── 1r
    │           └── procs
    ├── 20260201_alanine_13C
    │   ├── acqus
    │   ├── fid
    │   └── pdata
    │       └── 1
    │           └── 1r
    └── 20260210_ethanol_COSY
        ├── acqus
        ├── fid
        └── pdata
            └── 1
                └── 1r

11 directories, 10 files


## 6. Test Pre-Transfer Hook (File Filter)

The **file size filter** hook skips zero-byte files and files matching
`*.tmp`, `*.lock`, `~*` patterns.

### 6a. Add a temp and empty file to the NMR instrument share

In [57]:
# Write a .tmp file and empty .txt file into the samba-instruments volume
# via docker exec - this simulates two new files being created by the instrument
_ = run(f'docker compose {DEV_COMPOSE} exec samba-instruments sh -c '
  '"echo temp data > /data/nmr/scratch.tmp && truncate -s 0 /data/nmr/empty.txt"')
print("Files in NMR share:\n-------------------")
_ = run(f'docker compose {DEV_COMPOSE} exec samba-instruments sh -c '
      '"cd /data/nmr && find . -type f | sort"')

Files in NMR share:
-------------------
./20260115_glucose_1H/acqus
./20260115_glucose_1H/fid
./20260115_glucose_1H/pdata/1/1r
./20260115_glucose_1H/pdata/1/procs
./20260201_alanine_13C/acqus
./20260201_alanine_13C/fid
./20260201_alanine_13C/pdata/1/1r
./20260210_ethanol_COSY/acqus
./20260210_ethanol_COSY/fid
./20260210_ethanol_COSY/pdata/1/1r
./empty.txt
./scratch.tmp


### 6b. Trigger another harvest

In [58]:
resp = client.post(f"/api/schedules/{SCHEDULE_ID}/trigger")
IGNORE_FLOW_RUN_ID = resp.json().get("flow_run_id")
pp(resp)
print(f"\nView the run at https://streamweave.local/prefect/runs/flow-run/{IGNORE_FLOW_RUN_ID}")
wait_for_flow_run(IGNORE_FLOW_RUN_ID)

{
  "flow_run_id": "7bdfcd9a-fefa-4be9-9528-6d9fc5b5522d",
  "schedule_id": "4f8738db-7351-4ed5-b903-9157a34060c8"
}

View the run at https://streamweave.local/prefect/runs/flow-run/7bdfcd9a-fefa-4be9-9528-6d9fc5b5522d
Waiting for flow run 7bdfcd9a-fefa-4be9-9528-6d9fc5b5522d to complete...
  State: SCHEDULED (attempt 1/120)
  State: SCHEDULED (attempt 2/120)
  State: SCHEDULED (attempt 3/120)
  State: SCHEDULED (attempt 4/120)
  State: SCHEDULED (attempt 5/120)
  State: SCHEDULED (attempt 6/120)
  State: SCHEDULED (attempt 7/120)
  State: PENDING (attempt 8/120)
  State: PENDING (attempt 9/120)
  State: PENDING (attempt 10/120)
  State: PENDING (attempt 11/120)
  State: PENDING (attempt 12/120)
  State: PENDING (attempt 13/120)
  State: PENDING (attempt 14/120)
  State: PENDING (attempt 15/120)
  State: PENDING (attempt 16/120)
  State: PENDING (attempt 17/120)
  State: PENDING (attempt 18/120)
  State: PENDING (attempt 19/120)
  State: PENDING (attempt 20/120)
Flow run finished with 

'COMPLETED'

The run details in the Prefect interface show the two new files being found, and skipped:

<img src="_static/prefect_flowrun_logs_skip_tmp.png" width="80%" style="margin-left: 3em">

### 6c. Verify the .tmp file was skipped

StreamWeave has a demonstration pre-transfer hook that ignores certain file patterns. These can be configured easily on a per-instrument basis. The following example will show that the `scratch.tmp` file is discovered in the file finding flow, but is not transferred due to the pre-transfer hook blocking it.


In [59]:
print("Files:\n------")
resp = client.get("/api/files")
pp(resp, n=4)
# nmr/scratch.tmp and nmr/empty.txt will be in the file list printed out at this step,
# since they were discovered, but we will confirm they were not transferred in the next step

Files:
------
[
  {
    "id": "0f79af7d-2df4-4e65-bec6-7f83cd3b7350",
    "persistent_id": "ark:/99999/fk4e4czt2fvuffgtk7dbjomesv3aa",
    "persistent_id_type": "ark",
    "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
    "source_path": "scratch.tmp",
    "filename": "scratch.tmp",
    "size_bytes": 10,
    "source_mtime": "2026-03-01T04:54:04.087000Z",
    "xxhash": null,
    "sha256": null,
    "first_discovered_at": "2026-03-01T04:54:24.095756Z",
    "metadata_": {},
    "owner_id": null
  },
  {
    "id": "ed055185-5b27-45ca-8edf-e3d1453953e9",
    "persistent_id": "ark:/99999/fk4zpthh45nubdt5jtb5zmcw4uxvi",
    "persistent_id_type": "ark",
    "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
    "source_path": "empty.txt",
    "filename": "empty.txt",
    "size_bytes": 0,
    "source_mtime": "2026-03-01T04:54:04.087000Z",
    "xxhash": null,
    "sha256": null,
    "first_discovered_at": "2026-03-01T04:54:24.083013Z",
    "metadata_": {},
    "owner_id": null
  

In [60]:
# firmly assert that the scratch.tmp file was found
files = resp.json()
scratch = next((f for f in files if f["filename"] == "scratch.tmp"), None)
assert scratch is not None, "FAIL: scratch.tmp file record not found"

# firmly assert that the scratch.tmp file was not transferred
transfers = client.get(f"/api/transfers?file_id={scratch['id']}").json()
assert all(t["status"] == "skipped" for t in transfers), "FAIL: scratch.tmp should only have skipped transfers"
print("PASS: scratch.tmp was correctly filtered by the pre-transfer hook (transfer skipped)")

PASS: scratch.tmp was correctly filtered by the pre-transfer hook (transfer skipped)


## 7. Test Post-Transfer Hook (Metadata Enrichment)

StreamWeave supports post-transfer hooks that can extract scientific metadata
from either files or the file paths, which is a common pattern for laboratories to encode metadata.

This example will configure a hook with regex rules that extract the **date**, **compound**, and **nucleus**
from Bruker NMR folder names like `20260115_glucose_1H/`.

### 7a. Update the hook with extraction rules

In [61]:
# Find the NMR metadata enrichment hook
hooks = client.get("/api/hooks").json()
nmr_hook = next(h for h in hooks if "NMR" in h["name"])
pp_dict(nmr_hook)

{
  "id": "884d9b30-3d3c-46c2-b729-9fc0e6a0783a",
  "name": "NMR metadata enrichment",
  "description": "Extracts pulse programme, solvent, and nucleus from Bruker acqus files",
  "trigger": "post_transfer",
  "implementation": "builtin",
  "builtin_name": "metadata_enrichment",
  "script_path": null,
  "webhook_url": null,
  "config": null,
  "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
  "priority": 10,
  "enabled": true,
  "deleted_at": null
}


In [62]:
# Update it with regex rules to extract experiment metadata from the path
resp = client.patch(f"/api/hooks/{nmr_hook['id']}", json={
    "config": {
        "rules": [
            {
                "source": "path",
                "pattern": r"^(?P<date>\d{8})_(?P<compound>[^_/]+)_(?P<nucleus>[^/]+)/",
            }
        ]
    }
})
print(f"Updated hook: {resp.status_code}")
pp(resp)

Updated hook: 200
{
  "id": "884d9b30-3d3c-46c2-b729-9fc0e6a0783a",
  "name": "NMR metadata enrichment",
  "description": "Extracts pulse programme, solvent, and nucleus from Bruker acqus files",
  "trigger": "post_transfer",
  "implementation": "builtin",
  "builtin_name": "metadata_enrichment",
  "script_path": null,
  "webhook_url": null,
  "config": {
    "rules": [
      {
        "source": "path",
        "pattern": "^(?P<date>\\d{8})_(?P<compound>[^_/]+)_(?P<nucleus>[^/]+)/"
      }
    ]
  },
  "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
  "priority": 10,
  "enabled": true,
  "deleted_at": null
}


### 7b. Clear transferred files and re-harvest

Delete all file records and transferred files so the harvest runs fresh:

In [63]:
_ = run(f"docker compose {DEV_COMPOSE} exec worker rm -rf /storage/posix-archive/*")
_ = run(f'docker compose {DEV_COMPOSE} exec postgres psql -U streamweave -c "DELETE FROM file_transfers; DELETE FROM file_records;"')

resp = client.post(f"/api/schedules/{SCHEDULE_ID}/trigger", headers=AUTH)
METADATA_FLOW_RUN_ID = resp.json().get("flow_run_id")
pp(resp)

DELETE 12
DELETE 12
{
  "flow_run_id": "43b94d64-72e4-4f34-b721-6a2c96da1b8b",
  "schedule_id": "4f8738db-7351-4ed5-b903-9157a34060c8"
}


The run details in the Prefect interface show the metadata being extracted from the paths and
added to the file records:

<img src="_static/prefect_flowrun_logs_metadata_enrichment.png" width="80%" style="margin-left: 3em">

In [68]:
# Fetching file records from the API displays extracted metadata under the "metadata_" key:
resp = client.get("/api/files")
pp(resp, n=4)

[
  {
    "id": "93eb26f3-f3bd-4f21-8634-d9b55c56b34c",
    "persistent_id": "ark:/99999/fk4jjgv3ohf7jgchgjwiohkquiapa",
    "persistent_id_type": "ark",
    "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
    "source_path": "20260201_alanine_13C/pdata/1/1r",
    "filename": "1r",
    "size_bytes": 8192,
    "source_mtime": "2026-03-01T04:52:53.030000Z",
    "xxhash": "f90b1bb50d3a727b",
    "sha256": null,
    "first_discovered_at": "2026-03-01T04:54:42.079250Z",
    "metadata_": {
      "date": "20260201",
      "compound": "alanine",
      "nucleus": "13C"
    },
    "owner_id": null
  },
  {
    "id": "772b1ebe-5d83-4d56-a7b0-1500a8b7c34d",
    "persistent_id": "ark:/99999/fk4x3ctp6weovhqhmsx7tg53lec5m",
    "persistent_id_type": "ark",
    "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
    "source_path": "20260201_alanine_13C/fid",
    "filename": "fid",
    "size_bytes": 16384,
    "source_mtime": "2026-03-01T04:52:53.036000Z",
    "xxhash": "b9ae9fcc0155a1c7",

## 8. User-Scoped Access Control Demo

Files are private by default. Access is granted explicitly to users, groups, or projects via the `FileAccessGrant` system.

### 8a. Get regular user token and user ID

In [69]:
resp = client.post("/auth/jwt/login", data={"username": "chemist@example.com", "password": "devpass123!"})
USER_TOKEN = resp.json()["access_token"]
USER_AUTH = {"Authorization": f"Bearer {USER_TOKEN}"}

resp = client.get("/users/me", headers=USER_AUTH)
USER_ID = resp.json()["id"]
print(f"User ID: {USER_ID}")

User ID: ae2c9997-3f45-44ba-9437-adbc5762fd34


### 8b. Verify user sees no files (no access granted)

In [70]:
resp = client.get("/api/files", headers=USER_AUTH)
print("Files:", resp.json())
# Expected: []

resp = client.get("/api/transfers", headers=USER_AUTH)
print("Transfers:", resp.json())
# Expected: []

Files: []
Transfers: []


### 8c. Grant direct user access to a file

In [71]:
# Pick a file to grant access to
FILE_ID = client.get("/api/files", headers=AUTH).json()[0]["id"]
print(f"File ID: {FILE_ID}")

# Grant the user access (admin-only endpoint)
resp = client.post(f"/api/files/{FILE_ID}/access", headers=AUTH, json={
    "grantee_type": "user",
    "grantee_id": USER_ID,
})
pp(resp)

File ID: 93eb26f3-f3bd-4f21-8634-d9b55c56b34c
{
  "id": "5a9b65e0-b8c0-4e78-9fbd-7160008c4397",
  "file_id": "93eb26f3-f3bd-4f21-8634-d9b55c56b34c",
  "grantee_type": "user",
  "grantee_id": "ae2c9997-3f45-44ba-9437-adbc5762fd34",
  "granted_at": "2026-03-01T04:54:58.092172Z"
}


Expected response:
```json
{
  "id": "<grant-uuid>",
  "file_id": "<file-uuid>",
  "grantee_type": "user",
  "grantee_id": "<user-uuid>",
  "granted_at": "2026-02-23T..."
}
```

### 8d. Verify user now sees the granted file

In [72]:
resp = client.get("/api/files", headers=USER_AUTH)
print(f"Files visible to user: {len(resp.json())}")
# Expected: exactly 1 file

resp = client.get(f"/api/files/{FILE_ID}", headers=USER_AUTH)
pp(resp)
# Expected: 200 with full file details

Files visible to user: 1
{
  "id": "93eb26f3-f3bd-4f21-8634-d9b55c56b34c",
  "persistent_id": "ark:/99999/fk4jjgv3ohf7jgchgjwiohkquiapa",
  "persistent_id_type": "ark",
  "instrument_id": "4a02b04d-5ac3-4af8-a88e-e3e201228349",
  "source_path": "20260201_alanine_13C/pdata/1/1r",
  "filename": "1r",
  "size_bytes": 8192,
  "source_mtime": "2026-03-01T04:52:53.030000Z",
  "xxhash": "f90b1bb50d3a727b",
  "sha256": null,
  "first_discovered_at": "2026-03-01T04:54:42.079250Z",
  "metadata_": {
    "date": "20260201",
    "compound": "alanine",
    "nucleus": "13C"
  },
  "owner_id": null
}


### 8e. Verify 404 for files without access

In [73]:
OTHER_FILE = client.get("/api/files", headers=AUTH).json()[1]["id"]

resp = client.get(f"/api/files/{OTHER_FILE}", headers=USER_AUTH)
pp(resp)
# Expected: {"detail": "File not found"} (404, not 403 — avoids leaking existence)

{
  "detail": "File not found"
}


### 8f. List and revoke a grant

In [74]:
# List grants for the file (admin only)
resp = client.get(f"/api/files/{FILE_ID}/access", headers=AUTH)
pp(resp)

# Revoke the grant
GRANT_ID = resp.json()[0]["id"]
resp = client.delete(f"/api/files/{FILE_ID}/access/{GRANT_ID}", headers=AUTH)
print(f"Delete status: {resp.status_code}")
# Expected: 204

# Verify user can no longer see the file
resp = client.get(f"/api/files/{FILE_ID}", headers=USER_AUTH)
pp(resp)
# Expected: {"detail": "File not found"}

[
  {
    "id": "5a9b65e0-b8c0-4e78-9fbd-7160008c4397",
    "file_id": "93eb26f3-f3bd-4f21-8634-d9b55c56b34c",
    "grantee_type": "user",
    "grantee_id": "ae2c9997-3f45-44ba-9437-adbc5762fd34",
    "granted_at": "2026-03-01T04:54:58.092172Z"
  }
]
Delete status: 204
{
  "detail": "File not found"
}


## 9. Group-Based Access Demo

File access can also be granted via group memberships, which are collections of users

### 9a. Get group for the example chemistry user

In [75]:
# Get the group that chemist@example.com belongs to
resp = client.get("/api/groups", headers=AUTH)
pp(resp)

[
  {
    "id": "d5b79142-6eed-4849-a6e8-1500e6960cc9",
    "name": "Chemistry & Chemical Biology",
    "description": "Organic and inorganic chemistry researchers using NMR and HPLC",
    "created_at": "2026-03-01T04:53:11.893526Z",
    "updated_at": "2026-03-01T04:53:11.893526Z"
  },
  {
    "id": "35c6d45e-f229-47c9-ac13-aeed656895f8",
    "name": "Proteomics Core",
    "description": "Mass spectrometry and proteomics platform users",
    "created_at": "2026-03-01T04:53:11.904828Z",
    "updated_at": "2026-03-01T04:53:11.904828Z"
  },
  {
    "id": "5aa89405-1980-4d93-b285-777b9fc71f17",
    "name": "EM Facility",
    "description": "Electron microscopy facility operators and approved users",
    "created_at": "2026-03-01T04:53:11.913416Z",
    "updated_at": "2026-03-01T04:53:11.913416Z"
  },
  {
    "id": "156a6764-12ac-404d-bf79-5ff2aa3a00f0",
    "name": "Analytical Core",
    "description": "Cross-departmental analytical instrumentation users",
    "created_at": "2026-03-01T04:5

In [34]:
# Add the regular user to the group
resp = client.post(f"/api/groups/{GROUP_ID}/members", headers=AUTH, json={"user_id": USER_ID})
pp(resp)

Group ID: 646512c1-d81b-434a-b798-aeb402cd5dca
{
  "group_id": "646512c1-d81b-434a-b798-aeb402cd5dca",
  "user_id": "e20b41d4-b5b4-4b54-bc10-6d8fd3c72d2c"
}


### 10B-b. Grant the group access to a file

In [35]:
resp = client.post(f"/api/files/{FILE_ID}/access", headers=AUTH, json={
    "grantee_type": "group",
    "grantee_id": GROUP_ID,
})
pp(resp)

{
  "id": "ca7145e7-ae7b-4518-8657-58461cfd898f",
  "file_id": "fc38212d-4f71-405c-be81-26aa8f173beb",
  "grantee_type": "group",
  "grantee_id": "646512c1-d81b-434a-b798-aeb402cd5dca",
  "granted_at": "2026-02-24T17:06:11.640838Z"
}


### 10B-c. Verify user sees the file via group membership

In [36]:
resp = client.get(f"/api/files/{FILE_ID}", headers=USER_AUTH)
pp(resp)
# Expected: 200 — user can see the file because they're in the granted group

{
  "id": "fc38212d-4f71-405c-be81-26aa8f173beb",
  "persistent_id": "ark:/99999/fk4biqspimyync43nspm5elixje3m",
  "persistent_id_type": "ark",
  "instrument_id": "4bca34e9-5309-4d8d-b78c-af9ef64845f8",
  "source_path": "microscope/user_a/experiment_001/scan_01.csv",
  "filename": "scan_01.csv",
  "size_bytes": 23,
  "source_mtime": "2026-02-24T17:05:59.485000Z",
  "xxhash": "7c0d455ba2126c3d",
  "sha256": null,
  "first_discovered_at": "2026-02-24T17:06:10.729750Z",
  "metadata_": {
    "username": "user_a",
    "experiment": "experiment_001"
  },
  "owner_id": null
}


### 10B-d. Groups CRUD (Create, Read, Update, Delete)

In [37]:
# List groups
print("=== All groups ===")
pp(client.get("/api/groups", headers=AUTH))

# Get group details
print("\n=== Group details ===")
pp(client.get(f"/api/groups/{GROUP_ID}", headers=AUTH))

# List group members
print("\n=== Group members ===")
pp(client.get(f"/api/groups/{GROUP_ID}/members", headers=AUTH))

# Update group
print("\n=== Update group ===")
pp(client.patch(f"/api/groups/{GROUP_ID}", headers=AUTH, json={"description": "Updated description"}))

=== All groups ===
[
  {
    "id": "646512c1-d81b-434a-b798-aeb402cd5dca",
    "name": "Lab A Researchers",
    "description": "All researchers in Lab A",
    "created_at": "2026-02-24T17:06:11.627889Z",
    "updated_at": "2026-02-24T17:06:11.627889Z"
  }
]

=== Group details ===
{
  "id": "646512c1-d81b-434a-b798-aeb402cd5dca",
  "name": "Lab A Researchers",
  "description": "All researchers in Lab A",
  "created_at": "2026-02-24T17:06:11.627889Z",
  "updated_at": "2026-02-24T17:06:11.627889Z"
}

=== Group members ===
[
  {
    "group_id": "646512c1-d81b-434a-b798-aeb402cd5dca",
    "user_id": "e20b41d4-b5b4-4b54-bc10-6d8fd3c72d2c"
  }
]

=== Update group ===
{
  "id": "646512c1-d81b-434a-b798-aeb402cd5dca",
  "name": "Lab A Researchers",
  "description": "Updated description",
  "created_at": "2026-02-24T17:06:11.627889Z",
  "updated_at": "2026-02-24T17:06:11.661656Z"
}


In [38]:
# Remove member
resp = client.delete(f"/api/groups/{GROUP_ID}/members/{USER_ID}", headers=AUTH)
print(f"Remove member status: {resp.status_code}")
# Expected: 204

# Verify user lost access (group membership removed)
resp = client.get(f"/api/files/{FILE_ID}", headers=USER_AUTH)
pp(resp)
# Expected: {"detail": "File not found"}

Remove member status: 204
{
  "detail": "File not found"
}


## 10C. Project-Based File Access Demo

Projects can contain both individual users and entire groups. When a file is granted to a project, all members (direct users + users in member groups) can see it.


### 10C-a. Create a project with user and group members

In [39]:
# Re-add user to the group (removed in previous step)
client.post(f"/api/groups/{GROUP_ID}/members", headers=AUTH, json={"user_id": USER_ID})

# Create project
resp = client.post("/api/projects", headers=AUTH, json={
    "name": "Microscopy Study 2026",
    "description": "Main research project",
})
PROJECT_ID = resp.json()["id"]
print(f"Project ID: {PROJECT_ID}")

# Add the group as a project member
resp = client.post(f"/api/projects/{PROJECT_ID}/members", headers=AUTH, json={
    "member_type": "group",
    "member_id": GROUP_ID,
})
pp(resp)

Project ID: 7d75d750-a051-4405-88de-ddfaaff5e98e
{
  "id": "b13ade48-a23e-4547-a0a3-57949ce214cd",
  "project_id": "7d75d750-a051-4405-88de-ddfaaff5e98e",
  "member_type": "group",
  "member_id": "646512c1-d81b-434a-b798-aeb402cd5dca"
}


### 10C-b. Grant the project access to a file

In [40]:
# Clean up previous grants on the file
grants = client.get(f"/api/files/{FILE_ID}/access", headers=AUTH).json()
for g in grants:
    client.delete(f"/api/files/{FILE_ID}/access/{g['id']}", headers=AUTH)
print(f"Cleaned up {len(grants)} existing grants")

# Grant project access
resp = client.post(f"/api/files/{FILE_ID}/access", headers=AUTH, json={
    "grantee_type": "project",
    "grantee_id": PROJECT_ID,
})
pp(resp)

Cleaned up 1 existing grants
{
  "id": "7d714f34-dbeb-494e-913b-1aa576fe1a86",
  "file_id": "fc38212d-4f71-405c-be81-26aa8f173beb",
  "grantee_type": "project",
  "grantee_id": "7d75d750-a051-4405-88de-ddfaaff5e98e",
  "granted_at": "2026-02-24T17:06:11.719943Z"
}


### 10C-c. Verify user sees the file via project → group → user chain

In [41]:
resp = client.get(f"/api/files/{FILE_ID}", headers=USER_AUTH)
pp(resp)
# Expected: 200 — user can see the file because:
#   user ∈ group → group ∈ project → project has file grant

{
  "id": "fc38212d-4f71-405c-be81-26aa8f173beb",
  "persistent_id": "ark:/99999/fk4biqspimyync43nspm5elixje3m",
  "persistent_id_type": "ark",
  "instrument_id": "4bca34e9-5309-4d8d-b78c-af9ef64845f8",
  "source_path": "microscope/user_a/experiment_001/scan_01.csv",
  "filename": "scan_01.csv",
  "size_bytes": 23,
  "source_mtime": "2026-02-24T17:05:59.485000Z",
  "xxhash": "7c0d455ba2126c3d",
  "sha256": null,
  "first_discovered_at": "2026-02-24T17:06:10.729750Z",
  "metadata_": {
    "username": "user_a",
    "experiment": "experiment_001"
  },
  "owner_id": null
}


### 10C-d. Test direct user membership in projects

In [42]:
# Create a second user
resp = client.post("/auth/register", json={"email": "postdoc@test.org", "password": "testpassword123"})
pp(resp)

resp = client.post("/auth/jwt/login", data={"username": "postdoc@test.org", "password": "testpassword123"})
POSTDOC_TOKEN = resp.json()["access_token"]
POSTDOC_AUTH = {"Authorization": f"Bearer {POSTDOC_TOKEN}"}
POSTDOC_ID = client.get("/users/me", headers=POSTDOC_AUTH).json()["id"]
print(f"Postdoc ID: {POSTDOC_ID}")

# Add postdoc directly to the project (not via group)
resp = client.post(f"/api/projects/{PROJECT_ID}/members", headers=AUTH, json={
    "member_type": "user",
    "member_id": POSTDOC_ID,
})
pp(resp)

# Postdoc can also see the file
resp = client.get(f"/api/files/{FILE_ID}", headers=POSTDOC_AUTH)
print(f"\nPostdoc file access status: {resp.status_code}")
# Expected: 200

{
  "id": "48b4470f-f365-4eb4-9a13-526122f9fd62",
  "email": "postdoc@test.org",
  "is_active": true,
  "is_superuser": false,
  "is_verified": false,
  "role": "user"
}
Postdoc ID: 48b4470f-f365-4eb4-9a13-526122f9fd62
{
  "id": "4c14f918-a3e4-437c-a402-35784ca749ee",
  "project_id": "7d75d750-a051-4405-88de-ddfaaff5e98e",
  "member_type": "user",
  "member_id": "48b4470f-f365-4eb4-9a13-526122f9fd62"
}

Postdoc file access status: 200


### 10C-e. Projects CRUD

In [43]:
# List projects
print("=== All projects ===")
pp(client.get("/api/projects", headers=AUTH))

# List project members
print("\n=== Project members ===")
pp(client.get(f"/api/projects/{PROJECT_ID}/members", headers=AUTH))
# Expected: 2 members (1 group + 1 direct user)

=== All projects ===
[
  {
    "id": "7d75d750-a051-4405-88de-ddfaaff5e98e",
    "name": "Microscopy Study 2026",
    "description": "Main research project",
    "created_at": "2026-02-24T17:06:11.689293Z",
    "updated_at": "2026-02-24T17:06:11.689293Z"
  }
]

=== Project members ===
[
  {
    "id": "b13ade48-a23e-4547-a0a3-57949ce214cd",
    "project_id": "7d75d750-a051-4405-88de-ddfaaff5e98e",
    "member_type": "group",
    "member_id": "646512c1-d81b-434a-b798-aeb402cd5dca"
  },
  {
    "id": "4c14f918-a3e4-437c-a402-35784ca749ee",
    "project_id": "7d75d750-a051-4405-88de-ddfaaff5e98e",
    "member_type": "user",
    "member_id": "48b4470f-f365-4eb4-9a13-526122f9fd62"
  }
]


In [44]:
# Remove postdoc from project
resp = client.delete(f"/api/projects/{PROJECT_ID}/members/{POSTDOC_ID}", headers=AUTH)
print(f"Remove member status: {resp.status_code}")
# Expected: 204

# Postdoc loses access
resp = client.get(f"/api/files/{FILE_ID}", headers=POSTDOC_AUTH)
pp(resp)
# Expected: {"detail": "File not found"}

Remove member status: 204
{
  "detail": "File not found"
}


### 10C-f. Non-admin users cannot manage groups/projects/grants

In [45]:
# All of these should return 403
for endpoint in ["/api/groups", "/api/projects", f"/api/files/{FILE_ID}/access"]:
    resp = client.get(endpoint, headers=USER_AUTH)
    print(f"GET {endpoint}: {resp.status_code} — {resp.json()}")
# Expected: {"detail": "Admin access required"}

GET /api/groups: 403 — {'detail': 'Admin access required'}
GET /api/projects: 403 — {'detail': 'Admin access required'}
GET /api/files/fc38212d-4f71-405c-be81-26aa8f173beb/access: 403 — {'detail': 'Admin access required'}



## 11. File & Transfer API Filtering Demo



### 11a. Filter files by instrument

In [46]:
INSTRUMENT_ID = instruments[0]["id"]

resp = client.get(f"/api/files?instrument_id={INSTRUMENT_ID}", headers=AUTH)
print(f"Files for instrument {INSTRUMENT_ID}: {len(resp.json())}")

Files for instrument 4bca34e9-5309-4d8d-b78c-af9ef64845f8: 6


### 11b. Filter transfers by file

In [47]:
FILE_ID = client.get("/api/files", headers=AUTH).json()[0]["id"]

resp = client.get(f"/api/transfers?file_id={FILE_ID}", headers=AUTH)
pp(resp)

[
  {
    "id": "4bf5abaf-b2ed-412b-87f6-3c7326d80ba9",
    "file_id": "fc38212d-4f71-405c-be81-26aa8f173beb",
    "storage_location_id": "7cb0e261-f511-4ff3-96af-0bd3323d0ba7",
    "destination_path": "/storage/archive/Microscope 01/microscope/user_a/experiment_001/scan_01.csv",
    "transfer_adapter": "rclone",
    "status": "completed",
    "bytes_transferred": 23,
    "source_checksum": null,
    "dest_checksum": "7c0d455ba2126c3d",
    "checksum_verified": false,
    "started_at": "2026-02-24T17:06:10.730384Z",
    "completed_at": "2026-02-24T17:06:10.790206Z",
    "error_message": null,
    "prefect_flow_run_id": null
  }
]


### 11c. Get single file by ID

In [48]:
resp = client.get(f"/api/files/{FILE_ID}", headers=AUTH)
pp(resp)

{
  "id": "fc38212d-4f71-405c-be81-26aa8f173beb",
  "persistent_id": "ark:/99999/fk4biqspimyync43nspm5elixje3m",
  "persistent_id_type": "ark",
  "instrument_id": "4bca34e9-5309-4d8d-b78c-af9ef64845f8",
  "source_path": "microscope/user_a/experiment_001/scan_01.csv",
  "filename": "scan_01.csv",
  "size_bytes": 23,
  "source_mtime": "2026-02-24T17:05:59.485000Z",
  "xxhash": "7c0d455ba2126c3d",
  "sha256": null,
  "first_discovered_at": "2026-02-24T17:06:10.729750Z",
  "metadata_": {
    "username": "user_a",
    "experiment": "experiment_001"
  },
  "owner_id": null
}


Verify all fields are present: `persistent_id`, `persistent_id_type`, `source_path`, `filename`, `xxhash`, `first_discovered_at`, `metadata_`.

### 11d. Get single transfer by ID

In [49]:
TRANSFER_ID = client.get("/api/transfers", headers=AUTH).json()[0]["id"]

resp = client.get(f"/api/transfers/{TRANSFER_ID}", headers=AUTH)
pp(resp)

{
  "id": "0866d35b-f4f2-492d-a562-2a816fd27da8",
  "file_id": "37bc26ec-ed36-4148-ba35-e27b4b3103ac",
  "storage_location_id": "7cb0e261-f511-4ff3-96af-0bd3323d0ba7",
  "destination_path": "/storage/archive/Microscope 01/microscope/experiment_001.csv",
  "transfer_adapter": "rclone",
  "status": "completed",
  "bytes_transferred": 302,
  "source_checksum": null,
  "dest_checksum": "2660dec29d8c3f7b",
  "checksum_verified": false,
  "started_at": "2026-02-24T17:06:10.399059Z",
  "completed_at": "2026-02-24T17:06:10.458032Z",
  "error_message": null,
  "prefect_flow_run_id": null
}


## 12. Test Schedule CRUD with Prefect Sync

### 12a. Create a new schedule

In [50]:
resp = client.post("/api/schedules", headers=AUTH, json={
    "instrument_id": INSTRUMENT_ID,
    "default_storage_location_id": STORAGE_ID,
    "cron_expression": "0 */6 * * *",
    "enabled": True,
})
pp(resp)

{
  "id": "f1598bd0-af1d-403d-8713-7431e03e5ec1",
  "instrument_id": "4bca34e9-5309-4d8d-b78c-af9ef64845f8",
  "default_storage_location_id": "7cb0e261-f511-4ff3-96af-0bd3323d0ba7",
  "cron_expression": "0 */6 * * *",
  "prefect_deployment_id": "a28a83b2-4d95-45ba-acd3-6b664c9ba1e7",
  "enabled": true,
  "created_at": "2026-02-24T17:06:11.923544Z",
  "updated_at": "2026-02-24T17:06:11.925086Z"
}


Check that `prefect_deployment_id` is populated (Prefect deployment was created).

### 12b. Update the schedule

In [51]:
client.get("/api/schedules", headers=AUTH).json()

[{'id': '725fa608-f4da-40f5-980a-2aa977ffe7e3',
  'instrument_id': '4bca34e9-5309-4d8d-b78c-af9ef64845f8',
  'default_storage_location_id': '7cb0e261-f511-4ff3-96af-0bd3323d0ba7',
  'cron_expression': '*/15 * * * *',
  'prefect_deployment_id': 'a28a83b2-4d95-45ba-acd3-6b664c9ba1e7',
  'enabled': True,
  'created_at': '2026-02-24T17:05:41.521453Z',
  'updated_at': '2026-02-24T17:05:41.523843Z'},
 {'id': '0293924e-a6de-459c-98c0-6d73303d576b',
  'instrument_id': '65186f54-074d-430e-b2ca-c24b4861e990',
  'default_storage_location_id': '7cb0e261-f511-4ff3-96af-0bd3323d0ba7',
  'cron_expression': '*/15 * * * *',
  'prefect_deployment_id': '97ee019b-aa52-43b0-84e9-f472b9951c50',
  'enabled': True,
  'created_at': '2026-02-24T17:05:42.798288Z',
  'updated_at': '2026-02-24T17:05:42.799405Z'},
 {'id': 'a86e9e55-9518-4b9f-9a24-321f3ea3b428',
  'instrument_id': '20bfa6b2-87a4-4d41-9c97-d7f59bd92723',
  'default_storage_location_id': '51c43e6b-e524-4fd0-aa92-3b40d8fe7cda',
  'cron_expression': '*/

In [52]:
NEW_SCHEDULE_ID = client.get("/api/schedules", headers=AUTH).json()[0]["id"]

resp = client.patch(f"/api/schedules/{NEW_SCHEDULE_ID}", headers=AUTH, json={
    "cron_expression": "0 */12 * * *",
})
pp(resp)

{
  "id": "725fa608-f4da-40f5-980a-2aa977ffe7e3",
  "instrument_id": "4bca34e9-5309-4d8d-b78c-af9ef64845f8",
  "default_storage_location_id": "7cb0e261-f511-4ff3-96af-0bd3323d0ba7",
  "cron_expression": "0 */12 * * *",
  "prefect_deployment_id": "a28a83b2-4d95-45ba-acd3-6b664c9ba1e7",
  "enabled": true,
  "created_at": "2026-02-24T17:05:41.521453Z",
  "updated_at": "2026-02-24T17:06:12.010835Z"
}


Verify in Prefect UI that the deployment schedule updated.

## 13. Test Idempotent Discovery

Trigger the same harvest twice — the second run should find zero new files.

In [53]:
SCHEDULE_ID

'725fa608-f4da-40f5-980a-2aa977ffe7e3'

In [54]:
# First trigger
resp = client.post(f"/api/schedules/{SCHEDULE_ID}/trigger", headers=AUTH)
FLOW_RUN_ID = resp.json().get("flow_run_id")
print("Trigger 1:", resp.json())
wait_for_flow_run(FLOW_RUN_ID)

before = len(client.get("/api/files", headers=AUTH).json())
print(f"Files before: {before}\n")

# Second trigger
resp = client.post(f"/api/schedules/{SCHEDULE_ID}/trigger", headers=AUTH)
print("Trigger 2:", resp.json())
FLOW_RUN_ID = resp.json().get("flow_run_id")
wait_for_flow_run(FLOW_RUN_ID)

after = len(client.get("/api/files", headers=AUTH).json())
print(f"Files after: {after}\n")

if before == after:
    print("PASS: No duplicate files")
else:
    print("FAIL: Duplicate files created")

Trigger 1: {'flow_run_id': '12a6ac99-7067-4e23-a245-692db7d52a42', 'schedule_id': '725fa608-f4da-40f5-980a-2aa977ffe7e3'}
Waiting for flow run 12a6ac99-7067-4e23-a245-692db7d52a42 to complete...
  State: SCHEDULED (attempt 1/120)
  State: SCHEDULED (attempt 2/120)
  State: SCHEDULED (attempt 3/120)
  State: SCHEDULED (attempt 4/120)
  State: SCHEDULED (attempt 5/120)
  State: PENDING (attempt 6/120)
  State: RUNNING (attempt 7/120)
Flow run finished with state: COMPLETED
Files before: 6

Trigger 2: {'flow_run_id': '836fecca-bbb1-4c9a-8280-bdf253751018', 'schedule_id': '725fa608-f4da-40f5-980a-2aa977ffe7e3'}
Waiting for flow run 836fecca-bbb1-4c9a-8280-bdf253751018 to complete...
  State: SCHEDULED (attempt 1/120)
  State: SCHEDULED (attempt 2/120)
  State: SCHEDULED (attempt 3/120)
  State: SCHEDULED (attempt 4/120)
  State: SCHEDULED (attempt 5/120)
  State: SCHEDULED (attempt 6/120)
  State: SCHEDULED (attempt 7/120)
  State: SCHEDULED (attempt 8/120)
  State: SCHEDULED (attempt 9/12

### 7c. Check enriched metadata

In [None]:
wait_for_flow_run(METADATA_FLOW_RUN_ID)

resp = client.get("/api/files", headers=AUTH)
for f in resp.json():
    print(json.dumps({
        "filename": f["filename"],
        "source_path": f["source_path"],
        "metadata_": f.get("metadata_"),
    }, indent=2))

Files inside dated experiment folders should have extracted metadata:
```json
{
  "filename": "acqus",
  "source_path": "20260115_glucose_1H/acqus",
  "metadata_": {
    "date": "20260115",
    "compound": "glucose",
    "nucleus": "1H"
  }
}
```

## 15. Cleanup

Run the cell below, or from a terminal:

```bash
docker compose -f docker-compose.yml -f docker-compose.dev.yml down -v
```

In [None]:
# Remove test files written to the NMR instrument share
run(f'docker compose {DEV_COMPOSE} exec samba-instruments sh -c '
    '"rm -rf /data/nmr/scratch.tmp /data/nmr/empty.txt /data/nmr/user_a /data/nmr/user_b 2>/dev/null; true"')

# Tear down the dev stack
_ = run(f"docker compose {DEV_COMPOSE} down -v")

client.close()


Confirm no containers are still running:

In [None]:
_ = run(f"docker compose {DEV_COMPOSE} ps")

## 16. Conclusion

This guide demonstrated the core capabilities of the **StreamWeave** backend, a research data management platform designed to automate the discovery, transfer, and governance of instrument-generated data.

### Features Covered

- **Automated File Discovery & Transfer** — Schedules that automatically harvest files from instrument sources, with checksum verification and idempotent processing
- **Persistent Identifiers (ARK)** — Every discovered file receives a unique, standards-compliant ARK identifier for long-term reference
- **Workflow Orchestration** — Prefect-powered flow execution with real-time monitoring, manual triggers, and scheduled runs
- **Extensible Hooks System** — Pre-transfer hooks for filtering files and post-transfer hooks for metadata enrichment
- **Fine-Grained Access Control** — User, group, and project-based permissions with hierarchical inheritance
- **Full API Coverage** — RESTful endpoints for instruments, storage locations, schedules, files, transfers, and access management

### Use Cases

StreamWeave is ideal for:
- Research core facilities managing data from multiple scientific instruments
- Laboratories requiring automated data archival with provenance tracking
- Organizations needing compliant data governance with audit trails



### Interested in deploying StreamWeave for your organization?

For deployment assistance, custom integrations, or enterprise support, contact us at:

**[https://datasophos.co/#contact](https://datasophos.co/#contact)**