# OME-Zarr Metadata Enrichment — Starter Notebook

This notebook walks through the OME-Zarr outputs from the brieflow pipeline test data.
It demonstrates how to inspect, enrich, and validate zarr store metadata so the undergrad
can implement each enrichment step for real.

**Prerequisites:** Run `bash run_brieflow_omezarr.sh` from the `small_test_analysis/` directory first.

In [None]:
import json
import sys
from pathlib import Path

import numpy as np
import pandas as pd
import zarr

# Add brieflow workflow to path so we can import its libraries
sys.path.insert(0, str(Path("../../../workflow").resolve()))
from lib.shared.io import read_image, save_image

# Root paths
ZARR_ROOT = Path("../brieflow_output_zarr")
assert ZARR_ROOT.exists(), f"Run the zarr pipeline first: {ZARR_ROOT}"

## 1. Explore the output directory structure

The pipeline produces three module outputs, each with `images/`, `hcs/`, and tabular data:

In [None]:
for module in ["preprocess", "sbs", "phenotype"]:
    module_dir = ZARR_ROOT / module
    if not module_dir.exists():
        continue
    subdirs = sorted(d.name for d in module_dir.iterdir() if d.is_dir())
    print(f"{module}/  →  {subdirs}")

## 2. Read pipeline metadata (parquet)

The preprocess module extracts hardware metadata from the raw images.
This is where pixel sizes, channel info, and other acquisition parameters live.

In [None]:
# Read phenotype metadata
ph_meta = pd.read_parquet(
    ZARR_ROOT / "preprocess/metadata/phenotype/1/A1/combined_metadata.parquet"
)
print(f"Shape: {ph_meta.shape}")
print(f"Columns: {list(ph_meta.columns)}")
ph_meta

In [None]:
# Key fields for metadata enrichment
print("Pixel sizes (from hardware):")
print(f"  X: {ph_meta['pixel_size_x'].iloc[0]} µm")
print(f"  Y: {ph_meta['pixel_size_y'].iloc[0]} µm")
print(f"  Z: {ph_meta['pixel_size_z'].iloc[0]} µm")
print(f"Objective: {ph_meta['objective_magnification'].iloc[0]}x")
print(f"Channels: {ph_meta['channels'].iloc[0]}")

## 3. Inspect a per-tile zarr store

Each tile is written as an independent OME-Zarr store with pyramid levels.

In [None]:
# Pick a phenotype tile
tile_store = ZARR_ROOT / "preprocess/images/phenotype/1/A1/2/image.zarr"

# Read zarr.json (OME-NGFF v0.5 uses zarr v3 format)
with open(tile_store / "zarr.json") as f:
    tile_meta = json.load(f)

print("=== zarr.json ===")
print(json.dumps(tile_meta, indent=2))

In [None]:
# Inspect the OME multiscale metadata
ome = tile_meta["attributes"]["ome"]
ms = ome["multiscales"][0]

print("Axes:")
for ax in ms["axes"]:
    print(f"  {ax['name']} ({ax['type']})  ←  MISSING: 'unit' key")

print("\nPyramid levels:")
for ds in ms["datasets"]:
    scales = ds["coordinateTransformations"][0]["scale"]
    print(f"  level {ds['path']}: scale={scales}  ←  MISSING: real pixel sizes")

print("\nOmero channels:")
omero = tile_meta["attributes"].get("omero", {})
for ch in omero.get("channels", []):
    print(f"  {ch['label']}  ←  MISSING: real name, contrast window")

In [None]:
# Read image data at different pyramid levels
root = zarr.open_group(str(tile_store), mode="r")
for level in ["0", "1", "2", "3", "4"]:
    if level in root:
        arr = root[level]
        print(
            f"Level {level}: shape={arr.shape}, dtype={arr.dtype}, chunks={arr.chunks}"
        )

In [None]:
# Compare: using the high-level read_image() API
img = read_image(tile_store)
print(f"read_image() shape: {img.shape}, dtype: {img.dtype}")
print(
    f"  (singleton channel dim was squeezed: original zarr shape was {root['0'].shape})"
)

## 4. Inspect a label store

Segmentation outputs (nuclei, cells) are zarr stores with an `image-label` attribute.

In [None]:
label_store = ZARR_ROOT / "sbs/images/1/A1/0/nuclei.zarr"
with open(label_store / "zarr.json") as f:
    label_meta = json.load(f)

is_label = "image-label" in label_meta.get("attributes", {})
print(f"Is label store: {is_label}")
print(f"Label dtype: {zarr.open_group(str(label_store), mode='r')['0'].dtype}")
print(f"  ← Should be int32 for segmentation masks")

## 5. Inspect the HCS plate stores

The HCS assembly step creates spec-compliant plate zarrs using symlinks to per-tile data.
Each module that produces zarr images gets its own HCS output:
- **preprocess** — per-modality: `hcs/sbs/1.zarr`, `hcs/phenotype/1.zarr`
- **sbs** — `hcs/1.zarr` (aligned + intermediates + labels)
- **phenotype** — `hcs/1.zarr` (aligned + intermediates + labels)

In [None]:
# Compare HCS layout across all three modules
hcs_roots = {
    "sbs": ZARR_ROOT / "sbs/hcs",
    "phenotype": ZARR_ROOT / "phenotype/hcs",
    "preprocess": ZARR_ROOT / "preprocess/hcs",
}

for module, hcs_dir in hcs_roots.items():
    if not hcs_dir.exists():
        print(f"{module}: no HCS output")
        continue
    plate_zarrs = sorted(hcs_dir.rglob("*.zarr"))
    paths = [str(p.relative_to(hcs_dir)) for p in plate_zarrs]
    print(f"{module}:  {paths}")

In [None]:
# Inspect plate + well metadata for SBS and phenotype
for module, plate_path in [
    ("sbs", ZARR_ROOT / "sbs/hcs/1.zarr"),
    ("phenotype", ZARR_ROOT / "phenotype/hcs/1.zarr"),
]:
    with open(plate_path / "zarr.json") as f:
        pmeta = json.load(f)
    plate = pmeta["attributes"]["ome"]["plate"]
    wells = [w["path"] for w in plate["wells"]]
    print(f"--- {module} plate 1 ---")
    print(f"  Rows: {[r['name'] for r in plate['rows']]}")
    print(f"  Columns: {[c['name'] for c in plate['columns']]}")
    print(f"  Wells: {wells}")

    # Show fields in first well
    well_dir = plate_path / wells[0].replace("/", "/")
    with open(well_dir / "zarr.json") as f:
        wmeta = json.load(f)
    fields = [img["path"] for img in wmeta["attributes"]["ome"]["well"]["images"]]
    print(f"  Fields in {wells[0]}: {fields}")
    print()

In [None]:
# Field contents side-by-side: SBS vs phenotype
# SBS has more intermediate image types (per-cycle); phenotype is simpler
for module, field_path in [
    ("sbs", ZARR_ROOT / "sbs/hcs/1.zarr/A/1/0"),
    ("phenotype", ZARR_ROOT / "phenotype/hcs/1.zarr/A/1/2"),
]:
    print(f"--- {module} field ---")
    for item in sorted(field_path.iterdir()):
        kind = "symlink" if item.is_symlink() else "dir" if item.is_dir() else "file"
        print(f"  {item.name:30s}  ({kind})")
    print()

In [None]:
# Labels: verify they match the direct per-tile stores (both modules)
for module, field_labels, direct_nuclei in [
    (
        "sbs",
        ZARR_ROOT / "sbs/hcs/1.zarr/A/1/0/labels",
        ZARR_ROOT / "sbs/images/1/A1/0/nuclei.zarr",
    ),
    (
        "phenotype",
        ZARR_ROOT / "phenotype/hcs/1.zarr/A/1/2/labels",
        ZARR_ROOT / "phenotype/images/1/A1/2/nuclei.zarr",
    ),
]:
    print(f"--- {module} labels ---")
    with open(field_labels / "zarr.json") as f:
        lmeta = json.load(f)
    label_names = lmeta["attributes"]["ome"]["labels"]
    print(f"  Available: {label_names}")

    # Verify HCS path gives same data as direct path
    nuc_hcs = read_image(field_labels / "nuclei")
    nuc_direct = read_image(direct_nuclei)
    print(f"  HCS shape: {nuc_hcs.shape}, direct shape: {nuc_direct.shape}")
    print(f"  Identical: {np.array_equal(nuc_hcs, nuc_direct)}")
    print()

In [None]:
# Preprocess has per-modality HCS subdirs (sbs/, phenotype/) — unique layout
preprocess_hcs = ZARR_ROOT / "preprocess/hcs"
print("Preprocess HCS layout (per-modality):")
for modality_dir in sorted(preprocess_hcs.iterdir()):
    if not modality_dir.is_dir():
        continue
    plate_zarrs = sorted(modality_dir.glob("*.zarr"))
    for pz in plate_zarrs:
        with open(pz / "zarr.json") as f:
            pmeta = json.load(f)
        wells = [w["path"] for w in pmeta["attributes"]["ome"]["plate"]["wells"]]
        print(f"  {modality_dir.name}/{pz.name}: wells={wells}")

# Show field contents — preprocess SBS has per-cycle image subgroups
field = preprocess_hcs / "sbs/1.zarr/A/1/2"
print(f"\nPreprocess SBS field contents:")
for item in sorted(field.iterdir()):
    kind = "symlink" if item.is_symlink() else "dir" if item.is_dir() else "file"
    print(f"  {item.name:30s}  ({kind})")

---

## 6. Metadata enrichment tasks

The sections below are templates for each enrichment the undergrad will implement.
Each shows what the metadata currently looks like, what it *should* look like
per [OME-NGFF v0.5](https://ngff.openmicroscopy.org/latest/), and a prototype
for writing it.

### References
- [OME-NGFF v0.5 spec](https://ngff.openmicroscopy.org/latest/)
- [HCS plate layout](https://ngff.openmicroscopy.org/latest/#hcs-layout)
- [BioHub spec](../../../zarr3_biohub_spec.md)

### 6a. Axis units

**Current:** Axes have `name` and `type` but no `unit`.

**Target:** Spatial axes should have `"unit": "micrometer"` per OME-NGFF spec.

In [None]:
# Current axes
print("Current axes (missing units):")
for ax in ms["axes"]:
    print(f"  {ax}")

# What they should look like:
target_axes = [
    {"name": "c", "type": "channel"},
    {"name": "y", "type": "space", "unit": "micrometer"},
    {"name": "x", "type": "space", "unit": "micrometer"},
]
print("\nTarget axes (with units):")
for ax in target_axes:
    print(f"  {ax}")

### 6b. Pixel sizes in coordinateTransformations

**Current:** All scale factors are `[1.0, 1.0, 1.0]` (placeholder).

**Target:** Use real pixel sizes from the hardware metadata parquet.
At level 0, scale = `[1.0, pixel_size_y, pixel_size_x]`.
At level N, scale = `[1.0, pixel_size_y * 2^N, pixel_size_x * 2^N]`.

In [None]:
# Get pixel sizes from hardware metadata
px_x = ph_meta["pixel_size_x"].iloc[0]
px_y = ph_meta["pixel_size_y"].iloc[0]
print(f"Pixel size from metadata: X={px_x} µm, Y={px_y} µm")

# Build correct coordinateTransformations for 5 pyramid levels
coarsening_factor = 2
n_levels = 5
print("\nTarget coordinateTransformations:")
for i in range(n_levels):
    scale_y = px_y * (coarsening_factor**i)
    scale_x = px_x * (coarsening_factor**i)
    print(f"  level {i}: scale=[1.0, {scale_y:.4f}, {scale_x:.4f}]")

### 6c. Channel names

**Current:** Channels are labeled `c0`, `c1`, `c2`, `c3` (generic).

**Target:** Meaningful names from config, e.g., `DAPI`, `COXIV`, `CENPA`, `WGA`.

In [None]:
# Current
print("Current channel labels:")
for ch in omero.get("channels", []):
    print(f"  {ch['label']}")

# From config
phenotype_channels = ["DAPI", "COXIV", "CENPA", "WGA"]
sbs_channels = ["DAPI", "G", "T", "A", "C"]

print(f"\nTarget phenotype channels: {phenotype_channels}")
print(f"Target SBS channels: {sbs_channels}")

### 6d. Contrast limits (rendering window)

**Current:** No `window` key in channel metadata.

**Target:** Each channel gets `"window": {"start": min, "end": max, "min": 0, "max": dtype_max}`
computed from 1st/99th percentile intensity.

In [None]:
# Compute contrast limits from actual data
img = read_image(tile_store)
print(f"Image shape: {img.shape} (channels, y, x)")

for i, ch_name in enumerate(phenotype_channels):
    ch_data = img[i]
    p1 = float(np.percentile(ch_data, 1))
    p99 = float(np.percentile(ch_data, 99))
    dtype_max = (
        int(np.iinfo(ch_data.dtype).max)
        if np.issubdtype(ch_data.dtype, np.integer)
        else 1.0
    )
    print(
        f"  {ch_name}: window={{start: {p1:.0f}, end: {p99:.0f}, min: 0, max: {dtype_max}}}"
    )

### 6e. Label dtype and segmentation metadata

**Current:** Labels may be float or have inconsistent dtypes.

**Target:** Segmentation masks should be `int32`. The `image-label` attribute can
include source info (method, label identity).

In [None]:
# Check label dtypes across modules
for label_path in sorted(ZARR_ROOT.rglob("nuclei.zarr")):
    root = zarr.open_group(str(label_path), mode="r")
    dtype = root["0"].dtype
    rel = label_path.relative_to(ZARR_ROOT)
    status = "OK" if dtype == np.int32 else f"NEEDS CONVERSION (currently {dtype})"
    print(f"  {rel}: {status}")

---

## 7. Prototype: writing enriched metadata to a zarr store

This shows how to modify a zarr store's `zarr.json` in place.
The undergrad should use this pattern to implement each enrichment.

In [None]:
import copy
import shutil

# Work on a copy so we don't modify pipeline outputs
demo_store = Path("./demo_enriched.zarr")
if demo_store.exists():
    shutil.rmtree(demo_store)
shutil.copytree(tile_store, demo_store)

# Read current metadata
with open(demo_store / "zarr.json") as f:
    meta = json.load(f)

# --- Enrich axes with units ---
ms_meta = meta["attributes"]["ome"]["multiscales"][0]
for ax in ms_meta["axes"]:
    if ax["type"] == "space":
        ax["unit"] = "micrometer"

# --- Enrich pixel sizes ---
px_x = ph_meta["pixel_size_x"].iloc[0]
px_y = ph_meta["pixel_size_y"].iloc[0]
for i, ds in enumerate(ms_meta["datasets"]):
    factor = 2**i
    ds["coordinateTransformations"] = [
        {"type": "scale", "scale": [1.0, px_y * factor, px_x * factor]}
    ]

# --- Enrich channel names + contrast limits ---
img = read_image(tile_store)
channels_enriched = []
for i, name in enumerate(phenotype_channels):
    ch_data = img[i]
    p1, p99 = float(np.percentile(ch_data, 1)), float(np.percentile(ch_data, 99))
    dtype_max = (
        int(np.iinfo(ch_data.dtype).max)
        if np.issubdtype(ch_data.dtype, np.integer)
        else 1.0
    )
    channels_enriched.append(
        {
            "label": name,
            "active": True,
            "color": "FFFFFF",
            "window": {"start": p1, "end": p99, "min": 0, "max": dtype_max},
        }
    )
meta["attributes"]["omero"]["channels"] = channels_enriched

# --- Write back ---
with open(demo_store / "zarr.json", "w") as f:
    json.dump(meta, f, indent=2)

print("Enriched zarr.json:")
print(json.dumps(meta, indent=2))

In [None]:
# Verify the enriched store still reads correctly
img_enriched = read_image(demo_store)
img_original = read_image(tile_store)
print(
    f"Data unchanged after metadata enrichment: {np.array_equal(img_enriched, img_original)}"
)

# Cleanup
shutil.rmtree(demo_store)

---

## 8. What needs to happen in the pipeline

Once the enrichment logic is prototyped here, integrate it into the pipeline:

1. **`save_image()` / `write_image_omezarr()`** — Accept and write richer metadata
   (pixel sizes, channel names, contrast limits)
2. **Snakemake rules** — Pass config values (pixel_size_um, channel names) as `params`
3. **Scripts** — Forward `snakemake.params` metadata into `save_image()` calls
4. **Config YAML** — Add `pixel_size_um`, channel name lists if not already present

### Key question: does metadata propagate through read → process → write?

Test this by:
1. Write metadata to a store
2. Read with `read_image()`
3. Process the array (e.g., crop, filter)
4. Write to a new store with `save_image()`
5. Check: is the metadata preserved?

Currently `read_image()` returns a numpy array and discards metadata,
so metadata will NOT propagate automatically. The pipeline needs explicit forwarding.

### Visual validation with Napari

After enriching metadata, verify it renders correctly in Napari using
`tests/viewer/load_omezarr_in_napari.py`. That script reads the same
`omero.pixel_size`, `omero.channels`, and `labels/` metadata you're writing here.
It uses `omero.channels[].window` for contrast limits when available, falling back
to percentile-based limits otherwise — so enrichment has a visible effect.

```bash
# On your laptop (not cluster — needs a display)
conda create -n napari-viz -c conda-forge python=3.11 napari zarr numpy -y
conda activate napari-viz

# Per-tile store:
python tests/viewer/load_omezarr_in_napari.py output/sbs/images/1/A1/0/aligned.zarr

# HCS field (labels nested under labels/):
python tests/viewer/load_omezarr_in_napari.py output/sbs/hcs/1.zarr/A/1/0
```

What to check:
- Channel names visible (not `c0`, `c1`, ...)
- Scale bar shows correct physical coordinates
- Contrast limits produce sensible default rendering
- Segmentation labels overlay correctly on images

In [None]:
# Demonstrate: metadata does NOT survive read → write roundtrip
demo_src = Path("./demo_src.zarr")
demo_dst = Path("./demo_dst.zarr")

# Write with pixel sizes
save_image(img_original, demo_src, pixel_size=0.65, channel_names=phenotype_channels)

# Read back (only gets array, metadata is lost)
arr = read_image(demo_src)

# Write to new store (no metadata forwarded)
save_image(arr, demo_dst)

# Compare metadata
with open(demo_src / "zarr.json") as f:
    src_meta = json.load(f)
with open(demo_dst / "zarr.json") as f:
    dst_meta = json.load(f)

src_channels = [ch["label"] for ch in src_meta["attributes"]["omero"]["channels"]]
dst_channels = [ch["label"] for ch in dst_meta["attributes"]["omero"]["channels"]]
print(f"Source channels: {src_channels}")
print(f"Dest channels:   {dst_channels}  ← reverted to generic names")
print("\n→ Metadata must be explicitly forwarded through pipeline scripts.")

# Cleanup
shutil.rmtree(demo_src)
shutil.rmtree(demo_dst)