# Measurement Set Staging Workflow

This notebook walks through staging Measurement Set (MS) jobs on the fast NVMe scratch volume and archiving the finished products back to `/data`.

## 1. Prerequisites

- You are running this notebook from the `dsa110-contimg` repository (any subdirectory is fine).
- `scripts/scratch_sync.sh` is executable.
- The `casa6` conda environment is available and contains the MS conversion tools.
- A set of UVH5 inputs exists under `/data/incoming/<run_id>/`.

The cells below set up convenience variables and ensure the scratch area has room.

In [None]:
from pathlib import Path
import os
import subprocess

# Resolve the repository root using git so notebook location does not matter.
REPO_ROOT = Path(subprocess.check_output([
    "git", "rev-parse", "--show-toplevel"
]).decode().strip())

# Configure the dataset you plan to convert.
RUN_ID = "0834_555_transit"  # <-- change if you are working on a different run
INPUT_DIR = f"/data/incoming/{RUN_ID}"  # UVH5 inputs (leave on /data unless you want to stage them)
SCRATCH_ROOT = "/scratch/dsa110-contimg"
SCRATCH_MS = f"{SCRATCH_ROOT}/data-samples/ms/{RUN_ID}"


# Export for use in subsequent shell cells.
os.environ["REPO_ROOT"] = str(REPO_ROOT)
os.environ["RUN_ID"] = RUN_ID
os.environ["INPUT_DIR"] = INPUT_DIR
os.environ["SCRATCH_ROOT"] = SCRATCH_ROOT
os.environ["SCRATCH_MS"] = SCRATCH_MS

print(f"Repository root: {REPO_ROOT}")
print(f"Run identifier: {RUN_ID}")
print(f"Input directory: {INPUT_DIR}")
print(f"Scratch MS directory: {SCRATCH_MS}")

Repository root: /data/dsa110-contimg
Run identifier: 0834_555_transit
Input directory: /data/incoming/0834_555_transit
Scratch MS directory: /scratch/dsa110-contimg/data-samples/ms/0834_555_transit


In [2]:
%%bash
set -euo pipefail
cd "$REPO_ROOT"

scripts/scratch_sync.sh status

Scratch usage (/scratch/dsa110-contimg):
  (empty)

Filesystem usage:
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  916G  370G  500G  43% /
/dev/sdb1        13T  4.7T  8.1T  37% /data


## 2. (Optional) Stage UVH5 inputs to scratch

Sequential reads from `/data` are usually fast enough, so this step is optional. Run it if you intend to reuse the same UVH5 set multiple times or want to keep *all* I/O on ext4.

Remove `--dry-run` to perform the copy.

In [4]:
%%bash
set -euo pipefail
cd "$REPO_ROOT"

scripts/scratch_sync.sh stage incoming/$RUN_ID --dry-run

Running rsync from /data/incoming/0834_555_transit/ to /scratch/dsa110-contimg/incoming/0834_555_transit/
          2.33G 100% 2165.86GB/s    0:00:00 (xfr#16, to-chk=0/17)

sent 447 bytes  received 67 bytes  1.03K bytes/sec
total size is 2.33G  speedup is 4,524,470.04 (DRY RUN)


## 3. Run the MS converter on scratch

Point the converter at the scratch directory so the Measurement Set is written on ext4. Update arguments as needed for your workflow.

The cell below prints a ready-to-run command. Review it, then execute it either in a terminal or in the optional shell cell that follows.

In [None]:
convert_cmd = "\n".join([
    "conda run -n casa6 python -m dsa110_contimg.conversion.strategies.hdf5_orchestrator",
    f"    \"{INPUT_DIR}\"",
    f"    \"{SCRATCH_MS}\"",
    "    \"<start_time>\"",
    "    \"<end_time>\"",
    "    --log-level INFO",
])
print(convert_cmd)

conda run -n casa6 python -m dsa110_contimg.conversion.uvh5_to_ms_converter_v2
    --input-dir "/data/incoming/0834_555_transit"
    --output-ms "/scratch/dsa110-contimg/data-samples/ms/0834_555_transit"
    --log-level INFO


In [None]:
%%bash
set -euo pipefail
cd "$REPO_ROOT"

START_TIME="2025-10-03 15:15:30"
END_TIME="2025-10-03 15:16:00"

PYTHONPATH=${PYTHONPATH:-}
export PYTHONPATH=/data/dsa110-contimg/src${PYTHONPATH:+:$PYTHONPATH}

conda run -n casa6 python -m dsa110_contimg.conversion.strategies.hdf5_orchestrator \
    "$INPUT_DIR" \
    "$SCRATCH_MS" \
    "$START_TIME" \
    "$END_TIME" \
    --log-level DEBUG \
    --scratch-dir "$SCRATCH_ROOT"


ERROR! Session/line number was not unique in database. History logging moved to new session 32


TypeError: %d format: a real number is required, not NoneType


CondaError: KeyboardInterrupt



TypeError: %d format: a real number is required, not NoneType

## 4. Validate results while they reside on scratch

Insert any QA or imaging steps here. Keeping the intermediate products on `/scratch` avoids slow writes on NTFS.

## 5. Archive the finished Measurement Set back to `/data`

When the dataset looks good, sync it to the long-term volume. The dry-run below previews the copy; remove `--dry-run` to commit.

In [None]:
%%bash
set -euo pipefail
cd "$REPO_ROOT"

scripts/scratch_sync.sh archive data-samples/ms/$RUN_ID --dry-run

## 6. Clean up scratch storage

After confirming the archive on `/data`, remove the scratch copy to free SSD space. Again, remove `--dry-run` to actually delete the staged data.

In [None]:
%%bash
set -euo pipefail
cd "$REPO_ROOT"

scripts/scratch_sync.sh clean data-samples/ms/$RUN_ID --dry-run

In [None]:
%%bash
set -euo pipefail
cd "$REPO_ROOT"

scripts/scratch_sync.sh status