# 1. Register datasets and paths (minimal, reproducible)
Create a single source of truth (`configs/datasets.yaml`) that records where raw EEG/EMG datasets live locally and the target sampling rates/channel subsets to match OpenBCI, keeping notebooks for experiments and `/src` for code.
#### Rationale:
- Guarantee **reproducibility** by centralising dataset paths and preprocessing targets.
- Enforce a **clear contract** early (names, rates, channels) before any loader code.
- Enable **device-matching** via explicit resampling targets (e.g., 125 Hz).
- Keep **scope small**: write only one config file now: no loaders yet.
- Promote **human readability** so that configuration can be easily reviewed or version-controlled.

In [2]:
# Step 1: Create a reproducible dataset registry at configs/datasets.yaml
# ------------------------------------------------------------------------
# This code writes a YAML configuration file that records:
#   1. The local folder where each dataset has been extracted.
#   2. The desired target sampling frequency (so all data later matches OpenBCI rates).
#   3. The subset of channels that are relevant for motor-imagery or muscle activity.
# The file is created only once and can be safely re-run (it overwrites the same file).

from pathlib import Path

# Determine the absolute path of the project root (expected to be inside neurassist/).
PROJECT_ROOT = Path("..").resolve()

# Define the path to the 'configs' directory.
CONFIG_DIR = PROJECT_ROOT / "configs"

# Create it if it does not exist to ensure a consistent structure.
CONFIG_DIR.mkdir(parents=True, exist_ok=True)

# Compose the YAML content. Triple quotes allow multi-line strings for readability.
# Each dataset section includes local_path, target_sampling_hz, and channels_keep.
YAML_TEXT = """# configs/datasets.yaml
# Central registry of dataset paths and sampling targets.
# Edit only the paths to point to local copies of the datasets.
# All subsequent scripts will read from this file.

eeg:
  physionet_eegmmidb:
    local_path: "data_raw/eeg/physionet_eegmmidb"
    target_sampling_hz: 125
    channels_keep: ["C3", "Cz", "C4", "FC3", "FC4", "CP3", "CP4"]

  bnci_002_2014:
    local_path: "data_raw/eeg/bnci_002_2014"
    target_sampling_hz: 125
    channels_keep: ["C3", "Cz", "C4", "FC3", "FC4", "CP3", "CP4"]

emg:
  ninapro_db2:
    local_path: "data_raw/emg/ninapro_db2"
    target_sampling_hz: 250
    channels_keep: ["forearm_flexors", "forearm_extensors", "biceps_brachii", "triceps_brachii"]
"""

# Define the full path of the output YAML file.
OUTPUT_FILE = CONFIG_DIR / "datasets.yaml"

# Write the YAML text to disk using UTF-8 encoding.
# If the file already exists, it will be overwritten with the same content.
OUTPUT_FILE.write_text(YAML_TEXT, encoding="utf-8")

# Print confirmation to verify the file creation and its absolute path.
print(f'Dataset registry written successfully at: {OUTPUT_FILE}')

Dataset registry written successfully at: /lab/px/anichlabs/neurassist/configs/datasets.yaml


# 2. Verify dataset registry and local folders.
Read the YAML ragistry created in step 1 and confirm that each dataset path exists locally and is not empty. This ensures that later loaders will have valid data sources before any preprocessing or modelling.
#### Rationale:
- Validate **file-system consistency** early to avoid silent path errors later.
- Build a small reusable utility to check dataset readiness.
- Teach how to use the `yaml` library safely (`yaml.safe_load`) for reproducible configuration parsing.
- Reinforce good engineering practice: verify assumptions before executing data pipelines.

In [None]:
# Step 2: Validate datasets.yaml and check dataset folders.
# ---------------------------------------------------------
# This code reads the YAML file created in Step 1 and verifies:
#   1. The file exists and is readable.
#   2. Each dataset section has a 'local_path' field.
#   3. Each referenced directory exists and contains at least one file.
# This prevents downstream errors when trying to load missing datasets.

from pathlib import Path
import yaml # PyYAML is part of most complete scientific Python environments.
import os

# Define paths.
PROJECT_ROOT = Path("..").resolve()
CONFIG_FILE  = PROJECT_ROOT / "configs" / "datasets.yaml"

# Check that the YAML file exists.
if not CONFIG_FILE.exists():
    raise FileNotFoundError(f'Configuration file not found: {CONFIG_FILE}')

# Load the YAML content safely.
with CONFIG_FILE.open("r", encoding="utf-8") as f:
    CONFIG = yaml.safe_load()

# Function to verify that each dataset path exists and is non-empty.
def verify_dataaset(config_dict):
    """
    Iterate through EEG and EMG entries, verifying that the given
    local_path exists and contains at least one fil or subdirectory.
    """
    for domain, datasets in config_dict.items