# Sidecar Metadata (JSON/YAML) and Data Dictionaries

In medical data integration, metadata plays a crucial role in ensuring data interoperability, reproducibility, and proper interpretation. This notebook explores sidecar metadata files and data dictionaries, which are essential components for documenting and standardizing medical datasets.

## What is Sidecar Metadata?

Sidecar metadata refers to separate files that accompany primary data files, containing descriptive information about the data structure, collection parameters, and other relevant details. In medical imaging and research, these are commonly stored as JSON or YAML files.

Let's start by importing the necessary libraries for working with JSON and YAML files.

In [None]:
import json
import yaml
import pandas as pd
from pathlib import Path

## Creating a JSON Sidecar File

Let's create a simple JSON sidecar file for a medical imaging dataset following BIDS (Brain Imaging Data Structure) conventions.

In [None]:
# Create metadata for an MRI scan
mri_metadata = {
    "Modality": "MR",
    "MagneticFieldStrength": 3.0,
    "Manufacturer": "Siemens",
    "ManufacturersModelName": "Prisma",
    "InstitutionName": "University Medical Center",
    "SequenceName": "T1_MPRAGE",
    "SliceThickness": 1.0,
    "RepetitionTime": 2.3,
    "EchoTime": 0.00226,
    "FlipAngle": 8
}

Now let's save this metadata as a JSON file and display its contents.

In [None]:
# Save as JSON
with open('T1w.json', 'w') as f:
    json.dump(mri_metadata, f, indent=2)

# Display the JSON content
print("JSON Sidecar Content:")
print(json.dumps(mri_metadata, indent=2))

## Creating a YAML Sidecar File

YAML is another popular format for metadata due to its human-readable structure. Let's create the same metadata in YAML format.

In [None]:
# Save as YAML
with open('T1w.yaml', 'w') as f:
    yaml.dump(mri_metadata, f, default_flow_style=False)

# Display the YAML content
print("YAML Sidecar Content:")
print(yaml.dump(mri_metadata, default_flow_style=False))

## Data Dictionaries

A data dictionary provides detailed descriptions of variables in a dataset. Let's create a data dictionary for a clinical research study.

In [None]:
# Create a data dictionary for clinical variables
data_dictionary = [
    {
        "variable_name": "patient_id",
        "description": "Unique patient identifier",
        "type": "string",
        "format": "PID-XXXXX",
        "required": True
    },
    {
        "variable_name": "age",
        "description": "Patient age at time of scan",
        "type": "integer",
        "units": "years",
        "min_value": 18,
        "max_value": 100,
        "required": True
    },
    {
        "variable_name": "diagnosis",
        "description": "Primary diagnosis",
        "type": "categorical",
        "levels": {
            "HC": "Healthy Control",
            "MCI": "Mild Cognitive Impairment",
            "AD": "Alzheimer's Disease"
        },
        "required": True
    },
    {
        "variable_name": "mmse_score",
        "description": "Mini-Mental State Examination score",
        "type": "integer",
        "min_value": 0,
        "max_value": 30,
        "missing_values": [-999, "NA"],
        "required": False
    }
]

Let's convert this data dictionary to a pandas DataFrame for better visualization.

In [None]:
# Convert to DataFrame
dd_df = pd.DataFrame(data_dictionary)
dd_df

Save the data dictionary as both JSON and CSV formats for different use cases.

In [None]:
# Save as JSON
with open('data_dictionary.json', 'w') as f:
    json.dump(data_dictionary, f, indent=2)

# Save as CSV
dd_df.to_csv('data_dictionary.csv', index=False)

print("Data dictionary saved as 'data_dictionary.json' and 'data_dictionary.csv'")

## Validating Data Against the Dictionary

Let's create a sample dataset and validate it against our data dictionary.

In [None]:
# Create sample data
sample_data = pd.DataFrame({
    'patient_id': ['PID-00001', 'PID-00002', 'PID-00003'],
    'age': [65, 72, 45],
    'diagnosis': ['HC', 'MCI', 'AD'],
    'mmse_score': [29, 24, -999]  # -999 represents missing value
})

sample_data

Now let's create a simple validation function to check if our data conforms to the data dictionary specifications.

In [None]:
def validate_data(df, data_dict):
    """Validate dataframe against data dictionary"""
    validation_results = []
    
    for var_info in data_dict:
        var_name = var_info['variable_name']
        
        # Check if required variable exists
        if var_info.get('required', False) and var_name not in df.columns:
            validation_results.append(f"❌ Required variable '{var_name}' is missing")
            continue
            
        if var_name in df.columns:
            # Check value ranges for numeric variables
            if 'min_value' in var_info:
                valid_data = df[var_name].apply(lambda x: x in var_info.get('missing_values', []) or x >= var_info['min_value'])
                if not valid_data.all():
                    validation_results.append(f"⚠️  Variable '{var_name}' has values below minimum ({var_info['min_value']})")
            
            # Check categorical levels
            if var_info['type'] == 'categorical' and 'levels' in var_info:
                valid_levels = set(var_info['levels'].keys())
                actual_levels = set(df[var_name].unique())
                if not actual_levels.issubset(valid_levels):
                    validation_results.append(f"⚠️  Variable '{var_name}' has invalid levels: {actual_levels - valid_levels}")
            
            validation_results.append(f"✅ Variable '{var_name}' validated successfully")
    
    return validation_results

Run the validation on our sample data.

In [None]:
# Validate the sample data
results = validate_data(sample_data, data_dictionary)
for result in results:
    print(result)

## Creating a Complete Metadata Package

Let's create a comprehensive metadata package that includes both sidecar metadata and data dictionary information.

In [None]:
# Create a complete metadata package
metadata_package = {
    "dataset_description": {
        "Name": "Alzheimer's Disease Neuroimaging Study",
        "BIDSVersion": "1.6.0",
        "License": "CC BY 4.0",
        "Authors": ["Dr. Smith", "Dr. Johnson"],
        "Acknowledgements": "Funded by NIH Grant #12345",
        "HowToAcknowledge": "Please cite our paper: Smith et al. (2023)",
        "ReferencesAndLinks": ["https://doi.org/10.1234/example"]
    },
    "imaging_parameters": mri_metadata,
    "participant_variables": data_dictionary
}

# Save the complete package
with open('dataset_metadata.json', 'w') as f:
    json.dump(metadata_package, f, indent=2)

print("Complete metadata package created successfully!")

Display a summary of the metadata package structure.

In [None]:
# Display package structure
print("Metadata Package Structure:")
print("-" * 50)
for key in metadata_package.keys():
    print(f"📁 {key}")
    if isinstance(metadata_package[key], dict):
        for subkey in metadata_package[key].keys():
            print(f"   └── {subkey}")
    elif isinstance(metadata_package[key], list):
        print(f"   └── {len(metadata_package[key])} items")

## Exercise

Create a sidecar metadata file and data dictionary for a hypothetical COVID-19 patient monitoring dataset that includes:

1. **Sidecar metadata** (save as both JSON and YAML):
   - Study name: "COVID-19 Longitudinal Monitoring"
   - Data collection period: "2020-03-01 to 2021-12-31"
   - Institution: Your choice
   - Ethics approval number: "ETH-2020-001"

2. **Data dictionary** with at least 5 variables including:
   - Patient ID (string, required)
   - Temperature (numeric, range 35-42°C)
   - Oxygen saturation (numeric, range 0-100%)
   - COVID test result (categorical: positive/negative/pending)
   - At least one more clinical variable of your choice

3. Create sample data for 5 patients and validate it against your data dictionary

4. Combine everything into a single metadata package and save it as 'covid_metadata_package.json'

Bonus: Create a function that automatically generates a human-readable report from your metadata package.