## Two methods for converting some test synthetic data to MEDS format

Version 1 (likely preferred): This uses the _--overrides_ for specifying options the way the official example notebook does

Version 2 (may still need fixing): This uses a _pipeline_config.yaml_ instead of overrides. Note, the _data_input_dir_ doesn't work as described in the repo.  
&nbsp;&nbsp;&nbsp;&nbsp; As the time of this script creation (2025-07-19), the yaml seems to require listing _${input_dir}_/_${input_dir}_ instead of just _${input_dir}_.  
&nbsp;&nbsp;&nbsp;&nbsp; That's not what the official docs indicate it should be.

### Make sure you install the MEDS-extract package by running the following command:

In [1]:
# Uncomment this line to install the MEDS-extract package if you haven't done so already
# !pip install MEDS-extract

To start, setup some useful packages

In [2]:
from pathlib import Path
from pretty_print_directory import PrintConfig, print_directory
import polars as pl

Assign some variables:

In [3]:
input_dir = "./raw_data"
output_dir = "./MEDS_output"
event_config_file = "event_config.yaml"
dataset_name = "Synthetic_Dataset"
dataset_version = "1.0"

Take a quick look at the input file list

In [4]:
DATA_ROOT = Path(input_dir)
print_directory(DATA_ROOT)

└── synthetic_data.csv


## Version 1

This version runs the command as indicated in the official example notebook

In [5]:
# Run the MEDS_Extract command with the specified configurations
!MEDS_transform-pipeline \
    pkg://MEDS_extract.configs._extract.yaml \
    --overrides \
    input_dir={input_dir} \
    output_dir={output_dir} \
    event_conversion_config_fp={event_config_file} \
    dataset.name={dataset_name} \
    dataset.version={dataset_version} 

## Version 2

This version runs the command with a _pipeline.yaml_ file that is configured with the appropriate settings.  
&nbsp;&nbsp;&nbsp;&nbsp; Remember: As the time of this script creation (2025-07-19), yaml seems to require listing _${input_dir}_/_${input_dir}_ instead of just _${input_dir}_/.  
&nbsp;&nbsp;&nbsp;&nbsp; That's not what the official docs indicate it should be.

In [6]:
!MEDS_transform-pipeline pipeline_config.yaml

# Viewing into the Ouputs

The final data, omitting logs, to keep the output small:

In [7]:
output_data_root = Path("MEDS_output/data")
print_directory(output_data_root, PrintConfig(ignore_regex=r"\.logs"))

├── held_out
│   └── 0.parquet
├── train
│   └── 0.parquet
└── tuning
    └── 0.parquet


The final metadata, omitting logs, to keep the output small:

In [8]:
output_metadata_root = Path("MEDS_output/metadata")
print_directory(output_metadata_root, PrintConfig(ignore_regex=r"\.logs"))

├── .shards.json
├── codes.parquet
├── dataset.json
└── subject_splits.parquet


Peak into some of the files

In [9]:
for fp in output_data_root.rglob("*.parquet"):
    print(fp.relative_to(output_data_root))
    display(pl.read_parquet(fp).head(6))

held_out/0.parquet


subject_id,time,code,numeric_value
i64,datetime[μs],str,f32
119031,2024-08-05 17:40:58.690041,"""MED002""",119.300003
119031,2024-08-08 10:04:42.859157,"""DX001""",120.489998
119031,2024-09-20 17:08:54.150772,"""DX005""",128.979996
119031,2024-09-21 01:44:48.734192,"""LAB002""",77.18
119031,2024-11-04 10:34:47.889702,"""DX005""",98.730003
119031,2025-01-19 19:43:59.433472,"""DX005""",116.099998


train/0.parquet


subject_id,time,code,numeric_value
i64,datetime[μs],str,f32
64447,2024-08-12 04:36:25.035832,"""LAB002""",115.660004
64447,2024-08-29 21:41:03.141390,"""MED001""",92.309998
64447,2024-09-04 03:50:49.234321,"""DX005""",112.199997
64447,2024-09-05 13:25:58.000016,"""DX002""",99.480003
64447,2024-09-06 09:51:29.417816,"""MED001""",101.830002
64447,2024-09-06 11:41:57.383762,"""DX005""",100.459999


tuning/0.parquet


subject_id,time,code,numeric_value
i64,datetime[μs],str,f32
19594,2024-07-15 01:40:31.157512,"""DX004""",115.32
19594,2024-08-13 10:21:46.713997,"""DX005""",106.489998
19594,2024-09-23 07:17:26.193540,"""MED002""",71.800003
19594,2024-10-19 01:13:42.248642,"""MED003""",101.290001
19594,2024-10-21 01:14:40.410467,"""MED002""",91.529999
19594,2024-11-12 06:14:08.095228,"""DX004""",81.120003


In [10]:
print((output_metadata_root / "dataset.json").read_text())

{"dataset_name": "Synthetic_Dataset", "dataset_version": "1.0", "etl_name": "MEDS_transforms", "etl_version": "0.6.0", "meds_version": "0.4.0", "created_at": "2025-07-20T04:52:46.592332+00:00"}


We can see that by default, the codes file has the right schema but is empty, as we extracted no metadata in this pipeline.

In [11]:
display(pl.read_parquet(output_metadata_root / "codes.parquet"))

code,description,parent_codes
str,str,list[str]


Show some split listings

In [12]:
display(pl.read_parquet(output_metadata_root / "subject_splits.parquet"))

subject_id,split
i64,str
7304150,"""train"""
8887065,"""train"""
4934115,"""train"""
1632631,"""train"""
948483,"""train"""
…,…
4501946,"""held_out"""
3345942,"""held_out"""
2351057,"""held_out"""
4228028,"""held_out"""


In [13]:
# for each "split" display the 3 subject_ids in each type.
subject_splits = pl.read_parquet(output_metadata_root / "subject_splits.parquet")
for split_type in subject_splits['split'].unique():
    print(f"Split: {split_type}")
    split_data = subject_splits.filter(pl.col('split') == split_type)
    display(split_data.head(3))
    print("\n")


Split: train


subject_id,split
i64,str
7304150,"""train"""
8887065,"""train"""
4934115,"""train"""




Split: tuning


subject_id,split
i64,str
7197926,"""tuning"""
19594,"""tuning"""
5289418,"""tuning"""




Split: held_out


subject_id,split
i64,str
9817725,"""held_out"""
9896237,"""held_out"""
738977,"""held_out"""




