# 05 - Synthetic Data Generation with YData

**Purpose**: This notebook uses the `ydata-sdk` to generate synthetic data that mimics the statistical properties of the real training and testing datasets. This is a crucial step for model validation, data augmentation, and robustness testing.

**Inputs**:
- Real data, accessed via the `BayesianData` class, which is then split into training and testing sets.

**Outputs**:
- `profile_real_train_{date}.html` & `profile_real_test_{date}.html`: `ydata-profiling` reports for the real datasets.
- `synth_train.ipc` & `synth_test.ipc`: The generated synthetic datasets, saved in the efficient IPC format.
- `synth_train_long.ipc`, `synth_test_long.ipc`, `synth_long.ipc`: The synthetic data reshaped into a long format.

### Key Steps:
1.  **Profile Real Data**: Generates and saves detailed `ydata-profiling` reports for the real training and testing sets to understand their distributions.
2.  **Train Synthesizer**: Fits a `RegularSynthesizer` from `ydata-sdk` on the real training data.
3.  **Generate Synthetic Data**: Uses the trained synthesizer to generate a new sample of synthetic data.
4.  **Evaluate Synthetic Data**: Creates a `SyntheticDataProfile` to compare the statistical properties of the generated data against the original data, ensuring quality and fidelity.
5.  **Save Artifacts**: Saves the generated synthetic data in both wide and long formats to IPC files for use in other notebooks.

# Profile Real Data

### 5.1 Profile Real Data and Generate Synthetic Data with YData

This cell is the core of the notebook, performing a multi-step process to generate synthetic data:
1.  **Setup**: Imports libraries, sets the YData license key, and defines constants for the number of synthetic rows to generate.
2.  **Load Real Data**: Instantiates `BayesianData` and prepares the wide-format training and testing sets as Pandas DataFrames.
3.  **Profile Real Data**: Uses `ydata-profiling` to generate and save detailed HTML reports for both the real training and testing data, which helps in understanding the data's characteristics before synthesis.
4.  **Fit Synthesizer**: Trains a `RegularSynthesizer` from the YData SDK on the real training data.
5.  **Generate & Profile Synthetic Data**: Samples new data and creates a `SyntheticDataProfile` to compare the generated data with the original data, providing a quality assessment.

In [None]:
%reload_ext autoreload
%autoreload 2

import os
import math
from datetime import datetime

import polars as pl
from polars import DataFrame
import polars.selectors as cs
# from sdv.metadata import Metadata
# from sdv.single_table import GaussianCopulaSynthesizer
# from sdv.sampling import Condition
# from sdv.evaluation.single_table import run_diagnostic, evaluate_quality, get_column_plot
import pandas as pd
from ydata_profiling import ProfileReport
from ydata.dataset import Dataset
from ydata.metadata import Metadata
from ydata.report import SyntheticDataProfile
from ydata.synthesizers.regular.model import RegularSynthesizer

# from early_markers.cribsy.common.data import get_dataframes, get_merged_dataframe
from early_markers.cribsy.common.constants import JSON_DIR, IPC_DIR, RAND_STATE, FEATURES, HTML_DIR
from early_markers.cribsy.common.bayes import BayesianData

# Ydata Key: ef3d3f6c-3b14-4309-8b95-e1ef605918fb
os.environ['YDATA_LICENSE_KEY'] = '{ef3d3f6c-3b14-4309-8b95-e1ef605918fb}'

NUM_ROWS = 2000
RISK_0_ROWS = math.ceil(0.740741 * NUM_ROWS)
RISK_1_ROWS = NUM_ROWS - RISK_0_ROWS

TODAY = datetime.today().strftime("%Y%m%d")

bd = BayesianData()

cols = ["infant", "category", "risk", "age_bracket"] + FEATURES

df_train = bd.base_train_wide.select(cols).to_pandas()
df_test = bd.base_test_wide.select(cols).to_pandas()


rpt_train = ProfileReport(
    df_train,
    title="Real Training Data",
    explorative=True,
)
rpt_train.to_file(HTML_DIR / f'profile_real_train_{TODAY}.html')

rpt_test = ProfileReport(
    df_test,
    title="Real Testing Data"
)
rpt_test.to_file(HTML_DIR / f'profile_real_test_{TODAY}.html')

meta_train = Metadata(Dataset(df_train))
synth_train = RegularSynthesizer()
synth_train.fit(df_train, meta_train, random_state=RAND_STATE)
sample_train = synth_train.sample(1_000)
meta_synth_train = Metadata(sample_train)
profile_train = SyntheticDataProfile(
    df_train,
    sample_train,
    metadata=meta_synth_train,
    target="risk",
    data_types=synth_train.data_types,
)

### 5.2 Configure and Save SDV Metadata

This cell configures metadata for the `sdv` (Synthetic Data Vault) library, a different tool for synthetic data generation. It demonstrates the interoperability of the workflow:
1.  **Detect Metadata**: Automatically detects the schema and data types from the wide-format DataFrame.
2.  **Update Column Type**: Manually overrides the data type for the `risk` column to ensure it is treated as categorical.
3.  **Define Distributions**: Specifies the desired statistical distribution for each feature to guide the `GaussianCopulaSynthesizer`.
4.  **Add Constraints**: Defines that the `risk` and `category` columns must use fixed combinations observed in the real data.
5.  **Fit and Sample**: Fits synthesizers for both training and testing sets and samples new data, enforcing the specified conditions.
6.  **Save Data**: Saves the final synthetic datasets to IPC files in both wide and long formats.

In [None]:

metadata = Metadata.detect_from_dataframe(
    data=bd.base_wide.select(cols).to_pandas(),
    table_name="features",
)
metadata.update_column(
    column_name="risk",
    sdtype = "categorical"
)
metadata.save_to_json(JSON_DIR / "sdv_metadata.json", mode="overwrite")


distro_train = {
    # "risk_raw": "beta",
    # "category": "beta",
    "Ankle_IQRaccx": "gaussian_kde",
    "Ankle_IQRaccy": "gaussian_kde",
    "Ankle_IQRvelx": "gaussian_kde",
    "Ankle_IQRvely": "gaussian_kde",
    "Ankle_IQRx": "norm",
    "Ankle_IQRy": "norm",
    "Ankle_lrCorr_x": "beta",
    "Ankle_meanent": "gaussian_kde",
    "Ankle_medianvelx": "gaussian_kde",
    "Ankle_medianvely": "norm",
    "Ankle_medianx": "norm",
    "Ankle_mediany": "gaussian_kde",
    "Ear_lrCorr_x": "norm",
    "Elbow_IQR_acc_angle": "gaussian_kde",
    "Elbow_IQR_vel_angle": "gaussian_kde",
    "Elbow_entropy_angle": "beta",
    "Elbow_lrCorr_angle": "beta",
    "Elbow_lrCorr_x": "norm",
    "Elbow_mean_angle": "gaussian_kde",
    "Elbow_median_vel_angle": "gaussian_kde",
    "Elbow_stdev_angle": "norm",
    "Eye_lrCorr_x": "norm",  # ***
    "Hip_IQR_acc_angle": "gaussian_kde",
    "Hip_IQR_vel_angle": "gaussian_kde",
    "Hip_entropy_angle": "beta",
    "Hip_lrCorr_angle": "beta",
    # "Hip_lrCorr_x": "gaussian_kde",  # ***
    "Hip_mean_angle": "gaussian_kde",
    "Hip_median_vel_angle": "gaussian_kde",
    "Hip_stdev_angle": "gaussian_kde",
    "Knee_IQR_acc_angle": "gaussian_kde",
    "Knee_IQR_vel_angle": "gaussian_kde",
    "Knee_entropy_angle": "gaussian_kde",
    "Knee_lrCorr_angle": "norm",
    "Knee_lrCorr_x": "gaussian_kde",
    "Knee_mean_angle": "gaussian_kde",
    "Knee_median_vel_angle": "gaussian_kde",
    "Knee_stdev_angle": "beta",
    "Shoulder_IQR_acc_angle": "gaussian_kde",
    "Shoulder_IQR_vel_angle": "gaussian_kde",
    "Shoulder_entropy_angle": "norm",
    "Shoulder_lrCorr_angle": "norm",
    # "Shoulder_lrCorr_x": "beta",
    "Shoulder_mean_angle": "gaussian_kde",
    "Shoulder_median_vel_angle": "beta",
    "Shoulder_stdev_angle": "gaussian_kde",
    "Wrist_IQRaccx": "gaussian_kde",
    "Wrist_IQRaccy": "gaussian_kde",
    "Wrist_IQRvelx": "gaussian_kde",
    "Wrist_IQRvely": "gaussian_kde",
    "Wrist_IQRx": "gaussian_kde",
    "Wrist_IQRy": "gaussian_kde",
    "Wrist_lrCorr_x": "gaussian_kde", # beta
    "Wrist_meanent": "gaussian_kde",
    "Wrist_medianvelx": "gaussian_kde",
    "Wrist_medianvely": "gaussian_kde",
    "Wrist_medianx": "norm",
    "Wrist_mediany": "gaussian_kde",
}

distro_test = {
    # "risk_raw": "beta",
    # "category": "beta",
    "Ankle_IQRaccx": "gaussian_kde",
    "Ankle_IQRaccy": "gaussian_kde",
    "Ankle_IQRvelx": "gaussian_kde",
    "Ankle_IQRvely": "gaussian_kde",
    "Ankle_IQRx": "gaussian_kde",
    "Ankle_IQRy": "gaussian_kde",
    "Ankle_lrCorr_x": "beta",
    "Ankle_meanent": "gaussian_kde",
    "Ankle_medianvelx": "gaussian_kde",
    "Ankle_medianvely": "gaussian_kde",
    "Ankle_medianx": "gamma",
    "Ankle_mediany": "gaussian_kde",
    "Ear_lrCorr_x": "norm",
    "Elbow_IQR_acc_angle": "gaussian_kde",
    "Elbow_IQR_vel_angle": "gaussian_kde",
    "Elbow_entropy_angle": "gaussian_kde",
    "Elbow_lrCorr_angle": "beta",
    "Elbow_lrCorr_x": "norm",
    "Elbow_mean_angle": "gaussian_kde",
    "Elbow_median_vel_angle": "gaussian_kde",
    "Elbow_stdev_angle": "norm",
    "Eye_lrCorr_x": "gaussian_kde",  # ***
    "Hip_IQR_acc_angle": "gaussian_kde",
    "Hip_IQR_vel_angle": "gaussian_kde",
    "Hip_entropy_angle": "gaussian_kde",
    "Hip_lrCorr_angle": "gaussian_kde",
    # "Hip_lrCorr_x": "gaussian_kde",  # ***
    "Hip_mean_angle": "gaussian_kde",
    "Hip_median_vel_angle": "gaussian_kde",
    "Hip_stdev_angle": "gaussian_kde",
    "Knee_IQR_acc_angle": "gaussian_kde",
    "Knee_IQR_vel_angle": "gaussian_kde",
    "Knee_entropy_angle": "gaussian_kde",
    "Knee_lrCorr_angle": "norm",
    "Knee_lrCorr_x": "gaussian_kde",
    "Knee_mean_angle": "gaussian_kde",
    "Knee_median_vel_angle": "gaussian_kde",
    "Knee_stdev_angle": "beta",
    "Shoulder_IQR_acc_angle": "gaussian_kde",
    "Shoulder_IQR_vel_angle": "gaussian_kde",
    "Shoulder_entropy_angle": "norm",
    "Shoulder_lrCorr_angle": "norm",
    # "Shoulder_lrCorr_x": "beta",
    "Shoulder_mean_angle": "gaussian_kde",
    "Shoulder_median_vel_angle": "beta",
    "Shoulder_stdev_angle": "gaussian_kde",
    "Wrist_IQRaccx": "gaussian_kde",
    "Wrist_IQRaccy": "gaussian_kde",
    "Wrist_IQRvelx": "gaussian_kde",
    "Wrist_IQRvely": "gaussian_kde",
    "Wrist_IQRx": "gaussian_kde",
    "Wrist_IQRy": "gaussian_kde",
    "Wrist_lrCorr_x": "beta",
    "Wrist_meanent": "gaussian_kde",
    "Wrist_medianvelx": "gaussian_kde",
    "Wrist_medianvely": "gaussian_kde",
    "Wrist_medianx": "gaussian_kde",
    "Wrist_mediany": "gaussian_kde",
}

category_risk_constraint = {
    "constraint_class": "FixedCombinations",
    "constraint_parameters": {
        "column_names": ["risk", "category"]
    }
}

risk_0_condition = Condition(
    num_rows=RISK_0_ROWS,
    column_values={"risk": 0}
)
risk_1_condition = Condition(
    num_rows=RISK_1_ROWS,
    column_values={"risk": 1}
)

synth_train = GaussianCopulaSynthesizer(
    metadata,
    numerical_distributions=distro_train,
    enforce_min_max_values=True,
)
synth_train.add_constraints(
    constraints=[
        category_risk_constraint
    ]
)

synth_train.fit(df_train.to_pandas())
synth_train._set_random_state(RAND_STATE)
synth_data_train = synth_train.sample(num_rows=NUM_ROWS)
df_synth_train =  pl.DataFrame(synth_data_train)

synth_test = GaussianCopulaSynthesizer(
    metadata,
    numerical_distributions=distro_test,
    enforce_min_max_values=True,
)
synth_test.add_constraints(
    constraints=[
        category_risk_constraint
    ]
)


synth_test.fit(df_test.to_pandas())
synth_test._set_random_state(RAND_STATE)
synth_data_test = synth_test.sample_from_conditions(
    conditions=[risk_0_condition, risk_1_condition]
)
df_synth_test =  pl.DataFrame(synth_data_test)

df_synth_train.write_ipc(IPC_DIR / "synth_train.ipc")
df_synth_test.write_ipc(IPC_DIR / "synth_test.ipc")

df_synth_train_long = df_synth_train.unpivot(on=FEATURES, index=["infant", "risk", "category", "age_bracket"], variable_name="feature", value_name="value")
df_synth_train_long.write_ipc(IPC_DIR / "synth_train_long.ipc")

df_synth_test_long = df_synth_test.unpivot(on=FEATURES, index=["infant", "risk", "category", "age_bracket"], variable_name="feature", value_name="value")
df_synth_test_long.write_ipc(IPC_DIR / "synth_test_long.ipc")

df_synth_long = df_synth_train_long.vstack(df_synth_test_long)
df_synth_long.write_ipc(IPC_DIR / "synth_long.ipc")

# Evaluate Synthetic Data

### 5.3 Evaluate Synthetic Data Quality

These cells use `sdv`'s evaluation utilities to assess the quality of the generated synthetic data. `run_diagnostic` checks for basic validity, while `evaluate_quality` provides a more detailed report comparing the statistical properties of the real and synthetic data. The `Column Shapes` visualization helps to visually confirm that the distributions of features are similar.

### 5.4 Manual Inspection of Risk Distributions

The final set of cells performs manual checks on the `risk` distribution within the synthetic and real datasets. This provides a quick, hands-on verification that the synthesizers have reasonably preserved the prevalence of different risk categories.

In [None]:
diagnostic = run_diagnostic(df_train.to_pandas(), pl.DataFrame(synth_data_train).to_pandas(), metadata)
quality_report = evaluate_quality(df_train.to_pandas(), pl.DataFrame(synth_data_train).to_pandas(), metadata)
quality_report.get_visualization("Column Shapes")

In [None]:
diagnostic = run_diagnostic(df_test.to_pandas(), pl.DataFrame(synth_data_test).to_pandas(), metadata)
quality_report = evaluate_quality(df_test.to_pandas(), pl.DataFrame(synth_data_test).to_pandas(), metadata)
quality_report.get_visualization("Column Shapes")

In [None]:
# df_synth.group_by("risk").len().with_columns(
#     pct=pl.col("len") / df_synth.height
# )
# 0: 862 | 86.2%
# 1:  29 |  2.9%
# 2:  64 |  6.4%
# 3:  45 |  4.5%


In [None]:
df_test.group_by("risk").len().with_columns(
    pct=pl.col("len") / df_test.height
)
# 0: 0 |  0.0%
# 1: 5 | 26.3%
# 2: 9 | 47.4%
# 3: 5 | 26.3%

In [None]:
df_synth_test.group_by("risk").len().with_columns(
    pct=pl.col("len") / df_synth_test.height
)

In [None]:

df_train.group_by("risk").len().with_columns(
    pct=pl.col("len") / df_train.height
)
# 0: 124 | 100.0%
# 1:   0 |   0.0%
# 2:   0 |   0.0%
# 3:   0 |   0.0%

# 0 + 1 : 129 | 90.2%
# 2 + 3 :  14 |  9.8%

# 14 / 143 = 9.8%

In [None]:
df_synth_train.group_by("risk").len().with_columns(
    pct=pl.col("len") / df_synth_train.height
)

In [None]:
# %reload_ext autoreload
# %autoreload 2
# 
# import polars as pl
# from polars import DataFrame
# from sdv.metadata import Metadata
# from sdv.single_table import GaussianCopulaSynthesizer
# from sdv.evaluation.single_table import run_diagnostic, evaluate_quality, get_column_plot
# 
# from early_markers.cribsy.common.data import get_dataframes, get_merged_dataframe
# from early_markers.cribsy.common.constants import JSON_DIR, IPC_DIR, RAND_STATE, FEATURES
# 
# frames = get_dataframes()
# 
# df_long_all = get_merged_dataframe().rename({"Value": "value"}).with_columns(
#     feature=pl.concat_str("part", "feature_name", separator="_")
# ).filter(pl.col("part") != "umber")
#       
# df_wide_all: DataFrame = df_long_all.pivot(on="feature", index=["infant", "risk_raw", "category"], values=["value"]).drop("Shoulder_lrCorr_x")  # Shoulder_lrCorr_x is constant = 1
# 
# df_train = df_wide_all.filter(pl.col("category") == 0)
# df_test = df_wide_all.filter(pl.col("category") == 1)
# 
# metadata = Metadata.detect_from_dataframe(
#     data=df_wide_all.to_pandas(),
#     table_name="features",
# )
# metadata.update_column(
#     column_name="risk_raw",
#     sdtype = "categorical"
# )
# metadata.save_to_json(JSON_DIR / "sdv_metadata.json", mode="overwrite")
# 
# distributions = {
#     # "risk_raw": "beta",
#     # "category": "beta",
#     "Ankle_IQRaccx": "gaussian_kde",
#     "Ankle_IQRaccy": "gaussian_kde",
#     "Ankle_IQRvelx": "gaussian_kde",
#     "Ankle_IQRvely": "gaussian_kde",
#     "Ankle_IQRx": "norm",
#     "Ankle_IQRy": "beta",
#     "Ankle_lrCorr_x": "beta",
#     "Ankle_meanent": "gaussian_kde",
#     "Ankle_medianvelx": "gaussian_kde",
#     "Ankle_medianvely": "norm",
#     "Ankle_medianx": "norm",
#     "Ankle_mediany": "gaussian_kde",
#     "Ear_lrCorr_x": "norm",
#     "Elbow_IQR_acc_angle": "gaussian_kde",
#     "Elbow_IQR_vel_angle": "gaussian_kde",
#     "Elbow_entropy_angle": "beta",
#     "Elbow_lrCorr_angle": "beta",
#     "Elbow_lrCorr_x": "norm",
#     "Elbow_mean_angle": "gaussian_kde",
#     "Elbow_median_vel_angle": "gaussian_kde",
#     "Elbow_stdev_angle": "norm",
#     "Eye_lrCorr_x": "beta",  # ***
#     "Hip_IQR_acc_angle": "gaussian_kde",
#     "Hip_IQR_vel_angle": "gaussian_kde",
#     "Hip_entropy_angle": "beta",
#     "Hip_lrCorr_angle": "beta",
#     "Hip_lrCorr_x": "gaussian_kde",  # ***
#     "Hip_mean_angle": "gaussian_kde",
#     "Hip_median_vel_angle": "gaussian_kde",
#     "Hip_stdev_angle": "gaussian_kde",
#     "Knee_IQR_acc_angle": "gaussian_kde",
#     "Knee_IQR_vel_angle": "gaussian_kde",
#     "Knee_entropy_angle": "gaussian_kde",
#     "Knee_lrCorr_angle": "norm",
#     "Knee_lrCorr_x": "gaussian_kde",
#     "Knee_mean_angle": "gaussian_kde",
#     "Knee_median_vel_angle": "gaussian_kde",
#     "Knee_stdev_angle": "beta",
#     "Shoulder_IQR_acc_angle": "gaussian_kde",
#     "Shoulder_IQR_vel_angle": "gaussian_kde",
#     "Shoulder_entropy_angle": "norm",
#     "Shoulder_lrCorr_angle: "norm",
#     # "Shoulder_lrCorr_x": "beta",
#     "Shoulder_mean_angle": "gaussian_kde",
#     "Shoulder_median_vel_angle": "beta",
#     "Shoulder_stdev_angle": "beta",
#     "Wrist_IQRaccx": "gaussian_kde",
#     "Wrist_IQRaccy": "gaussian_kde",
#     "Wrist_IQRvelx": "gaussian_kde",
#     "Wrist_IQRvely": "gaussian_kde",
#     "Wrist_IQRx": "gaussian_kde",
#     "Wrist_IQRy": "gaussian_kde",
#     "Wrist_lrCorr_x": "beta",
#     "Wrist_meanent": "gaussian_kde",
#     "Wrist_medianvelx": "gaussian_kde",
#     "Wrist_medianvely": "gaussian_kde",
#     "Wrist_medianx": "norm",
#     "Wrist_mediany": "gaussian_kde",
# }
# 
# category_risk_constraint = {
#     "constraint_class": "FixedCombinations",
#     "constraint_parameters": {
#         "column_names": ["risk_raw", "category"]
#     }
# }
# 
# synthesizer = GaussianCopulaSynthesizer(
#     metadata,
#     numerical_distributions=distributions,
#     enforce_min_max_values=True,
# )
# synthesizer.add_constraints(
#     constraints=[
#         category_risk_constraint
#     ]
# )
# synthesizer.fit(df_wide_all.to_pandas())
# synthesizer._set_random_state(RAND_STATE)
# synthetic_data = synthesizer.sample(num_rows=1000)
# df_synth =  pl.DataFrame(synthetic_data).with_columns(
#     risk=pl.col("risk_raw") >= 2
# ).drop("risk_raw")

In [None]:
# import math
# %reload_ext autoreload
# %autoreload 2
# 
# import polars as pl
# from polars import DataFrame
# import polars.selectors as cs
# from sdv.metadata import Metadata
# from sdv.single_table import GaussianCopulaSynthesizer
# from sdv.evaluation.single_table import run_diagnostic, evaluate_quality, get_column_plot
# 
# from early_markers.cribsy.common.data import get_dataframes, get_merged_dataframe
# from early_markers.cribsy.common.constants import JSON_DIR, IPC_DIR, RAND_STATE, FEATURES
# 
# frames = get_dataframes()
# 
# df_long_all = get_merged_dataframe().rename({"Value": "value"}).with_columns(
#     feature=pl.concat_str("part", "feature_name", separator="_")
# ).filter(pl.col("part") != "umber")
#       
# df_wide_all: DataFrame = df_long_all.pivot(on="feature", index=["infant", "risk_raw", "category"], values=["value"]).drop(["Shoulder_lrCorr_x", "Hip_lrCorr_x"])  # Shoulder_lrCorr_x is constant = 1
# 
# df_train = df_wide_all.filter(pl.col("category") == 0).with_columns(
#     risk=(pl.col("risk_raw") >= 2).cast(pl.Int64)
# ).select(["infant", "category", "risk", cs.exclude(["infant", "category", "risk"])]).drop("risk_raw")
# 
# df_test = df_wide_all.filter(pl.col("category") == 1).with_columns(
#     risk=(pl.col("risk_raw") >= 2).cast(pl.Int64)
# ).select(["infant", "category", "risk", cs.exclude(["infant", "category", "risk"])]).drop("risk_raw")
# 
# df_train

# PEB 2025.03.26 22:28 => Need to create and validate a model based only on kiddos age >= 10 weeks
# Will need to ensure both train and test synth datasets are 90/10 no-risk/risk
