# 01 — Data Preparation, Fixed Splits & Configuration

**Goal**  
In this step, we created *reusable, fixed* cross-validation splits on the original dataset.  
The splits are saved to disk in `data/splits_dir/` together with a `manifest.json`.

By fixing the splits once, we ensure that all future experiments (e.g., with tattooed/watermarked, anonymized, or synthetic datasets) can be evaluated under the exact same conditions.  
This guarantees that results remain **fair and directly comparable** across dataset versions.

> **We run this notebook.** Partners will need:

- The `splits_dir/manifest.json` containing the following components:  
>   - name: "Fixed Stratified K-Fold Splits"  
>   - k  
>   - random_state  
>   - target  
>   - features  
>   - rows  
>   - splits  
> - The original **Heart Failure Clinical Records** dataset augmented with `row_id` to ensure reproducibility.  


## 1) Configuration

- `data_csv` — path to the original dataset (e.g., *Heart Failure Clinical Records*, Chicco & Jurman, 2020).
- `features` — columns used as inputs.
- `target` — the binary target column.
- `k` — number of folds, with `shuffle=True` and fixed `random_state` to freeze splits.
- `splits_dir` — where split CSVs and `manifest.json` will be written.


In [1]:
# --- User config ---
from pathlib import Path
import pandas as pd

# Path to ORIGINAL dataset  
data_csv = Path("../data/heart_failure_clinical_records_dataset.csv") 

# Load dataset
df = pd.read_csv(data_csv)

# Baseline features & target  

target = "DEATH_EVENT"
# Use all features except the target
features = [col for col in df.columns if col != target]

# Cross‑validation setup
k = 5
random_state = 42

# Where to save the fixed splits + manifest
splits_dir = Path("../data/splits_k5_v1")   
splits_dir.mkdir(parents=True, exist_ok=True)
print(f"Splits will be saved to: {splits_dir.resolve()}")

Splits will be saved to: /donnees/home/elazzouzi/TracIA/use_cases/data/splits_k5_v1


## 2) Load data and add a stable `row_id`

We attach a `row_id` column (0..N-1) **once** on the original dataset. All transformed datasets MUST preserve this
`row_id` so splits remain valid and rows can be matched exactly.


In [2]:
import pandas as pd

df = pd.read_csv(data_csv)
assert target in df.columns, f"Target column '{target}' not found in data."
for col in features:
    assert col in df.columns, f"Feature column '{col}' not found in data."

# Ensure a deterministic row_id  
if 'row_id' not in df.columns:
    df = df.reset_index(drop=True).copy()
    df['row_id'] = df.index
else:
    # Enforce integer type
    df['row_id'] = df['row_id'].astype(int)

print(df.head())

# Optional: save a clean copy with row_id (owner's internal reference)
clean_with_ids = splits_dir / "heart_failure_clinical_records_dataset_with_row_id.csv"
df.to_csv(clean_with_ids, index=False)
print(f"Saved (owner‑only) copy with row_id to: {clean_with_ids}")

    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  DEATH_EVENT  row_id  
0        0     4            1       0  
1        0     6        

## 3) Create *fixed* stratified K‑fold splits and write CSVs

We store two CSV files **per fold**:
- `train_ids_foldX.csv` — a single column `row_id` with training row IDs
- `test_ids_foldX.csv` — a single column `row_id` with test row IDs

A `manifest.json` documents the split files and basic metadata.


In [3]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=random_state)

splits_meta = []
y = df[target].values

for fold, (train_idx, test_idx) in enumerate(skf.split(df, y), start=1):
    train_ids = df.loc[train_idx, 'row_id'].to_frame()
    test_ids  = df.loc[test_idx,  'row_id'].to_frame()

    train_file = splits_dir / f"train_ids_fold{fold}.csv"
    test_file  = splits_dir / f"test_ids_fold{fold}.csv"
    train_ids.to_csv(train_file, index=False)
    test_ids.to_csv(test_file, index=False)

    splits_meta.append({
        "fold": fold,
        "train_file": train_file.name,
        "test_file": test_file.name,
        "n_train": int(len(train_ids)),
        "n_test": int(len(test_ids))
    })

manifest = {
    "name": "Fixed Stratified K-Fold Splits",
    "k": k,
    "random_state": random_state,
    "target": target,
    "features": features,
    "rows": int(len(df)),
    "splits": splits_meta
}

import json
with open(splits_dir / "manifest.json", "w", encoding="utf-8") as f:
    json.dump(manifest, f, indent=2, ensure_ascii=False)

print(f"Created {k} folds. Manifest written to: {splits_dir / 'manifest.json'}")
splits_meta

Created 5 folds. Manifest written to: ../data/splits_k5_v1/manifest.json


[{'fold': 1,
  'train_file': 'train_ids_fold1.csv',
  'test_file': 'test_ids_fold1.csv',
  'n_train': 239,
  'n_test': 60},
 {'fold': 2,
  'train_file': 'train_ids_fold2.csv',
  'test_file': 'test_ids_fold2.csv',
  'n_train': 239,
  'n_test': 60},
 {'fold': 3,
  'train_file': 'train_ids_fold3.csv',
  'test_file': 'test_ids_fold3.csv',
  'n_train': 239,
  'n_test': 60},
 {'fold': 4,
  'train_file': 'train_ids_fold4.csv',
  'test_file': 'test_ids_fold4.csv',
  'n_train': 239,
  'n_test': 60},
 {'fold': 5,
  'train_file': 'train_ids_fold5.csv',
  'test_file': 'test_ids_fold5.csv',
  'n_train': 240,
  'n_test': 59}]