# 00 - Data Preparation

**Purpose**: This notebook serves as the foundational step in the `early-markers` analysis pipeline. It is responsible for loading the raw data, performing essential cleaning and transformations, and structuring the dataset for all subsequent exploratory data analysis (EDA), feature selection, and modeling tasks.

**Inputs**:
- `features_merged.pkl`: The raw feature data stored as a Pandas DataFrame pickle file, located in `early_markers/emmacp_metrics/`.

**Outputs**:
- An in-memory, transformed Polars DataFrame. This notebook's primary output is the cleaned data used to instantiate the `BayesianData` class in other notebooks, rather than saving a new file.

### Key Transformation Steps:
1.  **Load Raw Data**: Reads the initial pickle file.
2.  **Filter Invalid Records**: Removes known bad data points.
3.  **Encode Risk**: Converts the multi-level `risk_raw` score into a binary `risk` classification (0: Normal, 1: At-Risk).
4.  **Assign Category**: Segregates data into `category` 1 (Training) and 2 (Testing) based on risk and original category assignments.
5.  **Unify Feature Column**: Creates a single, descriptive `feature` name from `part` and `feature_name`.
6.  **Integrate Age**: Reshapes the data to include `age_in_weeks` as a feature for each infant.

### 1.1 Imports and Initial Correlation Analysis

This cell performs the following actions:
- **Imports Libraries**: Imports necessary libraries such as `polars` for data manipulation and modules from the local `early_markers` codebase.
- **Instantiates `BayesianData`**: Creates an instance of the main data handling class, which loads and preprocesses the data in its constructor.
- **Computes Correlation Matrix**: Calculates the feature correlation matrix on the wide-format DataFrame to identify highly correlated features.
- **Identifies Features to Drop**: Establishes two criteria for feature removal:
  1. `hi_corrs`: Features with a pairwise correlation greater than `CORR_MAX` (0.9).
  2. `lo_risk_corrs`: Features with a correlation to the `risk` target variable less than `CORR_MIN` (0.1).
- **Consolidates Drops**: Combines these lists to create a final set of `drops` to be excluded from modeling.

In [None]:
%reload_ext autoreload
%autoreload 2

import polars as pl
from polars import DataFrame
import polars.selectors as cs

from early_markers.cribsy.common.constants import CSV_DIR, FEATURES
from early_markers.cribsy.common.bayes import BayesianData

CORR_MIN = 0.1
CORR_MAX = 0.9

bd = BayesianData()
df_base = bd.base_wide.drop("Shoulder_lrCorr_x")
df = df_base.select(["risk"] + FEATURES).corr().with_columns(
    feature=pl.Series(["risk"] + FEATURES)
)
df_corr = df.select([df.columns[-1]] + df.columns[:-1])

hi_corrs = {}
for row in df_corr.rows(named=True):
    for k, v in row.items():
        if k == "feature":
            continue
        if k == row["feature"]:
            continue
        if k in hi_corrs:
            continue
        if abs(v) >= CORR_MAX:
            if row["feature"] not in hi_corrs:
                hi_corrs[row["feature"]] = []
            hi_corrs[row["feature"]].append(
                {
                    "correlate": k,
                    "value": v,
                }
            )

lo_risk_corrs = {}
rows = df_corr.rows(named=True)
risk = rows[0]

for k, v in risk.items():
    if k in ["feature", "risk"]:
        continue
    if abs(v) < CORR_MIN:
        lo_risk_corrs[k] = v

losers = []
for k, v in hi_corrs.items():
    l = []
    for v2 in v:
        if abs(risk[k]) > v2["value"]:
            l.append(v2["correlate"])
    if len(l) > 0:
        losers.extend(l)
    else:
        losers.append(k)

losers.extend(lo_risk_corrs.keys())
drops = list(set(losers))

# drops = ['Shoulder_IQR_vel_angle', 'Ankle_IQRaccx', 'Wrist_IQRaccx', 'Ankle_IQRvelx', 'Knee_IQR_vel_angle', 'Elbow_IQR_acc_angle', 'Shoulder_mean_angle', 'Ankle_IQRaccy', 'Shoulder_lrCorr_angle', 'Hip_entropy_angle', 'Elbow_mean_angle', 'Eye_lrCorr_x', 'Shoulder_entropy_angle', 'Knee_entropy_angle', 'Shoulder_IQR_acc_angle', 'Ankle_lrCorr_x', 'Hip_lrCorr_angle', 'Wrist_meanent', 'Wrist_IQRvelx', 'Wrist_mediany', 'Ankle_IQRvely', 'Shoulder_stdev_angle', 'Hip_IQR_acc_angle', 'Elbow_stdev_angle', 'Knee_IQR_acc_angle', 'Ankle_meanent', 'Ankle_medianx', 'Wrist_IQRy', 'Knee_lrCorr_angle', 'Hip_IQR_vel_angle', 'Elbow_IQR_vel_angle', 'Wrist_IQRaccy', 'Wrist_IQRvely', 'Elbow_lrCorr_x']

### 1.2 Data Loading and Transformation

This cell executes the core data preparation logic. It reads the raw data from the pickle file and applies a series of `polars` transformations in a single, efficient chain:
1.  **Load Data**: Reads the `features_merged.pkl` file into a Pandas DataFrame and immediately converts it to a Polars DataFrame for high-performance processing.
2.  **Filter**: Removes known invalid data points (`part == 'umber'` and `infant == 'clin_100_6'`).
3.  **Rename**: Renames columns for clarity (e.g., `Value` to `value`).
4.  **Encode Risk**: Creates the binary `risk` column based on the `risk_raw` value (`<= 1` is 0, `> 1` is 1).
5.  **Assign Category**: Creates the `category` column (1 for training, 2 for testing) based on the newly created `risk` column and the original `category`.
6.  **Create Feature Name**: Concatenates `part` and `feature_name` to create a single, unique identifier for each feature.
7.  **Drop Columns**: Removes original columns that are now redundant (`part`, `feature_name`, `age_bracket`).

In [None]:
import pandas as pd

from early_markers.cribsy.common.constants import RAW_DATA


pd_raw = pd.read_pickle(RAW_DATA)
df = (
    pl.from_pandas(pd_raw)
    .filter(pl.col("part") != "umber", pl.col("infant") != "clin_100_6")
    .rename({"risk": "risk_raw", "Value": "value"})
    .with_columns(
        risk=pl.when(pl.col("risk_raw") <= 1).then(0)
        .otherwise(1),
    ).with_columns(
        category=pl.when((pl.col("category") == 0) | (pl.col("risk") == 0)).then(1)
        .otherwise(2),
        feature=pl.concat_str(pl.col("part"), pl.col("feature_name"), separator="_")
    ).drop(["part", "feature_name", "age_bracket"])
)

### 1.3 Reshape Age as a Feature

This cell reshapes the data to treat `age_in_weeks` as its own feature, consistent with all other movement features.
1.  **Extract Unique Age**: Creates a temporary DataFrame (`df2`) containing the unique `age_in_weeks` for each infant.
2.  **Create Age Feature**: Adds a new `feature` column with the literal value `"age_in_weeks"` and copies the age value into the `value` column.
3.  **Stack DataFrames**: Vertically stacks this new age-based DataFrame with the original feature DataFrame.
4.  **Sort**: Sorts the final, combined DataFrame by infant and feature name, ensuring a clean and orderly dataset for subsequent analysis. The resulting shape is displayed.

In [None]:
df2 = (
    df.select(["infant", "category", "risk", "age_in_weeks"])
    .unique()
    .with_columns(
        feature=pl.lit("age_in_weeks"),
        value=pl.col("age_in_weeks")
    )
).drop("age_in_weeks")

df.select(["infant", "category", "risk", "feature", "value"]).vstack(df2).sort(["infant", "feature"])
