# 01 - Exploratory Data Analysis

**Purpose**: This notebook performs an initial exploratory data analysis (EDA) on the `early-markers` dataset. It calculates descriptive statistics, summarizes data distributions, and prepares the data for more detailed profiling.

**Inputs**:
- The merged feature data, accessed via the legacy `get_merged_dataframe()` function from `early_markers.cribsy.common.data`.

**Outputs**:
- An in-memory dictionary (`eda`) containing key statistics about the dataset.
- In-memory Polars DataFrames (`df_train`, `df_test`) representing the wide-format training and testing sets.
- Although commented out, this notebook was previously used to generate `ydata-profiling` reports for the training and testing sets (e.g., `report_train.html`).

### Key Steps:
1.  **Load Data**: Fetches the complete, merged dataset.
2.  **Calculate Statistics**: Computes counts of unique infants, features, and values. It also breaks down infant counts by category (train/test), age bracket, and risk level.
3.  **Reshape Data**: Pivots the data from a long to a wide format, where each feature becomes a separate column.
4.  **Split Data**: Filters the wide-format DataFrame to create distinct training (`category == 0`) and testing (`category == 1`) sets.

### 2.1 Imports, Statistics, and Data Preparation

This cell consolidates all the EDA steps into a single block of code:
- **Imports**: Brings in required libraries including `polars`, `pandas`, and data-related constants and functions from the local codebase.
- **Data Loading**: Calls `get_merged_dataframe()` to load the initial dataset.
- **Descriptive Statistics**: Populates the `eda` dictionary by executing a series of `polars` queries to count and summarize the data across various dimensions (risk, category, age, etc.).
- **Data Reshaping**: Converts the long-format DataFrame into a wide format using the `pivot` function, making it suitable for machine learning models.
- **Train/Test Split**: Creates `df_train` and `df_test` by filtering the wide DataFrame based on the `category` column. These are the final artifacts prepared by this notebook.

In [None]:
%reload_ext autoreload
%autoreload 2

import polars as pl
from polars import DataFrame

from early_markers.cribsy.common.hold.data import get_merged_dataframe

df_merged = get_merged_dataframe()

eda = {
    "n_kids": df_merged.select("infant").n_unique("infant"),
    "n_features": df_merged.n_unique(["part", "feature_name"]),
    "n_values": df_merged.height,
    "n_kids_train": df_merged.filter(pl.col("category") == 0).select("infant").n_unique("infant"),
    "n_kids_test": df_merged.filter(pl.col("category") == 1).select("infant").n_unique("infant"),
    "n_kids_train_lt10": df_merged.filter(pl.col("category") == 0, pl.col("age_bracket") == 0).select("infant").n_unique("infant"),
    "n_kids_train_ge10": df_merged.filter(pl.col("category") == 0, pl.col("age_bracket") == 1).select("infant").n_unique("infant"),
    "n_kids_test_lt10": df_merged.filter(pl.col("category") == 1, pl.col("age_bracket") == 0).select("infant").n_unique("infant"),
    "n_kids_test_ge10": df_merged.filter(pl.col("category") == 1, pl.col("age_bracket") == 1).select("infant").n_unique("infant"),
    "n_kids_risk": {t[0]: t[1] for t in df_merged.select(["risk", "infant"]).group_by(["risk"]).n_unique().rows()},
    "n_kids_train_risk": {t[0]: t[1] for t in df_merged.filter(pl.col("category") == 0).select(["risk", "infant"]).group_by(["risk"]).n_unique().rows()},
    "n_kids_test_risk": {t[0]: t[1] for t in df_merged.filter(pl.col("category") == 1).select(["risk", "infant"]).group_by(["risk"]).n_unique().rows()},
    "n_kids_risk_raw": {t[0]: t[1] for t in df_merged.select(["risk_raw", "infant"]).group_by(["risk_raw"]).n_unique().rows()},
    "n_kids_train_risk_raw": {t[0]: t[1] for t in df_merged.filter(pl.col("category") == 0).select(["risk_raw", "infant"]).group_by(["risk_raw"]).n_unique().rows()},
    "n_kids_test_risk_raw": {t[0]: t[1] for t in df_merged.filter(pl.col("category") == 1).select(["risk_raw", "infant"]).group_by(["risk_raw"]).n_unique().rows()},
}



df_long_all = get_merged_dataframe().rename({"Value": "value"}).with_columns(
    feature=pl.concat_str("part", "feature_name", separator="_")
).filter(pl.col("part") != "umber")

df_wide_all: DataFrame = df_long_all.pivot(on="feature", index=["infant", "risk_raw", "category"], values=["value"])

df_train = df_wide_all.filter(pl.col("category") == 0)
df_test = df_wide_all.filter(pl.col("category") == 1)

