In [None]:
# 03 — Methodology Overview (Public Synthesis)

This notebook outlines the D7.4 pipeline at a high level (no proprietary models).

## Scope (Phase 1 — MIT)
- Data normalization & validation
- Temporal parsing and derived features
- Public variant of “data explosion” for aligned multi-value cells
- Minimal EDA and documentation

## Planned (Phase 2 — EUPL/GPL, post-publication)
- Temporal normalization & holiday calendars
- Learned encoders (token reconciliation, rare-token grouping)
- Clustering & severity prediction pipelines (~82% internal accuracy)
- Schema contracts & cross-table integrity checks
- Reproducible reporting templates


In [None]:
## Data Explosion (Public Variant)
**Idea:** when a row has multi-value fields (e.g., `"Seat Belt,Helmet"`), we:
1. Split each chosen column by comma into lists.
2. Align list lengths per row (pad shorter lists with `None` if needed).
3. Zip into tuples and **explode** the row into multiple rows.
4. Optionally, one-hot frequent tokens for simple baselines.

This keeps Phase-1 safe and reproducible without disclosing advanced heuristics.


In [None]:
import pandas as pd
from preprocessing import (
    parse_datetime_column, add_accident_time_parts, explode_aligned_columns
)

CSV_PATH = "../data/accidents-corporels-de-la-circulation-millesime_eng_columns_selected_data_translated_sample.csv"
df = pd.read_csv(CSV_PATH)
df = parse_datetime_column(df, "Date_and_hour", "dt", utc=True)
df = add_accident_time_parts(df, "dt")

demo = explode_aligned_columns(df, ["Security_measures", "User_of_security_measures"], sep=",")
demo[["Security_measures", "User_of_security_measures"]].head(8)


In [None]:
## Next steps
- Replace the sample CSV with larger BAAC subsets as permitted.
- Track evaluation-ready artifacts (data splits, metrics) in Phase 2.
- Release full ML/DL components after publication under EUPL/GPL.
