# CAS (Cluster-Aware Severity) — End‑to‑End (data/cas layout)

This notebook reproduces the CAS pipeline using the repo layout:

```
k-diagram/
  data/
    cas/
      raw/
      preprocessed/
      modeling_results_ok/
      outputs/
      Readme.md
  examples/
    cas/
      scripts/
        prepare_cas_datasets.py
        preprocessing_cas_data.py
        cas_modeling.py
        results_config.py
        results_R1.py
        results_R2.py
        results_R3.py
        results_R4.py
        results_R5.py
        results_R6.py
        results_R7.py
        results_R8.py
        results_R9.py
      notebooks/
        CAS_end_to_end.ipynb  ← (this file)
      Readme.md
```

Run the notebook top‑to‑bottom. All artifacts live under `data/cas/`.


## 0) Paths & setup
Resolves the repository root when the notebook is opened from various
locations, then ensures `data/cas/` subfolders exist.

In [None]:
import os
import sys
from pathlib import Path


def find_repo_root(start: Path) -> Path:
    markers = ("pyproject.toml", ".git", "README.md")
    p = start.resolve()
    for _ in range(6):
        if (
            any((p / m).exists() for m in markers)
            and (p / "examples").exists()
        ):
            return p
        if p.parent == p:
            break
        p = p.parent
    return start.resolve()


repo_root = find_repo_root(Path.cwd())
DATA = repo_root / "data" / "cas"
RAW = DATA / "raw"
PRE = DATA / "preprocessed"
RESULTS = DATA / "modeling_results_ok"
FIG_OUT = DATA / "outputs"
SCRIPTS = repo_root / "examples" / "cas" / "scripts"

for p in [RAW, PRE, RESULTS, FIG_OUT]:
    p.mkdir(parents=True, exist_ok=True)

print("Repo root:", repo_root)
print("Scripts dir:", SCRIPTS)
print("Data root:", DATA)
print(" ├─ raw:", RAW)
print(" ├─ preprocessed:", PRE)
print(" ├─ modeling_results_ok:", RESULTS)
print(" └─ outputs:", FIG_OUT)

%matplotlib inline

Repo root: F:\repositories\k-diagram
Scripts dir: F:\repositories\k-diagram\examples\cas\scripts
Data root: F:\repositories\k-diagram\data\cas
 ├─ raw: F:\repositories\k-diagram\data\cas\raw
 ├─ preprocessed: F:\repositories\k-diagram\data\cas\preprocessed
 ├─ modeling_results_ok: F:\repositories\k-diagram\data\cas\modeling_results_ok
 └─ outputs: F:\repositories\k-diagram\data\cas\outputs


## 1) (Optional) Install dependencies
Uncomment if running in a fresh environment. Use ``tensorflow==2.15`` preferably

In [1]:
# make sure all the dependencies are well installed.
# numpy pandas matplotlib  lightgbm statsmodels scikit-learn [ are all Python existing modules
# if missing explicitly install them]

# !pip install -U pyarrow fastparquet
# # Optional (XTFT & backend)
# # !pip install fusionlab-learn tensorflow


In [None]:
# Verify what kernel you’re using

import importlib

import pandas as pd

print("python:", sys.executable)
print("pandas:", pd.__version__)
for mod in (
    "pyarrow",
    "fastparquet",
    "numpy",
    "pandas",
    "matplotlib",
    "lightgbm",
    "statsmodels",
    "scikit-learn",
):
    try:
        m = importlib.import_module(mod)
        print(mod, "OK:", m.__version__)
    except Exception as e:
        print(mod, "FAILED ->", e)

In [None]:
# ) Install into THIS kernel’s env
# If either import failed, install with the kernel’s interpreter:
# Uncomment below code if the pyarrow and fastparquet not installed in the selected kernel.

import sys

!{sys.executable} -m pip uninstall -y pandas fastparquet pyarrow scipy scikit-learn lightgbm 
!{sys.executable} -m pip install -U --no-cache-dir pandas fastparquet pyarrow lightgbm scipy scikit-learn 

## 2) Prepare raw → cleaned domain CSV/Parquet
Requires `examples/cas/scripts/prepare_cas_datasets.py` with a function
`prepare_all(wind_csv, hydro_csv, subs_csv, out_dir=...)`. It writes
cleaned domain files into `data/cas/preprocessed/`.

In [None]:
from importlib import util as _imp_util


def _load_py(path: Path):
    spec = _imp_util.spec_from_file_location(path.stem, str(path))
    mod = _imp_util.module_from_spec(spec)
    assert spec.loader is not None
    spec.loader.exec_module(mod)  # type: ignore
    return mod


prep_path = SCRIPTS / "prepare_cas_datasets.py"
assert prep_path.exists(), f"Missing: {prep_path}"
prep = _load_py(prep_path)

wind_csv = RAW / "gefcom_hourly.csv"  # or your own name
hydro_csv = RAW / "camels_timeseries.csv"  # or your own name
subs_csv = RAW / "egms_point.csv"  # or your own name

assert wind_csv.exists() and hydro_csv.exists() and subs_csv.exists(), (
    "Place the required CSVs in data/cas/raw/."
)

prepared = prep.prepare_all(wind_csv, hydro_csv, subs_csv, out_dir=PRE)
prepared

## 3) Build supervised frames per domain
Uses `examples/cas/scripts/preprocessing_cas_data.py` to produce
`supervised_long_*` files in `data/cas/preprocessed/` (CSV/Parquet).

In [None]:
preproc_path = SCRIPTS / "preprocessing_cas_data.py"
assert preproc_path.exists(), f"Missing: {preproc_path}"
pre = _load_py(preproc_path)

prefer_parquet = True

w_manifest = pre.preprocess_wind(
    PRE / "wind_clean.csv",
    outdir=PRE,
    suffix="_wind",
    prefer_parquet=prefer_parquet,
)
h_manifest = pre.preprocess_hydro(
    PRE / "hydro_clean.csv",
    outdir=PRE,
    suffix="_hydro",
    prefer_parquet=prefer_parquet,
)
s_manifest = pre.preprocess_subsidence(
    PRE / "subsidence_clean.csv",
    outdir=PRE,
    suffix="_subsidence",
    prefer_parquet=prefer_parquet,
)
w_manifest, h_manifest, s_manifest

## 4) Train & export predictions/metrics to `data/cas/modeling_results_ok/`
Requires `examples/cas/scripts/cas_modeling.py` with `run_domain(domain)`.
We repoint its path constants (if present) into `data/cas/`, run all
domains, and aggregate to `metrics_all_domains.csv`.

In [None]:
import pandas as pd

mdl_path = SCRIPTS / "cas_modeling.py"
assert mdl_path.exists(), f"Missing: {mdl_path}"
mdl = _load_py(mdl_path)

# Repoint if the module exposes these globals
if hasattr(mdl, "BASE_DIR"):
    mdl.BASE_DIR = DATA
if hasattr(mdl, "PQT_DIR"):
    mdl.PQT_DIR = PRE
if hasattr(mdl, "CSV_DIR"):
    mdl.CSV_DIR = PRE
if hasattr(mdl, "OUT_DIR"):
    mdl.OUT_DIR = RESULTS

RESULTS.mkdir(parents=True, exist_ok=True)
all_mets = []
for dom in ("hydro", "subsidence", "wind"):
    _pred_df, m = mdl.run_domain(dom)
    all_mets.append(m)

combo = pd.concat(all_mets, ignore_index=True)
combo.to_csv(RESULTS / "metrics_all_domains.csv", index=False)
combo.head()

## 5) Build paper figures/tables into `data/cas/outputs/`
Runs plotting scripts with `BASE_DIR = DATA` so they read from
`modeling_results_ok/` and save images/tables under `outputs/`.

In [None]:
import runpy

plots = [SCRIPTS / "results_R{n+1}.py" for n in range(10)]
plots += [SCRIPTS / "results_config.py"]

for ps in plots:
    assert ps.exists(), f"Missing plotting script: {ps}"
    runpy.run_path(str(ps), init_globals={"BASE_DIR": DATA})

sorted(FIG_OUT.glob("*"))[:6]  # preview a few outputs

## Appendix — README quickstart for this layout
Use this in `examples/cas/Readme.md`.

```bash
git clone https://github.com/earthai-tech/k-diagram
cd k-diagram

mkdir -p data/cas/{raw,preprocessed,modeling_results_ok,outputs}

# Place inputs here (or your equivalents)
# data/cas/raw/gefcom_hourly.csv
# data/cas/raw/camels_timeseries.csv
# data/cas/raw/egms_point.csv

pip install -U numpy pandas matplotlib pyarrow lightgbm statsmodels scikit-learn
# Optional
# pip install fusionlab-learn tensorflow

jupyter lab  # open examples/cas/notebooks/CAS_end_to_end.ipynb
```
