<a href="https://colab.research.google.com/github/apropos0/Scheduling_Inference/blob/main/notebooks/01_data_and_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01 â€” Data + Features

Goal:
- Load one or more experiment CSVs
- Confirm schema and balance across policy/workload/session
- Compute basic time-normalized features
- Save a clean artifact for downstream modeling

Expected input columns:
- timestamp, session_id, policy, workload
- task_clock, context_switches, cpu_migrations
- cycles, instructions, branches, branch_misses


In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)
print("pandas", pd.__version__)
print("numpy", np.__version__)


## Load data

Set `DATA_PATHS` to one or more CSV files.

Common options:
- Upload CSV(s) into Colab runtime (left sidebar -> Files -> Upload), then use the uploaded filenames.
- Keep CSV(s) in the repo under `data/` and reference them as `data/<name>.csv`.

Tip: start with one session CSV first, then add more later.

In [None]:
# Examples:
# DATA_PATHS = ["data/results_2025-12-31_A.csv"]
# DATA_PATHS = ["results.csv"]  # if uploaded into Colab

DATA_PATHS = []

if not DATA_PATHS:
    raise ValueError("Set DATA_PATHS to one or more CSV filenames.")

dfs = []
for p in DATA_PATHS:
    if not Path(p).exists():
        raise FileNotFoundError(f"File not found: {p}")
    dfs.append(pd.read_csv(p))

raw = pd.concat(dfs, ignore_index=True)
print("Loaded shape:", raw.shape)
raw.head()

In [None]:
expected_cols = [
    "timestamp","session_id","policy","workload",
    "task_clock","context_switches","cpu_migrations",
    "cycles","instructions","branches","branch_misses"
]

missing = [c for c in expected_cols if c not in raw.columns]
extra = [c for c in raw.columns if c not in expected_cols]

if missing:
    raise ValueError(f"Missing expected columns: {missing}")
print("Extra columns:", extra)

print("dtypes:\n", raw.dtypes)
print("\nMissing values per column:\n", raw.isna().sum())

In [None]:
print("Policies:\n", raw["policy"].value_counts(), "\n")
print("Workloads:\n", raw["workload"].value_counts(), "\n")
print("Sessions:\n", raw["session_id"].value_counts(), "\n")

print("Policy x Workload counts:\n")
pd.crosstab(raw["policy"], raw["workload"])

## Sanity checks

This is just an outlier scan to catch:
- parser errors
- missing values
- weirdly small / huge numbers


In [None]:
num_cols = [
    "task_clock","context_switches","cpu_migrations",
    "cycles","instructions","branches","branch_misses"
]
raw[num_cols].describe(percentiles=[0.01, 0.05, 0.5, 0.95, 0.99])

## Feature engineering

Compute:
- time-normalized rates (per second)
- ratios like IPC and miss rate

These will be used in the modeling notebook.

In [None]:
df = raw.copy()

# perf task-clock is in milliseconds here
df["task_sec"] = df["task_clock"] / 1000.0

# rates
df["cs_per_sec"] = df["context_switches"] / df["task_sec"]
df["mig_per_sec"] = df["cpu_migrations"] / df["task_sec"]
df["cycles_per_sec"] = df["cycles"] / df["task_sec"]
df["instr_per_sec"] = df["instructions"] / df["task_sec"]
df["branches_per_sec"] = df["branches"] / df["task_sec"]

# ratios
df["ipc"] = df["instructions"] / df["cycles"]
df["branch_miss_rate"] = df["branch_misses"] / df["branches"]

# safety
df.replace([np.inf, -np.inf], np.nan, inplace=True)

df[["cs_per_sec","mig_per_sec","ipc","branch_miss_rate"]].describe()

In [None]:
df.isna().sum().sort_values(ascending=False).head(15)

## Save clean dataset

We save Parquet for speed + preserved dtypes.
This is what Notebook 02 will load.


In [None]:
out_path = "clean_results.parquet"
df.to_parquet(out_path, index=False)
print("Wrote:", out_path)
