<a href="https://colab.research.google.com/github/equiphysics/education/blob/main/labs/Duque-Exercise/Duque_Lunge_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <span style='color:#e91e63'>Duque Weekly Lunge Test: HR + HRV Data Analysis</span>

This notebook is for **week-to-week tracking** during Duque’s 90-day fitness routine. The goal is simple: **keep the workout standardized**, collect clean heart data, and use the numbers to ask better questions—not to “diagnose” anything.

---

## How to use this notebook
- Read the prompt for each section.
- Run the code cells (Shift+Enter).
- When you see **#TODO**, you can change a few settings (you do not need to write new code).
- Use the text “Answer” cells to write short interpretations and record field notes.

---

## What we’re measuring (and why)
- **Heart Rate (HR):** how hard the heart is working during the session.
- **Heart Rate Variability (HRV):** beat-to-beat timing variation (from RR / NN intervals). Most useful in **quiet standing** and **standing recovery**.
- **Recovery:** how quickly HR drops after the trot sets.

---

## The standardized weekly test (summary)
We will analyze the same basic structure each week:
1. **Pre-test:** quiet standing baseline  
2. **Work:** lunge left, then lunge right (with walk + trot sets)  
3. **Post-test:** standing recovery

---

## Learning goals
By the end of this activity, you will be able to:
- Upload and parse Duque’s HR/HRV export files.
- Plot HR and RR time series and identify baseline / work / recovery segments.
- Compute **RMSSD** and **SDNN** on clean segments (especially standing).
- Estimate simple recovery metrics (e.g., recovery slope after trot sets).
- Compare **left vs right**.
- Flag artifacts and explain what you chose to exclude (and why).

---

## Data you’ll upload
You’ll be prompted to upload (file names may vary by date/time):
- **Initial resting HR file** (separate short recording taken at rest)
- **Workout HR file** (same format; recorded during the weekly lunge test)
- **ECG file** (for additional signal checks)
- **ACC file** (for movement context)

These HR files typically include: **HR**, **RR (ms)**, **timestamps (ms)**, and a **skin-contact flag**.
  
---

## Important note
This notebook is **educational** and supports training decisions and discussion with your team. It is **not veterinary diagnosis**.



## Step 0 — Setup + Upload Data

Run the code cell below to:
1. Load the Python packages we’ll use throughout the notebook.
2. Prompt you to upload your data files.
3. Automatically load each file into a table (DataFrame) and summarize what was uploaded.

**Upload these files:**
- **Initial resting HR file** (separate short recording taken at rest)
- **Workout HR file** (recorded during the weekly lunge test)
- **Workout ECG** file
- **Workout ACC** file

After upload, you’ll select which HR file is **Resting** and which is **Workout** (if more than one HR file is present).


In [None]:

# =========================
# Step 0 — Setup + Upload (REQUIRED)
# Resting HR is separate; ECG + ACC are workout-only.
# =========================

import sys
import io
from pathlib import Path
from datetime import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Widgets in Colab
try:
    import ipywidgets as widgets
    from IPython.display import display
    from google.colab import output, files
    output.enable_custom_widget_manager()
except Exception as e:
    print("Widget setup note:", e)
    print("If widgets don't display, we can still run everything without them.")

print("Python:", sys.version.split()[0])
print("pandas:", pd.__version__)
print("numpy:", np.__version__)

# ---------- Helpers ----------

def _read_lines_strip_nulls(file_path: str) -> list[str]:
    """Read a text export while stripping null bytes (common in some device exports)."""
    data = Path(file_path).read_bytes().replace(b"\x00", b"")
    return data.decode("utf-8", errors="replace").splitlines()

def _parse_collection_timestamp(header_lines: list[str]):
    """
    Tries to parse: '# Collection Timestamp: 1.19.26 12.11.36'
    Interprets it as: month.day.year hour.minute.second.
    Returns datetime or None.
    """
    for line in header_lines:
        if "Collection Timestamp:" in line:
            raw = line.split("Collection Timestamp:", 1)[1].strip()
            for fmt in ("%m.%d.%y %H.%M.%S", "%m.%d.%y %I.%M.%S"):
                try:
                    return datetime.strptime(raw, fmt)
                except ValueError:
                    pass
            return None
    return None

def detect_file_type(file_path: str) -> str:
    """
    Detect file type from header.
    """
    lines = _read_lines_strip_nulls(file_path)
    header = "\n".join(lines[:60]).lower()

    if "heart rate data" in header:
        return "hr"
    if "electrocardiogram data" in header or "ecg data" in header:
        return "ecg"
    if "accelerometer data" in header:
        return "acc"
    return "unknown"

def load_device_export(file_path: str) -> dict:
    """
    Load an export text file into:
      {'type': ..., 'meta': ..., 'df': DataFrame}
    Adds:
      - t_s: seconds since first sample (from MS)
      - dt: absolute datetime if collection timestamp is parseable
    """
    lines = _read_lines_strip_nulls(file_path)

    header_lines = []
    data_start = None

    for i, line in enumerate(lines):
        if line.startswith("#"):
            header_lines.append(line)
            continue
        if line.strip() == "":
            continue
        data_start = i
        break

    if data_start is None:
        raise ValueError(f"Could not find data section in {file_path}")

    csv_text = "\n".join(lines[data_start:])
    df = pd.read_csv(io.StringIO(csv_text), skipinitialspace=True)
    df.columns = [c.strip() for c in df.columns]

    meta = {
        "file_name": Path(file_path).name,
        "collection_dt": _parse_collection_timestamp(header_lines),
        "header_lines": header_lines[:],
    }

    if "MS" in df.columns:
        df["MS"] = pd.to_numeric(df["MS"], errors="coerce")
        df = df.dropna(subset=["MS"]).reset_index(drop=True)
        df["t_s"] = (df["MS"] - df["MS"].iloc[0]) / 1000.0
        if meta["collection_dt"] is not None:
            df["dt"] = meta["collection_dt"] + pd.to_timedelta(df["MS"], unit="ms")

    # Coerce common numeric columns if present
    for col in ["HR", "RR", "SC", "ECG", "ACCX", "ACCY", "ACCZ"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    ftype = detect_file_type(file_path)
    return {"type": ftype, "meta": meta, "df": df}

def summarize_hr(df: pd.DataFrame) -> dict:
    out = {"rows": len(df)}
    out["duration_s"] = float(df["t_s"].iloc[-1]) if "t_s" in df.columns and len(df) else np.nan

    if "SC" in df.columns and len(df):
        out["skin_contact_%"] = float(100 * np.nanmean(df["SC"].values == 1))
    else:
        out["skin_contact_%"] = np.nan

    if "HR" in df.columns and len(df):
        hr = df.loc[df.get("SC", 1) == 1, "HR"] if "SC" in df.columns else df["HR"]
        hr = hr.replace([np.inf, -np.inf], np.nan).dropna()
        out["HR_mean"] = float(hr.mean()) if len(hr) else np.nan
        out["HR_min"] = float(hr.min()) if len(hr) else np.nan
        out["HR_max"] = float(hr.max()) if len(hr) else np.nan
    else:
        out["HR_mean"] = out["HR_min"] = out["HR_max"] = np.nan

    return out

def ms_range(df: pd.DataFrame):
    """Return (minMS, maxMS) if MS exists, else (nan, nan)."""
    if df is None or "MS" not in df.columns or len(df) == 0:
        return (np.nan, np.nan)
    return (float(np.nanmin(df["MS"])), float(np.nanmax(df["MS"])))

# ---------- Upload ----------

print("\nUpload REQUIRED files now:")
print("  - Resting HR file (rest-only recording)")
print("  - Workout HR file (weekly lunge test)")
print("  - ECG file (recorded during workout)")
print("  - ACC file (recorded during workout)\n")
print("Tip: you can select all files at once in the upload dialog.\n")

uploaded = files.upload()
if not uploaded:
    raise RuntimeError("No files uploaded. Re-run this cell and upload ALL required files.")

uploaded_paths = [str(Path(k).resolve()) for k in uploaded.keys()]
print("\nUploaded:")
for p in uploaded_paths:
    print(" -", Path(p).name)

# ---------- Load + Organize ----------

DATA = {}
for p in uploaded_paths:
    try:
        DATA[p] = load_device_export(p)
    except Exception as e:
        print(f"\n⚠️ Could not load {Path(p).name}: {e}")

HR_FILES  = [p for p, d in DATA.items() if d["type"] == "hr"]
ECG_FILES = [p for p, d in DATA.items() if d["type"] == "ecg"]
ACC_FILES = [p for p, d in DATA.items() if d["type"] == "acc"]
UNK_FILES = [p for p, d in DATA.items() if d["type"] == "unknown"]

print("\nDetected file types:")
print("  HR :", [Path(p).name for p in HR_FILES])
print("  ECG:", [Path(p).name for p in ECG_FILES])
print("  ACC:", [Path(p).name for p in ACC_FILES])
if UNK_FILES:
    print("  ???:", [Path(p).name for p in UNK_FILES])

# ---------- Enforce requirements ----------

errors = []
if len(HR_FILES) < 2:
    errors.append(f"Need at least 2 HR files (resting + workout). Detected: {len(HR_FILES)}")
if len(ECG_FILES) < 1:
    errors.append("Missing ECG file (workout-only).")
if len(ACC_FILES) < 1:
    errors.append("Missing ACC file (workout-only).")
if len(UNK_FILES) > 0:
    errors.append(f"Some files were not recognized: {[Path(p).name for p in UNK_FILES]}")

if errors:
    msg = "\n".join([f"- {e}" for e in errors])
    raise ValueError(
        "Upload/typing check failed:\n"
        f"{msg}\n\nFix: re-run this cell and upload the correct exports."
    )

# ---------- Summary table ----------

summary_rows = []
for p in HR_FILES:
    df = DATA[p]["df"]
    s = summarize_hr(df)
    summary_rows.append({
        "file": Path(p).name,
        "type": "hr",
        "rows": s["rows"],
        "duration_s": round(s["duration_s"], 1) if np.isfinite(s["duration_s"]) else np.nan,
        "skin_contact_%": round(s["skin_contact_%"], 1) if np.isfinite(s["skin_contact_%"]) else np.nan,
        "HR_mean": round(s["HR_mean"], 1) if np.isfinite(s["HR_mean"]) else np.nan,
        "HR_min": round(s["HR_min"], 1) if np.isfinite(s["HR_min"]) else np.nan,
        "HR_max": round(s["HR_max"], 1) if np.isfinite(s["HR_max"]) else np.nan,
    })

for p in ECG_FILES + ACC_FILES:
    df = DATA[p]["df"]
    dur = float(df["t_s"].iloc[-1]) if "t_s" in df.columns and len(df) else np.nan
    summary_rows.append({
        "file": Path(p).name,
        "type": DATA[p]["type"],
        "rows": len(df),
        "duration_s": round(dur, 1) if np.isfinite(dur) else np.nan,
        "skin_contact_%": np.nan,
        "HR_mean": np.nan,
        "HR_min": np.nan,
        "HR_max": np.nan,
    })

SUMMARY = pd.DataFrame(summary_rows).sort_values(["type", "file"]).reset_index(drop=True)
display(SUMMARY)

# ---------- Select Resting vs Workout HR ----------

# Suggest resting = shortest HR duration, workout = longest HR duration
durations = []
for p in HR_FILES:
    df = DATA[p]["df"]
    dur = float(df["t_s"].iloc[-1]) if "t_s" in df.columns and len(df) else np.inf
    durations.append((p, dur))
durations_sorted = sorted(durations, key=lambda x: x[1])
suggested_rest = durations_sorted[0][0]
suggested_workout = durations_sorted[-1][0]

print("\nSelect which HR file is RESTING vs WORKOUT:")

rest_dropdown = widgets.Dropdown(
    options=[(Path(p).name, p) for p in HR_FILES],
    value=suggested_rest,
    description="Resting HR:",
    style={"description_width": "initial"},
    layout=widgets.Layout(width="60%")
)
workout_dropdown = widgets.Dropdown(
    options=[(Path(p).name, p) for p in HR_FILES],
    value=suggested_workout,
    description="Workout HR:",
    style={"description_width": "initial"},
    layout=widgets.Layout(width="60%")
)

# ---------- Select ECG + ACC (workout-only) ----------

ecg_dropdown = widgets.Dropdown(
    options=[(Path(p).name, p) for p in ECG_FILES],
    value=ECG_FILES[0],
    description="Workout ECG:",
    style={"description_width": "initial"},
    layout=widgets.Layout(width="60%")
)
acc_dropdown = widgets.Dropdown(
    options=[(Path(p).name, p) for p in ACC_FILES],
    value=ACC_FILES[0],
    description="Workout ACC:",
    style={"description_width": "initial"},
    layout=widgets.Layout(width="60%")
)

warn = widgets.HTML(value="")

REST_HR_PATH = None
WORKOUT_HR_PATH = None
WORKOUT_ECG_PATH = None
WORKOUT_ACC_PATH = None

def _update_paths(change=None):
    global REST_HR_PATH, WORKOUT_HR_PATH, WORKOUT_ECG_PATH, WORKOUT_ACC_PATH
    REST_HR_PATH = rest_dropdown.value
    WORKOUT_HR_PATH = workout_dropdown.value
    WORKOUT_ECG_PATH = ecg_dropdown.value
    WORKOUT_ACC_PATH = acc_dropdown.value

    messages = []
    if REST_HR_PATH == WORKOUT_HR_PATH:
        messages.append("⚠️ Resting and Workout HR are set to the same file. Choose two different HR files.")

    # Overlap checks (very simple): MS ranges should overlap for workout HR vs ECG/ACC
    df_work = DATA[WORKOUT_HR_PATH]["df"] if WORKOUT_HR_PATH else None
    df_ecg = DATA[WORKOUT_ECG_PATH]["df"] if WORKOUT_ECG_PATH else None
    df_acc = DATA[WORKOUT_ACC_PATH]["df"] if WORKOUT_ACC_PATH else None

    w0, w1 = ms_range(df_work)
    e0, e1 = ms_range(df_ecg)
    a0, a1 = ms_range(df_acc)

    def overlap_frac(x0, x1, y0, y1):
        if not np.isfinite([x0, x1, y0, y1]).all():
            return np.nan
        inter = max(0.0, min(x1, y1) - max(x0, y0))
        denom = max(1.0, (x1 - x0))
        return inter / denom

    oe = overlap_frac(w0, w1, e0, e1)
    oa = overlap_frac(w0, w1, a0, a1)

    if np.isfinite(oe) and oe < 0.2:
        messages.append("⚠️ Workout ECG time range barely overlaps Workout HR. Double-check you selected the matching workout ECG file.")
    if np.isfinite(oa) and oa < 0.2:
        messages.append("⚠️ Workout ACC time range barely overlaps Workout HR. Double-check you selected the matching workout ACC file.")

    if messages:
        warn.value = "<br>".join([f"<b style='color:#b71c1c'>{m}</b>" for m in messages])
    else:
        warn.value = ""

rest_dropdown.observe(_update_paths, names="value")
workout_dropdown.observe(_update_paths, names="value")
ecg_dropdown.observe(_update_paths, names="value")
acc_dropdown.observe(_update_paths, names="value")
_update_paths()

display(rest_dropdown, workout_dropdown, ecg_dropdown, acc_dropdown, warn)

# ---------- Convenience variables for later cells ----------

df_rest = DATA[REST_HR_PATH]["df"].copy()
df_work = DATA[WORKOUT_HR_PATH]["df"].copy()
df_ecg  = DATA[WORKOUT_ECG_PATH]["df"].copy()
df_acc  = DATA[WORKOUT_ACC_PATH]["df"].copy()

print("\nCurrent selections:")
print("  REST HR   :", Path(REST_HR_PATH).name)
print("  WORK HR   :", Path(WORKOUT_HR_PATH).name)
print("  WORK ECG  :", Path(WORKOUT_ECG_PATH).name)
print("  WORK ACC  :", Path(WORKOUT_ACC_PATH).name)

print("\n✅ Setup complete.")
print("Next we’ll make a sanity plot of Resting HR/RR and Workout HR/RR + ECG + ACC to confirm everything lines up.")


## Step 2 — Workout overview: HR and relative speed, with walk/trot/stand highlighted

This figure gives a clean “big picture” view of the workout.

We compute a **relative speed estimate** from the accelerometer (ACC) by:
1. converting ACC from milli-g → m/s²,
2. removing gravity/tilt using a rolling mean,
3. taking the magnitude of the remaining (dynamic) acceleration,
4. applying strong smoothing.

Then we classify each moment into **stand / walk / trot** using the *speed estimate* (three-level clustering), and we **shade the background** of both plots using those labels.

**Important:** the “speed” here is a *proxy* (relative intensity). It’s great for detecting structure (stand vs moving, walk-ish vs trot-ish), but it is not guaranteed to be true speed in m/s without calibration.




In [None]:

# Step 2 — Two-panel plot: HR + relative speed, with stand/walk/trot shading from the speed proxy
# Requires: df_work (workout HR), df_acc (workout accelerometer), numpy as np, pandas as pd, matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

# -----------------------------
# #TODO parameters (tune these)
# -----------------------------
GRAVITY_WIN_S   = 3.0   # seconds: stronger gravity/tilt estimate (bigger = stronger)
SPEED_SMOOTH_S  = 2.5   # seconds: strong smoothing for speed proxy (bigger = smoother)
PLOT_FS         = 10    # Hz: downsample for plotting + classification
MIN_BOUT_S      = 4.0   # seconds: merge bouts shorter than this (reduces flicker)
SPEED_SCALE     = 1.0   # unitless scaling for the speed axis

# -----------------------------
# Helper functions
# -----------------------------
def kmeans_1d(x, k=3, n_iter=30, seed=0):
    """Simple 1D k-means (no sklearn). Returns sorted centers and labels (0..k-1 mapped to sorted centers)."""
    x = np.asarray(x, dtype=float)
    x = x[np.isfinite(x)]
    if len(x) < k:
        raise ValueError("Not enough finite points for k-means.")
    rng = np.random.default_rng(seed)
    # init centers: spread across quantiles
    qs = np.linspace(0.1, 0.9, k)
    centers = np.quantile(x, qs)

    for _ in range(n_iter):
        # assign
        d = np.abs(x[:, None] - centers[None, :])
        labels = np.argmin(d, axis=1)
        new_centers = centers.copy()
        for j in range(k):
            pts = x[labels == j]
            if len(pts) > 0:
                new_centers[j] = np.mean(pts)
        if np.allclose(new_centers, centers, rtol=1e-6, atol=1e-8):
            centers = new_centers
            break
        centers = new_centers

    # sort centers and remap labels accordingly
    order = np.argsort(centers)
    centers_sorted = centers[order]
    return centers_sorted

def classify_speed(speed, centers):
    """Classify speed into 3 bins using midpoints between sorted centers."""
    c0, c1, c2 = centers
    b01 = 0.5*(c0 + c1)
    b12 = 0.5*(c1 + c2)
    # 0=stand, 1=walk, 2=trot
    labels = np.zeros_like(speed, dtype=int)
    labels[speed >= b01] = 1
    labels[speed >= b12] = 2
    return labels, (b01, b12)

def segments_from_labels(t, labels):
    """Return list of (label, t0, t1) contiguous segments."""
    t = np.asarray(t, dtype=float)
    labels = np.asarray(labels, dtype=int)
    segs = []
    if len(t) == 0:
        return segs
    start = 0
    for i in range(1, len(labels)):
        if labels[i] != labels[i-1]:
            segs.append((labels[i-1], t[start], t[i-1]))
            start = i
    segs.append((labels[-1], t[start], t[-1]))
    return segs

def merge_short_segments(segs, min_dur_s):
    """Merge segments shorter than min_dur_s into a neighbor (reduces flicker)."""
    if not segs:
        return segs
    segs = segs.copy()
    i = 0
    while i < len(segs):
        lab, t0, t1 = segs[i]
        dur = t1 - t0
        if dur < min_dur_s:
            # Prefer merging into the larger neighbor segment
            if i > 0 and i < len(segs) - 1:
                prev = segs[i-1]
                nxt = segs[i+1]
                prev_dur = prev[2] - prev[1]
                nxt_dur = nxt[2] - nxt[1]
                if prev_dur >= nxt_dur:
                    # merge into prev: extend prev end, drop current
                    segs[i-1] = (prev[0], prev[1], t1)
                    segs.pop(i)
                    continue
                else:
                    # merge into next: extend next start backward, drop current
                    segs[i+1] = (nxt[0], t0, nxt[2])
                    segs.pop(i)
                    continue
            elif i > 0:
                prev = segs[i-1]
                segs[i-1] = (prev[0], prev[1], t1)
                segs.pop(i)
                continue
            elif i < len(segs) - 1:
                nxt = segs[i+1]
                segs[i+1] = (nxt[0], t0, nxt[2])
                segs.pop(i)
                continue
            else:
                # only one segment; keep it
                i += 1
        else:
            i += 1

    # If merges created adjacent same-label segments, collapse them
    collapsed = []
    for lab, t0, t1 in segs:
        if collapsed and collapsed[-1][0] == lab:
            collapsed[-1] = (lab, collapsed[-1][1], t1)
        else:
            collapsed.append((lab, t0, t1))
    return collapsed

def shade_segments(ax, segs, colors, alpha=0.60):
    """Shade ax background according to segments."""
    for lab, t0, t1 in segs:
        ax.axvspan(t0, t1, color=colors[lab], alpha=alpha, linewidth=0)

# -----------------------------
# Prepare HR (remove HR==0)
# -----------------------------
work = df_work.copy()
work["HR"] = pd.to_numeric(work["HR"], errors="coerce")
work = work[work["HR"] > 0].copy()
work = work.sort_values("t_s")

# -----------------------------
# Prepare accelerometer → speed proxy
# -----------------------------
g = 9.80665  # m/s^2 per g

acc_ms2 = (df_acc[["ACCX", "ACCY", "ACCZ"]].to_numpy(dtype=float) * g) / 1000.0

# Estimate ACC sampling rate
if "MS" in df_acc.columns and len(df_acc) > 1:
    dt = np.nanmedian(np.diff(df_acc["MS"].to_numpy(dtype=float))) / 1000.0
else:
    dt = np.nanmedian(np.diff(df_acc["t_s"].to_numpy(dtype=float)))
fs = 1.0 / dt

# Gravity/tilt (low-pass)
win_g = max(1, int(round(GRAVITY_WIN_S * fs)))
grav = (
    pd.DataFrame(acc_ms2, columns=["ax", "ay", "az"])
    .rolling(win_g, center=True, min_periods=1)
    .mean()
    .to_numpy()
)

# Dynamic acceleration magnitude
acc_dyn = acc_ms2 - grav
acc_dyn_mag = np.linalg.norm(acc_dyn, axis=1)

# Strong smoothing → speed proxy
win_s = max(1, int(round(SPEED_SMOOTH_S * fs)))
speed_proxy = (
    pd.Series(acc_dyn_mag)
    .rolling(win_s, center=True, min_periods=1)
    .mean()
    .to_numpy()
)

speed_est = SPEED_SCALE * speed_proxy

# Downsample for plotting + classification
step = max(1, int(round(fs / PLOT_FS)))
t_speed = df_acc["t_s"].to_numpy(dtype=float)[::step]
speed_plot = speed_est[::step]

# -----------------------------
# Classify stand / walk / trot from speed_plot
# -----------------------------
centers = kmeans_1d(speed_plot[np.isfinite(speed_plot)], k=3, n_iter=50, seed=0)
labels, (thr01, thr12) = classify_speed(speed_plot, centers)

# Build segments and merge short bouts
segs = segments_from_labels(t_speed, labels)
segs = merge_short_segments(segs, MIN_BOUT_S)

# Colors for background shading (0=stand, 1=walk, 2=trot)
bg_colors = {
    0: "lightgreen",   # stand
    1: "lightblue",  # walk
    2: "pink",  # trot
}

# -----------------------------
# Plot: 2 panels (HR, speed)
# -----------------------------
fig, (ax_hr, ax_spd) = plt.subplots(2, 1, sharex=True, figsize=(12, 6), constrained_layout=True)

# Shade both panels
shade_segments(ax_hr, segs, bg_colors, alpha=0.6)
shade_segments(ax_spd, segs, bg_colors, alpha=0.6)

# HR panel
ax_hr.plot(work["t_s"], work["HR"], color="crimson", linewidth=1.4)
ax_hr.set_ylabel("HR (bpm)")
ax_hr.set_title("Workout HR and ACC-based relative speed (stand/walk/trot highlighted)")

# Speed panel (blue as requested)
ax_spd.plot(t_speed, speed_plot, color="blue", linewidth=1.4)
ax_spd.set_ylabel("relative speed (proxy)")
ax_spd.set_xlabel("time (s)")

# Legend for background labels
legend_patches = [
    Patch(facecolor=bg_colors[0], edgecolor="none", alpha=0.6, label="stand"),
    Patch(facecolor=bg_colors[1], edgecolor="none", alpha=0.6, label="walk"),
    Patch(facecolor=bg_colors[2], edgecolor="none", alpha=0.6, label="trot"),
]
ax_hr.legend(handles=legend_patches, loc="upper right")

# Print thresholds so you can sanity-check
print("Speed clustering centers:", np.round(centers, 4))
print("Thresholds (stand→walk, walk→trot):", np.round([thr01, thr12], 4))
print(f"Min bout merged: {MIN_BOUT_S:.1f} s")

plt.show()




## Step 3 — Identify protocol segments (Left/Right), flag missing parts, and flag *extra* added bouts

Now we use the ACC-based speed proxy to break the workout into **stand / walk / trot** bouts and map those onto the intended weekly protocol:

**Left direction**
- Walk L (5 min)
- Trot 1 L (2 min)
- Walk 1 L (2 min)
- Trot 2 L (2 min)
- Walk 2 L (2 min)

**Change directions**
- Stand / pause (variable)

**Right direction**
- Walk R (5 min)
- Trot 1 R (2 min)
- Walk 1 R (2 min)
- Trot 2 R (2 min)
- Walk 2 R (2 min)

**Finish**
- Final stand (variable)

Instead of forcing the workout into a fixed template, we will label the **segments that were actually detected** from the ACC-based speed estimate.

For each detected segment, you will choose:
- **Gait:** Stand / Walk / Trot  
- **Direction:** Left / Right / Transition / Unknown  
- **Ignore:** check this if the segment is noise, a stop to fix equipment, etc.

The table shows the code’s **suggested gait** and the **duration** to help you label quickly.

When you click **Save labels**, the notebook will create a clean `segments_labeled` table that we’ll use for HR/HRV analysis by segment.
Code cell (Python)

In [None]:

# Step 3 — Ask user to label each detected segment (gait + direction + ignore)
# Assumes you already ran Step 2 and have `segs` as a list of (label_id, t0, t1)
# where label_id: 0=stand, 1=walk, 2=trot (based on speed proxy classification).

import numpy as np
import pandas as pd

try:
    import ipywidgets as widgets
    from IPython.display import display
except Exception as e:
    raise RuntimeError("ipywidgets not available. Re-run Step 0 setup cell first.") from e

# ---- safety checks ----
if "segs" not in globals():
    raise ValueError("Missing `segs`. Re-run Step 2 (the plot with shading) first.")

label_name = {0: "Stand", 1: "Walk", 2: "Trot"}

# Build a clean segments table
segments_df = pd.DataFrame(
    [{
        "segment_id": i+1,
        "suggested_gait": label_name.get(lab, "Unknown"),
        "t0_s": float(t0),
        "t1_s": float(t1),
        "dur_s": float(t1 - t0),
        "dur_min": float(t1 - t0) / 60.0,
    } for i, (lab, t0, t1) in enumerate(segs)]
).sort_values("t0_s").reset_index(drop=True)

display(segments_df)

# --- UI controls for each segment ---
GAIT_OPTIONS = ["Stand", "Walk", "Trot"]
DIR_OPTIONS  = ["Left", "Right", "Transition", "Unknown"]

rows = []
widgets_rows = []

for _, r in segments_df.iterrows():
    seg_id = int(r["segment_id"])
    sug = r["suggested_gait"]
    dur_min = r["dur_min"]
    t0 = r["t0_s"]
    t1 = r["t1_s"]

    # default gait = suggested gait (if in options)
    gait_default = sug if sug in GAIT_OPTIONS else "Stand"

    ignore_chk = widgets.Checkbox(value=False, description="Ignore", indent=False)
    gait_dd = widgets.Dropdown(options=GAIT_OPTIONS, value=gait_default, description="Gait:",
                              layout=widgets.Layout(width="220px"))
    dir_dd  = widgets.Dropdown(options=DIR_OPTIONS, value="Unknown", description="Direction:",
                              layout=widgets.Layout(width="280px"))
    note_tx = widgets.Text(value="", description="Note:", layout=widgets.Layout(width="420px"))

    info = widgets.HTML(
        f"<b>Seg {seg_id}</b> "
        f"<span style='color:#555'>(suggested: {sug}, {dur_min:.2f} min, {t0:.1f}–{t1:.1f}s)</span>"
    )

    row = widgets.HBox([info, ignore_chk, gait_dd, dir_dd, note_tx])
    widgets_rows.append((seg_id, ignore_chk, gait_dd, dir_dd, note_tx))
    rows.append(row)

ui = widgets.VBox(rows)

save_btn = widgets.Button(description="Save labels", button_style="success")
out = widgets.Output()

def on_save(_):
    labeled_rows = []
    for seg_id, ignore_chk, gait_dd, dir_dd, note_tx in widgets_rows:
        base = segments_df.loc[segments_df["segment_id"] == seg_id].iloc[0].to_dict()
        base.update({
            "gait": gait_dd.value,
            "direction": dir_dd.value,
            "ignore": bool(ignore_chk.value),
            "note": note_tx.value.strip(),
        })
        labeled_rows.append(base)

    global segments_labeled
    segments_labeled = pd.DataFrame(labeled_rows).sort_values("t0_s").reset_index(drop=True)

    with out:
        out.clear_output()
        print("✅ Saved segments_labeled (we’ll use this in later steps).")
        display(segments_labeled)

        # quick summaries
        kept = segments_labeled[~segments_labeled["ignore"]].copy()

        if len(kept):
            print("\nSummary (time by gait):")
            display(kept.groupby("gait", as_index=False)["dur_s"].sum().assign(dur_min=lambda d: d["dur_s"]/60))

            print("\nSummary (time by direction):")
            display(kept.groupby("direction", as_index=False)["dur_s"].sum().assign(dur_min=lambda d: d["dur_s"]/60))

            print("\nSummary (time by direction × gait):")
            pivot = kept.pivot_table(index="direction", columns="gait", values="dur_s", aggfunc="sum", fill_value=0.0)
            display((pivot/60).round(2))  # minutes
        else:
            print("\n(Everything is marked ignore — nothing to summarize.)")

save_btn.on_click(on_save)

display(ui, save_btn, out)



## Step 4 — Recovery rate estimates: what we compute, what the tables mean, and what is “standard”

### Why we estimate recovery
Heart rate recovery after work is one of the simplest indicators we can track week-to-week. We are **not diagnosing** anything. We are trying to measure whether Duque:
- returns toward baseline more quickly over time (fitness/training effect), or
- shows slower recovery on a given day (fatigue, heat, stress, soreness, equipment issues, etc.).

---

## A. What “recovery rate” means in this notebook

We estimate recovery across two transition types:
1. **Trot → Walk**
2. **Walk → Stand** (including the final transition into the end stand)

For each transition, we look at how HR changes **immediately after the transition**, using a fixed early window (default: **60 seconds**). This gives a consistent number we can compare week-to-week.

---

## B. What the code actually does (step-by-step)

### 1) Segment-average HR
For every labeled segment (Walk / Trot / Stand), we compute:
- **HR_mean** = average HR over all HR samples whose timestamps fall inside that segment.

This gives a simple overview of effort level in each segment.

### 2) End-of-segment HR (the “starting point” for recovery)
For a transition (e.g., Trot → Walk), we estimate HR at the *end* of the “from” segment:
- **HR_from_end_med** = the **median HR** over the last *W* seconds (default **20 s**) of the “from” segment.

We use a **median** because it is less sensitive to spikes/dropouts.

### 3) HR after the transition (the “early recovery” point)
Inside the “to” segment (Walk or Stand), we look at the **first 60 seconds** (or shorter if the segment is shorter, but we require segments ≥ 1 minute to compute recovery).

We compute:
- **HR_to_end_med(first_60s)** = median HR over the **last W seconds** within the first 60 seconds of the “to” segment.

This means we are comparing:
- HR right before the transition (end of the “from” segment)
vs
- HR after one minute of recovery (inside the “to” segment)

### 4) Two recovery numbers
We report **two versions** of recovery:

**(i) 60-second drop rate (primary)**
- **ΔHR_end60_minus_fromEnd** = HR_to_end_med(first_60s) − HR_from_end_med  
  (negative means HR dropped — recovery)
- **rate_bpm_per_min(ΔHR/60s)** = ΔHR divided by 1 minute  
  (so it is numerically the same as ΔHR, but explicitly “bpm/min”)

**(ii) Linear slope over the first 60 seconds (secondary)**
- **slope_bpm_per_min(fit_0-60s)** = best-fit straight-line slope of HR vs time over the first 60 seconds of the “to” segment.

This can be helpful if HR is noisy, but it is still a simple model.

---

## C. What the recovery tables contain

Each row corresponds to a single transition that met our quality rules:
- **from_seg / to_seg**: segment IDs in your labeled table
- **HR_from_mean / HR_to_mean**: average HR in each entire segment
- **HR_from_end_med**: median HR in the last 20 seconds before transition
- **HR_to_end_med(first_60s)**: median HR near the end of the first 60 seconds after transition
- **ΔHR_end60_minus_fromEnd**: the HR change after 60 s (negative is “good recovery”)
- **rate_bpm_per_min(ΔHR/60s)**: same change expressed as bpm/min
- **slope_bpm_per_min(fit_0-60s)**: slope of HR during early recovery (optional)
- **Final Walk→Stand (blips ignored)**: for the final stand, we skip tiny walk segments (settling steps) so we capture the true final recovery.

---

## D. Is this “standard”? Should we fit an exponential?

There isn’t one single universally “standard” method across sports science and veterinary contexts, but there are common families:

### Common/simple approaches (very common for field tracking)
- **HR at 1 min post-exercise** (or 2 min, etc.)
- **ΔHR over first 60 s** (what we do)
- **linear slope over a short recovery window** (what we also report)

These are popular because they are:
- easy to compute,
- robust to imperfect field data,
- comparable across weeks.

### Exponential recovery models (also common in research, but more assumptions)
A classic model is:
- HR(t) = HR_rest + (HR0 − HR_rest) * exp(−t/τ)

Where **τ** is the recovery time constant.

This can be great if:
- you have a clean recovery segment,
- HR is sampled reliably,
- the recovery is approximately monotone,
- you have a well-defined baseline (HR_rest) and start point HR0.

But in real horse field data, recovery can be messy:
- movement during “stand,”
- environmental changes,
- sensor contact artifacts,
- handler interruptions,
- non-monotone HR (spikes).

So exponential fitting can be **more fragile** unless we do careful filtering and quality control.





In [None]:

# Step 4 — Recovery estimates + include final Walk→Stand (ignore brief walk blips during final stand)
# Requires: segments_labeled (from Step 3) and df_work (workout HR)

import numpy as np
import pandas as pd
from IPython.display import display

# -----------------------------
# #TODO parameters
# -----------------------------
MIN_SEG_S = 60.0          # segments must be at least 1 minute long to be used for recovery estimates
R_SEC     = 60.0          # recovery window: first 60 seconds of the "to" segment
W_SEC     = 20.0          # end-of-segment HR median window (seconds)
HR_TIME_OFFSET_S = 0.0    # shift segment times to line up with HR time axis if needed

WALK_BLIP_S = 30.0        # treat walk segments shorter than this as "blips" inside the final stand

# -----------------------------
# Prep HR signal (remove zeros, sort)
# -----------------------------
work = df_work.copy()
work["HR"] = pd.to_numeric(work["HR"], errors="coerce")
work = work[(work["HR"] > 0) & np.isfinite(work["t_s"])].dropna(subset=["HR", "t_s"])
work = work.sort_values("t_s").reset_index(drop=True)

t_hr = work["t_s"].to_numpy(dtype=float)
hr   = work["HR"].to_numpy(dtype=float)

def hr_stats(t0, t1):
    if t1 <= t0:
        return (np.nan, 0)
    m = (t_hr >= t0) & (t_hr <= t1)
    if not np.any(m):
        return (np.nan, 0)
    return (float(np.nanmean(hr[m])), int(np.sum(m)))

def hr_median_last(t0, t1, w=W_SEC):
    a = max(t0, t1 - w)
    m = (t_hr >= a) & (t_hr <= t1)
    if np.sum(m) < 3:
        return (np.nan, int(np.sum(m)))
    return (float(np.nanmedian(hr[m])), int(np.sum(m)))

def hr_slope_bpm_per_min(t0, t1):
    if t1 <= t0:
        return (np.nan, 0)
    m = (t_hr >= t0) & (t_hr <= t1)
    if np.sum(m) < 6:
        return (np.nan, int(np.sum(m)))
    x = t_hr[m]
    y = hr[m]
    slope_bpm_per_s = np.polyfit(x, y, 1)[0]
    return (float(slope_bpm_per_s * 60.0), int(np.sum(m)))

# -----------------------------
# Build segment table with average HR
# -----------------------------
if "segments_labeled" not in globals():
    raise ValueError("segments_labeled not found. Run Step 3 labeling and click Save labels first.")

seg = segments_labeled.copy().sort_values("t0_s").reset_index(drop=True)
seg["dur_s"] = seg["t1_s"] - seg["t0_s"]

seg["t0_hr"] = seg["t0_s"] + HR_TIME_OFFSET_S
seg["t1_hr"] = seg["t1_s"] + HR_TIME_OFFSET_S

seg_hr_rows = []
for _, r in seg.iterrows():
    t0, t1 = float(r["t0_hr"]), float(r["t1_hr"])
    mean_hr, n = hr_stats(t0, t1)
    seg_hr_rows.append({
        "segment_id": int(r["segment_id"]),
        "gait": r["gait"],
        "direction": r["direction"],
        "ignore": bool(r["ignore"]),
        "t0_s": float(r["t0_s"]),
        "t1_s": float(r["t1_s"]),
        "dur_s": float(r["dur_s"]),
        "dur_min": float(r["dur_s"]/60.0),
        "HR_mean": mean_hr,
        "HR_samples": n,
        "note": r.get("note",""),
    })

segments_hr = pd.DataFrame(seg_hr_rows)
print("Segment averages (all segments):")
display(segments_hr)

# Keep only non-ignored segments
kept = segments_hr[~segments_hr["ignore"]].copy().reset_index(drop=True)

# -----------------------------
# Generic transition finder (adjacent segments only)
# -----------------------------
def transition_table_adjacent(from_gait, to_gait, label):
    rows = []
    for i in range(len(kept) - 1):
        a = kept.iloc[i]
        b = kept.iloc[i+1]

        if a["gait"] != from_gait or b["gait"] != to_gait:
            continue

        if (a["dur_s"] < MIN_SEG_S) or (b["dur_s"] < MIN_SEG_S):
            continue

        # HR end of from-segment (median over last W_SEC)
        a_t0 = float(a["t0_s"] + HR_TIME_OFFSET_S)
        a_t1 = float(a["t1_s"] + HR_TIME_OFFSET_S)
        hr_from_end, _ = hr_median_last(a_t0, a_t1, W_SEC)

        # HR over first R_SEC of to-segment (use median at end of that window)
        b_t0 = float(b["t0_s"] + HR_TIME_OFFSET_S)
        b_t1 = float(b["t1_s"] + HR_TIME_OFFSET_S)
        b_early_end = min(b_t1, b_t0 + R_SEC)
        hr_to_early_end, _ = hr_median_last(b_t0, b_early_end, W_SEC)

        slope_bpm_min, n_fit = hr_slope_bpm_per_min(b_t0, b_early_end)

        if np.isfinite(hr_from_end) and np.isfinite(hr_to_early_end):
            delta_hr = hr_to_early_end - hr_from_end
            rate_bpm_min = delta_hr / (R_SEC/60.0)
        else:
            delta_hr = np.nan
            rate_bpm_min = np.nan

        rows.append({
            "transition": label,
            "from_seg": int(a["segment_id"]),
            "to_seg": int(b["segment_id"]),
            "t_to_start_s": float(b["t0_s"]),
            "HR_from_mean": float(a["HR_mean"]),
            "HR_to_mean": float(b["HR_mean"]),
            "HR_from_end_med": hr_from_end,
            "HR_to_end_med(first_60s)": hr_to_early_end,
            "ΔHR_end60_minus_fromEnd": delta_hr,
            "rate_bpm_per_min(ΔHR/60s)": rate_bpm_min,
            "slope_bpm_per_min(fit_0-60s)": slope_bpm_min,
            "HR_samples_fit": n_fit,
        })

    return pd.DataFrame(rows)

# -----------------------------
# Special: final Walk → Stand (skip brief walk blips)
# -----------------------------
def final_walk_to_stand_transition():
    """
    Find the last meaningful Walk→Stand transition.
    Starting near the end:
      - find the final Stand segment that is >= MIN_SEG_S
      - walk backward to find the most recent Walk segment >= MIN_SEG_S
      - ignore any intervening Walk segments shorter than WALK_BLIP_S (walk blips)
    Returns a dict with indices into `kept` or None.
    """
    if len(kept) < 2:
        return None

    # candidate stand segments at end
    stand_idxs = [i for i in range(len(kept)) if kept.loc[i, "gait"] == "Stand" and kept.loc[i, "dur_s"] >= MIN_SEG_S]
    if not stand_idxs:
        return None

    stand_i = stand_idxs[-1]  # last stand
    # now search backward for last good walk before this stand, skipping short walk blips
    j = stand_i - 1
    while j >= 0:
        if kept.loc[j, "gait"] == "Walk":
            if kept.loc[j, "dur_s"] >= MIN_SEG_S:
                return {"walk_index": j, "stand_index": stand_i}
            elif kept.loc[j, "dur_s"] < WALK_BLIP_S:
                # ignore blip and keep moving left
                j -= 1
                continue
            else:
                # walk segment is between blip and "good"—still not long enough for estimate
                j -= 1
                continue
        else:
            j -= 1
    return None

def compute_transition_row(a, b, label):
    """Compute recovery metrics from segment a (from) to segment b (to)."""
    a_t0 = float(a["t0_s"] + HR_TIME_OFFSET_S)
    a_t1 = float(a["t1_s"] + HR_TIME_OFFSET_S)
    b_t0 = float(b["t0_s"] + HR_TIME_OFFSET_S)
    b_t1 = float(b["t1_s"] + HR_TIME_OFFSET_S)

    hr_from_end, _ = hr_median_last(a_t0, a_t1, W_SEC)
    b_early_end = min(b_t1, b_t0 + R_SEC)
    hr_to_early_end, _ = hr_median_last(b_t0, b_early_end, W_SEC)

    slope_bpm_min, n_fit = hr_slope_bpm_per_min(b_t0, b_early_end)

    if np.isfinite(hr_from_end) and np.isfinite(hr_to_early_end):
        delta_hr = hr_to_early_end - hr_from_end
        rate_bpm_min = delta_hr / (R_SEC/60.0)
    else:
        delta_hr = np.nan
        rate_bpm_min = np.nan

    return {
        "transition": label,
        "from_seg": int(a["segment_id"]),
        "to_seg": int(b["segment_id"]),
        "t_to_start_s": float(b["t0_s"]),
        "HR_from_mean": float(a["HR_mean"]),
        "HR_to_mean": float(b["HR_mean"]),
        "HR_from_end_med": hr_from_end,
        "HR_to_end_med(first_60s)": hr_to_early_end,
        "ΔHR_end60_minus_fromEnd": delta_hr,
        "rate_bpm_per_min(ΔHR/60s)": rate_bpm_min,
        "slope_bpm_per_min(fit_0-60s)": slope_bpm_min,
        "HR_samples_fit": n_fit,
    }

# -----------------------------
# Compute tables
# -----------------------------
tw = transition_table_adjacent("Trot", "Walk", "Trot→Walk")
ws_adj = transition_table_adjacent("Walk", "Stand", "Walk→Stand")

# Final walk→stand (robust)
final_pair = final_walk_to_stand_transition()
if final_pair is None:
    ws_final = pd.DataFrame([])
else:
    a = kept.iloc[final_pair["walk_index"]]
    b = kept.iloc[final_pair["stand_index"]]
    ws_final = pd.DataFrame([compute_transition_row(a, b, "Walk→Stand (final)")])

# -----------------------------
# Combined recovery report (major results only)
# -----------------------------
def _pick_cols(df, cols):
    if df is None or len(df) == 0:
        return pd.DataFrame(columns=cols)
    keep = [c for c in cols if c in df.columns]
    out = df[keep].copy()
    # ensure all columns exist for concat
    for c in cols:
        if c not in out.columns:
            out[c] = np.nan
    return out[cols]

MAJOR_COLS = [
    "transition",
    "from_seg", "to_seg",
    "t_to_start_s",
    "HR_from_mean", "HR_to_mean",
    "rate_bpm_per_min(ΔHR/60s)",
    "slope_bpm_per_min(fit_0-60s)",
]

tw_s  = _pick_cols(tw, MAJOR_COLS)
ws_s  = _pick_cols(ws_adj, MAJOR_COLS)
wf_s  = _pick_cols(ws_final, MAJOR_COLS)

combined_recovery = pd.concat([tw_s, ws_s, wf_s], ignore_index=True)

# Add a simple interpretation column
rate_col = "rate_bpm_per_min(ΔHR/60s)"
rate_num = pd.to_numeric(combined_recovery[rate_col], errors="coerce")
combined_recovery["interpretation"] = np.where(
    rate_num.isna(),
    "insufficient data",
    np.where(rate_num < 0, "HR dropping (recovery)", "HR not dropping / unclear")
)

# Add segment duration context (minutes), if segments_hr exists
if "segments_hr" in globals() and len(segments_hr):
    dur_from = segments_hr[["segment_id", "dur_min"]].rename(columns={"segment_id": "from_seg", "dur_min": "from_dur_min"})
    dur_to   = segments_hr[["segment_id", "dur_min"]].rename(columns={"segment_id": "to_seg",   "dur_min": "to_dur_min"})
    combined_recovery = combined_recovery.merge(dur_from, on="from_seg", how="left")
    combined_recovery = combined_recovery.merge(dur_to,   on="to_seg",   how="left")

# Round for readability
for c in ["t_to_start_s", "HR_from_mean", "HR_to_mean", rate_col, "slope_bpm_per_min(fit_0-60s)", "from_dur_min", "to_dur_min"]:
    if c in combined_recovery.columns:
        combined_recovery[c] = pd.to_numeric(combined_recovery[c], errors="coerce").round(4)

# Clean report view
REPORT_COLS = [
    "transition",
    "from_seg", "to_seg",
    "from_dur_min", "to_dur_min",
    "HR_from_mean", "HR_to_mean",
    rate_col,
    "slope_bpm_per_min(fit_0-60s)",
    "t_to_start_s",
]
REPORT_COLS = [c for c in REPORT_COLS if c in combined_recovery.columns]

combined_recovery_report = (
    combined_recovery[REPORT_COLS]
    .sort_values(["transition", "t_to_start_s"], na_position="last")
    .reset_index(drop=True)
)

print("\nCombined recovery report (major results only):")
display(combined_recovery_report)


## Step 5 — HRV metrics for (1) the initial resting stand and (2) the final stand

In this step we compute **time-domain HRV metrics** from **NN (RR) intervals** for two standing periods:

1. **Initial resting stand** (from the separate “resting HR” file)
2. **Final stand** (from the workout recording, using your labeled segments)

### What data we use
- HRV is computed from **NN / RR intervals** (in milliseconds).  
- We **clean** the NN series to remove obvious artifacts (dropouts, spikes, bad-contact segments if flagged).

### Metrics reported (time-domain)
Let NN be the sequence of cleaned NN intervals (ms):
- **Mean NN (ms)**: average interval length  
- **Mean HR (bpm)**: average heart rate (from HR column if available, otherwise 60000 / Mean NN)
- **SDNN (ms)**: standard deviation of NN intervals  
- **RMSSD (ms)**: sqrt(mean( (diff(NN))² ))  
- **pNN50 (%)**: percent of successive NN differences > 50 ms  
- **N_intervals**: how many NN intervals were used after cleaning (data quality indicator)

### Notes
- HRV is most meaningful during **quiet standing**. If the horse is shifting, stepping, or contact is poor, HRV will be less reliable.
- We trim a small amount off the start/end of each stand by default to avoid transition effects.


In [None]:

# Step 5 — HRV metrics for initial resting stand (separate file) + final stand (from labeled workout segments)
# Requires: segments_labeled (from Step 3) and df_work (workout HR dataframe already loaded earlier).
# This code will prompt for the resting HR file if df_rest is not already defined.

import numpy as np
import pandas as pd
from IPython.display import display

# -----------------------------
# #TODO parameters
# -----------------------------
# Basic NN artifact filters (tune if needed)
NN_MIN_MS = 300      # too short to be real (artifact)
NN_MAX_MS = 2000     # too long to be real (artifact)
MAX_DIFF_MS = 250    # drop successive NN jumps bigger than this (artifact)

# Trims (seconds) to avoid transition edges
REST_TRIM_START_S = 20
REST_TRIM_END_S   = 10

FINAL_TRIM_START_S = 10
FINAL_TRIM_END_S   = 10

# If HR and segment times are slightly misaligned, reuse the same offset you used earlier
HR_TIME_OFFSET_S = 0.0

# -----------------------------
# Helpers: load + standardize a Polar-style txt export
# -----------------------------
def load_hr_txt_to_df(path):
    """Robust-ish loader for tab/space/comma-separated Polar exports."""
    df = pd.read_csv(path, sep=None, engine="python")
    # normalize column names
    df.columns = [str(c).strip() for c in df.columns]
    return df

def ensure_time_seconds(df):
    """Ensure a 't_s' column exists."""
    if "t_s" in df.columns:
        df["t_s"] = pd.to_numeric(df["t_s"], errors="coerce")
        return df

    # Common alternatives
    col_lower = {c.lower(): c for c in df.columns}

    if "ms" in col_lower:
        c = col_lower["ms"]
        ms = pd.to_numeric(df[c], errors="coerce")
        df["t_s"] = (ms - ms.iloc[0]) / 1000.0
        return df

    # try timestamp-like columns
    for key in ["timestamp", "time", "datetime", "date_time", "date time"]:
        if key in col_lower:
            c = col_lower[key]
            ts = pd.to_datetime(df[c], errors="coerce", utc=False)
            df["t_s"] = (ts - ts.iloc[0]).dt.total_seconds()
            return df

    raise ValueError("Could not infer time column. Expected 't_s' or 'MS' or a timestamp column.")

def find_rr_column(df):
    """Find a likely NN/RR column and return its name."""
    candidates = [
        "rr", "rr_ms", "rr(ms)", "rr (ms)", "rrinterval", "rr_interval",
        "nn", "nn_ms", "ibi", "ibi_ms", "rri", "rri_ms"
    ]
    col_lower = {c.lower().replace(" ", ""): c for c in df.columns}
    for k in candidates:
        kk = k.lower().replace(" ", "")
        if kk in col_lower:
            return col_lower[kk]
    # fallback: look for any column containing 'rr' and 'ms'
    for c in df.columns:
        cl = c.lower()
        if ("rr" in cl or "nn" in cl or "ibi" in cl) and ("ms" in cl or "interval" in cl):
            return c
    raise ValueError("Could not find an RR/NN interval column in this file.")

def find_hr_column(df):
    """Find a likely HR column; return name or None."""
    col_lower = {c.lower().replace(" ", ""): c for c in df.columns}
    for k in ["hr", "heart_rate", "heartrate", "hr(bpm)", "hr(bpm)"]:
        kk = k.lower().replace(" ", "")
        if kk in col_lower:
            return col_lower[kk]
    # fallback: any column that looks like HR
    for c in df.columns:
        cl = c.lower()
        if cl in ["hr", "heart rate", "heartrate"] or ("hr" in cl and "bpm" in cl):
            return c
    return None

def filter_nn(nn_ms):
    """Simple NN cleaning: range + big-jump removal."""
    nn = pd.to_numeric(pd.Series(nn_ms), errors="coerce").dropna().to_numpy(dtype=float)
    if len(nn) == 0:
        return nn, 0

    mask = (nn >= NN_MIN_MS) & (nn <= NN_MAX_MS)
    nn = nn[mask]

    if len(nn) < 3:
        return nn, 0

    # remove points that create huge successive jumps
    d = np.abs(np.diff(nn))
    keep = np.ones(len(nn), dtype=bool)
    keep[1:] &= (d <= MAX_DIFF_MS)
    nn2 = nn[keep]
    removed = int(len(nn) - len(nn2))
    return nn2, removed

def hrv_metrics_from_nn(nn_ms):
    """Compute time-domain HRV metrics from NN (ms)."""
    nn = np.asarray(nn_ms, dtype=float)
    nn = nn[np.isfinite(nn)]
    out = {}
    out["N_intervals"] = int(len(nn))

    if len(nn) < 10:
        # Too few points for stable HRV
        out.update({
            "Mean_NN_ms": np.nan,
            "Mean_HR_bpm_fromNN": np.nan,
            "SDNN_ms": np.nan,
            "RMSSD_ms": np.nan,
            "pNN50_pct": np.nan
        })
        return out

    out["Mean_NN_ms"] = float(np.mean(nn))
    out["Mean_HR_bpm_fromNN"] = float(60000.0 / out["Mean_NN_ms"])
    out["SDNN_ms"] = float(np.std(nn, ddof=1))

    diff_nn = np.diff(nn)
    out["RMSSD_ms"] = float(np.sqrt(np.mean(diff_nn**2)))
    out["pNN50_pct"] = float(100.0 * np.mean(np.abs(diff_nn) > 50.0))
    return out

def compute_hrv_for_window(df, t0, t1, label, trim_start=0, trim_end=0):
    """Compute HRV on [t0+trim_start, t1-trim_end] using RR/NN column."""
    t0u = float(t0 + trim_start)
    t1u = float(t1 - trim_end)
    if t1u <= t0u:
        return {"segment": label, "t0_s": t0, "t1_s": t1, "used_start_s": t0u, "used_end_s": t1u, "note": "window too short after trimming"}

    rr_col = find_rr_column(df)
    hr_col = find_hr_column(df)

    sub = df[(df["t_s"] >= t0u) & (df["t_s"] <= t1u)].copy()
    rr = pd.to_numeric(sub[rr_col], errors="coerce").to_numpy(dtype=float)

    nn_clean, removed_jumps = filter_nn(rr)

    metrics = hrv_metrics_from_nn(nn_clean)
    metrics["segment"] = label
    metrics["t0_s"] = float(t0)
    metrics["t1_s"] = float(t1)
    metrics["used_start_s"] = float(t0u)
    metrics["used_end_s"] = float(t1u)
    metrics["RR_col"] = rr_col
    metrics["removed_after_range/jump_filter"] = int(removed_jumps)

    # Mean HR from HR column if available
    if hr_col is not None and hr_col in sub.columns:
        hr_vals = pd.to_numeric(sub[hr_col], errors="coerce")
        hr_vals = hr_vals[(hr_vals > 0) & np.isfinite(hr_vals)]
        metrics["Mean_HR_bpm_fromHR"] = float(hr_vals.mean()) if len(hr_vals) else np.nan
    else:
        metrics["Mean_HR_bpm_fromHR"] = np.nan

    return metrics

# -----------------------------
# Load resting file if needed
# -----------------------------
if "df_rest" not in globals():
    # Colab upload prompt (works in Colab; safe to ignore elsewhere)
    try:
        from google.colab import files
        print("Upload the *resting* HR file (initial stand):")
        up = files.upload()
        rest_path = next(iter(up.keys()))
        df_rest = load_hr_txt_to_df(rest_path)
    except Exception:
        raise ValueError("df_rest not found. Please upload the resting HR file and assign it to df_rest.")

# Standardize time columns for rest + workout
df_rest = ensure_time_seconds(df_rest)

# We assume df_work already exists from your earlier steps
if "df_work" not in globals():
    raise ValueError("df_work not found. Please load the workout HR file into df_work first.")

df_work = ensure_time_seconds(df_work)

# Apply the same "HR > 0" cleaning for convenience (HRV uses RR, but this helps if HR column is used)
if "HR" in df_work.columns:
    df_work["HR"] = pd.to_numeric(df_work["HR"], errors="coerce")
if "HR" in df_rest.columns:
    df_rest["HR"] = pd.to_numeric(df_rest["HR"], errors="coerce")

# -----------------------------
# Define the two windows:
#  (1) Initial resting stand: use the full rest file time range
#  (2) Final stand: last labeled Stand segment (not ignored)
# -----------------------------
# Initial rest window
rest_t0 = float(np.nanmin(df_rest["t_s"]))
rest_t1 = float(np.nanmax(df_rest["t_s"]))

# Final stand from labeled segments
if "segments_labeled" not in globals():
    raise ValueError("segments_labeled not found. Run Step 3 labeling and click Save labels first.")

seg_keep = segments_labeled[(~segments_labeled["ignore"]) & (segments_labeled["gait"] == "Stand")].copy()
if len(seg_keep) == 0:
    raise ValueError("No final stand segment found in segments_labeled (gait == 'Stand' and not ignored).")

final_stand = seg_keep.sort_values("t0_s").iloc[-1]
final_t0 = float(final_stand["t0_s"] + HR_TIME_OFFSET_S)
final_t1 = float(final_stand["t1_s"] + HR_TIME_OFFSET_S)

# -----------------------------
# Compute HRV metrics
# -----------------------------
rest_metrics = compute_hrv_for_window(
    df_rest, rest_t0, rest_t1,
    label="Initial stand (rest file)",
    trim_start=REST_TRIM_START_S,
    trim_end=REST_TRIM_END_S
)

final_metrics = compute_hrv_for_window(
    df_work, final_t0, final_t1,
    label="Final stand (workout file)",
    trim_start=FINAL_TRIM_START_S,
    trim_end=FINAL_TRIM_END_S
)

hrv_table = pd.DataFrame([rest_metrics, final_metrics])

# Keep the “headline” columns first
headline_cols = [
    "segment",
    "N_intervals",
    "Mean_HR_bpm_fromHR",
    "Mean_HR_bpm_fromNN",
    "Mean_NN_ms",
    "SDNN_ms",
    "RMSSD_ms",
    "pNN50_pct",
    "removed_after_range/jump_filter",
    "used_start_s",
    "used_end_s",
    "RR_col",
]
headline_cols = [c for c in headline_cols if c in hrv_table.columns]
hrv_table = hrv_table[headline_cols]

# Light rounding for display only (does not affect underlying calculations)
for c in ["Mean_HR_bpm_fromHR", "Mean_HR_bpm_fromNN", "Mean_NN_ms", "SDNN_ms", "RMSSD_ms", "pNN50_pct"]:
    if c in hrv_table.columns:
        hrv_table[c] = pd.to_numeric(hrv_table[c], errors="coerce").round(2)

print("HRV metrics (initial rest stand vs final stand):")
display(hrv_table)



## Step 6 — HRV over time with gait shading (two plots: normalized and raw)

In this section we visualize how HRV changes throughout the workout using a **moving-window** estimate computed from the **NN (RR) intervals**.

We make **two plots** (same gait shading on both):

### Plot A: Normalized HRV (local normalization)
We compute RMSSD in a moving window, then normalize it by the **local mean NN** computed over a (possibly longer) moving window:

- **RMSSD(t)**: computed over the last `WINDOW_S` seconds  
- **meanNN_local(t)**: mean NN over the last `NORM_WINDOW_S` seconds  
- **Normalized HRV**:  
  **CVRMSSD_local(t) = 100 × RMSSD(t) / meanNN_local(t)**

This gives a dimensionless percent-like measure that is less sensitive to overall heart rate level and is easier to compare across time and across weeks.

### Plot B: Raw HRV (no normalization)
We plot **RMSSD(t)** in milliseconds. This is the same moving-window HRV estimate, but without dividing by mean NN.

### Notes
- HRV is most interpretable during **quiet standing**; movement and contact artifacts can distort it.
- We apply several NN cleaning steps (range filter, jump filter, and robust outlier removal) to reduce artifacts.
- Use these plots as **tracking and context**, not as diagnosis.



In [None]:

# Step 6 — HRV vs time with gait shading (two separate graphs)
# Plot A: normalized (local) CVRMSSD_local = 100 * RMSSD / meanNN_local
# Plot B: raw RMSSD (ms)
#
# Shading uses gait segments `segs` where label_id: 0=stand,1=walk,2=trot
# Legend includes ONLY gaits (shading), not the HRV lines.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

# -----------------------------
# #TODO parameters
# -----------------------------
WINDOW_S = 60                 # window for RMSSD (seconds)
NORM_WINDOW_S = 60           # window for local meanNN used for normalization (seconds)
MIN_NN_IN_WINDOW = 15         # min NN points required in RMSSD window

# Artifact cleaning
NN_MIN_MS = 300               # physiological bounds
NN_MAX_MS = 2000
MAX_DIFF_MS = 250             # drop beats with huge successive NN jumps
NN_OUTLIER_Z = 4.0            # MAD-based global outlier filter (bigger = less aggressive)

# Time alignment
HR_TIME_OFFSET_S = 0.0        # align beat times with segment times if needed

# Shading colors (0=stand, 1=walk, 2=trot)
bg_colors = {
    0: "lightgreen",  # stand
    1: "lightblue",   # walk
    2: "pink",        # trot
}
SHADE_ALPHA = 0.6

# Line colors
NORM_LINE_COLOR = "purple"
RAW_LINE_COLOR  = "black"

# -----------------------------
# Preconditions
# -----------------------------
if "df_work" not in globals():
    raise ValueError("df_work not found. Load the workout HR file into df_work first.")
if "segs" not in globals():
    raise ValueError("segs not found. Run the speed/segmentation step first (the one that produces `segs`).")

# -----------------------------
# Helpers
# -----------------------------
def ensure_time_seconds(df):
    """Ensure a 't_s' column exists."""
    if "t_s" in df.columns:
        df["t_s"] = pd.to_numeric(df["t_s"], errors="coerce")
        return df

    col_lower = {c.lower(): c for c in df.columns}
    if "ms" in col_lower:
        ms = pd.to_numeric(df[col_lower["ms"]], errors="coerce")
        df["t_s"] = (ms - ms.iloc[0]) / 1000.0
        return df

    for key in ["timestamp", "time", "datetime", "date_time", "date time"]:
        if key in col_lower:
            ts = pd.to_datetime(df[col_lower[key]], errors="coerce")
            df["t_s"] = (ts - ts.iloc[0]).dt.total_seconds()
            return df

    raise ValueError("Could not infer time column. Expected 't_s' or 'MS' or a timestamp-like column.")

def find_rr_column(df):
    """Find likely NN/RR interval column in ms."""
    candidates = [
        "rr", "rr_ms", "rr(ms)", "rr (ms)", "rrinterval", "rr_interval",
        "nn", "nn_ms", "ibi", "ibi_ms", "rri", "rri_ms"
    ]
    col_map = {c.lower().replace(" ", ""): c for c in df.columns}
    for k in candidates:
        kk = k.lower().replace(" ", "")
        if kk in col_map:
            return col_map[kk]
    for c in df.columns:
        cl = c.lower()
        if ("rr" in cl or "nn" in cl or "ibi" in cl) and ("ms" in cl or "interval" in cl):
            return c
    raise ValueError("Could not find an RR/NN interval column in df_work.")

def mad_based_inlier_mask(x, z=4.0):
    """True = keep; False = outlier."""
    x = np.asarray(x, dtype=float)
    m = np.nanmedian(x)
    mad = np.nanmedian(np.abs(x - m))
    if not np.isfinite(mad) or mad == 0:
        return np.isfinite(x)
    robust_z = 0.6745 * (x - m) / mad
    return np.isfinite(x) & (np.abs(robust_z) <= z)

def rmssd_raw(x):
    """RMSSD on a 1D array of NN intervals (ms)."""
    x = np.asarray(x, dtype=float)
    if len(x) < MIN_NN_IN_WINDOW:
        return np.nan
    dx = np.diff(x)
    if len(dx) == 0:
        return np.nan
    return float(np.sqrt(np.mean(dx * dx)))

def shade_segments(ax, segs, alpha=0.6):
    """Shade gait segments on x-axis in minutes. segs: list of (label_id, t0, t1)."""
    for lab, t0, t1 in segs:
        if lab not in bg_colors:
            continue
        ax.axvspan(t0/60.0, t1/60.0, alpha=alpha, color=bg_colors[lab], lw=0)

def add_gait_legend(ax):
    """Legend for gait shading only."""
    gait_labels = {0: "Stand", 1: "Walk", 2: "Trot"}
    patches = [mpatches.Patch(color=bg_colors[k], label=gait_labels[k], alpha=SHADE_ALPHA) for k in [0, 1, 2]]
    ax.legend(handles=patches, loc="best", title="Gait (shading)")

# -----------------------------
# Prepare beat series (time + NN)
# -----------------------------
df_work = ensure_time_seconds(df_work)
rr_col = find_rr_column(df_work)

beat = df_work[["t_s", rr_col]].copy()
beat["t_s"] = pd.to_numeric(beat["t_s"], errors="coerce") + HR_TIME_OFFSET_S
beat["NN_ms"] = pd.to_numeric(beat[rr_col], errors="coerce")
beat = beat.dropna(subset=["t_s", "NN_ms"]).sort_values("t_s").reset_index(drop=True)

# Range filter
beat = beat[(beat["NN_ms"] >= NN_MIN_MS) & (beat["NN_ms"] <= NN_MAX_MS)].reset_index(drop=True)

# Jump filter
if len(beat) >= 3:
    d = np.abs(np.diff(beat["NN_ms"].to_numpy(dtype=float)))
    keep = np.ones(len(beat), dtype=bool)
    keep[1:] &= (d <= MAX_DIFF_MS)
    beat = beat.loc[keep].reset_index(drop=True)

# Global MAD outlier filter
inliers = mad_based_inlier_mask(beat["NN_ms"].to_numpy(dtype=float), z=NN_OUTLIER_Z)
beat = beat.loc[inliers].reset_index(drop=True)

if len(beat) < 50:
    print("⚠️ Very few NN points after filtering. Consider loosening filters (MAX_DIFF_MS or NN_OUTLIER_Z).")

# -----------------------------
# Rolling metrics using time-based windows
# -----------------------------
idx = pd.to_timedelta(beat["t_s"], unit="s")
nn = pd.Series(beat["NN_ms"].to_numpy(dtype=float), index=idx)

rmssd = nn.rolling(f"{WINDOW_S}s").apply(rmssd_raw, raw=True)
mean_nn_local = nn.rolling(f"{NORM_WINDOW_S}s").mean()
cvrmssd_local = 100.0 * (rmssd / mean_nn_local)

ts_norm = pd.DataFrame({
    "t_s": beat["t_s"].to_numpy(dtype=float),
    "CVRMSSD_local_pct": cvrmssd_local.to_numpy(dtype=float),
}).dropna(subset=["CVRMSSD_local_pct"]).reset_index(drop=True)

ts_raw = pd.DataFrame({
    "t_s": beat["t_s"].to_numpy(dtype=float),
    "RMSSD_ms": rmssd.to_numpy(dtype=float),
}).dropna(subset=["RMSSD_ms"]).reset_index(drop=True)

# -----------------------------
# Plot A: Normalized HRV (local)
# -----------------------------
fig, ax = plt.subplots(figsize=(12, 4))
shade_segments(ax, segs, alpha=SHADE_ALPHA)
ax.plot(ts_norm["t_s"]/60.0, ts_norm["CVRMSSD_local_pct"], color=NORM_LINE_COLOR, linewidth=1.6)

ax.set_xlabel("Time (min)")
ax.set_ylabel(f"CVRMSSD_local (%) = 100 × RMSSD/meanNN_local")
ax.set_title("Normalized HRV over time (local normalization) with gait shading")
ax.grid(True, alpha=0.25)
add_gait_legend(ax)
plt.show()

# -----------------------------
# Plot B: Raw RMSSD (ms)
# -----------------------------
fig, ax = plt.subplots(figsize=(12, 4))
shade_segments(ax, segs, alpha=SHADE_ALPHA)
ax.plot(ts_raw["t_s"]/60.0, ts_raw["RMSSD_ms"], color=RAW_LINE_COLOR, linewidth=1.4)

ax.set_xlabel("Time (min)")
ax.set_ylabel(f"RMSSD (ms)  [window={WINDOW_S}s]")
ax.set_title("Raw HRV over time (RMSSD) with gait shading")
ax.grid(True, alpha=0.25)
add_gait_legend(ax)
plt.show()

