Clean up hdbscan validation workflow and plots by paco-barreras · Pull Request #278 · Watts-Lab/nomad

paco-barreras · 2026-04-24T16:04:49Z

This branch cleans up the HDBSCAN validation notebook with some deeper refactors to nomad's function.

Validation

The first part of the change makes the validation path related to compute_visitation_errors better, and it now lives with the rest of the stop-detection validation logic in validation.py, and the overlap / validation code can handle a separate traj_cols for the right-hand table when the predicted stops and the truth table do not use the same column names. That let me remove a lot of notebook-side transformations that was only there to work with that fragile code.

Notebook

The notebook hdbscan_validation_paper is leaner. It no longer passes default traj_cols mappings into loaders just to restate the defaults, and it no longer drops diary rows with missing building IDs before validation. The general metrics now use the full truth diary, while category-specific slices happen naturally where the categories are actually used. I also fixed the stale start_timestamp / timestamp mismatch after the summarize-stop output switched to keep_col_names=True, and cleaned up the generation path so regenerated diaries keep user_id.

Plotting

The plotting code also got reorganized. The notebook was mixing up two different statistical objects: the per-user distribution of a metric, and uncertainty in the median metric estimate. Those are now shown separately. validation.py now provides a small bootstrap summary helper plus two plotting helpers: one for per-user boxplots, and one for bootstrapped median estimates with interval whiskers. The boxplots are there to show the spread across users; the point-and-whisker plot is there to compare the estimated medians. That split makes the interpretation much clearer for this notebook.

For the grouped colors, the x-axis still uses the registry family labels such as lachesis_coarse and lachesis_fine, but the colors are grouped by the underlying base algorithm. That is piped through from the registry as {algo['family']: algo['algorithm']}, so variants of the same base method share a hue family without hardcoding the palette in the notebook.

I ran the validation notebook end to end on the 250-agent dataset after these changes. The script completes successfully and writes figures that make sense to me.

…rounds)

…#277) ## The Problem When a trajectory dataset contains both datetime and imestamp columns, NOMAD's fallback logic automatically picks one based on priority rules. But this creates a friction point: if you explicitly tell the code 'I want to use this column,' that request gets ignored if the defaults would pick something else. This matters because real-world data often has redundant time columns—maybe your raw data has both unix seconds and formatted datetime, or you've added computed columns alongside the originals. When there's ambiguity, the code should respect your choice, not force you to delete columns just to make it cooperate. ## The Workaround (Before This Fix) Consider hdbscan_validation_paper.py: it had to manually drop the datetime column to force the fallback to pick start_timestamp, even though the code explicitly mapped to timestamp columns. The preprocessing became: `python def _prep_truth(truth): t = truth.copy() # drop 'datetime' so _fallback_time_cols_dt resolves to 'start_timestamp' t = t.drop(columns=['datetime'], errors='ignore') # ... more manipulation ` This is backwards. If you've explicitly specified raj_cols={'timestamp': 'start_timestamp'}, the fallback should use that, not ignore it. ## The Fix The patch captures user-provided column mappings *before* loading defaults, then gives them priority. It's a minimal change with a clean rule: explicit mappings win. Now you can pass data with all its natural columns intact—the code listens to what you asked for. Fixes #233

paco-barreras and others added 7 commits April 24, 2026 01:54

Move visitation error scoring into validation

5446b1b

Clean up validation notebook preprocessing (refactor deprecated worka…

ecaad86

…rounds)

Merge branch 'main' into validation-schema-flexibility

18411ce

Simplify hdbscan validation notebook schema handling

8107850

Move validation notebook bootstrap plotting into library

39f9c2e

Separate distribution and bootstrap plots for validation metrics

aea9cdb

paco-barreras requested a review from carolineychen8 April 24, 2026 16:08

paco-barreras merged commit f72788a into caroline-hdbscan-benchmarking Apr 24, 2026

paco-barreras deleted the validation-schema-flexibility branch April 24, 2026 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up hdbscan validation workflow and plots#278

Clean up hdbscan validation workflow and plots#278
paco-barreras merged 7 commits intocaroline-hdbscan-benchmarkingfrom
validation-schema-flexibility

paco-barreras commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paco-barreras commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation

Notebook

Plotting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

paco-barreras commented Apr 24, 2026 •

edited

Loading