Skip to content

Clean up hdbscan validation workflow and plots#278

Merged
paco-barreras merged 7 commits intocaroline-hdbscan-benchmarkingfrom
validation-schema-flexibility
Apr 24, 2026
Merged

Clean up hdbscan validation workflow and plots#278
paco-barreras merged 7 commits intocaroline-hdbscan-benchmarkingfrom
validation-schema-flexibility

Conversation

@paco-barreras
Copy link
Copy Markdown
Collaborator

@paco-barreras paco-barreras commented Apr 24, 2026

This branch cleans up the HDBSCAN validation notebook with some deeper refactors to nomad's function.

Validation

The first part of the change makes the validation path related to compute_visitation_errors better, and it now lives with the rest of the stop-detection validation logic in validation.py, and the overlap / validation code can handle a separate traj_cols for the right-hand table when the predicted stops and the truth table do not use the same column names. That let me remove a lot of notebook-side transformations that was only there to work with that fragile code.

Notebook

The notebook hdbscan_validation_paper is leaner. It no longer passes default traj_cols mappings into loaders just to restate the defaults, and it no longer drops diary rows with missing building IDs before validation. The general metrics now use the full truth diary, while category-specific slices happen naturally where the categories are actually used. I also fixed the stale start_timestamp / timestamp mismatch after the summarize-stop output switched to keep_col_names=True, and cleaned up the generation path so regenerated diaries keep user_id.

Plotting

The plotting code also got reorganized. The notebook was mixing up two different statistical objects: the per-user distribution of a metric, and uncertainty in the median metric estimate. Those are now shown separately. validation.py now provides a small bootstrap summary helper plus two plotting helpers: one for per-user boxplots, and one for bootstrapped median estimates with interval whiskers. The boxplots are there to show the spread across users; the point-and-whisker plot is there to compare the estimated medians. That split makes the interpretation much clearer for this notebook.

For the grouped colors, the x-axis still uses the registry family labels such as lachesis_coarse and lachesis_fine, but the colors are grouped by the underlying base algorithm. That is piped through from the registry as {algo['family']: algo['algorithm']}, so variants of the same base method share a hue family without hardcoding the palette in the notebook.

I ran the validation notebook end to end on the 250-agent dataset after these changes. The script completes successfully and writes figures that make sense to me.

paco-barreras and others added 7 commits April 24, 2026 01:54
…#277)

## The Problem

When a trajectory dataset contains both datetime and imestamp columns,
NOMAD's fallback logic automatically picks one based on priority rules.
But this creates a friction point: if you explicitly tell the code 'I
want to use this column,' that request gets ignored if the defaults
would pick something else.

This matters because real-world data often has redundant time
columns—maybe your raw data has both unix seconds and formatted
datetime, or you've added computed columns alongside the originals. When
there's ambiguity, the code should respect your choice, not force you to
delete columns just to make it cooperate.

## The Workaround (Before This Fix)

Consider hdbscan_validation_paper.py: it had to manually drop the
datetime column to force the fallback to pick start_timestamp, even
though the code explicitly mapped to timestamp columns. The
preprocessing became:

`python
def _prep_truth(truth):
    t = truth.copy()
# drop 'datetime' so _fallback_time_cols_dt resolves to
'start_timestamp'
    t = t.drop(columns=['datetime'], errors='ignore')
    # ... more manipulation
`

This is backwards. If you've explicitly specified raj_cols={'timestamp':
'start_timestamp'}, the fallback should use that, not ignore it.

## The Fix

The patch captures user-provided column mappings *before* loading
defaults, then gives them priority. It's a minimal change with a clean
rule: explicit mappings win.

Now you can pass data with all its natural columns intact—the code
listens to what you asked for.

Fixes #233
@paco-barreras paco-barreras merged commit f72788a into caroline-hdbscan-benchmarking Apr 24, 2026
@paco-barreras paco-barreras deleted the validation-schema-flexibility branch April 24, 2026 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant