Clean up hdbscan validation workflow and plots#278
Merged
paco-barreras merged 7 commits intocaroline-hdbscan-benchmarkingfrom Apr 24, 2026
Merged
Clean up hdbscan validation workflow and plots#278paco-barreras merged 7 commits intocaroline-hdbscan-benchmarkingfrom
paco-barreras merged 7 commits intocaroline-hdbscan-benchmarkingfrom
Conversation
…#277) ## The Problem When a trajectory dataset contains both datetime and imestamp columns, NOMAD's fallback logic automatically picks one based on priority rules. But this creates a friction point: if you explicitly tell the code 'I want to use this column,' that request gets ignored if the defaults would pick something else. This matters because real-world data often has redundant time columns—maybe your raw data has both unix seconds and formatted datetime, or you've added computed columns alongside the originals. When there's ambiguity, the code should respect your choice, not force you to delete columns just to make it cooperate. ## The Workaround (Before This Fix) Consider hdbscan_validation_paper.py: it had to manually drop the datetime column to force the fallback to pick start_timestamp, even though the code explicitly mapped to timestamp columns. The preprocessing became: `python def _prep_truth(truth): t = truth.copy() # drop 'datetime' so _fallback_time_cols_dt resolves to 'start_timestamp' t = t.drop(columns=['datetime'], errors='ignore') # ... more manipulation ` This is backwards. If you've explicitly specified raj_cols={'timestamp': 'start_timestamp'}, the fallback should use that, not ignore it. ## The Fix The patch captures user-provided column mappings *before* loading defaults, then gives them priority. It's a minimal change with a clean rule: explicit mappings win. Now you can pass data with all its natural columns intact—the code listens to what you asked for. Fixes #233
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch cleans up the HDBSCAN validation notebook with some deeper refactors to nomad's function.
Validation
The first part of the change makes the validation path related to
compute_visitation_errorsbetter, and it now lives with the rest of the stop-detection validation logic in validation.py, and the overlap / validation code can handle a separate traj_cols for the right-hand table when the predicted stops and the truth table do not use the same column names. That let me remove a lot of notebook-side transformations that was only there to work with that fragile code.Notebook
The notebook
hdbscan_validation_paperis leaner. It no longer passes default traj_cols mappings into loaders just to restate the defaults, and it no longer drops diary rows with missing building IDs before validation. The general metrics now use the full truth diary, while category-specific slices happen naturally where the categories are actually used. I also fixed the stalestart_timestamp/timestampmismatch after the summarize-stop output switched tokeep_col_names=True, and cleaned up the generation path so regenerated diaries keepuser_id.Plotting
The plotting code also got reorganized. The notebook was mixing up two different statistical objects: the per-user distribution of a metric, and uncertainty in the median metric estimate. Those are now shown separately.
validation.pynow provides a small bootstrap summary helper plus two plotting helpers: one for per-user boxplots, and one for bootstrapped median estimates with interval whiskers. The boxplots are there to show the spread across users; the point-and-whisker plot is there to compare the estimated medians. That split makes the interpretation much clearer for this notebook.For the grouped colors, the x-axis still uses the registry family labels such as
lachesis_coarseandlachesis_fine, but the colors are grouped by the underlying base algorithm. That is piped through from the registry as{algo['family']: algo['algorithm']}, so variants of the same base method share a hue family without hardcoding the palette in the notebook.I ran the validation notebook end to end on the 250-agent dataset after these changes. The script completes successfully and writes figures that make sense to me.