Skip to content

Improved dataset loading; fixes, safeties, diagnostics, and better feedback#613

Merged
tlwillke merged 7 commits intomainfrom
improved_loader
Feb 5, 2026
Merged

Improved dataset loading; fixes, safeties, diagnostics, and better feedback#613
tlwillke merged 7 commits intomainfrom
improved_loader

Conversation

@jshook
Copy link
Copy Markdown
Contributor

@jshook jshook commented Feb 5, 2026

This fixes a regression in hdf5 data loading, but it also adds minor improvements for testing diagnostics.
Each functional change is shown on a themed commit, for easier review.

Changes:

  1. Fix off-by-one name extraction for .hdf5 dataset names.
  2. Add pom config to allow in-process jvm debugging of exec (exec:java).
  3. Add missing goal qualifiers to pom entries for exec, to allow IntelliJ to correctly map them.
  4. Add logging to dataset loader paths.
  5. Make argv to regex processing more defensive.
  6. Add warnings to users for invalid dataset name.
  7. Make hdf5 loader handle paths correctly.
  8. Remove invalid thrown exception in Optional code path.
  9. Add example config for yaml-based glove-25-angular, for testing/demonstration purposes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 5, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

Copy link
Copy Markdown
Contributor

@MarkWolters MarkWolters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix works. The dataset loader is repeating the load 3 times, twice locally cached. This was not introduced by your PR, but can you please attempt a fix while you're at it?

@jshook
Copy link
Copy Markdown
Contributor Author

jshook commented Feb 5, 2026

The fix works. The dataset loader is repeating the load 3 times, twice locally cached. This was not introduced by your PR, but can you please attempt a fix while you're at it?

Yes, I'll take a look. I think this is due to no clear distinction between acquiring (a handle to) and accessing (the data therein). I'll see if there is a way to improve it which isn't onerous.

@tlwillke tlwillke self-requested a review February 5, 2026 18:15
@tlwillke
Copy link
Copy Markdown
Collaborator

tlwillke commented Feb 5, 2026

The fix works. The dataset loader is repeating the load 3 times, twice locally cached. This was not introduced by your PR, but can you please attempt a fix while you're at it?

Please disregard. It's actually the 3 required files per dataset. No fix required.

Copy link
Copy Markdown
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tlwillke tlwillke merged commit e263cc8 into main Feb 5, 2026
13 checks passed
@tlwillke tlwillke deleted the improved_loader branch February 5, 2026 18:23
jshook added a commit that referenced this pull request Feb 12, 2026
…edback (#613)

* enable better diagnostics for dataset loading

* fix off-by-one error in extension truncation

* enable java-native invocation of bench for debugging

* qualify exec config to unconfuse IDE

* avoid possible NPE from maven null argv invocation

* make hdf5 loader more robust

* add example config for glove-25-angular.yml for testing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants