Add Hugging Face Hub integration with Zarr format for dataset sharing #820

Kkuntal990 · 2025-11-05T04:35:52Z

Summary

Adds Hugging Face Hub integration to braindecode, enabling easy sharing, versioning, and collaboration on EEG datasets. Datasets can be uploaded to and downloaded from the Hub using optimized Zarr format.

Supports three dataset types:

WindowsDataset - Epoched data (mne.Epochs-based)
EEGWindowsDataset - Continuous raw data with windowing metadata
RawDataset - Continuous raw data without windowing

Key Features

Hub Integration:

push_to_hub() - Upload datasets with one command
from_pretrained() - Download shared datasets
Automatic README generation with metadata and usage info
Private/public repositories with version control

Zarr Format (Optimized for Training):

Fastest random access: 0.010 ms (critical for PyTorch DataLoader)
~23% size reduction with blosc compression
Lossless float32 precision storage
Explicit dtype specification for clarity
Preserves all MNE metadata, channel info, and preprocessing history

Note: Hub uses Zarr exclusively. Braindecode already supports local .fif format via dataset.save() and load_concat_dataset().

Architecture

Registry Pattern - Eliminates circular imports by using a central registry where dataset classes register themselves. Both hub.py and base.py can access the registry without importing each other.

Validation System:

Enforces dataset type consistency (no mixing types)
Checks channel names and sampling frequency uniformity

Integration with PR #813:

Compatible with improved preprocessor serialization
Preprocessing kwargs properly initialized on all dataset types

Changes

Modified Files:

braindecode/datasets/base.py - Registry decorators, preprocessing kwargs init
braindecode/datasets/hub.py - Core Hub logic, validation, dataset card
braindecode/datautil/hub_formats.py - Delegation pattern (67% duplication eliminated)
braindecode/datasets/registry.py - Registry pattern (NEW)
examples/datasets_io/plot_hub_integration.py - Comprehensive example

Tests:

test/unit_tests/datasets/test_hub_integration.py - 19 tests
test/unit_tests/datautil/test_hub_formats.py - 9 tests
Total: 28 tests, all passing ✅, 0 warnings

Test Coverage: Hub methods, zarr round-trip (all 3 types), validation, registry, overwrite

Live Demo Datasets

Successfully uploaded and tested (post upstream/master merge):

All verified for download, data integrity, and PyTorch DataLoader compatibility.

Checklist

All tests passing (28/28, 0 warnings)
Documentation updated (examples, docstrings)
No breaking changes to existing API
Error handling with helpful messages
Registry pattern eliminates circular imports
All three dataset types supported
Float32 precision with explicit dtype
Compatible with PR Improve Preprocessor serialisation #813 (preprocessor serialization)
Production ready (live datasets tested post-merge)

- Implemented Hub integration for braindecode datasets - Added converters for HDF5, Zarr, NumPy+Parquet formats - Created HubDatasetMixin with push_to_hub() and from_pretrained() - Added comprehensive benchmarking script comparing all formats + FIF - Benchmark configuration: 1000 subjects, size-optimized weights - Includes documentation and test files

- Remove format parameter from push_to_hub(), use Zarr exclusively - Remove HDF5 and NPZ_Parquet converters from hub_formats.py - Reduce codebase from 1050 to 465 lines (~56% reduction) - Update from_pretrained() to only support Zarr format - Update documentation and examples - Remove benchmark and test development files - Add development files to .gitignore Based on comprehensive benchmarking, Zarr provides optimal performance: - Fastest random access: 0.010 ms (critical for PyTorch training) - Fast save/load: 0.46s / 0.12s - Good compression: ~23% size reduction with blosc

- Add 26 unit tests covering all Hub integration functionality - Fix BaseConcatDataset to inherit from HubDatasetMixin - Remove unreachable empty dataset validation - Update README template default format reference - Remove development documentation files (HUGGINGFACE_HUB_INTEGRATION.md, IMPLEMENTATION_STATUS.md) - Update .gitignore to exclude development/benchmark files Tests cover: - Zarr format save/load (windowed and raw datasets) - Compression options (blosc, different levels, uncompressed) - Overwrite and partial loading functionality - Preprocessing kwargs preservation - Hub mixin availability and error handling - Mock Hub uploads/downloads - Channel info preservation All 26 tests passing successfully.

Development files have been moved to ../braindecode_hub_development/ and are no longer in the repository.

bruAristimunha · 2025-11-05T07:42:56Z

@PierreGtch , can you take a quick look please?

braindecode/datasets/hub_mixin.py

test/unit_tests/datasets/test_hub_integration.py

test/unit_tests/datautil/test_hub_formats.py

test/unit_tests/datasets/test_hub_integration.py

bruAristimunha · 2025-11-05T08:21:08Z

Nice first round

PierreGtch

Another nice PR @Kkuntal990

a few small comments, but the main two are:

You need to take into account the three types of RecordDataset we have, or raise a clear no implemented error. Currently only two are considered
different dataset in the same concat can have different information. Two options:
- 1. you chech that all the elements in the concat have the EXACT info, and raise an error if not. This will be easier to browse datasets through the hub
- 1. save the different info of each dataset contained in the concat separately

I'm more in favour of 1.

braindecode/datasets/base.py

braindecode/datasets/hub_mixin.py

braindecode/datautil/hub_formats.py

test/unit_tests/datautil/test_hub_formats.py

Code Standards: - Change logger variable from 'logger' to 'log' (braindecode standard) - Update circular import comments to '# Prevent circular import' - Fix README API examples to show X, y, metainfo pattern - Remove irrelevant library citation from dataset card PR Feedback Addressed: - Move imports to global scope (keep lazy where needed for circular imports) - Replace all print() statements with log.info() for proper logging - Update import error messages to suggest 'braindecode[hub]' - Add braindecode.__version__ to format_info.json for reproducibility - Implement proper dataset type detection for all 3 types: * BaseDataset (continuous) - raises NotImplementedError * EEGWindowsDataset (windowed with Raw) * WindowsDataset (windowed with Epochs) - Add dataset uniformity validation in get_format_info() * Validates all datasets have same channels and sampling frequency * Raises ValueError if inconsistent - Replace tempfile.TemporaryDirectory with pytest tmp_path fixtures Test Improvements: - Fix mock patching to use braindecode.datasets.hub_mixin.snapshot_download - Update error type expectations to RuntimeError (wrapping internal errors) - Remove tests for unsupported BaseDataset (continuous raw data) - Update format_info assertions (removed is_windowed field) All 24 tests passing successfully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…r imports Problem: -------- The Hub integration code had circular import dependencies: - base.py imports HubDatasetMixin from hub_mixin.py - hub_mixin.py methods need dataset classes from base.py for isinstance() checks - This created a circular dependency: base.py → hub_mixin.py → base.py Previous workaround used lazy imports with _ensure_classes_loaded() and module-level __getattr__(), which was complex and non-standard. Solution: --------- Implemented the Registry pattern to decouple modules: 1. Created registry.py - a central registry with zero dependencies 2. Dataset classes register themselves on definition using @register_dataset 3. Hub methods look up classes dynamically via get_dataset_class() 4. Type checking uses get_dataset_type() instead of direct isinstance() Benefits: --------- ✓ Eliminates circular imports completely ✓ All imports at module level (PEP 8 compliant) ✓ No lazy imports or globals() manipulation ✓ Type-safe: registry provides actual class objects ✓ Extensible: easy to add new dataset types ✓ Clean, maintainable code following best practices Changes: -------- - NEW: braindecode/datasets/registry.py * register_dataset() decorator for class registration * get_dataset_class() for dynamic class lookup * get_dataset_type() for type detection without isinstance() * is_registered_dataset() for type checking - MODIFIED: braindecode/datasets/base.py * Added @register_dataset to BaseDataset, WindowsDataset, EEGWindowsDataset, BaseConcatDataset - RENAMED: hub_mixin.py → hub.py * Removed _ensure_classes_loaded() and __getattr__() * Removed all globals() manipulation * Updated all isinstance() checks to use registry * Updated class instantiations to use get_dataset_class() - MODIFIED: braindecode/datautil/hub_formats.py * Replaced direct imports with registry lookups * Updated isinstance() checks to use get_dataset_type() * Updated loader functions to use get_dataset_class() - MODIFIED: test/unit_tests/datasets/test_hub_integration.py * Removed outdated lazy import tests * Added test_registry_pattern_works() * Updated test_zarr_save_and_load() All tests pass (5/5) with registry pattern implementation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Kkuntal990 · 2025-11-07T02:58:56Z

Another nice PR @Kkuntal990

a few small comments, but the main two are:

You need to take into account the three types of RecordDataset we have, or raise a clear no implemented error. Currently only two are considered

different dataset in the same concat can have different information. Two options:

you chech that all the elements in the concat have the EXACT info, and raise an error if not. This will be easier to browse datasets through the hub

save the different info of each dataset contained in the concat separately

I'm more in favour of 1.

Handled all 3 datasets
Raising error if concatenation has different datasets. Added a test for that error. Is it fine for now or do we need to handle mixed datasets, it will be a little tricky.

…d testing - Add full RawDataset support for Hub integration (continuous data without windows) - Add parametrized tests for WindowsDataset and EEGWindowsDataset - Add preprocessing kwargs preservation and lazy loading tests - Fix preprocessing kwargs restoration to set on individual datasets - Update example to demonstrate all three dataset types - Add comprehensive validation for mixed types, channels, and sampling frequencies - Enhance error messages with helpful resampling guidance - Add RawDataset tests to hub_formats module - Total 26 tests passing (19 hub_integration + 7 hub_formats) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Kkuntal990 · 2025-11-07T05:00:35Z

Hi @PierreGtch @bruAristimunha , I have made some architectural changes to the PR and left couple of minor comments open as I need more clarification on those.

Also added some info about the features and comparison with the hugging face, in case it is helpful.

Thank you for your time!

Also sorry for the big PR. Next time I will modularize it better

examples/datasets_io/plot_hub_integration.py

braindecode/datasets/hub.py

This commit fixes ImportError when numcodecs is not installed by using try/except at module level instead of direct import. Problem: - numcodecs is an optional dependency (only with braindecode[hub]) - Direct import at module level caused ImportError for users without zarr - This would break braindecode entirely, even for non-Hub users Solution: - Wrap numcodecs imports in try/except at module level - Set NUMCODECS_AVAILABLE flag to track availability - Check flag in _create_compressor() and raise helpful error if missing - Apply same pattern to test file Changes in braindecode/datasets/hub.py: - Lines 26-33: Add try/except for numcodecs imports with availability flag - Lines 841-844: Check NUMCODECS_AVAILABLE in _create_compressor Changes in test/unit_tests/datasets/test_hub_integration.py: - Lines 18-26: Add try/except for zarr and numcodecs imports Testing: - All 33 hub tests pass - Pre-commit checks all pass - Module imports correctly with and without numcodecs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The plot_hub_integration.py example requires huggingface_hub and zarr packages to execute. Added these dependencies to the docs optional dependencies so the example can run during documentation builds. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Added comprehensive tests for Hub integration functionality: Priority 1 - Hub Integration (5 tests): - test_push_to_hub_import_error: Verify ImportError when deps missing - test_push_to_hub_success_mocked: Full push workflow with mocked API - test_push_to_hub_upload_failure: RuntimeError on upload failure - test_from_pretrained_import_error: ImportError for missing deps - test_from_pretrained_404_error: FileNotFoundError for non-existent repos Priority 2 - Error Paths (2 tests): - test_create_compressor_zarr_not_available: zarr ImportError - test_create_compressor_numcodecs_not_available: numcodecs ImportError Priority 3 - Branch Coverage (2 tests): - test_save_dataset_card_all_dataset_types: RawDataset branch testing - test_push_to_hub_default_commit_message: Default message generation All 33 tests passing with 0 warnings. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Replace hard-coded "test_token" strings with None in mocked unit tests. Since these tests use fully mocked API calls, no real authentication occurs and the token value is irrelevant. Using None eliminates security scanner warnings and follows best practices. Changes: - test_push_to_hub_success_mocked: Use token=None instead of "test_token" - Mock assertion updated to match the None token value All 33 hub integration tests pass successfully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Fixes dtype mismatch error on Windows where int32 columns were being converted to int64 after Zarr save/load cycle. JSON serialization doesn't preserve numpy/pandas dtype information, causing platform- specific integer type defaults to take over. Solution: - Save metadata dtypes as separate JSON attribute during save - Restore dtypes explicitly after loading metadata from JSON - Applied to both WindowsDataset and EEGWindowsDataset paths This ensures lossless round-trip preservation of metadata dtypes across all platforms (Windows, macOS, Linux). Fixes: test_eegwindows_lossless_round_trip on Windows builds 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull Request Overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pyproject.toml

braindecode/datasets/hub.py

docs/whats_new.rst

pyproject.toml

Copilot · 2025-11-13T09:19:04Z

braindecode/datasets/hub.py

+::
+
+    from torch.utils.data import DataLoader
+
+    # Create DataLoader for training
+    train_loader = DataLoader(
+        dataset,
+        batch_size=32,
+        shuffle=True,
+        num_workers=4
+    )
+
+    # Training loop
+    for X, y, metainfo in train_loader:
+        # X shape: [batch_size, n_channels, n_times]
+        # y shape: [batch_size]
+        # metainfo shape: [batch_size, 2] (start and end indices)
+        # Process your batch...
+


Similar RST syntax issue: mixing :: with explicit .. code-block:: directive. The :: at the end of line 932 should be removed, or the entire block should use simple literal block syntax (just :: with indentation).

Recommended fix - use markdown code fences for consistency with the rest of the README:

## Using with PyTorch DataLoader ```python from torch.utils.data import DataLoader ...

```suggestion ```python from torch.utils.data import DataLoader # Create DataLoader for training train_loader = DataLoader( dataset, batch_size=32, shuffle=True, num_workers=4 ) # Training loop for X, y, metainfo in train_loader: # X shape: [batch_size, n_channels, n_times] # y shape: [batch_size] # metainfo shape: [batch_size, 2] (start and end indices) # Process your batch...

@copilot open a new pull request to apply changes based on this feedback

test/unit_tests/datasets/test_hub_integration.py

braindecode/datasets/hub.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

braindecode/datasets/hub.py

bruAristimunha · 2025-11-13T13:02:56Z

okay 105 msgs in this PR, it is too much ;p

bruAristimunha · 2025-11-13T16:31:44Z

many thanks for the all the work @Kkuntal990

Kkuntal990 added 4 commits November 4, 2025 18:49

Update author attribution to Kuntal Kokate only

7a7a0d6

Kkuntal990 marked this pull request as draft November 5, 2025 04:36

Clean up .gitignore - remove development file entries

acc82b4

Development files have been moved to ../braindecode_hub_development/ and are no longer in the repository.

Kkuntal990 marked this pull request as ready for review November 5, 2025 04:46

Kkuntal990 requested a review from bruAristimunha November 5, 2025 04:47

bruAristimunha requested changes Nov 5, 2025

View reviewed changes

Merge branch 'master' into feature/hub-format-benchmark

181f959

PierreGtch reviewed Nov 5, 2025

View reviewed changes

Kkuntal990 and others added 2 commits November 6, 2025 07:27

Kkuntal990 requested review from PierreGtch and bruAristimunha November 7, 2025 04:58

Merge branch 'master' into feature/hub-format-benchmark

96a1398