Skip to content

Conversation

@Kkuntal990
Copy link
Collaborator

@Kkuntal990 Kkuntal990 commented Nov 5, 2025

Summary

Adds Hugging Face Hub integration to braindecode, enabling easy sharing, versioning, and collaboration on EEG datasets. Datasets can be uploaded to and downloaded from the Hub using optimized Zarr format.

Supports three dataset types:

  • WindowsDataset - Epoched data (mne.Epochs-based)
  • EEGWindowsDataset - Continuous raw data with windowing metadata
  • RawDataset - Continuous raw data without windowing

Key Features

Hub Integration:

  • push_to_hub() - Upload datasets with one command
  • from_pretrained() - Download shared datasets
  • Automatic README generation with metadata and usage info
  • Private/public repositories with version control

Zarr Format (Optimized for Training):

  • Fastest random access: 0.010 ms (critical for PyTorch DataLoader)
  • ~23% size reduction with blosc compression
  • Lossless float32 precision storage
  • Explicit dtype specification for clarity
  • Preserves all MNE metadata, channel info, and preprocessing history

Note: Hub uses Zarr exclusively. Braindecode already supports local .fif format via dataset.save() and load_concat_dataset().

Architecture

Registry Pattern - Eliminates circular imports by using a central registry where dataset classes register themselves. Both hub.py and base.py can access the registry without importing each other.

Validation System:

  • Enforces dataset type consistency (no mixing types)
  • Checks channel names and sampling frequency uniformity

Integration with PR #813:

  • Compatible with improved preprocessor serialization
  • Preprocessing kwargs properly initialized on all dataset types

Changes

Modified Files:

  • braindecode/datasets/base.py - Registry decorators, preprocessing kwargs init
  • braindecode/datasets/hub.py - Core Hub logic, validation, dataset card
  • braindecode/datautil/hub_formats.py - Delegation pattern (67% duplication eliminated)
  • braindecode/datasets/registry.py - Registry pattern (NEW)
  • examples/datasets_io/plot_hub_integration.py - Comprehensive example

Tests:

  • test/unit_tests/datasets/test_hub_integration.py - 19 tests
  • test/unit_tests/datautil/test_hub_formats.py - 9 tests
  • Total: 28 tests, all passing ✅, 0 warnings

Test Coverage: Hub methods, zarr round-trip (all 3 types), validation, registry, overwrite

Live Demo Datasets

Successfully uploaded and tested (post upstream/master merge):

All verified for download, data integrity, and PyTorch DataLoader compatibility.

Checklist

  • All tests passing (28/28, 0 warnings)
  • Documentation updated (examples, docstrings)
  • No breaking changes to existing API
  • Error handling with helpful messages
  • Registry pattern eliminates circular imports
  • All three dataset types supported
  • Float32 precision with explicit dtype
  • Compatible with PR Improve Preprocessor serialisation  #813 (preprocessor serialization)
  • Production ready (live datasets tested post-merge)

- Implemented Hub integration for braindecode datasets
- Added converters for HDF5, Zarr, NumPy+Parquet formats
- Created HubDatasetMixin with push_to_hub() and from_pretrained()
- Added comprehensive benchmarking script comparing all formats + FIF
- Benchmark configuration: 1000 subjects, size-optimized weights
- Includes documentation and test files
- Remove format parameter from push_to_hub(), use Zarr exclusively
- Remove HDF5 and NPZ_Parquet converters from hub_formats.py
- Reduce codebase from 1050 to 465 lines (~56% reduction)
- Update from_pretrained() to only support Zarr format
- Update documentation and examples
- Remove benchmark and test development files
- Add development files to .gitignore

Based on comprehensive benchmarking, Zarr provides optimal performance:
- Fastest random access: 0.010 ms (critical for PyTorch training)
- Fast save/load: 0.46s / 0.12s
- Good compression: ~23% size reduction with blosc
- Add 26 unit tests covering all Hub integration functionality
- Fix BaseConcatDataset to inherit from HubDatasetMixin
- Remove unreachable empty dataset validation
- Update README template default format reference
- Remove development documentation files (HUGGINGFACE_HUB_INTEGRATION.md, IMPLEMENTATION_STATUS.md)
- Update .gitignore to exclude development/benchmark files

Tests cover:
- Zarr format save/load (windowed and raw datasets)
- Compression options (blosc, different levels, uncompressed)
- Overwrite and partial loading functionality
- Preprocessing kwargs preservation
- Hub mixin availability and error handling
- Mock Hub uploads/downloads
- Channel info preservation

All 26 tests passing successfully.
@Kkuntal990 Kkuntal990 marked this pull request as draft November 5, 2025 04:36
Development files have been moved to ../braindecode_hub_development/
and are no longer in the repository.
@Kkuntal990 Kkuntal990 marked this pull request as ready for review November 5, 2025 04:46
@bruAristimunha
Copy link
Collaborator

@PierreGtch , can you take a quick look please?

@bruAristimunha
Copy link
Collaborator

Nice first round

Copy link
Collaborator

@PierreGtch PierreGtch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nice PR @Kkuntal990

a few small comments, but the main two are:

  • You need to take into account the three types of RecordDataset we have, or raise a clear no implemented error. Currently only two are considered
  • different dataset in the same concat can have different information. Two options:
      1. you chech that all the elements in the concat have the EXACT info, and raise an error if not. This will be easier to browse datasets through the hub
      1. save the different info of each dataset contained in the concat separately

I'm more in favour of 1.

Kkuntal990 and others added 2 commits November 6, 2025 07:27
Code Standards:
- Change logger variable from 'logger' to 'log' (braindecode standard)
- Update circular import comments to '# Prevent circular import'
- Fix README API examples to show X, y, metainfo pattern
- Remove irrelevant library citation from dataset card

PR Feedback Addressed:
- Move imports to global scope (keep lazy where needed for circular imports)
- Replace all print() statements with log.info() for proper logging
- Update import error messages to suggest 'braindecode[hub]'
- Add braindecode.__version__ to format_info.json for reproducibility
- Implement proper dataset type detection for all 3 types:
  * BaseDataset (continuous) - raises NotImplementedError
  * EEGWindowsDataset (windowed with Raw)
  * WindowsDataset (windowed with Epochs)
- Add dataset uniformity validation in get_format_info()
  * Validates all datasets have same channels and sampling frequency
  * Raises ValueError if inconsistent
- Replace tempfile.TemporaryDirectory with pytest tmp_path fixtures

Test Improvements:
- Fix mock patching to use braindecode.datasets.hub_mixin.snapshot_download
- Update error type expectations to RuntimeError (wrapping internal errors)
- Remove tests for unsupported BaseDataset (continuous raw data)
- Update format_info assertions (removed is_windowed field)

All 24 tests passing successfully.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…r imports

Problem:
--------
The Hub integration code had circular import dependencies:
- base.py imports HubDatasetMixin from hub_mixin.py
- hub_mixin.py methods need dataset classes from base.py for isinstance() checks
- This created a circular dependency: base.py → hub_mixin.py → base.py

Previous workaround used lazy imports with _ensure_classes_loaded() and
module-level __getattr__(), which was complex and non-standard.

Solution:
---------
Implemented the Registry pattern to decouple modules:
1. Created registry.py - a central registry with zero dependencies
2. Dataset classes register themselves on definition using @register_dataset
3. Hub methods look up classes dynamically via get_dataset_class()
4. Type checking uses get_dataset_type() instead of direct isinstance()

Benefits:
---------
✓ Eliminates circular imports completely
✓ All imports at module level (PEP 8 compliant)
✓ No lazy imports or globals() manipulation
✓ Type-safe: registry provides actual class objects
✓ Extensible: easy to add new dataset types
✓ Clean, maintainable code following best practices

Changes:
--------
- NEW: braindecode/datasets/registry.py
  * register_dataset() decorator for class registration
  * get_dataset_class() for dynamic class lookup
  * get_dataset_type() for type detection without isinstance()
  * is_registered_dataset() for type checking

- MODIFIED: braindecode/datasets/base.py
  * Added @register_dataset to BaseDataset, WindowsDataset,
    EEGWindowsDataset, BaseConcatDataset

- RENAMED: hub_mixin.py → hub.py
  * Removed _ensure_classes_loaded() and __getattr__()
  * Removed all globals() manipulation
  * Updated all isinstance() checks to use registry
  * Updated class instantiations to use get_dataset_class()

- MODIFIED: braindecode/datautil/hub_formats.py
  * Replaced direct imports with registry lookups
  * Updated isinstance() checks to use get_dataset_type()
  * Updated loader functions to use get_dataset_class()

- MODIFIED: test/unit_tests/datasets/test_hub_integration.py
  * Removed outdated lazy import tests
  * Added test_registry_pattern_works()
  * Updated test_zarr_save_and_load()

All tests pass (5/5) with registry pattern implementation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Kkuntal990
Copy link
Collaborator Author

Another nice PR @Kkuntal990

a few small comments, but the main two are:

  • You need to take into account the three types of RecordDataset we have, or raise a clear no implemented error. Currently only two are considered

  • different dataset in the same concat can have different information. Two options:

      1. you chech that all the elements in the concat have the EXACT info, and raise an error if not. This will be easier to browse datasets through the hub
      1. save the different info of each dataset contained in the concat separately

I'm more in favour of 1.

  • Handled all 3 datasets
  • Raising error if concatenation has different datasets. Added a test for that error. Is it fine for now or do we need to handle mixed datasets, it will be a little tricky.

…d testing

- Add full RawDataset support for Hub integration (continuous data without windows)
- Add parametrized tests for WindowsDataset and EEGWindowsDataset
- Add preprocessing kwargs preservation and lazy loading tests
- Fix preprocessing kwargs restoration to set on individual datasets
- Update example to demonstrate all three dataset types
- Add comprehensive validation for mixed types, channels, and sampling frequencies
- Enhance error messages with helpful resampling guidance
- Add RawDataset tests to hub_formats module
- Total 26 tests passing (19 hub_integration + 7 hub_formats)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Kkuntal990
Copy link
Collaborator Author

Kkuntal990 commented Nov 7, 2025

Hi @PierreGtch @bruAristimunha , I have made some architectural changes to the PR and left couple of minor comments open as I need more clarification on those.

Also added some info about the features and comparison with the hugging face, in case it is helpful.

Thank you for your time!

Also sorry for the big PR. Next time I will modularize it better

Kkuntal990 and others added 5 commits November 12, 2025 11:56
This commit fixes ImportError when numcodecs is not installed by using
try/except at module level instead of direct import.

Problem:
- numcodecs is an optional dependency (only with braindecode[hub])
- Direct import at module level caused ImportError for users without zarr
- This would break braindecode entirely, even for non-Hub users

Solution:
- Wrap numcodecs imports in try/except at module level
- Set NUMCODECS_AVAILABLE flag to track availability
- Check flag in _create_compressor() and raise helpful error if missing
- Apply same pattern to test file

Changes in braindecode/datasets/hub.py:
- Lines 26-33: Add try/except for numcodecs imports with availability flag
- Lines 841-844: Check NUMCODECS_AVAILABLE in _create_compressor

Changes in test/unit_tests/datasets/test_hub_integration.py:
- Lines 18-26: Add try/except for zarr and numcodecs imports

Testing:
- All 33 hub tests pass
- Pre-commit checks all pass
- Module imports correctly with and without numcodecs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The plot_hub_integration.py example requires huggingface_hub and zarr
packages to execute. Added these dependencies to the docs optional
dependencies so the example can run during documentation builds.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added comprehensive tests for Hub integration functionality:

Priority 1 - Hub Integration (5 tests):
- test_push_to_hub_import_error: Verify ImportError when deps missing
- test_push_to_hub_success_mocked: Full push workflow with mocked API
- test_push_to_hub_upload_failure: RuntimeError on upload failure
- test_from_pretrained_import_error: ImportError for missing deps
- test_from_pretrained_404_error: FileNotFoundError for non-existent repos

Priority 2 - Error Paths (2 tests):
- test_create_compressor_zarr_not_available: zarr ImportError
- test_create_compressor_numcodecs_not_available: numcodecs ImportError

Priority 3 - Branch Coverage (2 tests):
- test_save_dataset_card_all_dataset_types: RawDataset branch testing
- test_push_to_hub_default_commit_message: Default message generation

All 33 tests passing with 0 warnings.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Replace hard-coded "test_token" strings with None in mocked unit tests.
Since these tests use fully mocked API calls, no real authentication
occurs and the token value is irrelevant. Using None eliminates
security scanner warnings and follows best practices.

Changes:
- test_push_to_hub_success_mocked: Use token=None instead of "test_token"
- Mock assertion updated to match the None token value

All 33 hub integration tests pass successfully.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixes dtype mismatch error on Windows where int32 columns were being
converted to int64 after Zarr save/load cycle. JSON serialization
doesn't preserve numpy/pandas dtype information, causing platform-
specific integer type defaults to take over.

Solution:
- Save metadata dtypes as separate JSON attribute during save
- Restore dtypes explicitly after loading metadata from JSON
- Applied to both WindowsDataset and EEGWindowsDataset paths

This ensures lossless round-trip preservation of metadata dtypes
across all platforms (Windows, macOS, Linux).

Fixes: test_eegwindows_lossless_round_trip on Windows builds

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +932 to +950
::
from torch.utils.data import DataLoader
# Create DataLoader for training
train_loader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4
)
# Training loop
for X, y, metainfo in train_loader:
# X shape: [batch_size, n_channels, n_times]
# y shape: [batch_size]
# metainfo shape: [batch_size, 2] (start and end indices)
# Process your batch...
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar RST syntax issue: mixing :: with explicit .. code-block:: directive. The :: at the end of line 932 should be removed, or the entire block should use simple literal block syntax (just :: with indentation).

Recommended fix - use markdown code fences for consistency with the rest of the README:

## Using with PyTorch DataLoader

```python
from torch.utils.data import DataLoader
...
```suggestion
```python
from torch.utils.data import DataLoader

# Create DataLoader for training
train_loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4
)

# Training loop
for X, y, metainfo in train_loader:
    # X shape: [batch_size, n_channels, n_times]
    # y shape: [batch_size]
    # metainfo shape: [batch_size, 2] (start and end indices)
    # Process your batch...

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

PierreGtch and others added 4 commits November 13, 2025 11:03
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@bruAristimunha
Copy link
Collaborator

okay 105 msgs in this PR, it is too much ;p

@bruAristimunha bruAristimunha enabled auto-merge (squash) November 13, 2025 16:30
@bruAristimunha bruAristimunha merged commit 712e962 into braindecode:master Nov 13, 2025
16 of 17 checks passed
@bruAristimunha
Copy link
Collaborator

many thanks for the all the work @Kkuntal990

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants