Skip to content

Unify built-in storage protocols through StorageAdapter (Alt B from #1432) #1440

@dimitri-yatsenko

Description

@dimitri-yatsenko

Background

PR #1432 introduces a StorageAdapter ABC + entry-point plugin system for third-party storage protocols. The four built-in protocols (file, s3, gcs, azure) remain hardcoded in storage.py, producing a deliberate two-tier pattern: extension authors go through the ABC; built-ins keep their if self.protocol == "..." branches.

This issue tracks the follow-up work to collapse that two-tier pattern by routing all protocols through StorageAdapter, so built-ins and third-party adapters become first-class citizens of the same code path.

The user-facing API (dj.config.stores) does not change.

Scope

storage.py on master has ~22 protocol-conditional branches: 12 in setup methods (already addressable by the ABC drafted in #1432), and 10 in the I/O methods that are not yet addressable through the current four-method interface. This issue is about the I/O 10.

Method Line Branch nature Design question
put_file 470 safe_copy (file) vs fsspec.put_file (clouds) Atomicity contract
put_buffer 518 safe_write (file) vs fsspec.pipe_file (clouds) Atomicity contract
get_file 497 pathlib (file) vs fsspec.get_file (clouds)
get_buffer 549 open() (file) vs fsspec.cat_file (clouds)
remove 604 Path.unlink (file) vs fsspec.rm (clouds)
open 651 pre-write mkdir (file only)
put_folder 697 shutil.copytree (file) vs fsspec.put recursive (clouds) Recursive-op semantics
remove_folder 737 shutil.rmtree (file) vs fsspec.rm recursive (clouds) Recursive-op semantics
copy_from_url 796 mkdir + copy (file) vs fsspec.pipe (clouds)
copy_folder_from_url 847 mkdir (file) vs no-op (clouds)

Gating design questions

These must be resolved before any I/O branch is moved into an adapter. They are not blockers for #1432 itself.

  1. Atomic-write semantics. safe_write and safe_copy (utils.py:131,153) write to a .saving/.copying temp file and os.rename to the final path — a documented atomicity guarantee. fsspec.LocalFileSystem.put_file/pipe_file write directly. A LocalFileAdapter must preserve this contract; either by reimplementing the temp-file-then-rename pattern in the adapter, or by deciding (with explicit user-facing notice) that DataJoint no longer guarantees atomicity for the local backend.

  2. Recursive-op semantics. shutil.copytree and shutil.rmtree have specific behaviors around symlinks, partial failures, and metadata that fsspec.put recursive / fsspec.rm do not match 1:1. Document the divergences and decide whether to preserve shutil semantics for LocalFileAdapter or accept the fsspec semantics.

  3. Interface shape. Open question whether each I/O operation gets its own adapter method, or whether the adapter exposes a smaller surface (e.g. a local_atomic_writer hook that only file-protocol adapters override, with everything else delegating to the fsspec defaults). The 10 branches above are not all equally weighty — open (pre-write mkdir) and copy_folder_from_url (mkdir vs no-op) are nearly trivial; put_file/put_buffer/put_folder are the load-bearing ones.

Proposed phasing

Each phase is a separate, reviewable PR.

  • Phase 0 — test scaffolding. Add per-protocol unit tests for the four built-ins covering _create_filesystem, _full_path, get_url, and each I/O method. This is the prerequisite that makes the rest safe — currently only test_get_store_spec_file_protocol exercises StorageBackend per-protocol logic; everything else relies on integration tests, which don't typically stress atomicity or recursive-op edge cases. Block all subsequent phases on Phase 0.
  • Phase 1 — setup-layer migration. Route all four built-ins through StorageAdapter.create_filesystem, validate_spec, full_path, get_url. Eliminates ~12 setup branches. No I/O changes; design questions above are not yet engaged.
  • Phase 2 — design decision on atomicity + recursive ops. Write up the design proposal (a short doc or a discussion in this issue) addressing the gating questions. Get sign-off before coding.
  • Phase 3 — I/O migration. One protocol at a time (files3gcsazure), each in its own PR with the Phase-0 test suite as the regression backstop.

Non-goals

  • Changing dj.config.stores user-facing API.
  • Deprecating or removing fsspec; the adapters wrap it.
  • Adding new built-in protocols beyond the existing four.

Acceptance criteria

  • storage.py contains zero if self.protocol == "..." branches.
  • All four built-in adapters (FileAdapter, S3Adapter, GCSAdapter, AzureAdapter) are registered at import time and discoverable via get_storage_adapter(protocol).
  • Existing integration test suite passes unchanged.
  • New per-protocol unit tests exist (delivered in Phase 0) and cover atomicity + recursive-op semantics for the local backend.
  • Atomicity contract on the local backend is either preserved or, if changed, documented in release notes.

Tracking

This is the architectural follow-up to #1432. No target release is committed until the Phase 2 design discussion concludes — sequencing is against other work, not a fixed date.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions