Skip to content

Critical: dj_gc.scan reports referenced: 0 for every storage column; collect would delete live data #1442

@kushalbakshi

Description

@kushalbakshi

TL;DR

dj_gc.scan(schema, store_name=...) reports hash_referenced: 0 and schema_paths_referenced: 0 for any populated schema, regardless of how many live rows reference external storage. Every external file is therefore classified as an orphan, and a subsequent dj_gc.collect(...) would delete live data.

The bug is a contract mismatch between two helper functions in gc.py. scan_hash_references (gc.py:234) and scan_schema_references (gc.py:296) read column values via table.to_arrays(attr_name), which routes through Expression.to_dicts (expression.py:899) and runs decode_attribute (codecs.py:518) for each value. The decoded outputs (numpy arrays, NpyRef / ObjectRef lazy handles, file paths, bytes) do not satisfy _extract_*_refs's isinstance(value, dict) and "path" in value check, so both helpers silently return empty sets. No exception fires, no warning is logged.

The bug affects every storage codec (<blob@>, <npy@>, <object@>, <attach@>, <hash@>) uniformly and is independent of storage backend. It slipped past CI because tests/integration/test_gc.py:163–165 mocks scan_hash_references, scan_schema_references, and list_*_paths directly — the production code path through to_arraysdecode_attribute is never exercised end-to-end.

Affected versions

  • master at the time of investigation (verified at commit dde64719).
  • gc.py has not diverged across recent feature branches; this is longstanding behavior, not a regression in flight.

Root cause

scan_hash_references (and the schema-addressed twin) iterate every table in each passed schema and read storage-column values via table.to_arrays(attr_name):

# src/datajoint/gc.py:234 (excerpt)
for attr_name, attr in table.heading.attributes.items():
    if not _uses_hash_storage(attr):
        continue
    try:
        values = table.to_arrays(attr_name)
        for value in values:
            for path, ref_store in _extract_hash_refs(value):
                if store_name is None or ref_store == store_name:
                    referenced.add(path)
    except Exception as e:
        logger.warning(f"Error scanning {table_name}.{attr_name}: {e}")

Expression.to_arrays (src/datajoint/expression.py:989) calls to_dicts(...), which runs decode_attribute(heading[name], row[name], ...) for every attribute (expression.py:899codecs.py:518). For storage columns this invokes the codec's decode() method, which:

  1. Parses the stored JSON metadata (the dict {"hash": ..., "path": ..., "store": ..., "size": ...} per the type-system spec),
  2. Downloads the actual bytes from external storage,
  3. Returns the deserialized payload (or a lazy ref handle).

So to_arrays('payload') returns the decoded values, not the raw metadata.

_extract_hash_refs(value) (src/datajoint/gc.py:115) is designed for raw metadata:

def _extract_hash_refs(value):
    refs = []
    if value is None:
        return refs
    if isinstance(value, str):
        try:
            value = json.loads(value)
        except (json.JSONDecodeError, TypeError):
            return refs
    if isinstance(value, dict) and "path" in value:
        refs.append((value["path"], value.get("store")))
    return refs

None of the decoded codec outputs are dicts or JSON strings, and this is the documented contract for each codec, not an implementation accident:

Codec decode() returns _extract_*_refs(...) Documented at
<blob@> numpy.ndarray (or stored object) [] reference/specs/type-system.md (BlobCodec example)
<npy@> NpyRef lazy handle [] reference/specs/npy-codec.md — "When you fetch an <npy@> attribute, you get an NpyRef object"
<object@> ObjectRef lazy handle [] reference/specs/type-system.md (ObjectCodec example)
<attach@> local file path str after download [] how-to/use-object-storage.md
<hash@> bytes [] reference/specs/type-system.md

The function silently returns [] for every value. No exception, no warning, the surrounding try/except never triggers, and referenced stays empty.

Three-way reproducer

Two to_arrays call paths produce identical wrong results; the raw SQL bypass shows what _extract_hash_refs actually expects.

import datajoint as dj
import numpy as np
from datajoint import gc as dj_gc
from datajoint.gc import _extract_hash_refs

dj.config['safemode'] = False

schema = dj.Schema('test_gc_repro')
if schema.exists:
    schema.drop(prompt=False)
schema = dj.Schema('test_gc_repro')

@schema
class T(dj.Manual):
    definition = '''
    rid : int
    ---
    payload : <blob@my_store>
    '''

T.insert1((1, np.arange(64, dtype='uint8')))

# 1. User-idiomatic call:
v_user = T().to_arrays('payload')[0]
print(type(v_user).__name__, _extract_hash_refs(v_user))
# → ndarray  []

# 2. GC-internal call (FreeTable via schema.get_table):
v_gc = schema.get_table('t')().to_arrays('payload')[0]
print(type(v_gc).__name__, _extract_hash_refs(v_gc))
# → ndarray  []

# 3. Raw SQL bypass — what `_extract_hash_refs` is actually designed for:
raw = list(schema.connection.query(
    'SELECT payload FROM "test_gc_repro"."t" LIMIT 1'
))[0][0]
print(type(raw).__name__, _extract_hash_refs(raw))
# → dict  [('_hash/test_gc_repro/<hash>', 'my_store')]   ← only this works

# Public API confirms the silent failure:
stats = dj_gc.scan(schema, store_name='my_store')
print(stats['hash_referenced'], stats['hash_stored'], stats['hash_orphaned'])
# → 0 1 1   ← collect() would delete this referenced file

The same pattern triggers for <npy@> / <object@> via scan_schema_references and _extract_schema_refs.

Suggested fix

Option (1): iterate expr.cursor(as_dict=True) directly in scan_*_references and bypass decode_attribute. Smallest change, stays close to the existing code shape, produces exactly the dict / JSON-string values that _extract_*_refs was designed to consume.

The proposed fix is intentionally backend-agnostic. JSON columns come back as strings on MySQL (needing json.loads) and as already-decoded dicts on PostgreSQL (JSONB); _extract_hash_refs and _extract_schema_refs already handle both branches at gc.py:124–129 (the isinstance(value, str)json.loads path plus the isinstance(value, dict) path). Cursor reads therefore produce the right shape across backends with no adapter dispatch.

Two alternatives, listed for completeness — option (1) is the right call:

  • (2) Add decode_storage=False to Expression.to_dicts / to_arrays. Lets reference-discovery (and any future metadata-only operation) opt out of codec decode. Wider API change.
  • (3) Have the Codec base class expose a get_storage_metadata(stored_value) hook that returns the path/hash without I/O. Cleanest abstraction but requires touching every built-in codec and asking custom-codec authors to implement it.

Test gap and proposed regression coverage

tests/integration/test_gc.py:163–165 patches the three internal functions:

@patch('datajoint.gc.scan_hash_references')
@patch('datajoint.gc.scan_schema_references')
@patch('datajoint.gc.list_*_paths')   # both list_stored_hashes and list_schema_paths

These tests are useful — they verify orchestration (scan composes references with file listings, returns the right stat keys, dry-run vs real, etc.). They should stay. What's missing is a non-mocked end-to-end test that would have caught this:

def test_scan_finds_active_blob_reference():
    """scan() must report hash_referenced > 0 for a populated <blob@> column."""
    schema = dj.Schema('test_gc_e2e_blob')
    if schema.exists:
        schema.drop(prompt=False)
    schema = dj.Schema('test_gc_e2e_blob')

    @schema
    class T(dj.Manual):
        definition = '''
        rid : int
        ---
        payload : <blob@my_store>
        '''

    T.insert1((1, np.arange(64, dtype='uint8')))

    stats = dj.gc.scan(schema, store_name='my_store')
    assert stats['hash_referenced'] >= 1, (
        f"scan should find the active reference; got {stats}"
    )


def test_scan_finds_active_npy_reference():
    """Same shape, schema-addressed via <npy@>."""
    # ... <npy@> column, insert, assert schema_paths_referenced >= 1


def test_scan_finds_active_object_reference():
    """Same shape, schema-addressed via <object@>."""
    # ... <object@> column, insert, assert schema_paths_referenced >= 1

Three tests, one per codec family — covers the code path the mocked tests cannot reach. Happy to PR these alongside the fix if useful.

Performance side-effect (independent of correctness)

Because the broken path runs decode_attribute, scan downloads every external blob from storage just to deserialize it and discard the result via the silent type mismatch. A scan against a schema with 1 TB of blobs downloads 1 TB to produce referenced: 0. With option (1), scan becomes a metadata-only operation as it should be — touches the JSON column on the database, never reaches storage during reference enumeration.

Implication for custom-codec authors

Third-party codec authors following the documented encode/decode contract (how-to/create-custom-codec.md, how-to/use-plugin-codecs.md, explanation/custom-codecs.md) will hit this same silent failure with no diagnostic. The canonical example in reference/specs/codec-api.md itself shows a ParquetCodec whose decode returns a ParquetRef object — exactly the kind of value that _extract_*_refs rejects.

Whichever fix lands needs to either:

  • (a) work uniformly via raw-metadata access regardless of codec — option (1) does this — or
  • (b) come with a documented codec-author contract for reference-discovery (e.g., codecs must implement a get_storage_metadata hook), updated codec docs, and a deprecation path for existing third-party codecs that don't implement it.

The current codec docs make no mention of GC or reference-discovery requirements, so authors have no way to know they need to do anything special.

Adjacent (low-severity)

The gc.py module docstring shows the usage example as:

import datajoint as dj
stats = dj.gc.scan(schema1, schema2, store_name='mystore')

This raises AttributeError: module 'datajoint' has no attribute 'gc' because gc is not registered in __init__.py's _lazy_modules. The actual working import is from datajoint import gc as dj_gc; dj_gc.scan(...). Fix: add "gc": (".gc", None) to _lazy_modules, or correct the docstring example.

Reproducer artifacts

Happy to share the standalone scripts and a screen-share-friendly notebook used during the investigation if useful (they live in a private repo at the moment but the bodies can be extracted and posted here on request).

Next steps

  1. Validate the suggested fix direction.
  2. Land option (1) plus the three regression tests above.
  3. Address the codec-author implication (probably easiest as a doc note: "your codec's decode return type does not affect GC; reference-discovery operates on raw stored metadata").
  4. (Optional) fix the dj.gc lazy-module exposure / docstring example.

Metadata

Metadata

Assignees

Labels

bugIndicates an unexpected problem or unintended behavior

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions