TL;DR
dj_gc.scan(schema, store_name=...) reports hash_referenced: 0 and schema_paths_referenced: 0 for any populated schema, regardless of how many live rows reference external storage. Every external file is therefore classified as an orphan, and a subsequent dj_gc.collect(...) would delete live data.
The bug is a contract mismatch between two helper functions in gc.py. scan_hash_references (gc.py:234) and scan_schema_references (gc.py:296) read column values via table.to_arrays(attr_name), which routes through Expression.to_dicts (expression.py:899) and runs decode_attribute (codecs.py:518) for each value. The decoded outputs (numpy arrays, NpyRef / ObjectRef lazy handles, file paths, bytes) do not satisfy _extract_*_refs's isinstance(value, dict) and "path" in value check, so both helpers silently return empty sets. No exception fires, no warning is logged.
The bug affects every storage codec (<blob@>, <npy@>, <object@>, <attach@>, <hash@>) uniformly and is independent of storage backend. It slipped past CI because tests/integration/test_gc.py:163–165 mocks scan_hash_references, scan_schema_references, and list_*_paths directly — the production code path through to_arrays → decode_attribute is never exercised end-to-end.
Affected versions
master at the time of investigation (verified at commit dde64719).
gc.py has not diverged across recent feature branches; this is longstanding behavior, not a regression in flight.
Root cause
scan_hash_references (and the schema-addressed twin) iterate every table in each passed schema and read storage-column values via table.to_arrays(attr_name):
# src/datajoint/gc.py:234 (excerpt)
for attr_name, attr in table.heading.attributes.items():
if not _uses_hash_storage(attr):
continue
try:
values = table.to_arrays(attr_name)
for value in values:
for path, ref_store in _extract_hash_refs(value):
if store_name is None or ref_store == store_name:
referenced.add(path)
except Exception as e:
logger.warning(f"Error scanning {table_name}.{attr_name}: {e}")
Expression.to_arrays (src/datajoint/expression.py:989) calls to_dicts(...), which runs decode_attribute(heading[name], row[name], ...) for every attribute (expression.py:899 → codecs.py:518). For storage columns this invokes the codec's decode() method, which:
- Parses the stored JSON metadata (the dict
{"hash": ..., "path": ..., "store": ..., "size": ...} per the type-system spec),
- Downloads the actual bytes from external storage,
- Returns the deserialized payload (or a lazy ref handle).
So to_arrays('payload') returns the decoded values, not the raw metadata.
_extract_hash_refs(value) (src/datajoint/gc.py:115) is designed for raw metadata:
def _extract_hash_refs(value):
refs = []
if value is None:
return refs
if isinstance(value, str):
try:
value = json.loads(value)
except (json.JSONDecodeError, TypeError):
return refs
if isinstance(value, dict) and "path" in value:
refs.append((value["path"], value.get("store")))
return refs
None of the decoded codec outputs are dicts or JSON strings, and this is the documented contract for each codec, not an implementation accident:
| Codec |
decode() returns |
_extract_*_refs(...) |
Documented at |
<blob@> |
numpy.ndarray (or stored object) |
[] |
reference/specs/type-system.md (BlobCodec example) |
<npy@> |
NpyRef lazy handle |
[] |
reference/specs/npy-codec.md — "When you fetch an <npy@> attribute, you get an NpyRef object" |
<object@> |
ObjectRef lazy handle |
[] |
reference/specs/type-system.md (ObjectCodec example) |
<attach@> |
local file path str after download |
[] |
how-to/use-object-storage.md |
<hash@> |
bytes |
[] |
reference/specs/type-system.md |
The function silently returns [] for every value. No exception, no warning, the surrounding try/except never triggers, and referenced stays empty.
Three-way reproducer
Two to_arrays call paths produce identical wrong results; the raw SQL bypass shows what _extract_hash_refs actually expects.
import datajoint as dj
import numpy as np
from datajoint import gc as dj_gc
from datajoint.gc import _extract_hash_refs
dj.config['safemode'] = False
schema = dj.Schema('test_gc_repro')
if schema.exists:
schema.drop(prompt=False)
schema = dj.Schema('test_gc_repro')
@schema
class T(dj.Manual):
definition = '''
rid : int
---
payload : <blob@my_store>
'''
T.insert1((1, np.arange(64, dtype='uint8')))
# 1. User-idiomatic call:
v_user = T().to_arrays('payload')[0]
print(type(v_user).__name__, _extract_hash_refs(v_user))
# → ndarray []
# 2. GC-internal call (FreeTable via schema.get_table):
v_gc = schema.get_table('t')().to_arrays('payload')[0]
print(type(v_gc).__name__, _extract_hash_refs(v_gc))
# → ndarray []
# 3. Raw SQL bypass — what `_extract_hash_refs` is actually designed for:
raw = list(schema.connection.query(
'SELECT payload FROM "test_gc_repro"."t" LIMIT 1'
))[0][0]
print(type(raw).__name__, _extract_hash_refs(raw))
# → dict [('_hash/test_gc_repro/<hash>', 'my_store')] ← only this works
# Public API confirms the silent failure:
stats = dj_gc.scan(schema, store_name='my_store')
print(stats['hash_referenced'], stats['hash_stored'], stats['hash_orphaned'])
# → 0 1 1 ← collect() would delete this referenced file
The same pattern triggers for <npy@> / <object@> via scan_schema_references and _extract_schema_refs.
Suggested fix
Option (1): iterate expr.cursor(as_dict=True) directly in scan_*_references and bypass decode_attribute. Smallest change, stays close to the existing code shape, produces exactly the dict / JSON-string values that _extract_*_refs was designed to consume.
The proposed fix is intentionally backend-agnostic. JSON columns come back as strings on MySQL (needing json.loads) and as already-decoded dicts on PostgreSQL (JSONB); _extract_hash_refs and _extract_schema_refs already handle both branches at gc.py:124–129 (the isinstance(value, str) → json.loads path plus the isinstance(value, dict) path). Cursor reads therefore produce the right shape across backends with no adapter dispatch.
Two alternatives, listed for completeness — option (1) is the right call:
- (2) Add
decode_storage=False to Expression.to_dicts / to_arrays. Lets reference-discovery (and any future metadata-only operation) opt out of codec decode. Wider API change.
- (3) Have the
Codec base class expose a get_storage_metadata(stored_value) hook that returns the path/hash without I/O. Cleanest abstraction but requires touching every built-in codec and asking custom-codec authors to implement it.
Test gap and proposed regression coverage
tests/integration/test_gc.py:163–165 patches the three internal functions:
@patch('datajoint.gc.scan_hash_references')
@patch('datajoint.gc.scan_schema_references')
@patch('datajoint.gc.list_*_paths') # both list_stored_hashes and list_schema_paths
These tests are useful — they verify orchestration (scan composes references with file listings, returns the right stat keys, dry-run vs real, etc.). They should stay. What's missing is a non-mocked end-to-end test that would have caught this:
def test_scan_finds_active_blob_reference():
"""scan() must report hash_referenced > 0 for a populated <blob@> column."""
schema = dj.Schema('test_gc_e2e_blob')
if schema.exists:
schema.drop(prompt=False)
schema = dj.Schema('test_gc_e2e_blob')
@schema
class T(dj.Manual):
definition = '''
rid : int
---
payload : <blob@my_store>
'''
T.insert1((1, np.arange(64, dtype='uint8')))
stats = dj.gc.scan(schema, store_name='my_store')
assert stats['hash_referenced'] >= 1, (
f"scan should find the active reference; got {stats}"
)
def test_scan_finds_active_npy_reference():
"""Same shape, schema-addressed via <npy@>."""
# ... <npy@> column, insert, assert schema_paths_referenced >= 1
def test_scan_finds_active_object_reference():
"""Same shape, schema-addressed via <object@>."""
# ... <object@> column, insert, assert schema_paths_referenced >= 1
Three tests, one per codec family — covers the code path the mocked tests cannot reach. Happy to PR these alongside the fix if useful.
Performance side-effect (independent of correctness)
Because the broken path runs decode_attribute, scan downloads every external blob from storage just to deserialize it and discard the result via the silent type mismatch. A scan against a schema with 1 TB of blobs downloads 1 TB to produce referenced: 0. With option (1), scan becomes a metadata-only operation as it should be — touches the JSON column on the database, never reaches storage during reference enumeration.
Implication for custom-codec authors
Third-party codec authors following the documented encode/decode contract (how-to/create-custom-codec.md, how-to/use-plugin-codecs.md, explanation/custom-codecs.md) will hit this same silent failure with no diagnostic. The canonical example in reference/specs/codec-api.md itself shows a ParquetCodec whose decode returns a ParquetRef object — exactly the kind of value that _extract_*_refs rejects.
Whichever fix lands needs to either:
- (a) work uniformly via raw-metadata access regardless of codec — option (1) does this — or
- (b) come with a documented codec-author contract for reference-discovery (e.g., codecs must implement a
get_storage_metadata hook), updated codec docs, and a deprecation path for existing third-party codecs that don't implement it.
The current codec docs make no mention of GC or reference-discovery requirements, so authors have no way to know they need to do anything special.
Adjacent (low-severity)
The gc.py module docstring shows the usage example as:
import datajoint as dj
stats = dj.gc.scan(schema1, schema2, store_name='mystore')
This raises AttributeError: module 'datajoint' has no attribute 'gc' because gc is not registered in __init__.py's _lazy_modules. The actual working import is from datajoint import gc as dj_gc; dj_gc.scan(...). Fix: add "gc": (".gc", None) to _lazy_modules, or correct the docstring example.
Reproducer artifacts
Happy to share the standalone scripts and a screen-share-friendly notebook used during the investigation if useful (they live in a private repo at the moment but the bodies can be extracted and posted here on request).
Next steps
- Validate the suggested fix direction.
- Land option (1) plus the three regression tests above.
- Address the codec-author implication (probably easiest as a doc note: "your codec's decode return type does not affect GC; reference-discovery operates on raw stored metadata").
- (Optional) fix the
dj.gc lazy-module exposure / docstring example.
TL;DR
dj_gc.scan(schema, store_name=...)reportshash_referenced: 0andschema_paths_referenced: 0for any populated schema, regardless of how many live rows reference external storage. Every external file is therefore classified as an orphan, and a subsequentdj_gc.collect(...)would delete live data.The bug is a contract mismatch between two helper functions in
gc.py.scan_hash_references(gc.py:234) andscan_schema_references(gc.py:296) read column values viatable.to_arrays(attr_name), which routes throughExpression.to_dicts(expression.py:899) and runsdecode_attribute(codecs.py:518) for each value. The decoded outputs (numpy arrays,NpyRef/ObjectReflazy handles, file paths, bytes) do not satisfy_extract_*_refs'sisinstance(value, dict) and "path" in valuecheck, so both helpers silently return empty sets. No exception fires, no warning is logged.The bug affects every storage codec (
<blob@>,<npy@>,<object@>,<attach@>,<hash@>) uniformly and is independent of storage backend. It slipped past CI becausetests/integration/test_gc.py:163–165mocksscan_hash_references,scan_schema_references, andlist_*_pathsdirectly — the production code path throughto_arrays→decode_attributeis never exercised end-to-end.Affected versions
masterat the time of investigation (verified at commitdde64719).gc.pyhas not diverged across recent feature branches; this is longstanding behavior, not a regression in flight.Root cause
scan_hash_references(and the schema-addressed twin) iterate every table in each passed schema and read storage-column values viatable.to_arrays(attr_name):Expression.to_arrays(src/datajoint/expression.py:989) callsto_dicts(...), which runsdecode_attribute(heading[name], row[name], ...)for every attribute (expression.py:899→codecs.py:518). For storage columns this invokes the codec'sdecode()method, which:{"hash": ..., "path": ..., "store": ..., "size": ...}per the type-system spec),So
to_arrays('payload')returns the decoded values, not the raw metadata._extract_hash_refs(value)(src/datajoint/gc.py:115) is designed for raw metadata:None of the decoded codec outputs are dicts or JSON strings, and this is the documented contract for each codec, not an implementation accident:
decode()returns_extract_*_refs(...)<blob@>numpy.ndarray(or stored object)[]reference/specs/type-system.md(BlobCodec example)<npy@>NpyReflazy handle[]reference/specs/npy-codec.md— "When you fetch an<npy@>attribute, you get anNpyRefobject"<object@>ObjectReflazy handle[]reference/specs/type-system.md(ObjectCodec example)<attach@>strafter download[]how-to/use-object-storage.md<hash@>bytes[]reference/specs/type-system.mdThe function silently returns
[]for every value. No exception, no warning, the surroundingtry/exceptnever triggers, andreferencedstays empty.Three-way reproducer
Two
to_arrayscall paths produce identical wrong results; the raw SQL bypass shows what_extract_hash_refsactually expects.The same pattern triggers for
<npy@>/<object@>viascan_schema_referencesand_extract_schema_refs.Suggested fix
Option (1): iterate
expr.cursor(as_dict=True)directly inscan_*_referencesand bypassdecode_attribute. Smallest change, stays close to the existing code shape, produces exactly the dict / JSON-string values that_extract_*_refswas designed to consume.The proposed fix is intentionally backend-agnostic. JSON columns come back as strings on MySQL (needing
json.loads) and as already-decoded dicts on PostgreSQL (JSONB);_extract_hash_refsand_extract_schema_refsalready handle both branches atgc.py:124–129(theisinstance(value, str)→json.loadspath plus theisinstance(value, dict)path). Cursor reads therefore produce the right shape across backends with no adapter dispatch.Two alternatives, listed for completeness — option (1) is the right call:
decode_storage=FalsetoExpression.to_dicts/to_arrays. Lets reference-discovery (and any future metadata-only operation) opt out of codec decode. Wider API change.Codecbase class expose aget_storage_metadata(stored_value)hook that returns the path/hash without I/O. Cleanest abstraction but requires touching every built-in codec and asking custom-codec authors to implement it.Test gap and proposed regression coverage
tests/integration/test_gc.py:163–165patches the three internal functions:These tests are useful — they verify orchestration (
scancomposes references with file listings, returns the right stat keys, dry-run vs real, etc.). They should stay. What's missing is a non-mocked end-to-end test that would have caught this:Three tests, one per codec family — covers the code path the mocked tests cannot reach. Happy to PR these alongside the fix if useful.
Performance side-effect (independent of correctness)
Because the broken path runs
decode_attribute, scan downloads every external blob from storage just to deserialize it and discard the result via the silent type mismatch. A scan against a schema with 1 TB of blobs downloads 1 TB to producereferenced: 0. With option (1), scan becomes a metadata-only operation as it should be — touches the JSON column on the database, never reaches storage during reference enumeration.Implication for custom-codec authors
Third-party codec authors following the documented
encode/decodecontract (how-to/create-custom-codec.md,how-to/use-plugin-codecs.md,explanation/custom-codecs.md) will hit this same silent failure with no diagnostic. The canonical example inreference/specs/codec-api.mditself shows aParquetCodecwhosedecodereturns aParquetRefobject — exactly the kind of value that_extract_*_refsrejects.Whichever fix lands needs to either:
get_storage_metadatahook), updated codec docs, and a deprecation path for existing third-party codecs that don't implement it.The current codec docs make no mention of GC or reference-discovery requirements, so authors have no way to know they need to do anything special.
Adjacent (low-severity)
The
gc.pymodule docstring shows the usage example as:This raises
AttributeError: module 'datajoint' has no attribute 'gc'becausegcis not registered in__init__.py's_lazy_modules. The actual working import isfrom datajoint import gc as dj_gc; dj_gc.scan(...). Fix: add"gc": (".gc", None)to_lazy_modules, or correct the docstring example.Reproducer artifacts
Happy to share the standalone scripts and a screen-share-friendly notebook used during the investigation if useful (they live in a private repo at the moment but the bodies can be extracted and posted here on request).
Next steps
dj.gclazy-module exposure / docstring example.