Critical: `dj_gc.scan` reports `referenced: 0` for every storage column; `collect` would delete live data

## TL;DR

`dj_gc.scan(schema, store_name=...)` reports `hash_referenced: 0` and `schema_paths_referenced: 0` for any populated schema, regardless of how many live rows reference external storage. Every external file is therefore classified as an orphan, and a subsequent `dj_gc.collect(...)` would delete live data.

The bug is a contract mismatch between two helper functions in `gc.py`. `scan_hash_references` (`gc.py:234`) and `scan_schema_references` (`gc.py:296`) read column values via `table.to_arrays(attr_name)`, which routes through `Expression.to_dicts` (`expression.py:899`) and runs `decode_attribute` (`codecs.py:518`) for each value. The decoded outputs (numpy arrays, `NpyRef` / `ObjectRef` lazy handles, file paths, bytes) do not satisfy `_extract_*_refs`'s `isinstance(value, dict) and "path" in value` check, so both helpers silently return empty sets. No exception fires, no warning is logged.

The bug affects every storage codec (`<blob@>`, `<npy@>`, `<object@>`, `<attach@>`, `<hash@>`) uniformly and is independent of storage backend. It slipped past CI because `tests/integration/test_gc.py:163–165` mocks `scan_hash_references`, `scan_schema_references`, and `list_*_paths` directly — the production code path through `to_arrays` → `decode_attribute` is never exercised end-to-end.

## Affected versions

- `master` at the time of investigation (verified at commit `dde64719`).
- `gc.py` has not diverged across recent feature branches; this is longstanding behavior, not a regression in flight.

## Root cause

`scan_hash_references` (and the schema-addressed twin) iterate every table in each passed schema and read storage-column values via `table.to_arrays(attr_name)`:

```python
# src/datajoint/gc.py:234 (excerpt)
for attr_name, attr in table.heading.attributes.items():
    if not _uses_hash_storage(attr):
        continue
    try:
        values = table.to_arrays(attr_name)
        for value in values:
            for path, ref_store in _extract_hash_refs(value):
                if store_name is None or ref_store == store_name:
                    referenced.add(path)
    except Exception as e:
        logger.warning(f"Error scanning {table_name}.{attr_name}: {e}")
```

`Expression.to_arrays` (`src/datajoint/expression.py:989`) calls `to_dicts(...)`, which runs `decode_attribute(heading[name], row[name], ...)` for every attribute (`expression.py:899` → `codecs.py:518`). For storage columns this invokes the codec's `decode()` method, which:

1. Parses the stored JSON metadata (the dict `{"hash": ..., "path": ..., "store": ..., "size": ...}` per the type-system spec),
2. **Downloads the actual bytes from external storage**,
3. Returns the deserialized payload (or a lazy ref handle).

So `to_arrays('payload')` returns the *decoded* values, not the raw metadata.

`_extract_hash_refs(value)` (`src/datajoint/gc.py:115`) is designed for raw metadata:

```python
def _extract_hash_refs(value):
    refs = []
    if value is None:
        return refs
    if isinstance(value, str):
        try:
            value = json.loads(value)
        except (json.JSONDecodeError, TypeError):
            return refs
    if isinstance(value, dict) and "path" in value:
        refs.append((value["path"], value.get("store")))
    return refs
```

None of the decoded codec outputs are dicts or JSON strings, and this is the documented contract for each codec, not an implementation accident:

| Codec | `decode()` returns | `_extract_*_refs(...)` | Documented at |
|-------|--------------------|------------------------|---------------|
| `<blob@>`   | `numpy.ndarray` (or stored object) | `[]` | `reference/specs/type-system.md` (BlobCodec example) |
| `<npy@>`    | `NpyRef` lazy handle | `[]` | `reference/specs/npy-codec.md` — "When you fetch an `<npy@>` attribute, you get an `NpyRef` object" |
| `<object@>` | `ObjectRef` lazy handle | `[]` | `reference/specs/type-system.md` (ObjectCodec example) |
| `<attach@>` | local file path `str` after download | `[]` | `how-to/use-object-storage.md` |
| `<hash@>`   | `bytes` | `[]` | `reference/specs/type-system.md` |

The function silently returns `[]` for every value. No exception, no warning, the surrounding `try/except` never triggers, and `referenced` stays empty.

## Three-way reproducer

Two `to_arrays` call paths produce identical wrong results; the raw SQL bypass shows what `_extract_hash_refs` actually expects.

```python
import datajoint as dj
import numpy as np
from datajoint import gc as dj_gc
from datajoint.gc import _extract_hash_refs

dj.config['safemode'] = False

schema = dj.Schema('test_gc_repro')
if schema.exists:
    schema.drop(prompt=False)
schema = dj.Schema('test_gc_repro')

@schema
class T(dj.Manual):
    definition = '''
    rid : int
    ---
    payload : <blob@my_store>
    '''

T.insert1((1, np.arange(64, dtype='uint8')))

# 1. User-idiomatic call:
v_user = T().to_arrays('payload')[0]
print(type(v_user).__name__, _extract_hash_refs(v_user))
# → ndarray  []

# 2. GC-internal call (FreeTable via schema.get_table):
v_gc = schema.get_table('t')().to_arrays('payload')[0]
print(type(v_gc).__name__, _extract_hash_refs(v_gc))
# → ndarray  []

# 3. Raw SQL bypass — what `_extract_hash_refs` is actually designed for:
raw = list(schema.connection.query(
    'SELECT payload FROM "test_gc_repro"."t" LIMIT 1'
))[0][0]
print(type(raw).__name__, _extract_hash_refs(raw))
# → dict  [('_hash/test_gc_repro/<hash>', 'my_store')]   ← only this works

# Public API confirms the silent failure:
stats = dj_gc.scan(schema, store_name='my_store')
print(stats['hash_referenced'], stats['hash_stored'], stats['hash_orphaned'])
# → 0 1 1   ← collect() would delete this referenced file
```

The same pattern triggers for `<npy@>` / `<object@>` via `scan_schema_references` and `_extract_schema_refs`.

## Suggested fix

**Option (1): iterate `expr.cursor(as_dict=True)` directly in `scan_*_references` and bypass `decode_attribute`.** Smallest change, stays close to the existing code shape, produces exactly the dict / JSON-string values that `_extract_*_refs` was designed to consume.

The proposed fix is intentionally backend-agnostic. JSON columns come back as strings on MySQL (needing `json.loads`) and as already-decoded dicts on PostgreSQL (`JSONB`); `_extract_hash_refs` and `_extract_schema_refs` already handle both branches at `gc.py:124–129` (the `isinstance(value, str)` → `json.loads` path plus the `isinstance(value, dict)` path). Cursor reads therefore produce the right shape across backends with no adapter dispatch.

Two alternatives, listed for completeness — option (1) is the right call:

- **(2) Add `decode_storage=False` to `Expression.to_dicts` / `to_arrays`.** Lets reference-discovery (and any future metadata-only operation) opt out of codec decode. Wider API change.
- **(3) Have the `Codec` base class expose a `get_storage_metadata(stored_value)` hook** that returns the path/hash without I/O. Cleanest abstraction but requires touching every built-in codec and asking custom-codec authors to implement it.

## Test gap and proposed regression coverage

`tests/integration/test_gc.py:163–165` patches the three internal functions:

```python
@patch('datajoint.gc.scan_hash_references')
@patch('datajoint.gc.scan_schema_references')
@patch('datajoint.gc.list_*_paths')   # both list_stored_hashes and list_schema_paths
```

These tests are useful — they verify orchestration (`scan` composes references with file listings, returns the right stat keys, dry-run vs real, etc.). They should stay. **What's missing is a non-mocked end-to-end test** that would have caught this:

```python
def test_scan_finds_active_blob_reference():
    """scan() must report hash_referenced > 0 for a populated <blob@> column."""
    schema = dj.Schema('test_gc_e2e_blob')
    if schema.exists:
        schema.drop(prompt=False)
    schema = dj.Schema('test_gc_e2e_blob')

    @schema
    class T(dj.Manual):
        definition = '''
        rid : int
        ---
        payload : <blob@my_store>
        '''

    T.insert1((1, np.arange(64, dtype='uint8')))

    stats = dj.gc.scan(schema, store_name='my_store')
    assert stats['hash_referenced'] >= 1, (
        f"scan should find the active reference; got {stats}"
    )


def test_scan_finds_active_npy_reference():
    """Same shape, schema-addressed via <npy@>."""
    # ... <npy@> column, insert, assert schema_paths_referenced >= 1


def test_scan_finds_active_object_reference():
    """Same shape, schema-addressed via <object@>."""
    # ... <object@> column, insert, assert schema_paths_referenced >= 1
```

Three tests, one per codec family — covers the code path the mocked tests cannot reach. Happy to PR these alongside the fix if useful.

## Performance side-effect (independent of correctness)

Because the broken path runs `decode_attribute`, scan **downloads every external blob from storage** just to deserialize it and discard the result via the silent type mismatch. A scan against a schema with 1 TB of blobs downloads 1 TB to produce `referenced: 0`. With option (1), scan becomes a metadata-only operation as it should be — touches the JSON column on the database, never reaches storage during reference enumeration.

## Implication for custom-codec authors

Third-party codec authors following the documented `encode`/`decode` contract (`how-to/create-custom-codec.md`, `how-to/use-plugin-codecs.md`, `explanation/custom-codecs.md`) will hit this same silent failure with no diagnostic. The canonical example in `reference/specs/codec-api.md` itself shows a `ParquetCodec` whose `decode` returns a `ParquetRef` object — exactly the kind of value that `_extract_*_refs` rejects.

Whichever fix lands needs to either:

- (a) work uniformly via raw-metadata access regardless of codec — option (1) does this — or
- (b) come with a documented codec-author contract for reference-discovery (e.g., codecs must implement a `get_storage_metadata` hook), updated codec docs, and a deprecation path for existing third-party codecs that don't implement it.

The current codec docs make no mention of GC or reference-discovery requirements, so authors have no way to know they need to do anything special.

## Adjacent (low-severity)

The `gc.py` module docstring shows the usage example as:

```python
import datajoint as dj
stats = dj.gc.scan(schema1, schema2, store_name='mystore')
```

This raises `AttributeError: module 'datajoint' has no attribute 'gc'` because `gc` is not registered in `__init__.py`'s `_lazy_modules`. The actual working import is `from datajoint import gc as dj_gc; dj_gc.scan(...)`. Fix: add `"gc": (".gc", None)` to `_lazy_modules`, or correct the docstring example.

## Reproducer artifacts

Happy to share the standalone scripts and a screen-share-friendly notebook used during the investigation if useful (they live in a private repo at the moment but the bodies can be extracted and posted here on request).

## Next steps

1. Validate the suggested fix direction.
2. Land option (1) plus the three regression tests above.
3. Address the codec-author implication (probably easiest as a doc note: "your codec's decode return type does not affect GC; reference-discovery operates on raw stored metadata").
4. (Optional) fix the `dj.gc` lazy-module exposure / docstring example.


Codec	`decode()` returns	`_extract_*_refs(...)`	Documented at
`<blob@>`	`numpy.ndarray` (or stored object)	`[]`	`reference/specs/type-system.md` (BlobCodec example)
`<npy@>`	`NpyRef` lazy handle	`[]`	`reference/specs/npy-codec.md` — "When you fetch an `<npy@>` attribute, you get an `NpyRef` object"
`<object@>`	`ObjectRef` lazy handle	`[]`	`reference/specs/type-system.md` (ObjectCodec example)
`<attach@>`	local file path `str` after download	`[]`	`how-to/use-object-storage.md`
`<hash@>`	`bytes`	`[]`	`reference/specs/type-system.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critical: `dj_gc.scan` reports `referenced: 0` for every storage column; `collect` would delete live data #1442

TL;DR

Affected versions

Root cause

Three-way reproducer

Suggested fix

Test gap and proposed regression coverage

Performance side-effect (independent of correctness)

Implication for custom-codec authors

Adjacent (low-severity)

Reproducer artifacts

Next steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Critical: dj_gc.scan reports referenced: 0 for every storage column; collect would delete live data #1442

Description

TL;DR

Affected versions

Root cause

Three-way reproducer

Suggested fix

Test gap and proposed regression coverage

Performance side-effect (independent of correctness)

Implication for custom-codec authors

Adjacent (low-severity)

Reproducer artifacts

Next steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Critical: `dj_gc.scan` reports `referenced: 0` for every storage column; `collect` would delete live data #1442