fix: prevent scoped context cache pollution during context processing by jdsika · Pull Request #249 · digitalbazaar/pyld

jdsika · 2026-04-24T20:12:07Z

Summary

Fixes a bug where @type: @vocab coercion silently fails when a JSON-LD context contains multiple terms sharing the same scoped @context -- a pattern that arises naturally with enum-typed properties.

This is the same underlying issue reported in #201, but with a minimal, targeted fix (1 line of production code + comment) instead of threading active_property paths through the entire processing stack.

Bug Description

Symptom

Given a context with multiple @type: @vocab terms that share an identical scoped @context:

{
  "Color": {
    "@id": "ex:Color",
    "@type": "@vocab",
    "@context": { "@vocab": "https://example.org/vocab/" }
  },
  "Shape": {
    "@id": "ex:Shape",
    "@type": "@vocab",
    "@context": { "@vocab": "https://example.org/vocab/" }
  }
}

Expansion of the first term works correctly, but subsequent terms produce {"@value": "Circle"} (plain literal) instead of {"@id": "https://example.org/vocab/Circle"} (IRI).

Root Cause

During context processing in _process_context() (line ~3326), the method iterates over all terms in the context, calling _create_term_definition() for each. After each term definition, if the term has a scoped @context, it is pre-validated by calling _process_context(rval, key_ctx, ...) recursively (lines 3337-3363). The result of this validation is discarded, but it has a critical side effect: it populates the ResolvedContext cache.

The cache key is rval['_uuid'], which is assigned once after cloning (line 3321) and never changes during the entire term definition loop. At the time the first scoped context is pre-validated, rval has only a partial set of term mappings (only terms processed so far). The processed result -- with incomplete mappings -- is cached.

Later, when the fully-built context is used during expansion and the same scoped context needs to be processed (line 2605), _process_context() finds a stale cache hit (same _uuid, same canonical scoped context) and returns the incomplete result. This causes get_context_value(active_ctx, term, '@type') to return None instead of '@vocab', so the value falls through to the @value branch instead of the @id branch in _expand_value().

Trace Evidence

Instrumented trace showing the cache pollution:

# During context processing -- pre-validation caches partial result:
_process_context depth=1 ctx=dict(4 keys) active_uuid=d3085059
  result mappings count: 12    <-- only 12 mappings (partial!)

# During expansion -- stale cache hit returns partial result:
_process_context depth=0 ctx=dict(4 keys) active_uuid=d3085059
  CACHE HIT => mappings=12 has_DrivableAreaType=False    <-- BUG!

# Result: @value instead of @id
{"@value": "RoadTypeMotorway"}    <-- should be {"@id": "...RoadTypeMotorway"}

Fix

One line of production code: After the term definition loop completes (all mappings are in rval), regenerate rval['_uuid'] before freezing:

rval['_uuid'] = str(uuid.uuid1())

This ensures that expansion-time lookups of scoped contexts use a _uuid that was never used during pre-validation, so they miss the stale cache and process the scoped context against the complete active context.

Why This Approach

As @dlongley noted in #201 (comment):

the python version does not generate a new _uuid property when cloning an active context for modification [...] Which seems like the most natural place to do this in order for it to be a "unique object identifier", similar to using the object reference itself in jsonld.js

The broader suggestion of adding _uuid to _clone_active_context() is sound for general correctness, but alone it does not fix this specific bug -- the outer rval keeps its clone-time _uuid throughout the loop and into the final freeze, so the pre-validation cache entries would still match. The targeted regeneration after the loop is necessary.

Test Plan

New regression tests (5 tests in `tests/test_scoped_context_cache.py`):

Test	Description
`test_single_vocab_term_expands_correctly`	Baseline: single `@type: @vocab` term (always worked)
`test_many_shared_scoped_contexts_expand_correctly`	30 enum terms with shared scoped context -- all must expand to `@id`
`test_last_vocab_term_expands_with_large_context`	Last of 27 enum terms in large context (most likely to fail due to cache)
`test_structured_value_still_works_with_scoped_context`	Object values still use scoped context mappings (`text`, `description`, `meaning`)
`test_mixed_plain_and_vocab_terms`	Mix of plain string terms and `@type: @vocab` terms in 100+ key context

Without fix: 3 of 5 fail. With fix: all 5 pass.

Existing test suites -- zero regressions:

Suite	Result
W3C JSON-LD API (`specifications/json-ld-api/tests/`)	1277 passed, 41 skipped
W3C JSON-LD Framing (`specifications/json-ld-framing/tests/`)	92 passed, 2 skipped
RDF Normalization (`specifications/normalization/tests/`)	121 passed, 2 skipped
pyld unit tests (`tests/`)	164 passed

Real-World Impact

This bug affects any JSON-LD context generated from schemas with enum-typed properties -- a common pattern in ontology management. We discovered it while implementing @type: @vocab context generation for LinkML enum slots (linkml/linkml#2497), where 27 enum properties in an OpenLABEL ontology all share identical scoped contexts. The bug caused all enum values to expand as plain literals instead of vocabulary IRIs, silently breaking SHACL validation downstream.

The rdflib JSON-LD implementation handles the same contexts correctly, confirming the context structure is valid per JSON-LD 1.1 section 4.2.3 (Type Coercion) and section 4.1.8 (Scoped Contexts).

When a JSON-LD context contains multiple terms that share the same scoped @context (e.g., enum-typed properties using @type: @vocab with a common scoped @vocab), the pre-validation of scoped contexts during _process_context() would cache the processed result keyed by rval['_uuid']. Since rval is mutated (mappings added incrementally) during the loop, the cached result contains only a partial set of term mappings. Later, when the returned context is used during expansion and its scoped contexts are processed, _process_context() would find a stale cache hit (same _uuid, same scoped context canonical form) and return the incomplete result. This causes @type coercion (e.g., @type: @vocab) to silently fail for any term whose mapping was absent from the cached context, producing @value literals instead of @id IRIs. The fix regenerates rval['_uuid'] after all term definitions are created, ensuring that expansion-time lookups of scoped contexts miss the pre-validation cache and process against the complete active context. All W3C JSON-LD API conformance tests (1277), framing tests (92), and normalization tests (121) continue to pass. Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>

Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>

mielvds

Thanks, this is a nice fix and allows us to close a couple of issues and PRs.
But for future reference wrt. AI generated PR's:

could you make clear who I am talking to? @jdsika, an agent and @jdsika or just an agent? Is @jdsika a human, no idea :)
I am ok with AI-assisted coding, but I am still on the fence on AI-generated PR's. AI made opening PR's really cheap, while reviewing them is still a manual task (else, what's the point?) This is PR description is a bit long (although I agree that it is valuable context) and it gives me zero proof that @jdsika actually understands the code that is being provided. I suggest you just summarize this yourself or at least add some note that convinces me that you checked this.

@anatoly-scherbakov we also need to discuss testing structure. Function-based tests, fine, but I don't want to end up with hundreds of AI generated tests that we can no longer navigate. And the very least we should stick to one file per function

jdsika · 2026-05-04T09:31:27Z

Hey all,
You are talking to me :) but:
This PR description is AI Generated from a lengthy "self- conversation" -> I evaluated this PR purely against my test setup and the model and generated output.
I am not aware of the overall code base in pyld and the review is crucial.

I re-model the ASAM OpenLabel ontology in LinkML (one PR as example ASCS-eV/ontology-management-base#66) and analyse the modelling gaps.
I create test data which I validate with my validation suite in OMB.
I consider three sources of error:

wrong modelling in open label v1
missing feature in linkml compiler
wrong model in open label v2 linkml model

I try to systematically narrow down the cases. The pyld issue is a finding by coincident as a test case in linkml failed after my feature improvmemt.

jdsika · 2026-05-04T09:34:05Z

And this is Carlo van Driesten from BMW -> simulation and test engineer - hello from sunny Munich

jdsika mentioned this pull request Apr 25, 2026

feat(contextgen): add @type: @vocab coercion for eligible enums linkml/linkml#3447

Merged

mielvds requested review from anatoly-scherbakov and mielvds April 27, 2026 07:32

anatoly-scherbakov reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/test_scoped_context_cache.py Outdated

jdsika force-pushed the fix/scoped-context-cache-pollution branch from ed47b3a to dbd5ced Compare April 27, 2026 08:50

fix: remove extra blank line to satisfy ruff I001

e0fa36a

Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>

mielvds approved these changes May 4, 2026

View reviewed changes

anatoly-scherbakov mentioned this pull request May 4, 2026

Review test structure #255

Open

mielvds mentioned this pull request May 4, 2026

Fixed bug where @vocab terms with identical context conflict in context resolution #201

Closed

mielvds merged commit b3309e2 into digitalbazaar:master May 4, 2026
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent scoped context cache pollution during context processing#249

fix: prevent scoped context cache pollution during context processing#249
mielvds merged 2 commits intodigitalbazaar:masterfrom
jdsika:fix/scoped-context-cache-pollution

jdsika commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

mielvds left a comment •

edited

Loading

Uh oh!

jdsika commented May 4, 2026

Uh oh!

jdsika commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jdsika commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug Description

Symptom

Root Cause

Trace Evidence

Fix

Why This Approach

Test Plan

New regression tests (5 tests in tests/test_scoped_context_cache.py):

Without fix: 3 of 5 fail. With fix: all 5 pass.

Existing test suites -- zero regressions:

Real-World Impact

Related

Uh oh!

Uh oh!

mielvds left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdsika commented May 4, 2026

Uh oh!

jdsika commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jdsika commented Apr 24, 2026 •

edited

Loading

New regression tests (5 tests in `tests/test_scoped_context_cache.py`):

mielvds left a comment •

edited

Loading