fix: prevent scoped context cache pollution during context processing#249
Conversation
When a JSON-LD context contains multiple terms that share the same scoped @context (e.g., enum-typed properties using @type: @vocab with a common scoped @vocab), the pre-validation of scoped contexts during _process_context() would cache the processed result keyed by rval['_uuid']. Since rval is mutated (mappings added incrementally) during the loop, the cached result contains only a partial set of term mappings. Later, when the returned context is used during expansion and its scoped contexts are processed, _process_context() would find a stale cache hit (same _uuid, same scoped context canonical form) and return the incomplete result. This causes @type coercion (e.g., @type: @vocab) to silently fail for any term whose mapping was absent from the cached context, producing @value literals instead of @id IRIs. The fix regenerates rval['_uuid'] after all term definitions are created, ensuring that expansion-time lookups of scoped contexts miss the pre-validation cache and process against the complete active context. All W3C JSON-LD API conformance tests (1277), framing tests (92), and normalization tests (121) continue to pass. Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
ed47b3a to
dbd5ced
Compare
Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
There was a problem hiding this comment.
Thanks, this is a nice fix and allows us to close a couple of issues and PRs.
But for future reference wrt. AI generated PR's:
- could you make clear who I am talking to? @jdsika, an agent and @jdsika or just an agent? Is @jdsika a human, no idea :)
- I am ok with AI-assisted coding, but I am still on the fence on AI-generated PR's. AI made opening PR's really cheap, while reviewing them is still a manual task (else, what's the point?) This is PR description is a bit long (although I agree that it is valuable context) and it gives me zero proof that @jdsika actually understands the code that is being provided. I suggest you just summarize this yourself or at least add some note that convinces me that you checked this.
@anatoly-scherbakov we also need to discuss testing structure. Function-based tests, fine, but I don't want to end up with hundreds of AI generated tests that we can no longer navigate. And the very least we should stick to one file per function
|
Hey all, I re-model the ASAM OpenLabel ontology in LinkML (one PR as example ASCS-eV/ontology-management-base#66) and analyse the modelling gaps.
I try to systematically narrow down the cases. The pyld issue is a finding by coincident as a test case in linkml failed after my feature improvmemt. |
|
And this is Carlo van Driesten from BMW -> simulation and test engineer - hello from sunny Munich |
Summary
Fixes a bug where
@type: @vocabcoercion silently fails when a JSON-LD context contains multiple terms sharing the same scoped@context-- a pattern that arises naturally with enum-typed properties.This is the same underlying issue reported in #201, but with a minimal, targeted fix (1 line of production code + comment) instead of threading
active_propertypaths through the entire processing stack.Bug Description
Symptom
Given a context with multiple
@type: @vocabterms that share an identical scoped@context:{ "Color": { "@id": "ex:Color", "@type": "@vocab", "@context": { "@vocab": "https://example.org/vocab/" } }, "Shape": { "@id": "ex:Shape", "@type": "@vocab", "@context": { "@vocab": "https://example.org/vocab/" } } }Expansion of the first term works correctly, but subsequent terms produce
{"@value": "Circle"}(plain literal) instead of{"@id": "https://example.org/vocab/Circle"}(IRI).Root Cause
During context processing in
_process_context()(line ~3326), the method iterates over all terms in the context, calling_create_term_definition()for each. After each term definition, if the term has a scoped@context, it is pre-validated by calling_process_context(rval, key_ctx, ...)recursively (lines 3337-3363). The result of this validation is discarded, but it has a critical side effect: it populates theResolvedContextcache.The cache key is
rval['_uuid'], which is assigned once after cloning (line 3321) and never changes during the entire term definition loop. At the time the first scoped context is pre-validated,rvalhas only a partial set of term mappings (only terms processed so far). The processed result -- with incomplete mappings -- is cached.Later, when the fully-built context is used during expansion and the same scoped context needs to be processed (line 2605),
_process_context()finds a stale cache hit (same_uuid, same canonical scoped context) and returns the incomplete result. This causesget_context_value(active_ctx, term, '@type')to returnNoneinstead of'@vocab', so the value falls through to the@valuebranch instead of the@idbranch in_expand_value().Trace Evidence
Instrumented trace showing the cache pollution:
Fix
One line of production code: After the term definition loop completes (all mappings are in
rval), regeneraterval['_uuid']before freezing:This ensures that expansion-time lookups of scoped contexts use a
_uuidthat was never used during pre-validation, so they miss the stale cache and process the scoped context against the complete active context.Why This Approach
As @dlongley noted in #201 (comment):
The broader suggestion of adding
_uuidto_clone_active_context()is sound for general correctness, but alone it does not fix this specific bug -- the outerrvalkeeps its clone-time_uuidthroughout the loop and into the final freeze, so the pre-validation cache entries would still match. The targeted regeneration after the loop is necessary.Test Plan
New regression tests (5 tests in
tests/test_scoped_context_cache.py):test_single_vocab_term_expands_correctly@type: @vocabterm (always worked)test_many_shared_scoped_contexts_expand_correctly@idtest_last_vocab_term_expands_with_large_contexttest_structured_value_still_works_with_scoped_contexttext,description,meaning)test_mixed_plain_and_vocab_terms@type: @vocabterms in 100+ key contextWithout fix: 3 of 5 fail. With fix: all 5 pass.
Existing test suites -- zero regressions:
specifications/json-ld-api/tests/)specifications/json-ld-framing/tests/)specifications/normalization/tests/)tests/)Real-World Impact
This bug affects any JSON-LD context generated from schemas with enum-typed properties -- a common pattern in ontology management. We discovered it while implementing
@type: @vocabcontext generation for LinkML enum slots (linkml/linkml#2497), where 27 enum properties in an OpenLABEL ontology all share identical scoped contexts. The bug caused all enum values to expand as plain literals instead of vocabulary IRIs, silently breaking SHACL validation downstream.The
rdflibJSON-LD implementation handles the same contexts correctly, confirming the context structure is valid per JSON-LD 1.1 section 4.2.3 (Type Coercion) and section 4.1.8 (Scoped Contexts).Related
active_propertypaths as cache keys)