Skip to content

fix: JSON-LD context-coerced @vector arrays produce single FlakeValue::Vector#1216

Merged
bplatz merged 1 commit into
mainfrom
fix/vector-default-ctx
May 6, 2026
Merged

fix: JSON-LD context-coerced @vector arrays produce single FlakeValue::Vector#1216
bplatz merged 1 commit into
mainfrom
fix/vector-default-ctx

Conversation

@bplatz
Copy link
Copy Markdown
Contributor

@bplatz bplatz commented May 4, 2026

A @vector-typed property in @context paired with a bare JSON array (e.g. "ex:embedding": [0.1, 0.2, 0.3]) was committing as N scalar flakes, each tagged with the f:embeddingVector datatype. The flakes pointed at unallocated VECTOR_ID arena handles, producing "vector handle out of arena" errors at decode time and leaving rows that couldn't be queried, formatted, or retracted via SPARQL UPDATE.

This PR fixes the corruption at its source (JSON-LD expansion) and adds defense-in-depth so no parsing path can produce mis-shaped vector flakes again.

Changes

Root-cause fix

  • fluree-graph-json-ld/src/expand.rsparse_node_value now detects context-coerced @vector (or f:embeddingVector IRI form) and wraps a JSON array as a single {"@value":[...], "@type":"https://ns.flur.ee/db#embeddingVector"} value-object instead of iterating it into N scalar @value objects. Both downstream consumers — the live transact JSON-LD path (expand_with_context_policy) and the bulk-import adapter (fluree_graph_json_ld::expand) — now see one well-formed vector value.

Shared coercion pipeline

  • fluree-db-core/src/coerce.rs
    • coerce_string_value made pub and gained an f:embeddingVector arm that parses "[..]" lexical form via the shared coerce_array_to_vector (f32 quantization).
    • coerce_number_value and coerce_bool_value now hard-error when datatype is f:embeddingVector (was silent fall-through to Long/Double/Boolean — the original corruption path).
    • coerce_array_to_vector rejects empty arrays (collide with the FlakeValue::max() sentinel and the vector arena hard-rejects them).
  • fluree-db-transact/src/value_convert.rsconvert_string_literal adds an f:embeddingVector arm that delegates to core::coerce::coerce_string_value. JSON-LD bulk import, Turtle, and SPARQL "[..]"^^f:embeddingVector now share a single f32-quantized parser.
  • fluree-graph-json-ld/src/adapter.rsprocess_literal routes vector arrays through sink.term_literal with a stringified lexical (Option B from the design discussion: keeps fluree-graph-ir stable; lets convert_string_literal's shared parser do the work).
  • fluree-db-transact/src/lower_sparql_update.rscoerce_typed_value and coerce_typed_flake_value now produce LiteralValue::Vector / FlakeValue::Vector for f:embeddingVector typed literals (was String).

Write-path invariant guards (hard runtime errors)

A single validator (transact::generate::flakes::validate_value_dt_pair) wired into all three sinks. Catches (FlakeValue, datatype) shape mismatches and empty vectors before a corrupt flake can hit the index:

  • FlakeGenerator::materialize_template — bails the whole transaction.
  • ImportSink::push_triple — sets encode_error so the bulk-import path surfaces a CommitCodecError::InvalidOp (was: silently dropping the spool record).
  • FlakeSink::build_flake — captures the first violation; finish() now returns Result<Vec<Flake>, TransactError> (was: Vec<Flake>, silent drop). stage_turtle_insert propagates via ApiError::from.

Documentation

  • docs/reference/vocabulary.md — vector datatype section now documents the validation rules user-facing: element type must be number, f32 quantization, non-empty, scalar values rejected. Adds the bare-array context-coercion form alongside the explicit @value form. Same rules called out as applying to JSON-LD, SPARQL, and Turtle.

Tests

  • fluree-db-api/tests/it_vector_corruption_repro.rs (new):
    • jsonld_context_vector_bare_array_round_trips_after_indexing — pins the post-fix end-to-end behavior: insert via context-coerced @vector → publish index → SELECT returns the 4-element vector.
    • jsonld_context_vector_empty_array_is_rejected — pins user-facing rejection of empty [] vectors.
    • jsonld_context_vector_bare_array_retracts_after_indexing (#[ignore]'d, see TODO below).
    • sparql_insert_data_embedding_vector_literal_round_trips_after_indexing (#[ignore]'d, see TODO below).
  • Unit tests added in coerce.rs (8 new), expand.rs (5 new — one per vector type form, plus non-vector regression), lower_sparql_update.rs (2 new), flake_sink.rs (1 new), generate/flakes.rs (3 new). All passing.

@bplatz bplatz requested review from aaj3f and zonotope May 4, 2026 17:11
Copy link
Copy Markdown
Contributor

@aaj3f aaj3f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(GitHub's inline comments are failing so going to put some notes that I would have put in individual files)

fluree-db-transact/src/generate/flakes.rs:397-435
This all looks good and right to me and makes sense. It is a bit hard to tell just from reviewing the code how much this could impact large-flake-batch transactions. Another possibly good candidate for wanting some CI benching possibly to see if this does or does not regress transaction benchmarks. And if it did, I'm not sure what the next step would be. Possibly short circuits on the most common hot-path scenarios before getting into Sid compares?

fluree-graph-json-ld/src/adapter.rs:439-454
I understand the choice to convert Value::Array(...) into string to unify the downstream handling, but just wanted to make sure we're accepting the redundant round-trip serde roundtrip rather than handling LiteralValue::Vector within fluree-graph-ir

@bplatz
Copy link
Copy Markdown
Contributor Author

bplatz commented May 6, 2026

(GitHub's inline comments are failing so going to put some notes that I would have put in individual files)

fluree-db-transact/src/generate/flakes.rs:397-435 This all looks good and right to me and makes sense. It is a bit hard to tell just from reviewing the code how much this could impact large-flake-batch transactions. Another possibly good candidate for wanting some CI benching possibly to see if this does or does not regress transaction benchmarks. And if it did, I'm not sure what the next step would be. Possibly short circuits on the most common hot-path scenarios before getting into Sid compares?

fluree-graph-json-ld/src/adapter.rs:439-454 I understand the choice to convert Value::Array(...) into string to unify the downstream handling, but just wanted to make sure we're accepting the redundant round-trip serde roundtrip rather than handling LiteralValue::Vector within fluree-graph-ir

Here is a benchmark for the first part so seems to be good:

Benchmark results — no regression

fluree-db-api/benches/insert_formats.rs (vector-free, the reviewer's "common hot
path"). Criterion baseline diff, 10 samples per scenario, 95% CIs.

┌─────────────────────┬──────────┬────────────┬───────────────────┬──────────┐
│ Scenario │ Main │ Branch │ Δ (95% CI) │ Verdict │
│ │ median │ median │ │ │
├─────────────────────┼──────────┼────────────┼───────────────────┼──────────┤
│ jsonld / 10txn × │ 17.37 ms │ 16.16 ms │ [−7.0%, +0.3%] │ no │
│ 100 nodes │ │ │ (p=0.11) │ change │
├─────────────────────┼──────────┼────────────┼───────────────────┼──────────┤
│ turtle / 10txn × │ 10.58 ms │ 9.92 ms │ [−9.9%, −6.4%] │ improved │
│ 100 nodes │ │ │ (p<0.05) │ │
├─────────────────────┼──────────┼────────────┼───────────────────┼──────────┤
│ jsonld / 100txn × │ 431.95 │ 377.94 ms │ [−15.7%, −9.1%] │ improved │
│ 100 nodes │ ms │ │ (p<0.05) │ │
├─────────────────────┼──────────┼────────────┼───────────────────┼──────────┤
│ turtle / 100txn × │ 341.86 │ 305.21 ms │ [−13.1%, −8.4%] │ improved │
│ 100 nodes │ ms │ │ (p<0.05) │ │
└─────────────────────┴──────────┴────────────┴───────────────────┴──────────┘

Single-shot summary table (all 6 × 2 scenarios from the bench's own summary)
corroborates: every scenario at or better than main, including the 480K-flake mega
case (4547ms → 3940ms).

@bplatz bplatz merged commit afae391 into main May 6, 2026
13 checks passed
@bplatz bplatz deleted the fix/vector-default-ctx branch May 6, 2026 19:22
@bplatz bplatz mentioned this pull request May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants