fix: JSON-LD context-coerced @vector arrays produce single FlakeValue::Vector#1216
Conversation
aaj3f
left a comment
There was a problem hiding this comment.
(GitHub's inline comments are failing so going to put some notes that I would have put in individual files)
fluree-db-transact/src/generate/flakes.rs:397-435
This all looks good and right to me and makes sense. It is a bit hard to tell just from reviewing the code how much this could impact large-flake-batch transactions. Another possibly good candidate for wanting some CI benching possibly to see if this does or does not regress transaction benchmarks. And if it did, I'm not sure what the next step would be. Possibly short circuits on the most common hot-path scenarios before getting into Sid compares?
fluree-graph-json-ld/src/adapter.rs:439-454
I understand the choice to convert Value::Array(...) into string to unify the downstream handling, but just wanted to make sure we're accepting the redundant round-trip serde roundtrip rather than handling LiteralValue::Vector within fluree-graph-ir
Here is a benchmark for the first part so seems to be good: Benchmark results — no regression fluree-db-api/benches/insert_formats.rs (vector-free, the reviewer's "common hot ┌─────────────────────┬──────────┬────────────┬───────────────────┬──────────┐ Single-shot summary table (all 6 × 2 scenarios from the bench's own summary) |
A
@vector-typed property in@contextpaired with a bare JSON array (e.g."ex:embedding": [0.1, 0.2, 0.3]) was committing as N scalar flakes, each tagged with thef:embeddingVectordatatype. The flakes pointed at unallocatedVECTOR_IDarena handles, producing "vector handle out of arena" errors at decode time and leaving rows that couldn't be queried, formatted, or retracted via SPARQL UPDATE.This PR fixes the corruption at its source (JSON-LD expansion) and adds defense-in-depth so no parsing path can produce mis-shaped vector flakes again.
Changes
Root-cause fix
fluree-graph-json-ld/src/expand.rs—parse_node_valuenow detects context-coerced@vector(orf:embeddingVectorIRI form) and wraps a JSON array as a single{"@value":[...], "@type":"https://ns.flur.ee/db#embeddingVector"}value-object instead of iterating it into N scalar@valueobjects. Both downstream consumers — the live transact JSON-LD path (expand_with_context_policy) and the bulk-import adapter (fluree_graph_json_ld::expand) — now see one well-formed vector value.Shared coercion pipeline
fluree-db-core/src/coerce.rs—coerce_string_valuemadepuband gained anf:embeddingVectorarm that parses"[..]"lexical form via the sharedcoerce_array_to_vector(f32 quantization).coerce_number_valueandcoerce_bool_valuenow hard-error when datatype isf:embeddingVector(was silent fall-through toLong/Double/Boolean— the original corruption path).coerce_array_to_vectorrejects empty arrays (collide with theFlakeValue::max()sentinel and the vector arena hard-rejects them).fluree-db-transact/src/value_convert.rs—convert_string_literaladds anf:embeddingVectorarm that delegates tocore::coerce::coerce_string_value. JSON-LD bulk import, Turtle, and SPARQL"[..]"^^f:embeddingVectornow share a single f32-quantized parser.fluree-graph-json-ld/src/adapter.rs—process_literalroutes vector arrays throughsink.term_literalwith a stringified lexical (Option B from the design discussion: keepsfluree-graph-irstable; letsconvert_string_literal's shared parser do the work).fluree-db-transact/src/lower_sparql_update.rs—coerce_typed_valueandcoerce_typed_flake_valuenow produceLiteralValue::Vector/FlakeValue::Vectorforf:embeddingVectortyped literals (wasString).Write-path invariant guards (hard runtime errors)
A single validator (
transact::generate::flakes::validate_value_dt_pair) wired into all three sinks. Catches(FlakeValue, datatype)shape mismatches and empty vectors before a corrupt flake can hit the index:FlakeGenerator::materialize_template— bails the whole transaction.ImportSink::push_triple— setsencode_errorso the bulk-import path surfaces aCommitCodecError::InvalidOp(was: silently dropping the spool record).FlakeSink::build_flake— captures the first violation;finish()now returnsResult<Vec<Flake>, TransactError>(was:Vec<Flake>, silent drop).stage_turtle_insertpropagates viaApiError::from.Documentation
docs/reference/vocabulary.md— vector datatype section now documents the validation rules user-facing: element type must be number, f32 quantization, non-empty, scalar values rejected. Adds the bare-array context-coercion form alongside the explicit@valueform. Same rules called out as applying to JSON-LD, SPARQL, and Turtle.Tests
fluree-db-api/tests/it_vector_corruption_repro.rs(new):jsonld_context_vector_bare_array_round_trips_after_indexing— pins the post-fix end-to-end behavior: insert via context-coerced@vector→ publish index → SELECT returns the 4-element vector.jsonld_context_vector_empty_array_is_rejected— pins user-facing rejection of empty[]vectors.jsonld_context_vector_bare_array_retracts_after_indexing(#[ignore]'d, see TODO below).sparql_insert_data_embedding_vector_literal_round_trips_after_indexing(#[ignore]'d, see TODO below).coerce.rs(8 new),expand.rs(5 new — one per vector type form, plus non-vector regression),lower_sparql_update.rs(2 new),flake_sink.rs(1 new),generate/flakes.rs(3 new). All passing.