Skip to content

fix: replace serde_avro_fast with apache-avro and optimize Avro reading#276

Merged
JingsongLi merged 5 commits into
apache:mainfrom
luoyuxia:fix-cargo-deny-license-check-main
Apr 23, 2026
Merged

fix: replace serde_avro_fast with apache-avro and optimize Avro reading#276
JingsongLi merged 5 commits into
apache:mainfrom
luoyuxia:fix-cargo-deny-license-check-main

Conversation

@luoyuxia
Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia commented Apr 22, 2026

Purpose

Linked issue: N/A

Replace serde_avro_fast (LGPL-3.0-only) with apache-avro to fix the license compatibility issue, and optimize the Avro reading path by eliminating the intermediate AvroValue conversion layer.

Brief change log

  • License fix: remove serde_avro_fast dependency (LGPL-3.0-only), use apache-avro for all Avro read/write paths
  • CI: bump cargo-deny from 0.14.22 to 0.19.0 so incompatible licenses fail the workflow
  • Refactor avro.rs: remove the custom AvroValue enum and avro_record_to_hash_map conversion, work directly with apache_avro::types::Value
  • Index caching: pre-compute field position once via field_index() from the first record, use O(1) indexed access for all subsequent rows instead of per-row linear search or HashMap lookup
  • Simplify to_avro_bytes: leverage the existing From<apache_avro::Error> impl to replace 4 repetitive map_err blocks with ?
  • Upgrade apache-avro: 0.17 → 0.21

Tests

  • All 482 existing tests pass
  • No behavioral changes

API and Format

No API or storage format changes.

Documentation

No.

@luoyuxia luoyuxia marked this pull request as draft April 22, 2026 11:33
@luoyuxia luoyuxia force-pushed the fix-cargo-deny-license-check-main branch 2 times, most recently from a1a73e4 to 8172dd1 Compare April 23, 2026 02:03
@luoyuxia luoyuxia marked this pull request as ready for review April 23, 2026 02:27
@luoyuxia luoyuxia force-pushed the fix-cargo-deny-license-check-main branch from ab991f3 to 9f1f4fd Compare April 23, 2026 02:35
@luoyuxia luoyuxia changed the title ci: upgrade cargo-deny to fail incompatible licenses fix: replace serde_avro_fast with apache-avro and optimize Avro reading Apr 23, 2026
@JingsongLi
Copy link
Copy Markdown
Contributor

JingsongLi commented Apr 23, 2026

@luoyuxia Thanks for the contribution!

Left comments from Claude 4.7:

  1. from_avro_bytes collects all records into Value::Array then deserializes the whole array (objects_file.rs, manifest.rs, index_manifest.rs). This creates an intermediate Vec plus a wrapping Value::Array, then from_value walks it again. For large files this doubles memory. Consider deserializing each record individually:
    reader
    .map(|r| r.map_err(Error::from).and_then(|v| from_value(&v).map_err(Error::from)))
    .collect::<Result<Vec>>()
  2. This avoids the intermediate collection entirely.
  3. get_field_at formatting — the match arm body has inconsistent brace style that may trip rustfmt:
    Value::Record(fields) => if let Some(i) = idx {
    fields.get(i).map(|(_, v)| v)
    } else {
    fields.iter().find(|(n, )| n == name).map(|(, v)| v)
    }
    .and_then(unwrap_value),
  4. The .and_then(unwrap_value) is chained on the if let expression, which is valid but hard to read. Wrapping the if let in braces or extracting to a local would be clearer.
  5. field_index returns None on empty slices — this is fine, but get_field_at with idx = None falls back to linear search. If the field genuinely doesn't exist in the schema, every row will do a linear scan and find nothing. Not a bug, but worth noting the fallback is O(n·m) in the degenerate case (n rows, m fields).
  6. build_map_column semantic change — the old code used get_field_raw (no union unwrapping) for maps, because AvroValue::Object was ambiguous between unions and maps. The new code uses get_field_at which calls unwrap_value and thus unwraps unions. With apache_avro::types::Value, Map and Union are distinct variants, so this is actually correct now. Just
    flagging it as a subtle behavioral change that's worth a test if one doesn't exist.
  7. DataUnexpected error variant (error.rs) — the message field is always set to "" in the From<apache_avro::Error> impl. Either drop the message field from this variant (use only source), or populate it with something useful like "avro error". An empty string in display("Paimon hitting unexpected avro error {}: {:?}", message, source) looks odd in logs.
  8. apache-avro 0.21 — HashMap iteration order in Value::Map — Avro maps use HashMap internally, so iteration order in build_map_column is non-deterministic. The old code also used HashMap, so this isn't a regression, but it's worth noting that map key/value ordering in the Arrow MapArray output is not guaranteed to be stable.
  9. Minor: index_manifest.rs test — the roundtrip test now manually unwraps Value::Union(_, inner). This is fine but somewhat fragile if the schema changes. Consider using from_value directly on the decoded value since from_value should handle union unwrapping.

…ue usage

Remove the custom AvroValue enum and conversion functions, working
directly with apache_avro::types::Value. This avoids per-record
HashMap allocation and recursive value conversion overhead.
Upgrade apache-avro from 0.17 to 0.21 and simplify to_avro_bytes by
leveraging the existing From<apache_avro::Error> impl to replace
repetitive map_err blocks with the ? operator.
@luoyuxia luoyuxia force-pushed the fix-cargo-deny-license-check-main branch from f075e25 to 8b88003 Compare April 23, 2026 04:15
@luoyuxia
Copy link
Copy Markdown
Contributor Author

1 & 2: fix
3: fix
4: fix
5: fix
6: not a code issue, add a new ut to cover it.
7: fix
8: don't need to fix, since this isn't a regression and not a problem
9: don't need to fix, it's only in test, the schema won't change

@JingsongLi Hi, I already address the comments.

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit cb08370 into apache:main Apr 23, 2026
8 checks passed
luoyuxia added a commit to luoyuxia/paimon-rust that referenced this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants