feat: replace apache-avro with serde_avro_fast and parallelize manifest reads#203
Conversation
456ab38 to
b004f5d
Compare
| pub bucket: i32, | ||
| pub level: i32, | ||
| pub file_name: String, | ||
| pub extra_files: Vec<String>, |
There was a problem hiding this comment.
Identifier must be aligned with Java, otherwise there will be exceptions.
e557c26 to
a194898
Compare
luoyuxia
left a comment
There was a problem hiding this comment.
@JingsongLi Thanks for the pr. Left minor comment. PTAL
| let mut deleted_entry_keys = HashSet::new(); | ||
| let mut added_entries = Vec::new(); | ||
|
|
||
| let mut map: IndexMap<Identifier, ManifestEntry> = IndexMap::with_capacity(entries.len()); |
There was a problem hiding this comment.
I'm curious why this change is needed.
The previous implementation mostly follows py-paimon. With the current code, we may read two manifest entries like [Delete(id), Add(id)].
Because manifests are read concurrently, Delete(id) may be observed first. In that case, the later Add(id) would be kept by the current IndexMap + shift_remove logic.
However, IIUC, that is incorrect: once Delete(id) exists, the corresponding Add(id) should be filtered out rather than kept.
There was a problem hiding this comment.
There aren't many deleted entries, it doesn't seem very meaningful to modify it to read twice, as it will repeatedly read the manifest containing the deleted entry.
If we don't read it twice, the current implementation is quite reasonable, we only need to traverse it once.
There was a problem hiding this comment.
But I think the previous implementation is indeed safer. I will revert here.
e92d138 to
c9590c4
Compare
…st reads Switch Avro deserialization from apache-avro (Value intermediate repr) to serde_avro_fast (direct bytes→struct), eliminating redundant allocations for ~10-20x deserialization speedup. Read manifest files concurrently with buffer_unordered(64) instead of sequentially. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c9590c4 to
72f753c
Compare
Purpose
Switch Avro deserialization from apache-avro (Value intermediate repr) to serde_avro_fast (direct bytes→struct), eliminating redundant allocations for ~10-20x deserialization speedup. Read manifest files concurrently with buffer_unordered(64) instead of sequentially.
Sub task of #173
Brief change log
Tests
API and Format
Documentation