Conversation
Previously, the column writer accumulated raw definition and repetition levels in `Vec<i16>` sinks (`def_levels_sink` / `rep_levels_sink`) and only RLE-encoded them in bulk at page-flush time. Replace the two sinks with streaming `LevelEncoder` fields. Levels are now encoded incrementally as each `write_batch` call arrives, so only the compact encoded bytes are held in memory at all times. At page flush, the encoder is consumed and its bytes are written directly into the page buffer; a fresh encoder is swapped in for the next page. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Previously, flushing a data page would `mem::replace` each level encoder with a freshly allocated one, consuming the old encoder to get its buffer. This allocated new internal `Vec`s on every page boundary. We now preserve the internal state of the encoder and reuse memory across pages. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
The benchmark failure appears to happen on main as well; I will file a follow on ticket |
|
The `v1`, `v2`, and `max_buffer_size` functions required knowing the number of values upfront and pre-allocated buffers. All callers have been migrated to the streaming variants (`v1_streaming`, `v2_streaming`), so remove the dead code and switch the last remaining caller in test utils. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
4d2010c to
d55ca60
Compare
This comment has been minimized.
This comment has been minimized.
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/test_encoding_speed (d55ca60) to 1f07e54 (merge-base) diff File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/test_encoding_speed (d55ca60) to 1f07e54 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
This is a dummy PR that has all the changes in that I can automatically run the benchmarks on
The benchmark runner currently doesn't support branches called
main