[core] Add mosaic-core crate with format spec, reader, and writer#1
Conversation
Initial contribution of the Mosaic file format core library: - Format specification (spec.rs): magic bytes, version, footer layout, bucket layouts (monolithic and paged), encoding types - Schema (schema.rs): column metadata, bucket assignment, serialization - Writer (writer.rs, bucket_writer.rs): row-group-based writer with automatic encoding selection (const, dict, plain), zstd compression, and paged bucket support for wide tables - Reader (reader.rs, bucket_reader.rs): two-round IO strategy with range coalescing, column projection, paged bucket optimization - Types (types.rs, values.rs): type system mapping to Arrow types - Utilities (varint.rs, bpe.rs): variable-length integer encoding, byte-pair encoding for string compression
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the initial core contribution. The overall direction of the storage layout and the bucket/page abstraction looks reasonable to me, but I think we should fix the following before merging.
-
Blocking:
cargo fmt --all -- --checkcurrently fails.
On the current PR head (448e02c), rustfmt reports a formatting diff incore/src/reader.rsat EOF (new blank line at EOF). Please runcargo fmt --alland update the PR. This will also make the new GitHub Actionsfmtjob pass. -
Please add reader roundtrip coverage.
cargo test --workspace --all-targetscurrently passes, but it only runs 22 tests and they cover BPE, schema, bucket writer, and writer behavior. This PR introduces the fullMosaicReader, row-group reader, paged bucket reader, projection path, and type deserialization, but those paths are not directly covered by committed tests. I would like to see at least a few integration-style tests that write a file and then reopen it withMosaicReader, for example:- basic roundtrip with nulls / booleans / strings,
- paged-bucket roundtrip with column projection,
- decimal and timestamp-with-timezone schema/data roundtrip.
Validation I ran locally:
cargo fmt --all -- --check: failed as described above.cargo clippy --workspace --all-targets -- -D warnings: passed.cargo test --workspace --all-targets: passed.- I also ran a few temporary local reader roundtrip tests for nulls/booleans/strings, paged projection, decimal, and timestamp timezone; those passed, so this is mainly a committed coverage gap rather than a known reader correctness failure.
|
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I re-ran the local checks on the latest head (6cb483a):
cargo fmt --all -- --check: passedcargo clippy --workspace --all-targets -- -D warnings: passedcargo test --workspace --all-targets: passed (22 tests)
The formatting blocker is fixed. I am OK with deferring the reader roundtrip coverage to the next PR, but please make sure that follow-up includes committed tests for basic read-after-write, paged projection, and type roundtrips such as decimal and timestamp. Approving this initial core PR.
Initial contribution of the Mosaic file format core library: