Skip to content

[core] Add mosaic-core crate with format spec, reader, and writer#1

Merged
JingsongLi merged 3 commits into
apache:mainfrom
JingsongLi:add-mosaic-core
May 18, 2026
Merged

[core] Add mosaic-core crate with format spec, reader, and writer#1
JingsongLi merged 3 commits into
apache:mainfrom
JingsongLi:add-mosaic-core

Conversation

@JingsongLi
Copy link
Copy Markdown
Contributor

Initial contribution of the Mosaic file format core library:

  • Format specification (spec.rs): magic bytes, version, footer layout, bucket layouts (monolithic and paged), encoding types
  • Schema (schema.rs): column metadata, bucket assignment, serialization
  • Writer (writer.rs, bucket_writer.rs): row-group-based writer with automatic encoding selection (const, dict, plain), zstd compression, and paged bucket support for wide tables
  • Reader (reader.rs, bucket_reader.rs): two-round IO strategy with range coalescing, column projection, paged bucket optimization
  • Types (types.rs, values.rs): type system mapping to Arrow types
  • Utilities (varint.rs, bpe.rs): variable-length integer encoding, byte-pair encoding for string compression

Initial contribution of the Mosaic file format core library:
- Format specification (spec.rs): magic bytes, version, footer layout,
  bucket layouts (monolithic and paged), encoding types
- Schema (schema.rs): column metadata, bucket assignment, serialization
- Writer (writer.rs, bucket_writer.rs): row-group-based writer with
  automatic encoding selection (const, dict, plain), zstd compression,
  and paged bucket support for wide tables
- Reader (reader.rs, bucket_reader.rs): two-round IO strategy with
  range coalescing, column projection, paged bucket optimization
- Types (types.rs, values.rs): type system mapping to Arrow types
- Utilities (varint.rs, bpe.rs): variable-length integer encoding,
  byte-pair encoding for string compression
Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initial core contribution. The overall direction of the storage layout and the bucket/page abstraction looks reasonable to me, but I think we should fix the following before merging.

  1. Blocking: cargo fmt --all -- --check currently fails.
    On the current PR head (448e02c), rustfmt reports a formatting diff in core/src/reader.rs at EOF (new blank line at EOF). Please run cargo fmt --all and update the PR. This will also make the new GitHub Actions fmt job pass.

  2. Please add reader roundtrip coverage.
    cargo test --workspace --all-targets currently passes, but it only runs 22 tests and they cover BPE, schema, bucket writer, and writer behavior. This PR introduces the full MosaicReader, row-group reader, paged bucket reader, projection path, and type deserialization, but those paths are not directly covered by committed tests. I would like to see at least a few integration-style tests that write a file and then reopen it with MosaicReader, for example:

    • basic roundtrip with nulls / booleans / strings,
    • paged-bucket roundtrip with column projection,
    • decimal and timestamp-with-timezone schema/data roundtrip.

Validation I ran locally:

  • cargo fmt --all -- --check: failed as described above.
  • cargo clippy --workspace --all-targets -- -D warnings: passed.
  • cargo test --workspace --all-targets: passed.
  • I also ran a few temporary local reader roundtrip tests for nulls/booleans/strings, paged projection, decimal, and timestamp timezone; those passed, so this is mainly a committed coverage gap rather than a known reader correctness failure.

@JingsongLi
Copy link
Copy Markdown
Contributor Author

@leaves12138

  1. cargo fmt --all -- --check: fixed.
  2. reader roundtrip coverage: I will add tests in next PR.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I re-ran the local checks on the latest head (6cb483a):

  • cargo fmt --all -- --check: passed
  • cargo clippy --workspace --all-targets -- -D warnings: passed
  • cargo test --workspace --all-targets: passed (22 tests)

The formatting blocker is fixed. I am OK with deferring the reader roundtrip coverage to the next PR, but please make sure that follow-up includes committed tests for basic read-after-write, paged projection, and type roundtrips such as decimal and timestamp. Approving this initial core PR.

@JingsongLi JingsongLi merged commit f7dc71a into apache:main May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants