Add docs: design, rust-api, cpp-api, java-api and python-api by JingsongLi · Pull Request #9 · apache/paimon-mosaic

JingsongLi · 2026-05-19T08:54:36Z

No description provided.

leaves12138

Thanks for adding the docs. I focused on whether the documentation matches the current implementation and found several mismatches that should be fixed before merge. Local static link validation passed; the issues below are about API/format accuracy.

leaves12138 · 2026-05-19T09:08:19Z

+varint   sharedPrefixLen       (bytes shared with previous column name)
+varint   suffixLen             (bytes of new suffix)
+bytes    suffix                (suffixLen bytes, raw or BPE-encoded)
+TypeDescriptor</code></pre>


This schema layout is missing the serialized logical/global column index. The implementation writes global_idx between suffix and TypeDescriptor (core/src/schema.rs), and the reader decodes it before deserialize_field. Without documenting that varint, an independent reader/writer based on this spec would be off by one field and could not preserve the original schema column order.

leaves12138 · 2026-05-19T09:08:19Z

+
+            <h3>Reading Stats</h3>
+<pre><code><span class="kw">for</span> rg_idx <span class="kw">in</span> <span class="num">0</span>..reader.num_row_groups() {
+    <span class="kw">let</span> stats = reader.row_group_stats(rg_idx);


row_group_stats returns io::Result<&[ColumnStats]> in ReaderAccess, so this snippet does not compile as written. It should unwrap/propagate the result before iterating, e.g. let stats = reader.row_group_stats(rg_idx)?; or .unwrap() in a sample.

leaves12138 · 2026-05-19T09:08:19Z

+            <span class="kw">return</span> (written == len) ? <span class="num">0</span> : <span class="num">-1</span>;
+        };
+        cbs.flush_fn = [file]() -&gt; <span class="kw">int</span> { <span class="kw">return</span> std::fflush(file.get()); };
+        cbs.get_pos_fn = [file, pos]() <span class="kw">mutable</span> -&gt; <span class="kw">int64_t</span> { <span class="kw">return</span> pos; };


The two lambdas capture separate copies of pos, so get_pos_fn always returns its own unchanged copy (0). The writer relies on get_pos_fn for file offsets, so this example can produce an invalid file. The tested code uses shared buffer state (buf.pos) for both callbacks; this snippet should do the same or use ftell/shared state.

leaves12138 · 2026-05-19T09:08:19Z

+            </p>
+<pre><code><span class="cmt">// Writing with stats (arrow_schema is an ArrowSchema* from C Data Interface)</span>
+<span class="ty">mosaic</span>::<span class="ty">Writer</span> writer(std::move(cbs), arrow_schema, {
+    .compression = <span class="num">1</span>,


This says "Writing with stats" but it only sets compression; stats_columns and num_stats_columns remain unset, so get_row_group_statistics will return an empty vector. Please include a uint32_t stats_cols[] = { ... } and set both opts.stats_columns and opts.num_stats_columns.

leaves12138 · 2026-05-19T09:08:19Z

+<span class="ty">Schema</span> arrowSchema = <span class="kw">new</span> <span class="ty">Schema</span>(<span class="ty">Arrays</span>.asList(
+    <span class="ty">Field</span>.notNullable(<span class="str">"id"</span>, <span class="kw">new</span> <span class="ty">ArrowType.Int</span>(<span class="num">32</span>, <span class="kw">true</span>)),
+    <span class="ty">Field</span>.nullable(<span class="str">"name"</span>, <span class="ty">ArrowType.Utf8</span>.INSTANCE),
+    <span class="ty">Field</span>.nullable(<span class="str">"score"</span>, <span class="kw">new</span> <span class="ty">ArrowType.FloatingPoint</span>(DOUBLE)),


The Java snippets use new ArrowType.FloatingPoint(DOUBLE), but DOUBLE is not in scope with the shown imports. The implementation/tests use FloatingPointPrecision.DOUBLE (or a static import would be required). Please update all Java snippets that use DOUBLE so users can copy/paste them.

leaves12138 · 2026-05-19T09:09:13Z

+                    <tr><td>14</td><td>DECIMAL</td><td>varint precision, varint scale</td></tr>
+                    <tr><td>15</td><td>TIME</td><td>varint precision</td></tr>
+                    <tr><td>16</td><td>TIMESTAMP</td><td>varint precision</td></tr>
+                    <tr><td>17</td><td>TIMESTAMP_LTZ</td><td>varint precision</td></tr>


For TIMESTAMP_LTZ/Timestamp-with-timezone, the implementation writes more than just precision: it also writes the timezone string length and bytes (core/src/types.rs). The spec should include varint timezoneLength and bytes timezone, otherwise the documented type descriptor does not match files written by the current code.

leaves12138

Thanks for the update. The previous issues look fixed, but I found two remaining doc/code mismatches that should be addressed before merge.

leaves12138 · 2026-05-19T09:52:49Z

+        arrow::ExportRecordBatch(*batch, &amp;ffi_array, &amp;ffi_schema);
+
+        <span class="ty">mosaic</span>::<span class="ty">Writer</span> writer(std::move(cbs), &amp;ffi_schema, {
+            .num_buckets = <span class="num">2</span>,


The C++ docs still say to compile with -std=c++17, but the examples use C++ designated initializers, which are a C++20 feature. This initializer is also out of declaration order for C++20 (WriterOptions declares compression before num_buckets), so strict compilation fails. Please either switch the docs/build commands to C++20 and order designators by the struct declaration, or avoid designated initializers and use WriterOptions opts; opts.num_buckets = 2; opts.compression = 1; ... so the examples match the documented C++17 build command.

leaves12138 · 2026-05-19T09:52:50Z

+                    <tr><td><code>VarBinary(n)</code></td><td>variable</td><td>Variable-length byte array with max length</td></tr>
+                    <tr><td><code>Bytes</code></td><td>variable</td><td>Unbounded byte array</td></tr>
+                    <tr><td><code>Decimal(p, s)</code></td><td>8 or variable</td><td>Exact numeric; compact (p&le;18) or large</td></tr>
+                    <tr><td><code>Timestamp(p)</code></td><td>8 or 12</td><td>Millis (p&le;3) or millis + nanos (p&gt;3)</td></tr>


This Timestamp summary does not match the implementation: Timestamp(Millisecond) and Timestamp(Microsecond) are both stored as 8-byte values, while only precision > 6 uses the 12-byte {millis, nanos_of_milli} representation. The detailed design page says this correctly; the home page should say something like millis (p <= 3), micros (p <= 6), or millis + nanos (p > 6).

leaves12138

Thanks for the update. The previously reported documentation/API mismatches are fixed now. I rechecked the C++ examples against the documented C++17 command, the timestamp description, schema layout, stats examples, Rust Result handling, and Java FloatingPointPrecision usage. Static docs link validation also passed. CI is still queued at review time.

Add docs: design, rust-api, cpp-api, java-api and python-api

81e10da

leaves12138 requested changes May 19, 2026

View reviewed changes

leaves12138 reviewed May 19, 2026

View reviewed changes

Fix comments

487c228

leaves12138 requested changes May 19, 2026

View reviewed changes

Fix comments

46a5dff

leaves12138 approved these changes May 19, 2026

View reviewed changes

JingsongLi merged commit 6b5ef18 into apache:main May 19, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs: design, rust-api, cpp-api, java-api and python-api#9

Add docs: design, rust-api, cpp-api, java-api and python-api#9
JingsongLi merged 3 commits into
apache:mainfrom
JingsongLi:docs

JingsongLi commented May 19, 2026

Uh oh!

leaves12138 left a comment

Uh oh!

leaves12138 May 19, 2026

Uh oh!

leaves12138 May 19, 2026

Uh oh!

leaves12138 May 19, 2026

Uh oh!

leaves12138 May 19, 2026

Uh oh!

leaves12138 May 19, 2026

Uh oh!

leaves12138 May 19, 2026

Uh oh!

leaves12138 left a comment

Uh oh!

leaves12138 May 19, 2026

Uh oh!

leaves12138 May 19, 2026

Uh oh!

leaves12138 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JingsongLi commented May 19, 2026

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants