Skip to content

Add docs: design, rust-api, cpp-api, java-api and python-api#9

Merged
JingsongLi merged 3 commits into
apache:mainfrom
JingsongLi:docs
May 19, 2026
Merged

Add docs: design, rust-api, cpp-api, java-api and python-api#9
JingsongLi merged 3 commits into
apache:mainfrom
JingsongLi:docs

Conversation

@JingsongLi
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the docs. I focused on whether the documentation matches the current implementation and found several mismatches that should be fixed before merge. Local static link validation passed; the issues below are about API/format accuracy.

Comment thread docs/design.html
varint sharedPrefixLen (bytes shared with previous column name)
varint suffixLen (bytes of new suffix)
bytes suffix (suffixLen bytes, raw or BPE-encoded)
TypeDescriptor</code></pre>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schema layout is missing the serialized logical/global column index. The implementation writes global_idx between suffix and TypeDescriptor (core/src/schema.rs), and the reader decodes it before deserialize_field. Without documenting that varint, an independent reader/writer based on this spec would be off by one field and could not preserve the original schema column order.

Comment thread docs/rust-api.html Outdated

<h3>Reading Stats</h3>
<pre><code><span class="kw">for</span> rg_idx <span class="kw">in</span> <span class="num">0</span>..reader.num_row_groups() {
<span class="kw">let</span> stats = reader.row_group_stats(rg_idx);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_group_stats returns io::Result<&[ColumnStats]> in ReaderAccess, so this snippet does not compile as written. It should unwrap/propagate the result before iterating, e.g. let stats = reader.row_group_stats(rg_idx)?; or .unwrap() in a sample.

Comment thread docs/cpp-api.html Outdated
<span class="kw">return</span> (written == len) ? <span class="num">0</span> : <span class="num">-1</span>;
};
cbs.flush_fn = [file]() -&gt; <span class="kw">int</span> { <span class="kw">return</span> std::fflush(file.get()); };
cbs.get_pos_fn = [file, pos]() <span class="kw">mutable</span> -&gt; <span class="kw">int64_t</span> { <span class="kw">return</span> pos; };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two lambdas capture separate copies of pos, so get_pos_fn always returns its own unchanged copy (0). The writer relies on get_pos_fn for file offsets, so this example can produce an invalid file. The tested code uses shared buffer state (buf.pos) for both callbacks; this snippet should do the same or use ftell/shared state.

Comment thread docs/cpp-api.html Outdated
</p>
<pre><code><span class="cmt">// Writing with stats (arrow_schema is an ArrowSchema* from C Data Interface)</span>
<span class="ty">mosaic</span>::<span class="ty">Writer</span> writer(std::move(cbs), arrow_schema, {
.compression = <span class="num">1</span>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says "Writing with stats" but it only sets compression; stats_columns and num_stats_columns remain unset, so get_row_group_statistics will return an empty vector. Please include a uint32_t stats_cols[] = { ... } and set both opts.stats_columns and opts.num_stats_columns.

Comment thread docs/java-api.html Outdated
<span class="ty">Schema</span> arrowSchema = <span class="kw">new</span> <span class="ty">Schema</span>(<span class="ty">Arrays</span>.asList(
<span class="ty">Field</span>.notNullable(<span class="str">"id"</span>, <span class="kw">new</span> <span class="ty">ArrowType.Int</span>(<span class="num">32</span>, <span class="kw">true</span>)),
<span class="ty">Field</span>.nullable(<span class="str">"name"</span>, <span class="ty">ArrowType.Utf8</span>.INSTANCE),
<span class="ty">Field</span>.nullable(<span class="str">"score"</span>, <span class="kw">new</span> <span class="ty">ArrowType.FloatingPoint</span>(DOUBLE)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java snippets use new ArrowType.FloatingPoint(DOUBLE), but DOUBLE is not in scope with the shown imports. The implementation/tests use FloatingPointPrecision.DOUBLE (or a static import would be required). Please update all Java snippets that use DOUBLE so users can copy/paste them.

Comment thread docs/design.html Outdated
<tr><td>14</td><td>DECIMAL</td><td>varint precision, varint scale</td></tr>
<tr><td>15</td><td>TIME</td><td>varint precision</td></tr>
<tr><td>16</td><td>TIMESTAMP</td><td>varint precision</td></tr>
<tr><td>17</td><td>TIMESTAMP_LTZ</td><td>varint precision</td></tr>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TIMESTAMP_LTZ/Timestamp-with-timezone, the implementation writes more than just precision: it also writes the timezone string length and bytes (core/src/types.rs). The spec should include varint timezoneLength and bytes timezone, otherwise the documented type descriptor does not match files written by the current code.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The previous issues look fixed, but I found two remaining doc/code mismatches that should be addressed before merge.

Comment thread docs/cpp-api.html Outdated
arrow::ExportRecordBatch(*batch, &amp;ffi_array, &amp;ffi_schema);

<span class="ty">mosaic</span>::<span class="ty">Writer</span> writer(std::move(cbs), &amp;ffi_schema, {
.num_buckets = <span class="num">2</span>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ docs still say to compile with -std=c++17, but the examples use C++ designated initializers, which are a C++20 feature. This initializer is also out of declaration order for C++20 (WriterOptions declares compression before num_buckets), so strict compilation fails. Please either switch the docs/build commands to C++20 and order designators by the struct declaration, or avoid designated initializers and use WriterOptions opts; opts.num_buckets = 2; opts.compression = 1; ... so the examples match the documented C++17 build command.

Comment thread docs/index.html Outdated
<tr><td><code>VarBinary(n)</code></td><td>variable</td><td>Variable-length byte array with max length</td></tr>
<tr><td><code>Bytes</code></td><td>variable</td><td>Unbounded byte array</td></tr>
<tr><td><code>Decimal(p, s)</code></td><td>8 or variable</td><td>Exact numeric; compact (p&le;18) or large</td></tr>
<tr><td><code>Timestamp(p)</code></td><td>8 or 12</td><td>Millis (p&le;3) or millis + nanos (p&gt;3)</td></tr>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Timestamp summary does not match the implementation: Timestamp(Millisecond) and Timestamp(Microsecond) are both stored as 8-byte values, while only precision > 6 uses the 12-byte {millis, nanos_of_milli} representation. The detailed design page says this correctly; the home page should say something like millis (p <= 3), micros (p <= 6), or millis + nanos (p > 6).

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The previously reported documentation/API mismatches are fixed now. I rechecked the C++ examples against the documented C++17 command, the timestamp description, schema layout, stats examples, Rust Result handling, and Java FloatingPointPrecision usage. Static docs link validation also passed. CI is still queued at review time.

@JingsongLi JingsongLi merged commit 6b5ef18 into apache:main May 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants