Add docs: design, rust-api, cpp-api, java-api and python-api#9
Conversation
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for adding the docs. I focused on whether the documentation matches the current implementation and found several mismatches that should be fixed before merge. Local static link validation passed; the issues below are about API/format accuracy.
| varint sharedPrefixLen (bytes shared with previous column name) | ||
| varint suffixLen (bytes of new suffix) | ||
| bytes suffix (suffixLen bytes, raw or BPE-encoded) | ||
| TypeDescriptor</code></pre> |
There was a problem hiding this comment.
This schema layout is missing the serialized logical/global column index. The implementation writes global_idx between suffix and TypeDescriptor (core/src/schema.rs), and the reader decodes it before deserialize_field. Without documenting that varint, an independent reader/writer based on this spec would be off by one field and could not preserve the original schema column order.
|
|
||
| <h3>Reading Stats</h3> | ||
| <pre><code><span class="kw">for</span> rg_idx <span class="kw">in</span> <span class="num">0</span>..reader.num_row_groups() { | ||
| <span class="kw">let</span> stats = reader.row_group_stats(rg_idx); |
There was a problem hiding this comment.
row_group_stats returns io::Result<&[ColumnStats]> in ReaderAccess, so this snippet does not compile as written. It should unwrap/propagate the result before iterating, e.g. let stats = reader.row_group_stats(rg_idx)?; or .unwrap() in a sample.
| <span class="kw">return</span> (written == len) ? <span class="num">0</span> : <span class="num">-1</span>; | ||
| }; | ||
| cbs.flush_fn = [file]() -> <span class="kw">int</span> { <span class="kw">return</span> std::fflush(file.get()); }; | ||
| cbs.get_pos_fn = [file, pos]() <span class="kw">mutable</span> -> <span class="kw">int64_t</span> { <span class="kw">return</span> pos; }; |
There was a problem hiding this comment.
The two lambdas capture separate copies of pos, so get_pos_fn always returns its own unchanged copy (0). The writer relies on get_pos_fn for file offsets, so this example can produce an invalid file. The tested code uses shared buffer state (buf.pos) for both callbacks; this snippet should do the same or use ftell/shared state.
| </p> | ||
| <pre><code><span class="cmt">// Writing with stats (arrow_schema is an ArrowSchema* from C Data Interface)</span> | ||
| <span class="ty">mosaic</span>::<span class="ty">Writer</span> writer(std::move(cbs), arrow_schema, { | ||
| .compression = <span class="num">1</span>, |
There was a problem hiding this comment.
This says "Writing with stats" but it only sets compression; stats_columns and num_stats_columns remain unset, so get_row_group_statistics will return an empty vector. Please include a uint32_t stats_cols[] = { ... } and set both opts.stats_columns and opts.num_stats_columns.
| <span class="ty">Schema</span> arrowSchema = <span class="kw">new</span> <span class="ty">Schema</span>(<span class="ty">Arrays</span>.asList( | ||
| <span class="ty">Field</span>.notNullable(<span class="str">"id"</span>, <span class="kw">new</span> <span class="ty">ArrowType.Int</span>(<span class="num">32</span>, <span class="kw">true</span>)), | ||
| <span class="ty">Field</span>.nullable(<span class="str">"name"</span>, <span class="ty">ArrowType.Utf8</span>.INSTANCE), | ||
| <span class="ty">Field</span>.nullable(<span class="str">"score"</span>, <span class="kw">new</span> <span class="ty">ArrowType.FloatingPoint</span>(DOUBLE)), |
There was a problem hiding this comment.
The Java snippets use new ArrowType.FloatingPoint(DOUBLE), but DOUBLE is not in scope with the shown imports. The implementation/tests use FloatingPointPrecision.DOUBLE (or a static import would be required). Please update all Java snippets that use DOUBLE so users can copy/paste them.
| <tr><td>14</td><td>DECIMAL</td><td>varint precision, varint scale</td></tr> | ||
| <tr><td>15</td><td>TIME</td><td>varint precision</td></tr> | ||
| <tr><td>16</td><td>TIMESTAMP</td><td>varint precision</td></tr> | ||
| <tr><td>17</td><td>TIMESTAMP_LTZ</td><td>varint precision</td></tr> |
There was a problem hiding this comment.
For TIMESTAMP_LTZ/Timestamp-with-timezone, the implementation writes more than just precision: it also writes the timezone string length and bytes (core/src/types.rs). The spec should include varint timezoneLength and bytes timezone, otherwise the documented type descriptor does not match files written by the current code.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. The previous issues look fixed, but I found two remaining doc/code mismatches that should be addressed before merge.
| arrow::ExportRecordBatch(*batch, &ffi_array, &ffi_schema); | ||
|
|
||
| <span class="ty">mosaic</span>::<span class="ty">Writer</span> writer(std::move(cbs), &ffi_schema, { | ||
| .num_buckets = <span class="num">2</span>, |
There was a problem hiding this comment.
The C++ docs still say to compile with -std=c++17, but the examples use C++ designated initializers, which are a C++20 feature. This initializer is also out of declaration order for C++20 (WriterOptions declares compression before num_buckets), so strict compilation fails. Please either switch the docs/build commands to C++20 and order designators by the struct declaration, or avoid designated initializers and use WriterOptions opts; opts.num_buckets = 2; opts.compression = 1; ... so the examples match the documented C++17 build command.
| <tr><td><code>VarBinary(n)</code></td><td>variable</td><td>Variable-length byte array with max length</td></tr> | ||
| <tr><td><code>Bytes</code></td><td>variable</td><td>Unbounded byte array</td></tr> | ||
| <tr><td><code>Decimal(p, s)</code></td><td>8 or variable</td><td>Exact numeric; compact (p≤18) or large</td></tr> | ||
| <tr><td><code>Timestamp(p)</code></td><td>8 or 12</td><td>Millis (p≤3) or millis + nanos (p>3)</td></tr> |
There was a problem hiding this comment.
This Timestamp summary does not match the implementation: Timestamp(Millisecond) and Timestamp(Microsecond) are both stored as 8-byte values, while only precision > 6 uses the 12-byte {millis, nanos_of_milli} representation. The detailed design page says this correctly; the home page should say something like millis (p <= 3), micros (p <= 6), or millis + nanos (p > 6).
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. The previously reported documentation/API mismatches are fixed now. I rechecked the C++ examples against the documented C++17 command, the timestamp description, schema layout, stats examples, Rust Result handling, and Java FloatingPointPrecision usage. Static docs link validation also passed. CI is still queued at review time.
No description provided.