Skip to content

feat(cubestore-cli): Support Arrow format for responses#10915

Merged
ovr merged 4 commits into
masterfrom
feat/cubestore-cli-arrow-format
May 20, 2026
Merged

feat(cubestore-cli): Support Arrow format for responses#10915
ovr merged 4 commits into
masterfrom
feat/cubestore-cli-arrow-format

Conversation

@ovr
Copy link
Copy Markdown
Member

@ovr ovr commented May 20, 2026

No description provided.

@github-actions github-actions Bot added the rust Pull requests that update Rust code label May 20, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 20, 2026

Claude finished @ovr's task in 2m 15s —— View job


PR Review: feat(cubestore-cli): Support Arrow format for responses

  • Gather context and read changed files
  • Review code quality and correctness
  • Review security considerations
  • Review performance implications
  • Review test coverage
  • Provide detailed feedback

Summary

Clean, well-structured PR that upgrades the CubeStore CLI's wire protocol from the legacy stringified-rows FlatBuffer format to Apache Arrow IPC. The design is backwards-compatible — the client requests Arrow, but gracefully handles Legacy responses from older servers via the ResultData enum.

What's good

  • Backwards compatibility: The ResultData enum + ResponseFormat design cleanly handles both wire formats. The comment on ResponseFormat correctly notes that an older server can still answer with Legacy — this is the right approach.
  • pub use arrow: Re-exporting the arrow crate from cubestore-ws-transport ensures the CLI uses the exact same Arrow version, avoiding potential type mismatches across crate boundaries.
  • Arrow rendering: render_arrow_batches iterates Arrow arrays directly rather than materializing to Vec<Vec<Option<String>>> first — good for memory on large results.
  • Test coverage: New decode_arrow_ipc_result_with_nulls test exercises the full encode→decode path for Arrow IPC including null handling. Existing tests properly updated for the new ResultData enum.
  • No API break for consumers: The QueryResult struct still has a single pub data field — callers access columns/rows through the accessor methods.

Issues found (see inline comments)

Severity File Issue
Minor format.rs:95-105 batch_formatters() uses .expect() which will panic on unsupported Arrow types — consider graceful fallback
Nit result.rs:70-77 get_columns() allocates Vec<String> on every call; Cow would avoid cloning for the Legacy path
Nit format.rs:45-79 batch_formatters() called twice per batch (sizing + rendering); acknowledged in doc comment, fine for CLI
Nit .gitignore .zed is IDE-specific — better in personal global gitignore
Question codec.rs:27 Arrow is now unconditionally requested — should this be configurable via ClientConfig?
Question codec.rs:137-149 decode_arrow_ipc eagerly collects all batches — fine for CLI, but worth noting if the transport crate gets reused

Verdict

The only substantive concern is the .expect() in batch_formatters — a malformed server response would crash the CLI instead of printing an error. Everything else is nits or forward-looking suggestions. Solid work overall.

Comment on lines +70 to 77
pub fn get_columns(&self) -> Vec<String> {
match &self.data {
ResultData::Legacy { columns, .. } => columns.clone(),
ResultData::Arrow { schema, .. } => {
schema.fields().iter().map(|f| f.name().clone()).collect()
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (performance): get_columns() allocates a new Vec<String> on every call. In exec.rs it's called once, but in format.rs and codec.rs it's called in paths where the result is used immediately and then dropped.

For the Legacy variant this clones all strings unnecessarily — the caller just needs to read names and count them. Consider returning Cow<'_, [String]> or a dedicated iterator/slice wrapper so the Legacy path avoids the clone:

pub fn get_columns(&self) -> Cow<'_, Vec<String>> {
    match &self.data {
        ResultData::Legacy { columns, .. } => Cow::Borrowed(columns),
        ResultData::Arrow { schema, .. } => {
            Cow::Owned(schema.fields().iter().map(|f| f.name().clone()).collect())
        }
    }
}

Not a blocker — the column count is small — but worth noting for future optimization if this gets called in a hot path.

Comment on lines +95 to +105
fn batch_formatters<'a>(
batch: &'a RecordBatch,
options: &'a FormatOptions,
) -> Vec<ArrayFormatter<'a>> {
batch
.columns()
.iter()
.map(|col| ArrayFormatter::try_new(col.as_ref(), options))
.collect::<Result<_, _>>()
.expect("arrow ArrayFormatter init")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concern: .expect("arrow ArrayFormatter init") will panic if Arrow encounters an unsupported data type. While this is unlikely with standard CubeStore types, a malformed server response could crash the CLI rather than printing an error.

Consider propagating the error or falling back gracefully:

fn batch_formatters<'a>(
    batch: &'a RecordBatch,
    options: &'a FormatOptions,
) -> Option<Vec<ArrayFormatter<'a>>> {
    batch.columns()
        .iter()
        .map(|col| ArrayFormatter::try_new(col.as_ref(), options))
        .collect::<Result<_, _>>()
        .ok()
}

Then skip rendering that batch (or fall back to debug formatting) if None.

Comment thread rust/cube/cubestore-cli/src/format.rs Outdated
Comment on lines +45 to +79
fn render_arrow_batches(columns: &[String], batches: &[RecordBatch]) -> String {
let ncols = columns.len();
let fmt_options = FormatOptions::default().with_display_error(true);
let mut widths = header_widths(columns);

for batch in batches {
let formatters = batch_formatters(batch, &fmt_options);
let batch_cols = formatters.len().min(ncols);
for row_idx in 0..batch.num_rows() {
for col_idx in 0..batch_cols {
let cell = arrow_cell(batch, &formatters, col_idx, row_idx);
observe_width(&mut widths, col_idx, cell_len(cell.as_deref()));
}
}
}

let mut out = String::new();
write_header(&mut out, columns, &widths);
write_separator(&mut out, &widths);
for batch in batches {
let formatters = batch_formatters(batch, &fmt_options);
let batch_cols = formatters.len().min(ncols);
for row_idx in 0..batch.num_rows() {
out.push('\n');
for (i, &width) in widths.iter().enumerate() {
let cell = if i < batch_cols {
arrow_cell(batch, &formatters, i, row_idx)
} else {
None
};
write_cell(&mut out, i, cell.as_deref(), width);
}
}
}
out
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: batch_formatters() is called twice per batch — once in the sizing pass and once in the rendering pass. The doc comment acknowledges the trade-off (CPU vs. not materializing all rows into strings), which is fair.

If this ever becomes a bottleneck with large result sets, you could compute formatters once per batch and store (batch, formatters) pairs, but the current approach is clean and the overhead is likely negligible for CLI output.

Comment thread rust/cube/.gitignore
Comment on lines +137 to +149
fn decode_arrow_ipc(bytes: &[u8]) -> Result<QueryResult, TransportError> {
let reader = StreamReader::try_new(Cursor::new(bytes), None)
.map_err(|e| TransportError::Protocol(format!("arrow IPC open: {e}")))?;

let schema = reader.schema();
let batches: Vec<RecordBatch> = reader
.collect::<Result<_, _>>()
.map_err(|e| TransportError::Protocol(format!("arrow IPC read batch: {e}")))?;

Ok(QueryResult {
data: ResultData::Arrow { schema, batches },
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: decode_arrow_ipc collects all batches into a Vec<RecordBatch> eagerly. For very large result sets this means the entire result is held in memory. This is fine for a CLI tool, but if this transport crate is intended to be reused by other consumers, a streaming approach (yielding batches one at a time) might be worth considering in the future.

Also: is there a size limit consideration? The StreamReader will happily read whatever the server sends. The WS layer already has a 256MiB frame cap (from actor.rs), so that provides an implicit ceiling, which seems sufficient.

Comment thread rust/cube/cubestore-ws-transport/src/codec.rs
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.93%. Comparing base (0fa472b) to head (712b25f).

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10915      +/-   ##
==========================================
- Coverage   83.53%   78.93%   -4.60%     
==========================================
  Files         254      470     +216     
  Lines       75844    92837   +16993     
  Branches        0     3449    +3449     
==========================================
+ Hits        63353    73279    +9926     
- Misses      12491    19054    +6563     
- Partials        0      504     +504     
Flag Coverage Δ
cube-backend 58.44% <ø> (?)
cubesql 83.52% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ovr ovr merged commit e4c8c43 into master May 20, 2026
91 of 93 checks passed
@ovr ovr deleted the feat/cubestore-cli-arrow-format branch May 20, 2026 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants