Skip to content

Establish nomenclature style guide and audit operator names for clarity #4419

@andygrove

Description

@andygrove

Problem

Comet's documentation and operator names use "native" to mean three different things:

  1. Rust-implemented"Comet* nodes that run natively in Rust" (docs/source/user-guide/latest/understanding-comet-plans.md)
  2. On the Comet pipeline"Comet executes Spark's Scala and Java UDFs on the native Comet path" (docs/source/user-guide/latest/scala_java_udfs.md), even though the UDF runs as JVM bytecode produced by codegen
  3. Fully-Rust scanCometNativeScan (fully Rust) vs CometBatchScan (DSv2 with JVM reader producing Arrow)

The shuffle naming is similarly muddled: CometExchange ("native shuffle") and CometColumnarExchange ("JVM columnar shuffle") are both columnar and both serialize via Arrow IPC, but only one of the operator names says "Columnar."

This will get worse, not better. The roadmap calls for Java/Scala UDFs that operate directly on Arrow columnar data; the Scala UDF codegen path already lands JVM bytecode inside what we call "the native Comet path." Comet's value proposition is shifting from "we rewrote operators in Rust" to "we keep your data Arrow-native end-to-end" — and the nomenclature hasn't caught up.

Proposed value prop

Comet keeps Spark queries Arrow-native end-to-end — operators, expressions, shuffle, and broadcast all stay in Apache Arrow columnar format, avoiding the per-row overhead of Spark's row-based engine. Within the Arrow-native pipeline, operators and expressions execute in either native Rust code (via Apache DataFusion) or JVM code that operates directly on Arrow batches.

Proposed vocabulary

Three orthogonal concepts the docs currently conflate. We should name each axis distinctly.

Axis 1 — Pipeline membership

  • Comet pipeline — operators and expressions Comet handles, regardless of implementation language. Replaces "native Comet path," "on Comet," "accelerated by Comet."
  • Spark fallback — drops out of the pipeline into row-based Spark execution. (Already used consistently; keep it.)

Axis 2 — Implementation language

Within the Arrow-native pipeline, expressions and operators may be:

  • Rust-implemented / native Rust — Rust code via DataFusion.
  • JVM-implemented — Scala/Java code, including codegen'd UDFs.
  • Hybrid — JVM codegen with native callouts, where the JVM kernel is the primary execution path and Rust is invoked via JNI for specific subexpressions.

The implementation language is an internal detail; from the query's perspective the work is Arrow-native regardless. Hybrid impls trade JNI crossing cost for native compute speedups on subexpressions where the win is large enough to justify the boundary.

Axis 3 — Data format

  • Arrow-native — the unifying property of the Comet pipeline. Data is in Apache Arrow columnar format throughout; no per-row materialization or transition cost.
  • Arrow IPC — the wire/disk format used by Arrow-native shuffle and broadcast.
  • Spark rows / row-basedUnsafeRow, the format outside the pipeline.
  • Spark columnar — non-Comet ColumnarBatch from a Spark vectorized reader; requires a transition into Arrow-native.
  • Reserve "vectorized" for SIMD-specific contexts (the Parquet reader). Don't use it as a synonym for columnar.

A new category to name

Arrow-native JVM expression — JVM code that operates directly on Arrow batches (the Scala UDF codegen path; future Arrow UDFs; future hybrid JVM/native impls). Distinct from existing JVM-side plumbing operators (CometUnion, CometCoalesce, CometCollectLimit) which coordinate Arrow batches but don't compute over column data.

Rule for "native"

Bare "native" as a vague adjective in prose is banned; replace with the specific axis it refers to. Compound forms where the binding word fixes the meaning are permitted:

Form Meaning OK?
native Rust / Rust-native Rust implementation yes
Arrow-native Arrow columnar throughout yes
native shuffle Rust-implemented shuffle, paired with JVM shuffle yes
native scan (in CometNativeScan context) Rust scan, paired with CometBatchScan yes
runs natively / native execution / the native path ambiguous no

Proposed plan-node taxonomy

The "three kinds of nodes" framing in understanding-comet-plans.md should become four:

Category What it is Example
Arrow-native Rust operator Rust compute over Arrow batches CometProject, CometHashAggregate, CometSort
Arrow-native JVM expression JVM compute over Arrow batches; may include hybrid impls that call native code for subexpressions Scala UDF codegen (today); Arrow UDFs and hybrid JVM/native expressions (future)
Arrow-native JVM plumbing JVM coordination of Arrow batches; no per-row compute CometUnion, CometCoalesce, CometBroadcastExchange
Spark fallback Row-based Spark execution Project, HashAggregate, plain Exchange

Hybrid JVM/native expressions fold into the "Arrow-native JVM expression" row rather than getting a fifth category — the user-visible property is the same (Arrow-native, executed via a JVM kernel), and treating the native callout as an implementation detail keeps the taxonomy from sprawling.

Proposed operator renames

Touches plan-stability goldens, external tooling, and user docs. Candidates, ranked by how misleading the current name is:

Current Proposed Reason
CometExchange CometNativeShuffleExchange Both shuffle implementations are columnar and both use Arrow IPC; the differentiator is implementation, not format
CometColumnarExchange CometJvmShuffleExchange Symmetric pair with the above
CometNativeExec Keep (or rename to CometRustExec) Internal wrapper, not user-visible in plans; lower priority
CometNativeScan Keep "Native" = Rust here, paired with CometBatchScan; already unambiguous as a compound
CometIcebergNativeScan, CometCsvNativeScan Keep Same
CometNativeColumnarToRow Keep Distinguishes from JVM CometColumnarToRow; consistent compound

For renames that are user-visible in plans (CometExchange is the main one), provide a deprecation alias for one minor release and update plan-stability goldens in the same PR.

Documentation rewrite scope

Files that need substantive prose changes:

  • README.md — value-prop sentence using the Arrow-native framing
  • docs/source/index.md — same
  • docs/source/user-guide/latest/scala_java_udfs.md — replace "on the native Comet path" with "in the Comet pipeline"; explain the path is Arrow-native even though the UDF is JVM bytecode. Also clarify the existing line "e.g. myUdf(upper(s)) runs as one native unit" — the word "native" here is exactly the kind of ambiguity this issue exists to fix.
  • docs/source/user-guide/latest/understanding-comet-plans.md — extend the three-category list to four; update the shuffle section to match the rename
  • docs/source/contributor-guide/plugin_overview.md — "native Comet operators" → "Rust-implemented operators"; explain that JVM Arrow expressions (and future hybrid impls) slot into the same pipeline
  • docs/source/contributor-guide/native_shuffle.md + jvm_shuffle.md — align with new operator names; the file titles are accurate so they can stay
  • docs/source/about/gluten_comparison.md — the JVM-on-Arrow path is now a Comet differentiator worth mentioning

Migration plan

  1. Land the docs vocabulary first (no code changes, no plan goldens touched).
  2. Land operator renames with deprecation aliases in a single PR per pair.
  3. Update plan-stability goldens in the rename PRs.
  4. Remove deprecation aliases one minor release later.

Open questions

  • Should CometNativeExec rename in the same pass, or stay since it's internal-only?
  • Should the issue describe a parallel rename for any internal class names (e.g. CometNativeScanExec) beyond what shows up in plan output?
  • Any external tools that grep CometExchange we should ping before the rename lands?
  • Hybrid JVM/native expressions — the codegen-dispatch design opens the door to JVM kernels that call into Rust via JNI for specific subexpressions where the speedup outweighs the boundary cost. This isn't on the immediate roadmap but the vocabulary should accommodate it without further naming churn. The proposal treats hybrid impls as a flavor of Arrow-native JVM expression rather than a separate category, on the principle that implementation-language mix is an internal detail and the externally relevant property remains "the work happens inside the Arrow-native pipeline." Worth confirming the community agrees with this categorization before it becomes a contested rename later.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions