Problem
Comet's documentation and operator names use "native" to mean three different things:
- Rust-implemented — "
Comet* nodes that run natively in Rust" (docs/source/user-guide/latest/understanding-comet-plans.md)
- On the Comet pipeline — "Comet executes Spark's Scala and Java UDFs on the native Comet path" (
docs/source/user-guide/latest/scala_java_udfs.md), even though the UDF runs as JVM bytecode produced by codegen
- Fully-Rust scan —
CometNativeScan (fully Rust) vs CometBatchScan (DSv2 with JVM reader producing Arrow)
The shuffle naming is similarly muddled: CometExchange ("native shuffle") and CometColumnarExchange ("JVM columnar shuffle") are both columnar and both serialize via Arrow IPC, but only one of the operator names says "Columnar."
This will get worse, not better. The roadmap calls for Java/Scala UDFs that operate directly on Arrow columnar data; the Scala UDF codegen path already lands JVM bytecode inside what we call "the native Comet path." Comet's value proposition is shifting from "we rewrote operators in Rust" to "we keep your data Arrow-native end-to-end" — and the nomenclature hasn't caught up.
Proposed value prop
Comet keeps Spark queries Arrow-native end-to-end — operators, expressions, shuffle, and broadcast all stay in Apache Arrow columnar format, avoiding the per-row overhead of Spark's row-based engine. Within the Arrow-native pipeline, operators and expressions execute in either native Rust code (via Apache DataFusion) or JVM code that operates directly on Arrow batches.
Proposed vocabulary
Three orthogonal concepts the docs currently conflate. We should name each axis distinctly.
Axis 1 — Pipeline membership
- Comet pipeline — operators and expressions Comet handles, regardless of implementation language. Replaces "native Comet path," "on Comet," "accelerated by Comet."
- Spark fallback — drops out of the pipeline into row-based Spark execution. (Already used consistently; keep it.)
Axis 2 — Implementation language
Within the Arrow-native pipeline, expressions and operators may be:
- Rust-implemented / native Rust — Rust code via DataFusion.
- JVM-implemented — Scala/Java code, including codegen'd UDFs.
- Hybrid — JVM codegen with native callouts, where the JVM kernel is the primary execution path and Rust is invoked via JNI for specific subexpressions.
The implementation language is an internal detail; from the query's perspective the work is Arrow-native regardless. Hybrid impls trade JNI crossing cost for native compute speedups on subexpressions where the win is large enough to justify the boundary.
Axis 3 — Data format
- Arrow-native — the unifying property of the Comet pipeline. Data is in Apache Arrow columnar format throughout; no per-row materialization or transition cost.
- Arrow IPC — the wire/disk format used by Arrow-native shuffle and broadcast.
- Spark rows / row-based —
UnsafeRow, the format outside the pipeline.
- Spark columnar — non-Comet
ColumnarBatch from a Spark vectorized reader; requires a transition into Arrow-native.
- Reserve "vectorized" for SIMD-specific contexts (the Parquet reader). Don't use it as a synonym for columnar.
A new category to name
Arrow-native JVM expression — JVM code that operates directly on Arrow batches (the Scala UDF codegen path; future Arrow UDFs; future hybrid JVM/native impls). Distinct from existing JVM-side plumbing operators (CometUnion, CometCoalesce, CometCollectLimit) which coordinate Arrow batches but don't compute over column data.
Rule for "native"
Bare "native" as a vague adjective in prose is banned; replace with the specific axis it refers to. Compound forms where the binding word fixes the meaning are permitted:
| Form |
Meaning |
OK? |
| native Rust / Rust-native |
Rust implementation |
yes |
| Arrow-native |
Arrow columnar throughout |
yes |
| native shuffle |
Rust-implemented shuffle, paired with JVM shuffle |
yes |
native scan (in CometNativeScan context) |
Rust scan, paired with CometBatchScan |
yes |
| runs natively / native execution / the native path |
ambiguous |
no |
Proposed plan-node taxonomy
The "three kinds of nodes" framing in understanding-comet-plans.md should become four:
| Category |
What it is |
Example |
| Arrow-native Rust operator |
Rust compute over Arrow batches |
CometProject, CometHashAggregate, CometSort |
| Arrow-native JVM expression |
JVM compute over Arrow batches; may include hybrid impls that call native code for subexpressions |
Scala UDF codegen (today); Arrow UDFs and hybrid JVM/native expressions (future) |
| Arrow-native JVM plumbing |
JVM coordination of Arrow batches; no per-row compute |
CometUnion, CometCoalesce, CometBroadcastExchange |
| Spark fallback |
Row-based Spark execution |
Project, HashAggregate, plain Exchange |
Hybrid JVM/native expressions fold into the "Arrow-native JVM expression" row rather than getting a fifth category — the user-visible property is the same (Arrow-native, executed via a JVM kernel), and treating the native callout as an implementation detail keeps the taxonomy from sprawling.
Proposed operator renames
Touches plan-stability goldens, external tooling, and user docs. Candidates, ranked by how misleading the current name is:
| Current |
Proposed |
Reason |
CometExchange |
CometNativeShuffleExchange |
Both shuffle implementations are columnar and both use Arrow IPC; the differentiator is implementation, not format |
CometColumnarExchange |
CometJvmShuffleExchange |
Symmetric pair with the above |
CometNativeExec |
Keep (or rename to CometRustExec) |
Internal wrapper, not user-visible in plans; lower priority |
CometNativeScan |
Keep |
"Native" = Rust here, paired with CometBatchScan; already unambiguous as a compound |
CometIcebergNativeScan, CometCsvNativeScan |
Keep |
Same |
CometNativeColumnarToRow |
Keep |
Distinguishes from JVM CometColumnarToRow; consistent compound |
For renames that are user-visible in plans (CometExchange is the main one), provide a deprecation alias for one minor release and update plan-stability goldens in the same PR.
Documentation rewrite scope
Files that need substantive prose changes:
README.md — value-prop sentence using the Arrow-native framing
docs/source/index.md — same
docs/source/user-guide/latest/scala_java_udfs.md — replace "on the native Comet path" with "in the Comet pipeline"; explain the path is Arrow-native even though the UDF is JVM bytecode. Also clarify the existing line "e.g. myUdf(upper(s)) runs as one native unit" — the word "native" here is exactly the kind of ambiguity this issue exists to fix.
docs/source/user-guide/latest/understanding-comet-plans.md — extend the three-category list to four; update the shuffle section to match the rename
docs/source/contributor-guide/plugin_overview.md — "native Comet operators" → "Rust-implemented operators"; explain that JVM Arrow expressions (and future hybrid impls) slot into the same pipeline
docs/source/contributor-guide/native_shuffle.md + jvm_shuffle.md — align with new operator names; the file titles are accurate so they can stay
docs/source/about/gluten_comparison.md — the JVM-on-Arrow path is now a Comet differentiator worth mentioning
Migration plan
- Land the docs vocabulary first (no code changes, no plan goldens touched).
- Land operator renames with deprecation aliases in a single PR per pair.
- Update plan-stability goldens in the rename PRs.
- Remove deprecation aliases one minor release later.
Open questions
- Should
CometNativeExec rename in the same pass, or stay since it's internal-only?
- Should the issue describe a parallel rename for any internal class names (e.g.
CometNativeScanExec) beyond what shows up in plan output?
- Any external tools that grep
CometExchange we should ping before the rename lands?
- Hybrid JVM/native expressions — the codegen-dispatch design opens the door to JVM kernels that call into Rust via JNI for specific subexpressions where the speedup outweighs the boundary cost. This isn't on the immediate roadmap but the vocabulary should accommodate it without further naming churn. The proposal treats hybrid impls as a flavor of Arrow-native JVM expression rather than a separate category, on the principle that implementation-language mix is an internal detail and the externally relevant property remains "the work happens inside the Arrow-native pipeline." Worth confirming the community agrees with this categorization before it becomes a contested rename later.
Problem
Comet's documentation and operator names use "native" to mean three different things:
Comet*nodes that run natively in Rust" (docs/source/user-guide/latest/understanding-comet-plans.md)docs/source/user-guide/latest/scala_java_udfs.md), even though the UDF runs as JVM bytecode produced by codegenCometNativeScan(fully Rust) vsCometBatchScan(DSv2 with JVM reader producing Arrow)The shuffle naming is similarly muddled:
CometExchange("native shuffle") andCometColumnarExchange("JVM columnar shuffle") are both columnar and both serialize via Arrow IPC, but only one of the operator names says "Columnar."This will get worse, not better. The roadmap calls for Java/Scala UDFs that operate directly on Arrow columnar data; the Scala UDF codegen path already lands JVM bytecode inside what we call "the native Comet path." Comet's value proposition is shifting from "we rewrote operators in Rust" to "we keep your data Arrow-native end-to-end" — and the nomenclature hasn't caught up.
Proposed value prop
Proposed vocabulary
Three orthogonal concepts the docs currently conflate. We should name each axis distinctly.
Axis 1 — Pipeline membership
Axis 2 — Implementation language
Within the Arrow-native pipeline, expressions and operators may be:
The implementation language is an internal detail; from the query's perspective the work is Arrow-native regardless. Hybrid impls trade JNI crossing cost for native compute speedups on subexpressions where the win is large enough to justify the boundary.
Axis 3 — Data format
UnsafeRow, the format outside the pipeline.ColumnarBatchfrom a Spark vectorized reader; requires a transition into Arrow-native.A new category to name
Arrow-native JVM expression — JVM code that operates directly on Arrow batches (the Scala UDF codegen path; future Arrow UDFs; future hybrid JVM/native impls). Distinct from existing JVM-side plumbing operators (
CometUnion,CometCoalesce,CometCollectLimit) which coordinate Arrow batches but don't compute over column data.Rule for "native"
Bare "native" as a vague adjective in prose is banned; replace with the specific axis it refers to. Compound forms where the binding word fixes the meaning are permitted:
CometNativeScancontext)CometBatchScanProposed plan-node taxonomy
The "three kinds of nodes" framing in
understanding-comet-plans.mdshould become four:CometProject,CometHashAggregate,CometSortCometUnion,CometCoalesce,CometBroadcastExchangeProject,HashAggregate, plainExchangeHybrid JVM/native expressions fold into the "Arrow-native JVM expression" row rather than getting a fifth category — the user-visible property is the same (Arrow-native, executed via a JVM kernel), and treating the native callout as an implementation detail keeps the taxonomy from sprawling.
Proposed operator renames
Touches plan-stability goldens, external tooling, and user docs. Candidates, ranked by how misleading the current name is:
CometExchangeCometNativeShuffleExchangeCometColumnarExchangeCometJvmShuffleExchangeCometNativeExecCometRustExec)CometNativeScanCometBatchScan; already unambiguous as a compoundCometIcebergNativeScan,CometCsvNativeScanCometNativeColumnarToRowCometColumnarToRow; consistent compoundFor renames that are user-visible in plans (
CometExchangeis the main one), provide a deprecation alias for one minor release and update plan-stability goldens in the same PR.Documentation rewrite scope
Files that need substantive prose changes:
README.md— value-prop sentence using the Arrow-native framingdocs/source/index.md— samedocs/source/user-guide/latest/scala_java_udfs.md— replace "on the native Comet path" with "in the Comet pipeline"; explain the path is Arrow-native even though the UDF is JVM bytecode. Also clarify the existing line "e.g.myUdf(upper(s))runs as one native unit" — the word "native" here is exactly the kind of ambiguity this issue exists to fix.docs/source/user-guide/latest/understanding-comet-plans.md— extend the three-category list to four; update the shuffle section to match the renamedocs/source/contributor-guide/plugin_overview.md— "native Comet operators" → "Rust-implemented operators"; explain that JVM Arrow expressions (and future hybrid impls) slot into the same pipelinedocs/source/contributor-guide/native_shuffle.md+jvm_shuffle.md— align with new operator names; the file titles are accurate so they can staydocs/source/about/gluten_comparison.md— the JVM-on-Arrow path is now a Comet differentiator worth mentioningMigration plan
Open questions
CometNativeExecrename in the same pass, or stay since it's internal-only?CometNativeScanExec) beyond what shows up in plan output?CometExchangewe should ping before the rename lands?