Establish nomenclature style guide and audit operator names for clarity

## Problem

Comet's documentation and operator names use **"native"** to mean three different things:

1. **Rust-implemented** — *"`Comet*` nodes that run natively in Rust"* (`docs/source/user-guide/latest/understanding-comet-plans.md`)
2. **On the Comet pipeline** — *"Comet executes Spark's Scala and Java UDFs on the native Comet path"* (`docs/source/user-guide/latest/scala_java_udfs.md`), even though the UDF runs as JVM bytecode produced by codegen
3. **Fully-Rust scan** — `CometNativeScan` (fully Rust) vs `CometBatchScan` (DSv2 with JVM reader producing Arrow)

The shuffle naming is similarly muddled: `CometExchange` ("native shuffle") and `CometColumnarExchange` ("JVM columnar shuffle") are *both* columnar and *both* serialize via Arrow IPC, but only one of the operator names says "Columnar."

This will get worse, not better. The roadmap calls for Java/Scala UDFs that operate directly on Arrow columnar data; the Scala UDF codegen path already lands JVM bytecode inside what we call "the native Comet path." Comet's value proposition is shifting from *"we rewrote operators in Rust"* to *"we keep your data Arrow-native end-to-end"* — and the nomenclature hasn't caught up.

## Proposed value prop

> Comet keeps Spark queries **Arrow-native end-to-end** — operators, expressions, shuffle, and broadcast all stay in Apache Arrow columnar format, avoiding the per-row overhead of Spark's row-based engine. Within the Arrow-native pipeline, operators and expressions execute in either native Rust code (via Apache DataFusion) or JVM code that operates directly on Arrow batches.

## Proposed vocabulary

Three orthogonal concepts the docs currently conflate. We should name each axis distinctly.

### Axis 1 — Pipeline membership

- **Comet pipeline** — operators and expressions Comet handles, regardless of implementation language. Replaces "native Comet path," "on Comet," "accelerated by Comet."
- **Spark fallback** — drops out of the pipeline into row-based Spark execution. (Already used consistently; keep it.)

### Axis 2 — Implementation language

Within the Arrow-native pipeline, expressions and operators may be:

- **Rust-implemented** / **native Rust** — Rust code via DataFusion.
- **JVM-implemented** — Scala/Java code, including codegen'd UDFs.
- **Hybrid** — JVM codegen with native callouts, where the JVM kernel is the primary execution path and Rust is invoked via JNI for specific subexpressions.

The implementation language is an internal detail; from the query's perspective the work is **Arrow-native** regardless. Hybrid impls trade JNI crossing cost for native compute speedups on subexpressions where the win is large enough to justify the boundary.

### Axis 3 — Data format

- **Arrow-native** — the unifying property of the Comet pipeline. Data is in Apache Arrow columnar format throughout; no per-row materialization or transition cost.
- **Arrow IPC** — the wire/disk format used by Arrow-native shuffle and broadcast.
- **Spark rows** / **row-based** — `UnsafeRow`, the format outside the pipeline.
- **Spark columnar** — non-Comet `ColumnarBatch` from a Spark vectorized reader; requires a transition into Arrow-native.
- Reserve **"vectorized"** for SIMD-specific contexts (the Parquet reader). Don't use it as a synonym for columnar.

### A new category to name

**Arrow-native JVM expression** — JVM code that operates directly on Arrow batches (the Scala UDF codegen path; future Arrow UDFs; future hybrid JVM/native impls). Distinct from existing **JVM-side plumbing operators** (`CometUnion`, `CometCoalesce`, `CometCollectLimit`) which coordinate Arrow batches but don't compute over column data.

### Rule for "native"

Bare **"native"** as a vague adjective in prose is banned; replace with the specific axis it refers to. Compound forms where the binding word fixes the meaning are permitted:

| Form | Meaning | OK? |
|---|---|---|
| *native Rust* / *Rust-native* | Rust implementation | yes |
| *Arrow-native* | Arrow columnar throughout | yes |
| *native shuffle* | Rust-implemented shuffle, paired with *JVM shuffle* | yes |
| *native scan* (in `CometNativeScan` context) | Rust scan, paired with `CometBatchScan` | yes |
| *runs natively* / *native execution* / *the native path* | ambiguous | no |

## Proposed plan-node taxonomy

The "three kinds of nodes" framing in `understanding-comet-plans.md` should become **four**:

| Category | What it is | Example |
|---|---|---|
| Arrow-native Rust operator | Rust compute over Arrow batches | `CometProject`, `CometHashAggregate`, `CometSort` |
| Arrow-native JVM expression | JVM compute over Arrow batches; may include hybrid impls that call native code for subexpressions | Scala UDF codegen (today); Arrow UDFs and hybrid JVM/native expressions (future) |
| Arrow-native JVM plumbing | JVM coordination of Arrow batches; no per-row compute | `CometUnion`, `CometCoalesce`, `CometBroadcastExchange` |
| Spark fallback | Row-based Spark execution | `Project`, `HashAggregate`, plain `Exchange` |

Hybrid JVM/native expressions fold into the "Arrow-native JVM expression" row rather than getting a fifth category — the user-visible property is the same (Arrow-native, executed via a JVM kernel), and treating the native callout as an implementation detail keeps the taxonomy from sprawling.

## Proposed operator renames

Touches plan-stability goldens, external tooling, and user docs. Candidates, ranked by how misleading the current name is:

| Current | Proposed | Reason |
|---|---|---|
| `CometExchange` | **`CometNativeShuffleExchange`** | Both shuffle implementations are columnar and both use Arrow IPC; the differentiator is implementation, not format |
| `CometColumnarExchange` | **`CometJvmShuffleExchange`** | Symmetric pair with the above |
| `CometNativeExec` | Keep (or rename to `CometRustExec`) | Internal wrapper, not user-visible in plans; lower priority |
| `CometNativeScan` | Keep | "Native" = Rust here, paired with `CometBatchScan`; already unambiguous as a compound |
| `CometIcebergNativeScan`, `CometCsvNativeScan` | Keep | Same |
| `CometNativeColumnarToRow` | Keep | Distinguishes from JVM `CometColumnarToRow`; consistent compound |

For renames that are user-visible in plans (`CometExchange` is the main one), provide a deprecation alias for one minor release and update plan-stability goldens in the same PR.

## Documentation rewrite scope

Files that need substantive prose changes:

- **`README.md`** — value-prop sentence using the Arrow-native framing
- **`docs/source/index.md`** — same
- **`docs/source/user-guide/latest/scala_java_udfs.md`** — replace "on the native Comet path" with "in the Comet pipeline"; explain the path is Arrow-native even though the UDF is JVM bytecode. Also clarify the existing line *"e.g. `myUdf(upper(s))` runs as one native unit"* — the word "native" here is exactly the kind of ambiguity this issue exists to fix.
- **`docs/source/user-guide/latest/understanding-comet-plans.md`** — extend the three-category list to four; update the shuffle section to match the rename
- **`docs/source/contributor-guide/plugin_overview.md`** — "native Comet operators" → "Rust-implemented operators"; explain that JVM Arrow expressions (and future hybrid impls) slot into the same pipeline
- **`docs/source/contributor-guide/native_shuffle.md`** + **`jvm_shuffle.md`** — align with new operator names; the file titles are accurate so they can stay
- **`docs/source/about/gluten_comparison.md`** — the JVM-on-Arrow path is now a Comet differentiator worth mentioning

## Migration plan

1. Land the docs vocabulary first (no code changes, no plan goldens touched).
2. Land operator renames with deprecation aliases in a single PR per pair.
3. Update plan-stability goldens in the rename PRs.
4. Remove deprecation aliases one minor release later.

## Open questions

- Should `CometNativeExec` rename in the same pass, or stay since it's internal-only?
- Should the issue describe a parallel rename for any internal class names (e.g. `CometNativeScanExec`) beyond what shows up in plan output?
- Any external tools that grep `CometExchange` we should ping before the rename lands?
- **Hybrid JVM/native expressions** — the codegen-dispatch design opens the door to JVM kernels that call into Rust via JNI for specific subexpressions where the speedup outweighs the boundary cost. This isn't on the immediate roadmap but the vocabulary should accommodate it without further naming churn. The proposal treats hybrid impls as a flavor of *Arrow-native JVM expression* rather than a separate category, on the principle that implementation-language mix is an internal detail and the externally relevant property remains "the work happens inside the Arrow-native pipeline." Worth confirming the community agrees with this categorization before it becomes a contested rename later.

Current	Proposed	Reason
`CometExchange`	`CometNativeShuffleExchange`	Both shuffle implementations are columnar and both use Arrow IPC; the differentiator is implementation, not format
`CometColumnarExchange`	`CometJvmShuffleExchange`	Symmetric pair with the above
`CometNativeExec`	Keep (or rename to `CometRustExec`)	Internal wrapper, not user-visible in plans; lower priority
`CometNativeScan`	Keep	"Native" = Rust here, paired with `CometBatchScan`; already unambiguous as a compound
`CometIcebergNativeScan`, `CometCsvNativeScan`	Keep	Same
`CometNativeColumnarToRow`	Keep	Distinguishes from JVM `CometColumnarToRow`; consistent compound

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establish nomenclature style guide and audit operator names for clarity #4419

Problem

Proposed value prop

Proposed vocabulary

Axis 1 — Pipeline membership

Axis 2 — Implementation language

Axis 3 — Data format

A new category to name

Rule for "native"

Proposed plan-node taxonomy

Proposed operator renames

Documentation rewrite scope

Migration plan

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Form	Meaning	OK?
native Rust / Rust-native	Rust implementation	yes
Arrow-native	Arrow columnar throughout	yes
native shuffle	Rust-implemented shuffle, paired with JVM shuffle	yes
native scan (in `CometNativeScan` context)	Rust scan, paired with `CometBatchScan`	yes
runs natively / native execution / the native path	ambiguous	no

Category	What it is	Example
Arrow-native Rust operator	Rust compute over Arrow batches	`CometProject`, `CometHashAggregate`, `CometSort`
Arrow-native JVM expression	JVM compute over Arrow batches; may include hybrid impls that call native code for subexpressions	Scala UDF codegen (today); Arrow UDFs and hybrid JVM/native expressions (future)
Arrow-native JVM plumbing	JVM coordination of Arrow batches; no per-row compute	`CometUnion`, `CometCoalesce`, `CometBroadcastExchange`
Spark fallback	Row-based Spark execution	`Project`, `HashAggregate`, plain `Exchange`

Establish nomenclature style guide and audit operator names for clarity #4419

Description

Problem

Proposed value prop

Proposed vocabulary

Axis 1 — Pipeline membership

Axis 2 — Implementation language

Axis 3 — Data format

A new category to name

Rule for "native"

Proposed plan-node taxonomy

Proposed operator renames

Documentation rewrite scope

Migration plan

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions