[DISCUSS] Simplify regex engine + incompatibility config model

## What is the problem the feature request solves?

Comet has two regex engines (Rust DataFusion regex and a JVM-backed UDF engine), and the current config model exposes two orthogonal axes for selecting which one runs:

- `spark.comet.exec.regexp.engine` ∈ `{rust, java}` — picks the engine.
- `spark.comet.expression.<Expr>.allowIncompatible=true` (per regex expression) and `spark.comet.expression.regexp.allowIncompatible=true` (pattern-level) — opt in to running known-incompatible cases.

The axes interact non-orthogonally and the result is confusing:

- The Rust regex engine **can never be fully Spark/Java-regex compatible** because of fundamental differences in regex semantics (`\d`, `\b`, backreferences, lookaround, embedded flags, etc.). So opting into the Rust engine is, in effect, opting into incompatibilities.
- Despite that, users currently still have to set per-expression `allowIncompatible` flags on top of choosing `rust` to actually run incompatible cases natively. The two opt-ins encode the same intent twice.
- Some regex expressions (`regexp_extract`, `regexp_extract_all`, `regexp_instr`) have no Rust implementation at all. Today `engine=rust` causes those to fall back to Spark, regardless of any `allowIncompatible` setting. That's a third, implicit behavior layered on top of the two configs.

## Describe the potential solution

Treat the engine choice as the regex-wide incompatibility opt-in. Concretely:

1. Drop the per-expression `*.allowIncompatible=true` flags for regex expressions (they become redundant).
2. Extend the existing `SupportLevel` ADT so an `Incompatible` result can name the config that already opted the user in:

   ```scala
   case class Incompatible(
       notes: Option[String] = None,
       optedInBy: Option[String] = None
   ) extends SupportLevel
   ```

3. The gating logic in `QueryPlanSerde.exprToProtoInternal` treats `Incompatible(_, Some(key))` as opted-in when that config is set, in addition to the existing per-expression `allowIncompatible` check.

4. Regex serdes read the engine config inside `getSupportLevel`:

   ```scala
   // RegExpReplace, StringSplit, RLike, etc.: rust supported but incompatible
   override def getSupportLevel(expr: RegExpReplace): SupportLevel =
     CometConf.regexpEngine.get match {
       case "java" => Compatible(None)
       case "rust" => Incompatible(
         Some("Rust regex engine differs from Java in subtle ways"),
         optedInBy = Some("spark.comet.exec.regexp.engine=rust"))
     }

   // RegExpExtract, RegExpExtractAll, RegExpInStr: java-only
   override def getSupportLevel(expr: RegExpExtract): SupportLevel =
     CometConf.regexpEngine.get match {
       case "java" => Compatible(None)
       case "rust" => Unsupported(Some(
         "Only supported via the java engine; set spark.comet.exec.regexp.engine=java"))
     }
   ```

After this change the model collapses to a single knob:

- `engine=java` (default): every regex expression is fully Spark-compatible and runs natively via the JVM UDF bridge (with a JNI cost per batch).
- `engine=rust`: regex expressions that have a Rust implementation run natively, with known semantic differences. Expressions without a Rust implementation fall back per the discussion point below. No per-expression flags involved.

### Trade-offs

- `getSupportLevel` becomes context-dependent (reads the engine config). That's a small philosophical shift: today's serdes don't read global config in `getSupportLevel`. The alternative is passing a context object through `getSupportLevel`, which would touch every existing serde. The cost of reading `CometConf` from a few regex serdes seems worth avoiding that churn.
- Removes per-expression granularity for regex. A user who wants Rust for `regexp_replace` but Java for `split` no longer has a way to express that. I don't think that's a real use case, but worth flagging.

## Discussion point: fallback path when the chosen engine doesn't support an expression

When the user picks `engine=rust` and runs `regexp_extract` (java-only), there are three plausible behaviors:

1. **Fall back to Spark** (current behavior). The expression is tagged with a reason, the planner declines to wrap it in `CometNativeExec`, and Spark executes that operator. Pro: matches the user's explicit choice — they asked for Rust and we don't silently swap in something else. Con: breaks the native pipeline, losing native execution for the rest of the stage.

2. **Fall back to Comet's Java regex engine** (JVM UDF). The expression still runs natively, just via the JVM UDF path instead of Rust. Pro: preserves the native pipeline, likely better end-to-end perf for queries that mix supported and unsupported regex. Con: the user said `engine=rust` and silently got a Java JNI call. Surprising; harder to reason about perf.

3. **Add a third value `engine=rust-then-java`** (or `engine=auto`) that explicitly opts into "Rust where possible, Java otherwise." Pure `engine=rust` keeps option (1) semantics. Pro: makes the intent explicit and observable; users who pick `rust-then-java` know they are accepting JNI for the unsupported cases. Con: three engine values instead of two.

My current lean is **option 3** — it preserves "what you set is what you get" for the binary `rust`/`java` cases and gives the hybrid path a name. But I'd like input from others before committing.

Related: do we want the same fallback question to apply to *patterns* the Rust engine can parse but doesn't handle correctly (the current `regexp.allowIncompatible` territory)? Today those run incompatibly when the user opts in; under option 3 we could route them through Java instead.

## Additional context

- This is shaped by #4239 (extending the JVM UDF framework to cover all regex expressions), where the dual-axis model first became visible.
- Implementation note: the existing per-expression `allowIncompatible` flags for non-regex expressions (e.g. cast) are unaffected; this change only touches the regex family.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Simplify regex engine + incompatibility config model #4310

What is the problem the feature request solves?

Describe the potential solution

Trade-offs

Discussion point: fallback path when the chosen engine doesn't support an expression

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[DISCUSS] Simplify regex engine + incompatibility config model #4310

Description

What is the problem the feature request solves?

Describe the potential solution

Trade-offs

Discussion point: fallback path when the chosen engine doesn't support an expression

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions