What is the problem the feature request solves?
Comet has two regex engines (Rust DataFusion regex and a JVM-backed UDF engine), and the current config model exposes two orthogonal axes for selecting which one runs:
spark.comet.exec.regexp.engine ∈ {rust, java} — picks the engine.
spark.comet.expression.<Expr>.allowIncompatible=true (per regex expression) and spark.comet.expression.regexp.allowIncompatible=true (pattern-level) — opt in to running known-incompatible cases.
The axes interact non-orthogonally and the result is confusing:
- The Rust regex engine can never be fully Spark/Java-regex compatible because of fundamental differences in regex semantics (
\d, \b, backreferences, lookaround, embedded flags, etc.). So opting into the Rust engine is, in effect, opting into incompatibilities.
- Despite that, users currently still have to set per-expression
allowIncompatible flags on top of choosing rust to actually run incompatible cases natively. The two opt-ins encode the same intent twice.
- Some regex expressions (
regexp_extract, regexp_extract_all, regexp_instr) have no Rust implementation at all. Today engine=rust causes those to fall back to Spark, regardless of any allowIncompatible setting. That's a third, implicit behavior layered on top of the two configs.
Describe the potential solution
Treat the engine choice as the regex-wide incompatibility opt-in. Concretely:
-
Drop the per-expression *.allowIncompatible=true flags for regex expressions (they become redundant).
-
Extend the existing SupportLevel ADT so an Incompatible result can name the config that already opted the user in:
case class Incompatible(
notes: Option[String] = None,
optedInBy: Option[String] = None
) extends SupportLevel
-
The gating logic in QueryPlanSerde.exprToProtoInternal treats Incompatible(_, Some(key)) as opted-in when that config is set, in addition to the existing per-expression allowIncompatible check.
-
Regex serdes read the engine config inside getSupportLevel:
// RegExpReplace, StringSplit, RLike, etc.: rust supported but incompatible
override def getSupportLevel(expr: RegExpReplace): SupportLevel =
CometConf.regexpEngine.get match {
case "java" => Compatible(None)
case "rust" => Incompatible(
Some("Rust regex engine differs from Java in subtle ways"),
optedInBy = Some("spark.comet.exec.regexp.engine=rust"))
}
// RegExpExtract, RegExpExtractAll, RegExpInStr: java-only
override def getSupportLevel(expr: RegExpExtract): SupportLevel =
CometConf.regexpEngine.get match {
case "java" => Compatible(None)
case "rust" => Unsupported(Some(
"Only supported via the java engine; set spark.comet.exec.regexp.engine=java"))
}
After this change the model collapses to a single knob:
engine=java (default): every regex expression is fully Spark-compatible and runs natively via the JVM UDF bridge (with a JNI cost per batch).
engine=rust: regex expressions that have a Rust implementation run natively, with known semantic differences. Expressions without a Rust implementation fall back per the discussion point below. No per-expression flags involved.
Trade-offs
getSupportLevel becomes context-dependent (reads the engine config). That's a small philosophical shift: today's serdes don't read global config in getSupportLevel. The alternative is passing a context object through getSupportLevel, which would touch every existing serde. The cost of reading CometConf from a few regex serdes seems worth avoiding that churn.
- Removes per-expression granularity for regex. A user who wants Rust for
regexp_replace but Java for split no longer has a way to express that. I don't think that's a real use case, but worth flagging.
Discussion point: fallback path when the chosen engine doesn't support an expression
When the user picks engine=rust and runs regexp_extract (java-only), there are three plausible behaviors:
-
Fall back to Spark (current behavior). The expression is tagged with a reason, the planner declines to wrap it in CometNativeExec, and Spark executes that operator. Pro: matches the user's explicit choice — they asked for Rust and we don't silently swap in something else. Con: breaks the native pipeline, losing native execution for the rest of the stage.
-
Fall back to Comet's Java regex engine (JVM UDF). The expression still runs natively, just via the JVM UDF path instead of Rust. Pro: preserves the native pipeline, likely better end-to-end perf for queries that mix supported and unsupported regex. Con: the user said engine=rust and silently got a Java JNI call. Surprising; harder to reason about perf.
-
Add a third value engine=rust-then-java (or engine=auto) that explicitly opts into "Rust where possible, Java otherwise." Pure engine=rust keeps option (1) semantics. Pro: makes the intent explicit and observable; users who pick rust-then-java know they are accepting JNI for the unsupported cases. Con: three engine values instead of two.
My current lean is option 3 — it preserves "what you set is what you get" for the binary rust/java cases and gives the hybrid path a name. But I'd like input from others before committing.
Related: do we want the same fallback question to apply to patterns the Rust engine can parse but doesn't handle correctly (the current regexp.allowIncompatible territory)? Today those run incompatibly when the user opts in; under option 3 we could route them through Java instead.
Additional context
What is the problem the feature request solves?
Comet has two regex engines (Rust DataFusion regex and a JVM-backed UDF engine), and the current config model exposes two orthogonal axes for selecting which one runs:
spark.comet.exec.regexp.engine∈{rust, java}— picks the engine.spark.comet.expression.<Expr>.allowIncompatible=true(per regex expression) andspark.comet.expression.regexp.allowIncompatible=true(pattern-level) — opt in to running known-incompatible cases.The axes interact non-orthogonally and the result is confusing:
\d,\b, backreferences, lookaround, embedded flags, etc.). So opting into the Rust engine is, in effect, opting into incompatibilities.allowIncompatibleflags on top of choosingrustto actually run incompatible cases natively. The two opt-ins encode the same intent twice.regexp_extract,regexp_extract_all,regexp_instr) have no Rust implementation at all. Todayengine=rustcauses those to fall back to Spark, regardless of anyallowIncompatiblesetting. That's a third, implicit behavior layered on top of the two configs.Describe the potential solution
Treat the engine choice as the regex-wide incompatibility opt-in. Concretely:
Drop the per-expression
*.allowIncompatible=trueflags for regex expressions (they become redundant).Extend the existing
SupportLevelADT so anIncompatibleresult can name the config that already opted the user in:The gating logic in
QueryPlanSerde.exprToProtoInternaltreatsIncompatible(_, Some(key))as opted-in when that config is set, in addition to the existing per-expressionallowIncompatiblecheck.Regex serdes read the engine config inside
getSupportLevel:After this change the model collapses to a single knob:
engine=java(default): every regex expression is fully Spark-compatible and runs natively via the JVM UDF bridge (with a JNI cost per batch).engine=rust: regex expressions that have a Rust implementation run natively, with known semantic differences. Expressions without a Rust implementation fall back per the discussion point below. No per-expression flags involved.Trade-offs
getSupportLevelbecomes context-dependent (reads the engine config). That's a small philosophical shift: today's serdes don't read global config ingetSupportLevel. The alternative is passing a context object throughgetSupportLevel, which would touch every existing serde. The cost of readingCometConffrom a few regex serdes seems worth avoiding that churn.regexp_replacebut Java forsplitno longer has a way to express that. I don't think that's a real use case, but worth flagging.Discussion point: fallback path when the chosen engine doesn't support an expression
When the user picks
engine=rustand runsregexp_extract(java-only), there are three plausible behaviors:Fall back to Spark (current behavior). The expression is tagged with a reason, the planner declines to wrap it in
CometNativeExec, and Spark executes that operator. Pro: matches the user's explicit choice — they asked for Rust and we don't silently swap in something else. Con: breaks the native pipeline, losing native execution for the rest of the stage.Fall back to Comet's Java regex engine (JVM UDF). The expression still runs natively, just via the JVM UDF path instead of Rust. Pro: preserves the native pipeline, likely better end-to-end perf for queries that mix supported and unsupported regex. Con: the user said
engine=rustand silently got a Java JNI call. Surprising; harder to reason about perf.Add a third value
engine=rust-then-java(orengine=auto) that explicitly opts into "Rust where possible, Java otherwise." Pureengine=rustkeeps option (1) semantics. Pro: makes the intent explicit and observable; users who pickrust-then-javaknow they are accepting JNI for the unsupported cases. Con: three engine values instead of two.My current lean is option 3 — it preserves "what you set is what you get" for the binary
rust/javacases and gives the hybrid path a name. But I'd like input from others before committing.Related: do we want the same fallback question to apply to patterns the Rust engine can parse but doesn't handle correctly (the current
regexp.allowIncompatibleterritory)? Today those run incompatibly when the user opts in; under option 3 we could route them through Java instead.Additional context
allowIncompatibleflags for non-regex expressions (e.g. cast) are unaffected; this change only touches the regex family.