Skip to content

[DISCUSS] Simplify regex engine + incompatibility config model #4310

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Comet has two regex engines (Rust DataFusion regex and a JVM-backed UDF engine), and the current config model exposes two orthogonal axes for selecting which one runs:

  • spark.comet.exec.regexp.engine{rust, java} — picks the engine.
  • spark.comet.expression.<Expr>.allowIncompatible=true (per regex expression) and spark.comet.expression.regexp.allowIncompatible=true (pattern-level) — opt in to running known-incompatible cases.

The axes interact non-orthogonally and the result is confusing:

  • The Rust regex engine can never be fully Spark/Java-regex compatible because of fundamental differences in regex semantics (\d, \b, backreferences, lookaround, embedded flags, etc.). So opting into the Rust engine is, in effect, opting into incompatibilities.
  • Despite that, users currently still have to set per-expression allowIncompatible flags on top of choosing rust to actually run incompatible cases natively. The two opt-ins encode the same intent twice.
  • Some regex expressions (regexp_extract, regexp_extract_all, regexp_instr) have no Rust implementation at all. Today engine=rust causes those to fall back to Spark, regardless of any allowIncompatible setting. That's a third, implicit behavior layered on top of the two configs.

Describe the potential solution

Treat the engine choice as the regex-wide incompatibility opt-in. Concretely:

  1. Drop the per-expression *.allowIncompatible=true flags for regex expressions (they become redundant).

  2. Extend the existing SupportLevel ADT so an Incompatible result can name the config that already opted the user in:

    case class Incompatible(
        notes: Option[String] = None,
        optedInBy: Option[String] = None
    ) extends SupportLevel
  3. The gating logic in QueryPlanSerde.exprToProtoInternal treats Incompatible(_, Some(key)) as opted-in when that config is set, in addition to the existing per-expression allowIncompatible check.

  4. Regex serdes read the engine config inside getSupportLevel:

    // RegExpReplace, StringSplit, RLike, etc.: rust supported but incompatible
    override def getSupportLevel(expr: RegExpReplace): SupportLevel =
      CometConf.regexpEngine.get match {
        case "java" => Compatible(None)
        case "rust" => Incompatible(
          Some("Rust regex engine differs from Java in subtle ways"),
          optedInBy = Some("spark.comet.exec.regexp.engine=rust"))
      }
    
    // RegExpExtract, RegExpExtractAll, RegExpInStr: java-only
    override def getSupportLevel(expr: RegExpExtract): SupportLevel =
      CometConf.regexpEngine.get match {
        case "java" => Compatible(None)
        case "rust" => Unsupported(Some(
          "Only supported via the java engine; set spark.comet.exec.regexp.engine=java"))
      }

After this change the model collapses to a single knob:

  • engine=java (default): every regex expression is fully Spark-compatible and runs natively via the JVM UDF bridge (with a JNI cost per batch).
  • engine=rust: regex expressions that have a Rust implementation run natively, with known semantic differences. Expressions without a Rust implementation fall back per the discussion point below. No per-expression flags involved.

Trade-offs

  • getSupportLevel becomes context-dependent (reads the engine config). That's a small philosophical shift: today's serdes don't read global config in getSupportLevel. The alternative is passing a context object through getSupportLevel, which would touch every existing serde. The cost of reading CometConf from a few regex serdes seems worth avoiding that churn.
  • Removes per-expression granularity for regex. A user who wants Rust for regexp_replace but Java for split no longer has a way to express that. I don't think that's a real use case, but worth flagging.

Discussion point: fallback path when the chosen engine doesn't support an expression

When the user picks engine=rust and runs regexp_extract (java-only), there are three plausible behaviors:

  1. Fall back to Spark (current behavior). The expression is tagged with a reason, the planner declines to wrap it in CometNativeExec, and Spark executes that operator. Pro: matches the user's explicit choice — they asked for Rust and we don't silently swap in something else. Con: breaks the native pipeline, losing native execution for the rest of the stage.

  2. Fall back to Comet's Java regex engine (JVM UDF). The expression still runs natively, just via the JVM UDF path instead of Rust. Pro: preserves the native pipeline, likely better end-to-end perf for queries that mix supported and unsupported regex. Con: the user said engine=rust and silently got a Java JNI call. Surprising; harder to reason about perf.

  3. Add a third value engine=rust-then-java (or engine=auto) that explicitly opts into "Rust where possible, Java otherwise." Pure engine=rust keeps option (1) semantics. Pro: makes the intent explicit and observable; users who pick rust-then-java know they are accepting JNI for the unsupported cases. Con: three engine values instead of two.

My current lean is option 3 — it preserves "what you set is what you get" for the binary rust/java cases and gives the hybrid path a name. But I'd like input from others before committing.

Related: do we want the same fallback question to apply to patterns the Rust engine can parse but doesn't handle correctly (the current regexp.allowIncompatible territory)? Today those run incompatibly when the user opts in; under option 3 we could route them through Java instead.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions