Skip to content

Port regex_extract#20308

Open
calcaura wants to merge 4 commits intoapache:mainfrom
calcaura:regexp-extract
Open

Port regex_extract#20308
calcaura wants to merge 4 commits intoapache:mainfrom
calcaura:regexp-extract

Conversation

@calcaura
Copy link

@calcaura calcaura commented Feb 12, 2026

Which issue does this PR close?

Rationale for this change

  • Implement the Spark function “regexp_extract" in Datafusion.

What changes are included in this PR?

What changes are NOT included in this PR?

  • Support for LargeUtf8.
  • Utf8View

Are these changes tested?

  • Yes, Unit tests + SQL + CI
# Unit tests
cargo test --package datafusion-functions --lib -- regex::regexpextract::tests --nocapture
# SQL tests
cargo test --test sqllogictests -- regexp_extract

Are there any user-facing changes?

Yes (new regex function added added to the docs).

@calcaura calcaura marked this pull request as draft February 12, 2026 10:33
@github-actions github-actions bot added the functions Changes to functions implementation label Feb 12, 2026
@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Feb 12, 2026
@calcaura calcaura marked this pull request as ready for review February 12, 2026 14:45
@Jefffrey
Copy link
Contributor

cc @Omega359 @comphead did we ever land on a consensus regarding regexp_extract and regexp_substr? We had some PRs for them before and they seemed to lapse, but looks like there was still some discussion on which regex functions we include as part of datafusion

@Omega359
Copy link
Contributor

cc @Omega359 @comphead did we ever land on a consensus regarding regexp_extract and regexp_substr? We had some PRs for them before and they seemed to lapse, but looks like there was still some discussion on which regex functions we include as part of datafusion

The last I recall thinking about this was summarized in this comment. The functions, at least as can be seen in other db's or query engines, are very similar with extract being slightly more powerful by allowing one to define which group to extract.

Frankly, I could see datafusion having one function that does both (aliased to regexp_substr and regexp_extract) where an optional 'index' or 'group' can be provided (defaulting to 0) that denotes which capture group to return.

| 200 |
+---------------------------------------------------------+
```
Additional examples can be found [here](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/builtin_functions/regexp.rs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not going to add examples to the regexp.rs file I would suggest removing this line.

argument(name = "str", description = "Column or column name"),
argument(
name = "regexp",
description = r#"a string representing a regular expression. The regex string should be a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is indeed the case (java) this function belongs in the spark crate, not in the main datafusion functions crate.

) -> Result<ColumnarValue> {
let args = &args.args;

if args.len() != 2 && args.len() != 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this could possibly work. If args.len() == 2 it'll fail the second condition, if 3, the first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's neither 2 nor 3, then it's an error.

So, if len == 2, it'll fail the check on 3, hence won't enter the branch.

Maybe written as following could read easier?

Suggested change
if args.len() != 2 && args.len() != 3 {
if ! (args.len() == 2 || args.len() == 3 ) {

@comphead
Copy link
Contributor

From what I remember it was quite complicated to expose rust backed regexp into JVM world, because of rust/jvm regexp processing difference.
The major ones:

  • no backtracking in rust
  • groups
  • quantifiers diff
  • lookaheads

Theoretically we still can expose the function but Spark users need to be careful, accept the nuances and this needs to be documented.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we should proceed with adding this as a function given other dbs/engines have something similar; however we should probably approach this from an angle of adding it as a datafusion function, but not necessarily to match Spark exactly given what @comphead outlined.

/// Extracts a group that matches `regexp`. If `idx` is not specified,
/// it defaults to 1.
///
/// Matches Spark's DataFrame API: `regexp_extract(e: Column, exp: String, groupIdx: Int)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should remove mention of Spark since we're adding this as a DataFusion function (i.e. not to the datafusion-spark crate)

use std::any::Any;
use std::sync::Arc;

// See https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_extract.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
use DataType::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need all these checks in return_type; we can simply return Ok(Utf8) as signature should guard this for us

}

// DataFusion passes either scalars or arrays. Convert to arrays.
let len = args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just use make_scalar_function which handles this boilerplate for us if we don't want to deal with columnarvalues

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll try to look into it!

}

/// Helper to build args for tests and external callers.
pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be public


/// Helper to build args for tests and external callers.
pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> {
if args.len() != 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it needs 3 arguments we should make the signature 3 distinct arguments instead of a slice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a small omission (either 2 or 3). If there's a desire to always have only 3 I can change it everywhere, but it'll make it diverge slightly from spark (where the group idx is optional and defaults to 1 when not specified).

}

#[cfg(test)]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move all these tests to be SLTs instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the existing pattern in this mod.

My $0.02: having unit tests write next to the definition helps future evolution (I for one find the step-by-step debugging much more efficient).

If there's a strong desire to remove them, I can (but all the other unit tests should also be removed in order to be consistent).

let pattern = &args[1];
let index = &args[2];

let values_array = values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use as_string_array for easier downcasting here; same idea for int array below too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

regexp_extract func from Spark

4 participants