Port regex_extract by calcaura · Pull Request #20308 · apache/datafusion

calcaura · 2026-02-12T10:33:45Z

Which issue does this PR close?

Closes regexp_extract func from Spark #14280

Rationale for this change

Implement the Spark function “regexp_extract" in Datafusion.

What changes are included in this PR?

Feature: implement Spark's analogue of regexp_extract

What changes are NOT included in this PR?

Support for LargeUtf8.
Utf8View

Are these changes tested?

Yes, Unit tests + SQL + CI

# Unit tests
cargo test --package datafusion-functions --lib -- regex::regexpextract::tests --nocapture

# SQL tests
cargo test --test sqllogictests -- regexp_extract

Are there any user-facing changes?

Yes (new regex function added added to the docs).

Jefffrey · 2026-02-13T06:43:44Z

cc @Omega359 @comphead did we ever land on a consensus regarding regexp_extract and regexp_substr? We had some PRs for them before and they seemed to lapse, but looks like there was still some discussion on which regex functions we include as part of datafusion

Omega359 · 2026-02-13T15:41:31Z

cc @Omega359 @comphead did we ever land on a consensus regarding regexp_extract and regexp_substr? We had some PRs for them before and they seemed to lapse, but looks like there was still some discussion on which regex functions we include as part of datafusion

The last I recall thinking about this was summarized in this comment. The functions, at least as can be seen in other db's or query engines, are very similar with extract being slightly more powerful by allowing one to define which group to extract.

Frankly, I could see datafusion having one function that does both (aliased to regexp_substr and regexp_extract) where an optional 'index' or 'group' can be provided (defaulting to 0) that denotes which capture group to return.

Omega359 · 2026-02-13T15:56:39Z

datafusion/functions/src/regex/regexpextract.rs

+            | 200                                                   |
+            +---------------------------------------------------------+
+```
+Additional examples can be found [here](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/builtin_functions/regexp.rs)


If we are not going to add examples to the regexp.rs file I would suggest removing this line.

Omega359 · 2026-02-13T15:58:16Z

datafusion/functions/src/regex/regexpextract.rs

+    argument(name = "str", description = "Column or column name"),
+    argument(
+        name = "regexp",
+        description = r#"a string representing a regular expression. The regex string should be a


If this is indeed the case (java) this function belongs in the spark crate, not in the main datafusion functions crate.

Omega359 · 2026-02-13T15:59:32Z

datafusion/functions/src/regex/regexpextract.rs

+    ) -> Result<ColumnarValue> {
+        let args = &args.args;
+
+        if args.len() != 2 && args.len() != 3 {


I'm not sure how this could possibly work. If args.len() == 2 it'll fail the second condition, if 3, the first.

If it's neither 2 nor 3, then it's an error.

So, if len == 2, it'll fail the check on 3, hence won't enter the branch.

Maybe written as following could read easier?

Suggested change

if args.len() != 2 && args.len() != 3 {

if ! (args.len() == 2 || args.len() == 3 ) {

comphead · 2026-02-14T00:42:46Z

From what I remember it was quite complicated to expose rust backed regexp into JVM world, because of rust/jvm regexp processing difference.
The major ones:

no backtracking in rust
groups
quantifiers diff
lookaheads

Theoretically we still can expose the function but Spark users need to be careful, accept the nuances and this needs to be documented.

Jefffrey

Sounds like we should proceed with adding this as a function given other dbs/engines have something similar; however we should probably approach this from an angle of adding it as a datafusion function, but not necessarily to match Spark exactly given what @comphead outlined.

Jefffrey · 2026-02-14T01:20:01Z

datafusion/functions/src/regex/mod.rs

+    /// Extracts a group that matches `regexp`. If `idx` is not specified,
+    /// it defaults to 1.
+    ///
+    /// Matches Spark's DataFrame API: `regexp_extract(e: Column, exp: String, groupIdx: Int)`


We probably should remove mention of Spark since we're adding this as a DataFusion function (i.e. not to the datafusion-spark crate)

Jefffrey · 2026-02-14T01:20:20Z

datafusion/functions/src/regex/regexpextract.rs

+use std::any::Any;
+use std::sync::Arc;
+
+// See https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_extract.html


Jefffrey · 2026-02-14T01:20:59Z

datafusion/functions/src/regex/regexpextract.rs

+    }
+
+    fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
+        use DataType::*;


We don't need all these checks in return_type; we can simply return Ok(Utf8) as signature should guard this for us

Jefffrey · 2026-02-14T01:21:42Z

datafusion/functions/src/regex/regexpextract.rs

+        }
+
+        // DataFusion passes either scalars or arrays. Convert to arrays.
+        let len = args


We should just use make_scalar_function which handles this boilerplate for us if we don't want to deal with columnarvalues

Thanks, I'll try to look into it!

Jefffrey · 2026-02-14T01:21:59Z

datafusion/functions/src/regex/regexpextract.rs

+}
+
+/// Helper to build args for tests and external callers.
+pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> {


This doesn't need to be public

Jefffrey · 2026-02-14T01:22:28Z

datafusion/functions/src/regex/regexpextract.rs

+
+/// Helper to build args for tests and external callers.
+pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> {
+    if args.len() != 3 {


If it needs 3 arguments we should make the signature 3 distinct arguments instead of a slice

Here's a small omission (either 2 or 3). If there's a desire to always have only 3 I can change it everywhere, but it'll make it diverge slightly from spark (where the group idx is optional and defaults to 1 when not specified).

Jefffrey · 2026-02-14T01:23:18Z

datafusion/functions/src/regex/regexpextract.rs

+}
+
+#[cfg(test)]
+mod tests {


Could we move all these tests to be SLTs instead?

I was following the existing pattern in this mod.

My $0.02: having unit tests write next to the definition helps future evolution (I for one find the step-by-step debugging much more efficient).

If there's a strong desire to remove them, I can (but all the other unit tests should also be removed in order to be consistent).

Jefffrey · 2026-02-14T01:24:17Z

datafusion/functions/src/regex/regexpextract.rs

+    let pattern = &args[1];
+    let index = &args[2];
+
+    let values_array = values


Can use as_string_array for easier downcasting here; same idea for int array below too

Port regex_extract

256531f

calcaura marked this pull request as draft February 12, 2026 10:33

github-actions bot added the functions Changes to functions implementation label Feb 12, 2026

calcaura and others added 2 commits February 12, 2026 11:44

Merge branch 'main' into regexp-extract

9c8f3c9

doc + sql test

a64a05c

github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Feb 12, 2026

[sql tests] Convert pyspark tests to datafusion sql tests

a9571d6

calcaura marked this pull request as ready for review February 12, 2026 14:45

Omega359 reviewed Feb 13, 2026

View reviewed changes

Jefffrey reviewed Feb 14, 2026

View reviewed changes

	if args.len() != 2 && args.len() != 3 {
	if ! (args.len() == 2 \|\| args.len() == 3 ) {

Conversation

calcaura commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

What changes are NOT included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey commented Feb 13, 2026

Uh oh!

Omega359 commented Feb 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead commented Feb 14, 2026

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

calcaura commented Feb 12, 2026 •

edited

Loading