Skip to content

[SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace#56240

Closed
pchintar wants to merge 1 commit into
apache:masterfrom
pchintar:regexp_replace
Closed

[SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace#56240
pchintar wants to merge 1 commit into
apache:masterfrom
pchintar:regexp_replace

Conversation

@pchintar
Copy link
Copy Markdown
Contributor

@pchintar pchintar commented Jun 1, 2026

What changes were proposed in this pull request?

Spark SQL already supports the 4-argument form of regexp_replace:

regexp_replace(str, regexp, replacement, position)

However, the corresponding Scala, PySpark, and Spark Connect APIs currently expose only the 3-argument variants.

This PR exposes the existing 4-argument functionality through:

  • Scala API (functions.regexp_replace)
  • PySpark API (functions.regexp_replace)
  • Spark Connect API (functions.regexp_replace)

and adds corresponding Scala, PySpark, and Connect test coverage.

Why are the changes needed?

The underlying SQL functionality already exists and is available through SQL expressions, but it is not accessible through the public Scala, PySpark, and Spark Connect APIs.

This creates an inconsistency between SQL and programmatic interfaces. Exposing the optional position argument aligns the public APIs with existing SQL functionality.

Does this PR introduce any user-facing change?

Yes.

Users can now specify the optional position argument through the Scala, PySpark, and Spark Connect APIs.

Before:

regexp_replace(col("s"), "(\\d+)", "--")

After:

regexp_replace(col("s"), "(\\d+)", "--", 5)

Similarly, PySpark users can now call:

F.regexp_replace("s", r"(\d+)", "--", 5)

How was this patch tested?

Added test coverage in:

  • StringFunctionsSuite
  • FunctionsTests.test_regexp_replace
  • SparkConnectFunctionTests.test_string_functions_multi_args

Verified with:

./build/sbt "sql-api/compile"

./build/sbt "sql/Test/compile"

./build/sbt \
"sql/testOnly org.apache.spark.sql.StringFunctionsSuite -- -z \"regex_replace / regex_extract\""

python3.11 -m pytest \
python/pyspark/sql/tests/test_functions.py::FunctionsTests::test_regexp_replace -v

python3.11 -m pytest \
python/pyspark/sql/tests/connect/test_connect_function.py::SparkConnectFunctionTests::test_string_functions_multi_args -v

Was this patch authored or co-authored using generative AI tooling?

Developed with assistance from GPT-5.

@pchintar pchintar force-pushed the regexp_replace branch 3 times, most recently from e203506 to e9a27de Compare June 1, 2026 08:57
@pchintar pchintar changed the title [SPARK-57190][SQL] Resolve API inconsistency for 4-argument regexp_replace [SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace Jun 1, 2026
@pchintar
Copy link
Copy Markdown
Contributor Author

pchintar commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 0 non-blocking, 1 nit.
Clean, well-tested API addition that faithfully mirrors the existing 3-arg pattern. The one blocking item is the @since / versionchanged version.

Design / architecture (1)

  • (blocking) @since / versionchanged should be 4.3.0, not 4.2.0. branch-4.2 is already cut and master is 5.0.0-SNAPSHOT (SPARK-56740), so the next open feature release this new API can ship in is 4.3.0. This is the same fix made in SPARK-56820 for counter_diff ("was not merged in 4.2"). Affects functions.scala (2 sites) and PySpark builtin.py (1 site) — see inline.

Nits: 1 minor item (Scaladoc position vs pos param) — see inline.

* `position`.
*
* @group string_funcs
* @since 4.2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@since should be 4.3.0, not 4.2.0: 4.2 is already cut (branch-4.2 exists) and master is on 5.0.0-SNAPSHOT, so this new overload first ships in 4.3.0. Same correction as SPARK-56820 for counter_diff.

Suggested change
* @since 4.2.0
* @since 4.3.0

* `position`.
*
* @group string_funcs
* @since 4.2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here — @since should be 4.3.0.

Suggested change
* @since 4.2.0
* @since 4.3.0

Comment thread python/pyspark/sql/functions/builtin.py Outdated

.. versionchanged:: 3.4.0
Supports Spark Connect.
.. versionchanged:: 4.2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged should be 4.3.0 — the position parameter first appears in 4.3.0, not the already-cut 4.2.0.

Suggested change
.. versionchanged:: 4.2.0
.. versionchanged:: 4.3.0

regexp_replace(e, lit(pattern), lit(replacement))

/**
* Replace all substrings of the specified string value that match regexp with rep, starting at
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the prose says "starting at position", but the parameter is named pos — the backtick'd position implies an identifier that doesn't exist in this signature. Consider matching the prose to the param name (or renaming pos to position to align with the PySpark binding).

@shrirangmhalgi
Copy link
Copy Markdown
Contributor

The change LGTM. It is a clean API addition that mirrors the existing 3-arg pattern.

One minor suggestion: The PySpark docstring mentions the position parameter but the Examples section only shows 3-arg usage. Consider adding a doctest for the 4-arg form so that users discover it via help(F.regexp_replace):

>>> df = spark.createDataFrame([("100-200",)], ["str"])
>>> df.select(regexp_replace('str', r'(\d+)', '--', 5).alias('d')).collect()
[Row(d='100---')]

(Minor / non-blocking - just for docs completeness.)

@pchintar
Copy link
Copy Markdown
Contributor Author

pchintar commented Jun 3, 2026

@cloud-fan & @shrirangmhalgi thnx for the review! I made some changes as per your feedback:

  • I Updated the Scala @since annotations and the PySpark versionchanged annotation from 4.2.0 to 4.3.0.

  • I added a PySpark doctest example demonstrating the new 4-argument form of regexp_replace with the position parameter, and so the documentation now covers both the existing 3-argument usage and the new 4-argument usage.

  • I kept the Scala parameter name as pos instead of position because functions.scala already defines overloaded public methods named position(...); using position as the parameter name in this overload causes ambiguous resolution against those existing methods inside the same object. So, I instead updated the Scaladoc wording to explicitly reference pos, so that the documentation now matches the signature clearly.

@pchintar
Copy link
Copy Markdown
Contributor Author

pchintar commented Jun 3, 2026

@cloud-fan @cloud-fan all the CI checks have passed as well!

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master/4.x!

@cloud-fan cloud-fan closed this in d35df2e Jun 4, 2026
cloud-fan pushed a commit that referenced this pull request Jun 4, 2026
### What changes were proposed in this pull request?

Spark SQL already supports the 4-argument form of `regexp_replace`:

```sql
regexp_replace(str, regexp, replacement, position)
```

However, the corresponding Scala, PySpark, and Spark Connect APIs currently expose only the 3-argument variants.

This PR exposes the existing 4-argument functionality through:

* Scala API (`functions.regexp_replace`)
* PySpark API (`functions.regexp_replace`)
* Spark Connect API (`functions.regexp_replace`)

and adds corresponding Scala, PySpark, and Connect test coverage.

### Why are the changes needed?

The underlying SQL functionality already exists and is available through SQL expressions, but it is not accessible through the public Scala, PySpark, and Spark Connect APIs.

This creates an inconsistency between SQL and programmatic interfaces. Exposing the optional `position` argument aligns the public APIs with existing SQL functionality.

### Does this PR introduce *any* user-facing change?

Yes.

Users can now specify the optional `position` argument through the Scala, PySpark, and Spark Connect APIs.

Before:

```scala
regexp_replace(col("s"), "(\\d+)", "--")
```

After:

```scala
regexp_replace(col("s"), "(\\d+)", "--", 5)
```

Similarly, PySpark users can now call:

```python
F.regexp_replace("s", r"(\d+)", "--", 5)
```

### How was this patch tested?

Added test coverage in:

* `StringFunctionsSuite`
* `FunctionsTests.test_regexp_replace`
* `SparkConnectFunctionTests.test_string_functions_multi_args`

Verified with:

```bash
./build/sbt "sql-api/compile"

./build/sbt "sql/Test/compile"

./build/sbt \
"sql/testOnly org.apache.spark.sql.StringFunctionsSuite -- -z \"regex_replace / regex_extract\""

python3.11 -m pytest \
python/pyspark/sql/tests/test_functions.py::FunctionsTests::test_regexp_replace -v

python3.11 -m pytest \
python/pyspark/sql/tests/connect/test_connect_function.py::SparkConnectFunctionTests::test_string_functions_multi_args -v
```

### Was this patch authored or co-authored using generative AI tooling?

Developed with assistance from GPT-5.

Closes #56240 from pchintar/regexp_replace.

Authored-by: pchintar <89355405+pchintar@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d35df2e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@pchintar
Copy link
Copy Markdown
Contributor Author

pchintar commented Jun 4, 2026

thanks, merging to master/4.x!

thnx @cloud-fan I'm new to Spark, so is this approved yet bc I don't see the "Approved" in closed PRs for this?
Image 6-4-26 at 11 57 AM

@shrirangmhalgi
Copy link
Copy Markdown
Contributor

@pchintar In Apache Spark, committers can merge directly after reviewing - they don't always use GitHub's formal "Approve" button. The merged commit itself is the approval. Welcome to the project! 🎉

Reference - 463d804

@pchintar
Copy link
Copy Markdown
Contributor Author

pchintar commented Jun 4, 2026

Thank you @shrirangmhalgi & @cloud-fan for your timely feedback & approval!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants