Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement REGEXP_REPLACE for StringView #11025

Closed
Tracked by #10918
alamb opened this issue Jun 20, 2024 · 3 comments
Closed
Tracked by #10918

Implement REGEXP_REPLACE for StringView #11025

alamb opened this issue Jun 20, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 20, 2024

Is your feature request related to a problem or challenge?

Part of #10918 where we are integrating StringView into DataFusion, initially targeting making ClickBench queries faster

In the ClickBench queries there is a REGEXP_REPLACE function on String columns such as

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

Describe the solution you'd like

I would like to be able to run REGEXP_REPLACE on string view columns

Given this table:

> CREATE OR REPLACE TABLE string_views AS VALUES (arrow_cast('http://foo.com/ff', 'Utf8View')), ('http://bar.com/bs'), ('http://foo.com/42'), (NULL);
0 row(s) fetched.
Elapsed 0.011 seconds.

> select *, arrow_typeof(column1) from string_views;
+-------------------+------------------------------------+
| column1           | arrow_typeof(string_views.column1) |
+-------------------+------------------------------------+
| http://foo.com/ff | Utf8View                           |
| http://bar.com/bs | Utf8View                           |
| http://foo.com/42 | Utf8View                           |
|                   | Utf8View                           |
+-------------------+------------------------------------+
4 row(s) fetched.
Elapsed 0.006 seconds.

I would like to be able to run this function

> SELECT REGEXP_REPLACE(column1, '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k from string_views;
Error during planning: The regexp_replace function can only accept strings. Got Utf8View

It works fine if you cast the column first to a string:

> SELECT REGEXP_REPLACE(arrow_cast(column1, 'Utf8'), '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k from string_views;
+---------+
| k       |
+---------+
| foo.com |
| bar.com |
| foo.com |
|         |
+---------+
4 row(s) fetched.
Elapsed 0.012 seconds.

Describe alternatives you've considered

We could add a coercion rule to automatically cast Utf8View to Utf8 which we probably should do in general to make it easy to work with Utf8View

However that is inefficient as it will involve copying all the strings. It would be much better to actually implement REGEXP_REPLACE for Utf8View arrays directly

I am hoping we can figure out a straightforward pattern that can generate code for any string function that works well for StringArray as well as LargeStringArry and StringViewArray

Here is how the regexp replace function is implemented now:

/// Replaces substring(s) matching a PCRE-like regular expression.
///
/// The full list of supported features and syntax can be found at
/// <https://docs.rs/regex/latest/regex/#syntax>
///
/// Supported flags with the addition of 'g' can be found at
/// <https://docs.rs/regex/latest/regex/#grouping-and-flags>
///
/// # Examples
///
/// ```ignore
/// # use datafusion::prelude::*;
/// # use datafusion::error::Result;
/// # #[tokio::main]
/// # async fn main() -> Result<()> {
/// let ctx = SessionContext::new();
/// let df = ctx.read_csv("tests/data/regex.csv", CsvReadOptions::new()).await?;
///
/// // use the regexp_replace function to replace substring(s) without flags
/// let df = df.with_column(
/// "a",
/// regexp_replace(vec![col("values"), col("patterns"), col("replacement")])
/// )?;
/// // use the regexp_replace function to replace substring(s) with flags
/// let df = df.with_column(
/// "b",
/// regexp_replace(vec![col("values"), col("patterns"), col("replacement"), col("flags")]),
/// )?;
///
/// // literals can be used as well
/// let df = df.with_column(
/// "c",
/// regexp_replace(vec![lit("foobarbequebaz"), lit("(bar)(beque)"), lit(r"\2")]),
/// )?;
///
/// df.show().await?;
///
/// # Ok(())
/// # }
/// ```
pub fn regexp_replace<T: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> {

Additional context

Please remember to target the string-view branch in DataFusion, rather than main with your PR

@alamb alamb added the enhancement New feature or request label Jun 20, 2024
@XiangpengHao
Copy link
Contributor

take

@XiangpengHao
Copy link
Contributor

I am hoping we can figure out a straightforward pattern that can generate code for any string function that works well for StringArray as well as LargeStringArry and StringViewArray

Agree, I have a prototype StringArrayType (apache/arrow-rs#5931) for LIKE operations (like, ilike, contains, begins_with etc). A similar pattern could be applied here.
Maybe we can directly use the StringArrayType trait here.

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2024

This was implemented in #11556 by @XiangpengHao and I added a test in #11753 to verify

@alamb alamb closed this as completed Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants