You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
Part of #10918 where we are integrating StringView into DataFusion, initially targeting making ClickBench queries faster
In the ClickBench queries there is a REGEXP_REPLACE function on String columns such as
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE"Referer"<>''GROUP BY k HAVINGCOUNT(*) >100000ORDER BY l DESCLIMIT25;
Describe the solution you'd like
I would like to be able to run REGEXP_REPLACE on string view columns
>SELECT REGEXP_REPLACE(column1, '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k from string_views;
Error during planning: The regexp_replace function can only accept strings. Got Utf8View
It works fine if you cast the column first to a string:
> SELECT REGEXP_REPLACE(arrow_cast(column1, 'Utf8'), '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k from string_views;
+---------+
| k |
+---------+
| foo.com |
| bar.com |
| foo.com |
| |
+---------+
4 row(s) fetched.
Elapsed 0.012 seconds.
Describe alternatives you've considered
We could add a coercion rule to automatically cast Utf8View to Utf8 which we probably should do in general to make it easy to work with Utf8View
However that is inefficient as it will involve copying all the strings. It would be much better to actually implement REGEXP_REPLACE for Utf8View arrays directly
I am hoping we can figure out a straightforward pattern that can generate code for any string function that works well for StringArray as well as LargeStringArry and StringViewArray
Here is how the regexp replace function is implemented now:
I am hoping we can figure out a straightforward pattern that can generate code for any string function that works well for StringArray as well as LargeStringArry and StringViewArray
Agree, I have a prototype StringArrayType (apache/arrow-rs#5931) for LIKE operations (like, ilike, contains, begins_with etc). A similar pattern could be applied here.
Maybe we can directly use the StringArrayType trait here.
Is your feature request related to a problem or challenge?
Part of #10918 where we are integrating
StringView
into DataFusion, initially targeting making ClickBench queries fasterIn the ClickBench queries there is a
REGEXP_REPLACE
function on String columns such asdatafusion/benchmarks/queries/clickbench/queries.sql
Line 29 in 5bfc11b
Describe the solution you'd like
I would like to be able to run
REGEXP_REPLACE
on string view columnsGiven this table:
I would like to be able to run this function
It works fine if you cast the column first to a string:
Describe alternatives you've considered
We could add a coercion rule to automatically cast
Utf8View
to Utf8 which we probably should do in general to make it easy to work with Utf8ViewHowever that is inefficient as it will involve copying all the strings. It would be much better to actually implement
REGEXP_REPLACE
forUtf8View
arrays directlyI am hoping we can figure out a straightforward pattern that can generate code for any string function that works well for StringArray as well as LargeStringArry and StringViewArray
Here is how the regexp replace function is implemented now:
datafusion/datafusion/functions/src/regex/regexpreplace.rs
Lines 142 to 182 in 5bfc11b
Additional context
Please remember to target the
string-view
branch in DataFusion, rather thanmain
with your PRThe text was updated successfully, but these errors were encountered: