-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently by far the worst performing query in ClickBench for datafusion is query 28:
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
It seems this is due too REGEXP_REPLACE in this query.
Describe the solution you'd like
Introduce a benchmark and improve the performance of REGEXP_REPLACE (with a scalar) in DataFusion.
Describe alternatives you've considered
n/a
Additional context
There are some more efficient patterns in arrow-rs for writing (string/regex) kernels. This could be used here as well.
Optimizations that might be beneficial:
- reusing the null buffer from the input array (instead of rebuilding in the iterator)
- special casing scalar type, so no hashmap has to be used for storing / retrieving compiled regexes
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster