Skip to content

Improve performance of regex_replace #3518

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently by far the worst performing query in ClickBench for datafusion is query 28:

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

It seems this is due too REGEXP_REPLACE in this query.

Describe the solution you'd like
Introduce a benchmark and improve the performance of REGEXP_REPLACE (with a scalar) in DataFusion.

Describe alternatives you've considered
n/a
Additional context

There are some more efficient patterns in arrow-rs for writing (string/regex) kernels. This could be used here as well.
Optimizations that might be beneficial:

  • reusing the null buffer from the input array (instead of rebuilding in the iterator)
  • special casing scalar type, so no hashmap has to be used for storing / retrieving compiled regexes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions