Skip to content

Handle poorly written/untrusted regexp input gracefully via re2j #12628

@itschrispeck

Description

@itschrispeck

Currently Pinot uses java.util.Regex package. This generally performs well, but it does not handle patterns that cause catastrophic backtracking gracefully.

For clusters in shared environments that take adhoc queries it's possible for a poorly written regex to hold resources indefinitely. Seen below, a query worker thread is still at java.util.regex.Pattern$CharPropertyGreedy.match hours after the problematic queries were executed:

"pqw-54" #121060 [121613] prio=5 os_prio=0 cpu=7109969.11ms elapsed=188953.55s tid=0x00007fd16a303800 nid=121613 runnable  [0x00007fe9c13fc000]   java.lang.Thread.State: RUNNABLE
	at java.util.regex.Pattern$CharPropertyGreedy.match(java.base@21.0.2/Pattern.java:4470)
	at java.util.regex.Pattern$Start.match(java.base@21.0.2/Pattern.java:3787)
	at java.util.regex.Matcher.search(java.base@21.0.2/Matcher.java:1767)
	at java.util.regex.Matcher.find(java.base@21.0.2/Matcher.java:787)
	at org.apache.pinot.core.operator.filter.predicate.RegexpLikePredicateEvaluatorFactory$RawValueBasedRegexpLikePredicateEvaluator.applySV(RegexpLikePredicateEvaluatorFactory.java:129)

Google's re2 libray was in part created to address this:

RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk. One of its primary guarantees is that the match time is linear in the length of the input string. It was also written with production concerns in mind: the parser, the compiler and the execution engines limit their memory usage by working within a configurable budget – failing gracefully when exhausted – and they avoid stack overflow by eschewing recursion.

re2 is used by other DBs such as ClickHouse.

If re2j seems to be the right approach, we could:

  1. Make a clean switch to re2j which carries some behavior/performance differences
  2. Add a config to allow users to choose their desired implementation
  3. Take an additional argument in regexp_like (I feel we should not leak the implementation like this)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions