-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Currently Pinot uses java.util.Regex package. This generally performs well, but it does not handle patterns that cause catastrophic backtracking gracefully.
For clusters in shared environments that take adhoc queries it's possible for a poorly written regex to hold resources indefinitely. Seen below, a query worker thread is still at java.util.regex.Pattern$CharPropertyGreedy.match hours after the problematic queries were executed:
"pqw-54" #121060 [121613] prio=5 os_prio=0 cpu=7109969.11ms elapsed=188953.55s tid=0x00007fd16a303800 nid=121613 runnable [0x00007fe9c13fc000] java.lang.Thread.State: RUNNABLE
at java.util.regex.Pattern$CharPropertyGreedy.match(java.base@21.0.2/Pattern.java:4470)
at java.util.regex.Pattern$Start.match(java.base@21.0.2/Pattern.java:3787)
at java.util.regex.Matcher.search(java.base@21.0.2/Matcher.java:1767)
at java.util.regex.Matcher.find(java.base@21.0.2/Matcher.java:787)
at org.apache.pinot.core.operator.filter.predicate.RegexpLikePredicateEvaluatorFactory$RawValueBasedRegexpLikePredicateEvaluator.applySV(RegexpLikePredicateEvaluatorFactory.java:129)
Google's re2 libray was in part created to address this:
RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk. One of its primary guarantees is that the match time is linear in the length of the input string. It was also written with production concerns in mind: the parser, the compiler and the execution engines limit their memory usage by working within a configurable budget – failing gracefully when exhausted – and they avoid stack overflow by eschewing recursion.
re2 is used by other DBs such as ClickHouse.
If re2j seems to be the right approach, we could:
- Make a clean switch to re2j which carries some behavior/performance differences
- Add a config to allow users to choose their desired implementation
- Take an additional argument in
regexp_like(I feel we should not leak the implementation like this)