Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp_like fusing #8818

Closed
wants to merge 2 commits into from
Closed

regexp_like fusing #8818

wants to merge 2 commits into from

Conversation

gortiz
Copy link
Contributor

@gortiz gortiz commented Jun 2, 2022

When executing queries like:

select col1, col2 from Table where regexp_like(col1, '/r1/') or regexp_like(col1, '/r2/')

Pinot has to scan the referred column twice. This PR creates an optimization that tries to fuse boolean algebra and regexp_like predicates. Specifically, as indicated in the Javadoc:

  • Queries where regexp_like(col1, 'r1') and regexp_like(col1, '/r2/') will be translated to where regexp_like(col1, '(?=r1)(?=r2)')
  • Queries where regexp_like(col1, 'r1') or regexp_like(col1, '/r2/') will be translated to where regexp_like(col1, '(?:r1)|(?:r2)')
  • Queries where not regexp_like(col1, 'r1') will be translated to where regexp_like(col1, '(?!r1)')

There are some tests that apply the optimization to more advanced cases.

Regex can be quite complex. By doing some analysis I'm sure that this optimization will break some regex. For example, predicates that use backrefences may change the semantics when this optimization is applied.

To know if the optimization can be applied or not, it would be necessary to analyze the regex, which would require to parse the regex into a AST. As far as I know Pinot doesn't have that, so this optimization is disabled by default and can be enabled by activating a new query option. There is an heuristic that applies a basic analysis to try to find incompatible expressions like backreferences. Even when the query option is enabled, the regexp that are detected as not optimizable by the heuristic will not be optimized.

Future improvements may include to translate some text_match, where this optimization may be useful.

PD: Originally this PR was focused on optimizing text_match instead of regexp_like. Some comments may still be referred to the original version

@codecov-commenter
Copy link

codecov-commenter commented Jun 2, 2022

Codecov Report

Merging #8818 (44ebaa5) into master (7b9e16b) will decrease coverage by 6.77%.
The diff coverage is 67.44%.

❗ Current head 44ebaa5 differs from pull request most recent head b8a2208. Consider uploading reports for the commit b8a2208 to get more accurate results

@@             Coverage Diff              @@
##             master    #8818      +/-   ##
============================================
- Coverage     69.82%   63.05%   -6.78%     
- Complexity     4658     4666       +8     
============================================
  Files          1741     1760      +19     
  Lines         91474    92058     +584     
  Branches      13674    13777     +103     
============================================
- Hits          63875    58048    -5827     
- Misses        23172    29846    +6674     
+ Partials       4427     4164     -263     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 66.41% <70.07%> (+0.18%) ⬆️
unittests2 15.42% <55.81%> (+1.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ot/broker/api/resources/PinotBrokerAppConfigs.java 0.00% <ø> (ø)
...e/pinot/broker/api/resources/PinotBrokerDebug.java 0.00% <ø> (-85.72%) ⬇️
...t/broker/api/resources/PinotBrokerHealthCheck.java 0.00% <ø> (ø)
...pinot/broker/api/resources/PinotBrokerRouting.java 0.00% <ø> (ø)
...pinot/broker/api/resources/PinotClientRequest.java 0.00% <ø> (-58.50%) ⬇️
...roker/requesthandler/BaseBrokerRequestHandler.java 33.02% <ø> (-40.23%) ⬇️
...roker/requesthandler/GrpcBrokerRequestHandler.java 0.00% <0.00%> (-80.00%) ⬇️
...requesthandler/MultiStageBrokerRequestHandler.java 0.00% <0.00%> (ø)
...e/pinot/common/utils/FileUploadDownloadClient.java 18.18% <0.00%> (-41.25%) ⬇️
...pinot/common/utils/fetcher/BaseSegmentFetcher.java 58.82% <0.00%> (-23.00%) ⬇️
... and 582 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b9e16b...b8a2208. Read the comment docs.

@Jackie-Jiang
Copy link
Contributor

Wondering how much performance gain are we expecting here? text_match will use the Lucene text index to get the matching docs for the request. Essentially the comparison is between 2 simple requests vs 1 combined request. Some perf number would be helpful to justify the change.

@richardstartin
Copy link
Member

I thought this was supposed to be targeting regexp_like

@gortiz
Copy link
Contributor Author

gortiz commented Jun 3, 2022

I thought this was supposed to be targeting regexp_like

My bad. I looked for regexp in the docs and given that only found regexpExact, I thought it was text_match the operation we want to optimize.

Anyway, the idea is the same, so most of the code can be reused

@@ -255,6 +255,9 @@ private BaseFilterOperator constructPhysicalOperator(FilterContext filter, int n
|| textIndexReader instanceof NativeMutableTextIndex) {
throw new UnsupportedOperationException("TEXT_MATCH is not supported on native text index");
}
if (textIndexReader == null) {
throw new UnsupportedOperationException("TEXT_MATCH is not supported when there is no text index");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This moves the correctness check from execution to parsing time, which should be better, but makes ExplainPlanQueriesTest fail.

As indicated by its name, this tests does execute explain plan instead of the actual query. This is why the test didn't fail before. As this PR shouldn't be related to text_match but regexp_like, I'm going to remove this change for simplicity

@richardstartin
Copy link
Member

richardstartin commented Jun 3, 2022

I don’t think it’s possible to express backreferences in a LIKE statement, this optimisation would target long unions of LIKE statements generated by ORMs, and should be most effective in reducing the impact on a cluster when there is no text or FST index in place. I expect this would also help FST based regular expression evaluation, so getting some numbers to justify extending to FST regex evaluators would be great.

@gortiz
Copy link
Contributor Author

gortiz commented Jun 3, 2022

ORMs are not going to generate them, but a user may try to do something like regexp_like(col1, '(.+)=\1'), which will be positive in texts like a=a but false in b=a. I think this is very bizarre, but possible.

What I'm going to do is to try to detect the usage of backreferences in the regexp. To do so, I'm going to discard any regexp_like whose predicate matches with .*(?:(?:\\[0-9]+)|(?:\\k\<\w+\>)).*. The function may have false positives (for example the regexp\\1 will be a false positive) and false negatives (where is more difficult to find examples, maybe playing with utf8?), but it may be good enough to activate the query option by default in a future release.

@gortiz gortiz changed the title text_match fusing regexp_like fusing Jun 3, 2022
@gortiz
Copy link
Contributor Author

gortiz commented Jun 3, 2022

I've changed the implementation and the description of the PR to be focused on regexp_like instead of text_match.

Anyway, I would like to have some benchmarks before this feature is merged.

@richardstartin
Copy link
Member

I don’t believe the native FST supports backreferences, and I don’t think it would be easy to add support either. Not sure about the lucene FST index. As mentioned above, LIKE statements (e.g. LIKE ‘%foo%’) can’t express all regular expression features, which should reduce the scope of the edge cases to be considered.

This definitely needs benchmarking for at least the following scenarios;

  • Single like statement
  • N like statements combined with OR

for each of:

  • no FST index - both raw and dictionarized
  • Lucene FST index
  • Native FST index

@gortiz
Copy link
Contributor Author

gortiz commented Jun 3, 2022

I didn't know LIKE expressions were evaluated as REGEX_LIKE. Reading the code, it seems that the transformation is made in RequestContextUtils, which is evaluated after the filter optimization is done. This means that (in case I don't miss something else) the optimizations included here will not affect LIKE expressions.

I can modify the optimizer to add the ability to transform LIKE expressions into REGEX_LIKE the same way RequestContextUtils does right now. But I'm going to focus on the benchmark that tests the scenarios @richardstartin mentioned, but explicitly using REGEXP_LIKE for now.

@gortiz
Copy link
Contributor Author

gortiz commented Jun 10, 2022

My last commit includes a JMH benchmark. I found several interesting things. Please take a look, some of my conclusions are so shocking that it wouldn't surprise me if another pair of eyes discovered that I did something wrong.

TL;DR: I think we should apply this optimization only when executed in columns without indexes. This may make the implementation more difficult, but the performance gain may be very large.

Benchmarks are:

  • optimal1Like: SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'domain0'
  • optimal1Regex: SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain0')
  • optimal10: SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 'domain\d')
  • increasing10Like: SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain0%' or DOMAIN_NAMES LIKE '%domain1%' or ... or DOMAIN_NAMES LIKE '%domain9%'
  • increasing10Regex: SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES LIKE, '^.domain0.*$') or ... or regexp_like(DOMAIN_NAMES LIKE, '^.domain9.*$')
  • increasing10Fusing: SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
  • decreasing9Like, decreasing9Regex and decreasing9Fusing: Are like increasing10Like, increasing10Regex and increasing10Fusing but going from 9 to 1
    • Note: they have in fact 9 elements. This a mistake. Initially queries where created dynamically and I did it wrong. But it was a lucky mistake that shows that Lucene fails when there are 10 fused expression but it doesn't fail when there are 9.

Last results I´ve got were:

Benchmark                               (_fstType)  Mode  Cnt     Score      Error  Units
BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
BenchmarkFuseRegexp.decreasing9Like         LUCENE  avgt    5     0.615 ±    0.056  ms/op
BenchmarkFuseRegexp.decreasing9Like         NATIVE  avgt    5     0.515 ±    0.111  ms/op
BenchmarkFuseRegexp.decreasing9Like           null  avgt    5  3596.864 ± 1335.035  ms/op
BenchmarkFuseRegexp.decreasing9Regex        LUCENE  avgt    5     0.593 ±    0.362  ms/op
BenchmarkFuseRegexp.decreasing9Regex        NATIVE  avgt    5     0.422 ±    0.222  ms/op
BenchmarkFuseRegexp.decreasing9Regex          null  avgt    5  1053.673 ±  379.729  ms/op
BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
BenchmarkFuseRegexp.increasing10Like        LUCENE  avgt    5     0.696 ±    0.121  ms/op
BenchmarkFuseRegexp.increasing10Like        NATIVE  avgt    5     0.540 ±    0.049  ms/op
BenchmarkFuseRegexp.increasing10Like          null  avgt    5  3543.339 ±  183.168  ms/op
BenchmarkFuseRegexp.increasing10Regex       LUCENE  avgt    5     0.590 ±    0.208  ms/op
BenchmarkFuseRegexp.increasing10Regex       NATIVE  avgt    5     0.435 ±    0.159  ms/op
BenchmarkFuseRegexp.increasing10Regex         null  avgt    5  1096.238 ±  289.928  ms/op
BenchmarkFuseRegexp.optimal10               LUCENE  avgt    5     0.169 ±    0.066  ms/op
BenchmarkFuseRegexp.optimal10               NATIVE  avgt    5     0.136 ±    0.011  ms/op
BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op
BenchmarkFuseRegexp.optimal1Like            LUCENE  avgt    5     0.179 ±    0.066  ms/op
BenchmarkFuseRegexp.optimal1Like            NATIVE  avgt    5     0.134 ±    0.029  ms/op
BenchmarkFuseRegexp.optimal1Like              null  avgt    5   470.387 ±  187.235  ms/op
BenchmarkFuseRegexp.optimal1Regex           LUCENE  avgt    5     0.172 ±    0.029  ms/op
BenchmarkFuseRegexp.optimal1Regex           NATIVE  avgt    5     0.141 ±    0.027  ms/op
BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   59.722  ms/op

After analyzing it, there are some points that are quite clear:

We cannot use this optimization with Lucene

The cost is too high when the expression gets complex. It even with java.lang.RuntimeException: org.apache.lucene.util.automaton.TooComplexToDeterminizeException: Determinizing .*(?:^.*domain0.*$)|(?:^.*domain1.*$)|(?:^.*domain2.*$)|(?:^.*domain3.*$)|(?:^.*domain4.*$)|(?:^.*domain5.*$)|(?:^.*domain6.*$)|(?:^.*domain7.*$)|(?:^.*domain8.*$)|(?:^.*domain9.*$).* would result in more than 10000 states. when the expression have 10 branches.

In general COL like 'whatever' expression is 2x-3x worse than regexp_like(COL, 'similar expression')

EDIT: This section is just wrong. The performance difference between optimal1Regex and optimal1Like is that optimal1regex uses features that the transformation from like to regex is not using, although it doesn't seem very difficult to achieve.

Reading the code it looks to me that Pinot automatically translate one to the other, but it that transformation is not applied when BaseQueriesTest.getBrokerResponse is called (which is great, because we can test the difference).

I've also tried the controller UI interface and this issue is repeated there. Using the github_events dataset,

select * from githubEvents where regexp_like(actor_url, '^.*victor\d.*$') limit 1

is quite faster than

select * from githubEvents where actor_url LIKE '%victor7%' limit 1

Pinot regex execute faster wildcards than a single values

It is quite clear in the optimal cases:

-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op
-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   59.722  ms/op

At the beginning I thought it was an error in my code, but it is easy to try that using the github_events dataset.

select * from githubEvents where regexp_like(actor_url, 'victor\d') limit 1

Responds quite faster than

select * from githubEvents where regexp_like(actor_url, 'victor7') limit 1

The difference is not the 2.000x that the benchmark shows, but is consistently between 2-3x.

I'm guessing that the issue actually comes from Java regex engine, but it is just a hunch, I didn't actually test it.

Native or Lucene indexes don't actually increase the performance of this regexp

This benchmark wasn't focused on that and more studies should be done, but it seems clear that:

When there are several regexp in the actual expression, indexes are very useful.

Examples:

BenchmarkFuseRegexp.decreasing9Regex        LUCENE  avgt    5     0.593 ±    0.362  ms/op
BenchmarkFuseRegexp.decreasing9Regex        NATIVE  avgt    5     0.422 ±    0.222  ms/op
BenchmarkFuseRegexp.decreasing9Regex          null  avgt    5  1053.673 ±  379.729  ms/op

BenchmarkFuseRegexp.increasing10Regex       LUCENE  avgt    5     0.590 ±    0.208  ms/op
BenchmarkFuseRegexp.increasing10Regex       NATIVE  avgt    5     0.435 ±    0.159  ms/op
BenchmarkFuseRegexp.increasing10Regex         null  avgt    5  1096.238 ±  289.928  ms/op
  • When there is only one regexp in the actual expressions, indexes are not very useful.
    For example:
-- this actually has a single, quite complex regex expression
BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
-- this actually has a single, quite complex regex expression
BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
BenchmarkFuseRegexp.optimal10               LUCENE  avgt    5     0.169 ±    0.066  ms/op
BenchmarkFuseRegexp.optimal10               NATIVE  avgt    5     0.136 ±    0.011  ms/op
BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op
-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
BenchmarkFuseRegexp.optimal1Regex           LUCENE  avgt    5     0.172 ±    0.029  ms/op
BenchmarkFuseRegexp.optimal1Regex           NATIVE  avgt    5     0.141 ±    0.027  ms/op
BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   59.722  ms/op

The last one shows a big difference, but it may be related to the issue with regex without wildcards that I discussed above.

@Jackie-Jiang
Copy link
Contributor

@gortiz Thanks for the detailed analysis. This is awesome, and very interesting!

We cannot use this optimization with Lucene

The cost is too high when the expression gets complex. It even with java.lang.RuntimeException: org.apache.lucene.util.automaton.TooComplexToDeterminizeException: Determinizing .*(?:^.*domain0.*$)|(?:^.*domain1.*$)|(?:^.*domain2.*$)|(?:^.*domain3.*$)|(?:^.*domain4.*$)|(?:^.*domain5.*$)|(?:^.*domain6.*$)|(?:^.*domain7.*$)|(?:^.*domain8.*$)|(?:^.*domain9.*$).* would result in more than 10000 states. when the expression have 10 branches.

Curious how native FST handles it, and why is it so much faster than the Lucene FST. cc @atris

In general COL like 'whatever' expression is 2x-3x worse than regexp_like(COL, 'similar expression')

Reading the code it looks to me that Pinot automatically translate one to the other, but it that transformation is not applied when BaseQueriesTest.getBrokerResponse is called (which is great, because we can test the difference).

I think the reason is that it is not apple-to-apple comparison. col LIKE 'abc' is actually equivalent to regexp_like(col, '^abc$')

Pinot regex execute faster wildcards than a single values

It is quite clear in the optimal cases:

-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op
-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   59.722  ms/op

At the beginning I thought it was an error in my code, but it is easy to try that using the github_events dataset.

select * from githubEvents where regexp_like(actor_url, 'victor\d') limit 1

Responds quite faster than

select * from githubEvents where regexp_like(actor_url, 'victor7') limit 1

The difference is not the 2.000x that the benchmark shows, but is consistently between 2-3x.

I'm guessing that the issue actually comes from Java regex engine, but it is just a hunch, I didn't actually test it.

Let's re-do the benchmark. I don't believe there can be 2000x difference. Also, this is very counter-intuitive, so I'd suggest let's figure out if it is actually from the java regex engine, or because of our code.

When there are several regexp in the actual expression, indexes are very useful.

Examples:

BenchmarkFuseRegexp.decreasing9Regex        LUCENE  avgt    5     0.593 ±    0.362  ms/op
BenchmarkFuseRegexp.decreasing9Regex        NATIVE  avgt    5     0.422 ±    0.222  ms/op
BenchmarkFuseRegexp.decreasing9Regex          null  avgt    5  1053.673 ±  379.729  ms/op

BenchmarkFuseRegexp.increasing10Regex       LUCENE  avgt    5     0.590 ±    0.208  ms/op
BenchmarkFuseRegexp.increasing10Regex       NATIVE  avgt    5     0.435 ±    0.159  ms/op
BenchmarkFuseRegexp.increasing10Regex         null  avgt    5  1096.238 ±  289.928  ms/op

I feel the inefficiency is from regexp_like(DOMAIN_NAMES, 'domain0') (exact match) which takes much longer than expected. We should check what is causing the huge performance difference

@atris
Copy link
Contributor

atris commented Jun 10, 2022

Yeah, the REGEXP_LIKE translation from LIKE is not equivalent.

Also, remember that any FST traversal will benefit most from a selective expression rather than a full term match.

Lastly, why do you say that indices don't help for single regexes? Unless I am misreading the benchmark result, even for a single regex case, native index is pretty fast?

@Jackie-Jiang
Copy link
Contributor

Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op

We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.

@atris
Copy link
Contributor

atris commented Jun 13, 2022

I will check out the branch and give some comments tomorrow

@richardstartin
Copy link
Member

Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op

We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.

The difference will be due to determinization of the automaton, which the native implementation doesn’t do.

@atris
Copy link
Contributor

atris commented Jun 13, 2022

Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op

We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.

The difference will be due to determinization of the automaton, which the native implementation doesn’t do.

I doubt determinization alone can cause such a massive difference.

@atris
Copy link
Contributor

atris commented Jun 13, 2022

@Jackie-Jiang one thing is that the native index aggressively tries to prune branches, both while building the query automaton and during the traversal.

That is also what causes general suffix and prefix queries to be much faster.

@richardstartin
Copy link
Member

Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op

We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.

The difference will be due to determinization of the automaton, which the native implementation doesn’t do.

I doubt determinization alone can cause such a massive difference.

Well, it’s an exponential time algorithm and Lucene refuses to even attempt to determinize some of the automata in the benchmark because they are too large. For an apples to apples comparison, native determinization should be re-enabled (by temporarily reverting/reversing #8237)

@atris
Copy link
Contributor

atris commented Jun 13, 2022

Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op

We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.

The difference will be due to determinization of the automaton, which the native implementation doesn’t do.

I doubt determinization alone can cause such a massive difference.

Well, it’s an exponential time algorithm and Lucene refuses to even attempt to determinize some of the automata in the benchmark because they are too large. For an apples to apples comparison, native determinization should be re-enabled (by temporarily reverting/reversing #8237)

I am still failing to understand what the concern is.

Even if its plain determinization that causes native to be significantly faster, is that not a good thing?

@richardstartin
Copy link
Member

Thanks @atris for taking a look. I also have a question around how Lucene and native FST handle the query, specifically for these results:

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain9.*$)|...|(?:^.*domain1.*$)')
BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op
BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op

SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')
BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op
BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op

We can see that Lucene FST is very slow in this case, but native is still performing okay. Trying to understand what is causing this difference.

The difference will be due to determinization of the automaton, which the native implementation doesn’t do.

I doubt determinization alone can cause such a massive difference.

Well, it’s an exponential time algorithm and Lucene refuses to even attempt to determinize some of the automata in the benchmark because they are too large. For an apples to apples comparison, native determinization should be re-enabled (by temporarily reverting/reversing #8237)

I am still failing to understand what the concern is.

Even if its plain determinization that causes native to be significantly faster, is that not a good thing?

@Jackie-Jiang was asking about the root cause of the difference. I suggested that the root cause is likely to be determinization, but it needs to be investigated. Where has concern been expressed?

@richardstartin
Copy link
Member

richardstartin commented Jun 13, 2022

Just circling back here having taken a quick look at BenchmarkFuseRegexp.decreasing9Fusing - I got similar results to Gonzalo:

Benchmark                              (_fstType)  Mode  Cnt   Score   Error  Units
BenchmarkFuseRegexp.decreasing9Fusing      LUCENE  avgt    5  32.525 ± 2.733  ms/op
BenchmarkFuseRegexp.decreasing9Fusing      NATIVE  avgt    5   0.487 ± 0.123  ms/op
BenchmarkFuseRegexp.decreasing9Fusing        null  avgt    5   0.845 ± 0.204  ms/op

Lucene spends virtually all of its time in minimization (which I mistakenly referred to as determinization earlier, but actually entails determinization):

Screenshot 2022-06-13 at 21 10 51

Now if I enable minimization in the native implementation by reverting #8237, the scores change drastically:

Benchmark                              (_fstType)  Mode  Cnt    Score    Error  Units
BenchmarkFuseRegexp.decreasing9Fusing      LUCENE  avgt    5   32.399 ±  3.005  ms/op
BenchmarkFuseRegexp.decreasing9Fusing      NATIVE  avgt    5  366.520 ± 40.574  ms/op
BenchmarkFuseRegexp.decreasing9Fusing        null  avgt    5    0.819 ±  0.172  ms/op

As do the profiles. Without minimization, there is no clear single bottleneck:
Screenshot 2022-06-13 at 21 14 47

With minimization, there is a single bottleneck, and it's in minimization :

Screenshot 2022-06-13 at 21 23 10

As to whether this is cause for concern is beside the point; this is the root cause of the difference observed, rather than a quirk of the benchmark which needs to be analysed and removed.

@gortiz
Copy link
Contributor Author

gortiz commented Jun 14, 2022

Lastly, why do you say that indices don't help for single regexes? Unless I am misreading the benchmark result, even for a single regex case, native index is pretty fast?

My benchmark may not be the best and it is only measuring a few regexp cases, but I wouldn't say that native gives us a significant advantage:

-- this actually has a single, quite complex regex expression
BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   13.439  ms/op
BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    0.019  ms/op <-- here native is faster by 1.8x
BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    0.040  ms/op
-- this actually has a single, quite complex regex expression
BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    0.138  ms/op <-- here native is slower by 1.4x, although the difference is lower than the error margin
BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    0.062  ms/op
-- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
BenchmarkFuseRegexp.optimal10               LUCENE  avgt    5     0.169 ±    0.066  ms/op
BenchmarkFuseRegexp.optimal10               NATIVE  avgt    5     0.136 ±    0.011  ms/op  <-- here native is faster by 1.1x, although the difference is lower than the error margin
BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    0.082  ms/op

So we have one case where native is 1.8x faster and some others were it is equivalent to the java regex engine, which is known by its lack of performance. I mean, it is not like the native index is atrocious, but I would expect more from an index than a situational 1.8x performance increase.

To be clear, as I already said, this is not a benchmark on the FST itself and it is very narrow, so I don't want to get conclusions about whether we should or we should not use these indexes. I think this benchmark proves that there are very specific cases where these indexes are not very useful and therefore it would be nice to invest time to design proper tests with different kind of regex to verify whether FST indexes are in general useful evaluating regex or not.

@gortiz
Copy link
Contributor Author

gortiz commented Jun 14, 2022

I was able to invest more time in this topic today and I have to say that my initial benchmark was completely useless. Apart from some typos in the queries (which may have some impact), there were two problems that completely fooled me:

  • The implicit limit introduced by Pinot. As only the first 10 elements were returned, queries that actually found elements could finish faster.
  • Something is wrong with Lucene and Native indexes. I guess I'm not configuring them correctly, but they simply return 0 results when the regexp is a little bit complex (like having groups).

Therefore I've changed all the tests to return a count and each iteration is actually verifying whether the expected value is returned or not, in which case the benchmark is stopped. As a result, all fusing tests fail whenever there is an index. I hope it is something related to my test (ie I didn't configure the index properly) and not some semantic discrepancy these indexes have in relation how each engine evaluates regexp, as that could mean that there are customers receiving false results.

These are the results I've got. Note that some combinations are not shown. For example, decreasing9Fusing and LUCENE or NATIVE. That means that these combinations did not return what it was expected (in all these particular cases it means that they return 0 rows). Note that apart from the fusing case, benchmarks like optimal10, which is regexp_like(DOMAIN_NAMES, 'domain\d') did also failed.

Benchmark                                  (_fstType)  Mode  Cnt     Score      Error  Units
BenchmarkFuseRegexp.decreasing9Fusing            null  avgt    5  1438.185 ±   73.963  ms/op
BenchmarkFuseRegexp.decreasing9Like            LUCENE  avgt    5    41.927 ±    0.984  ms/op
BenchmarkFuseRegexp.decreasing9Like            NATIVE  avgt    5    40.232 ±    5.597  ms/op
BenchmarkFuseRegexp.decreasing9Like              null  avgt    5  3957.902 ±  154.228  ms/op
BenchmarkFuseRegexp.decreasing9Regex           LUCENE  avgt    5    40.744 ±    2.059  ms/op
BenchmarkFuseRegexp.decreasing9Regex           NATIVE  avgt    5    61.443 ±    1.301  ms/op
BenchmarkFuseRegexp.decreasing9Regex             null  avgt    5  2116.747 ±  602.570  ms/op
BenchmarkFuseRegexp.increasing10Fusing           null  avgt    5  1320.869 ±   45.509  ms/op
BenchmarkFuseRegexp.increasing10Like           LUCENE  avgt    5    42.253 ±    0.695  ms/op
BenchmarkFuseRegexp.increasing10Like           NATIVE  avgt    5    39.049 ±    1.829  ms/op
BenchmarkFuseRegexp.increasing10Like             null  avgt    5  5071.207 ± 3134.887  ms/op
BenchmarkFuseRegexp.increasing10Regex          LUCENE  avgt    5    44.084 ±    2.737  ms/op
BenchmarkFuseRegexp.increasing10Regex          NATIVE  avgt    5    39.650 ±    1.567  ms/op
BenchmarkFuseRegexp.increasing10Regex            null  avgt    5  2318.479 ±   75.632  ms/op
BenchmarkFuseRegexp.optimal10                    null  avgt    5   185.757 ±    3.426  ms/op
BenchmarkFuseRegexp.optimal10NotFound          LUCENE  avgt    5     0.212 ±    0.136  ms/op
BenchmarkFuseRegexp.optimal10NotFound          NATIVE  avgt    5     0.201 ±    0.140  ms/op
BenchmarkFuseRegexp.optimal10NotFound            null  avgt    5   248.525 ±   60.819  ms/op
BenchmarkFuseRegexp.optimal1Like               LUCENE  avgt    5    16.431 ±    3.533  ms/op
BenchmarkFuseRegexp.optimal1Like               NATIVE  avgt    5    16.586 ±    0.828  ms/op
BenchmarkFuseRegexp.optimal1Like                 null  avgt    5   468.731 ±  185.623  ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound       LUCENE  avgt    5     0.197 ±    0.062  ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound       NATIVE  avgt    5     0.170 ±    0.072  ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound         null  avgt    5   477.176 ±  129.445  ms/op
BenchmarkFuseRegexp.optimal1Regex              LUCENE  avgt    5    15.485 ±    0.517  ms/op
BenchmarkFuseRegexp.optimal1Regex              NATIVE  avgt    5    16.711 ±    0.857  ms/op
BenchmarkFuseRegexp.optimal1Regex                null  avgt    5   205.676 ±    6.921  ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound      LUCENE  avgt    5     0.212 ±    0.123  ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound      NATIVE  avgt    5     0.186 ±    0.029  ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound        null  avgt    5   239.459 ±   13.846  ms/op
BenchmarkFuseRegexp.selective2Fusing             null  avgt    5   541.530 ±  168.173  ms/op
BenchmarkFuseRegexp.selective2Like             LUCENE  avgt    5    19.961 ±    0.937  ms/op
BenchmarkFuseRegexp.selective2Like             NATIVE  avgt    5    20.387 ±    1.953  ms/op
BenchmarkFuseRegexp.selective2Like               null  avgt    5   916.620 ±   96.702  ms/op
BenchmarkFuseRegexp.selective2Regex            LUCENE  avgt    5    23.366 ±    0.795  ms/op
BenchmarkFuseRegexp.selective2Regex            NATIVE  avgt    5    20.037 ±    0.601  ms/op
BenchmarkFuseRegexp.selective2Regex              null  avgt    5   389.965 ±    7.284  ms/op

I've also changed RegexpPatternConverterUtils (and its relative test) to do not introduce useless ^.* at the beginning of the expression nor .*$ at the end. I've repeated the benchmark with that change and these are the results I've got. As you can see, most benchmarks didn't change that much, but the ones that use like and have no index are twice as fast, so they achieve the same numbers than their equivalent optimalXRegex, as expected.

Benchmark                                  (_fstType)  Mode  Cnt     Score     Error  Units
BenchmarkFuseRegexp.decreasing9Fusing            null  avgt    5  1365.403 ±  91.822  ms/op
BenchmarkFuseRegexp.decreasing9Like            LUCENE  avgt    5    40.553 ±   1.394  ms/op
BenchmarkFuseRegexp.decreasing9Like            NATIVE  avgt    5    39.057 ±   0.905  ms/op
BenchmarkFuseRegexp.decreasing9Like              null  avgt    5  2046.090 ±  40.428  ms/op
BenchmarkFuseRegexp.decreasing9Regex           LUCENE  avgt    5    42.584 ±   2.387  ms/op
BenchmarkFuseRegexp.decreasing9Regex           NATIVE  avgt    5    61.583 ±   1.893  ms/op
BenchmarkFuseRegexp.decreasing9Regex             null  avgt    5  2054.523 ±  75.452  ms/op
BenchmarkFuseRegexp.increasing10Fusing           null  avgt    5  1305.249 ±  65.576  ms/op
BenchmarkFuseRegexp.increasing10Like           LUCENE  avgt    5    43.328 ±   2.040  ms/op
BenchmarkFuseRegexp.increasing10Like           NATIVE  avgt    5    61.580 ±   1.149  ms/op
BenchmarkFuseRegexp.increasing10Like             null  avgt    5  2272.112 ±  99.393  ms/op
BenchmarkFuseRegexp.increasing10Regex          LUCENE  avgt    5    54.319 ±   2.048  ms/op
BenchmarkFuseRegexp.increasing10Regex          NATIVE  avgt    5    39.080 ±   1.817  ms/op
BenchmarkFuseRegexp.increasing10Regex            null  avgt    5  2267.062 ±  49.525  ms/op
BenchmarkFuseRegexp.optimal10                    null  avgt    5   187.012 ±  13.054  ms/op
BenchmarkFuseRegexp.optimal10NotFound          LUCENE  avgt    5     0.210 ±   0.103  ms/op
BenchmarkFuseRegexp.optimal10NotFound          NATIVE  avgt    5     0.201 ±   0.066  ms/op
BenchmarkFuseRegexp.optimal10NotFound            null  avgt    5   242.423 ±  17.612  ms/op
BenchmarkFuseRegexp.optimal1Like               LUCENE  avgt    5    16.395 ±   0.884  ms/op
BenchmarkFuseRegexp.optimal1Like               NATIVE  avgt    5    15.506 ±   0.591  ms/op
BenchmarkFuseRegexp.optimal1Like                 null  avgt    5   209.389 ±  11.456  ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound       LUCENE  avgt    5     0.209 ±   0.129  ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound       NATIVE  avgt    5     0.172 ±   0.021  ms/op
BenchmarkFuseRegexp.optimal1LikeNotFound         null  avgt    5   245.542 ±  35.323  ms/op
BenchmarkFuseRegexp.optimal1Regex              LUCENE  avgt    5    15.421 ±   0.804  ms/op
BenchmarkFuseRegexp.optimal1Regex              NATIVE  avgt    5    15.908 ±   0.491  ms/op
BenchmarkFuseRegexp.optimal1Regex                null  avgt    5   204.505 ±  13.134  ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound      LUCENE  avgt    5     0.210 ±   0.089  ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound      NATIVE  avgt    5     0.203 ±   0.107  ms/op
BenchmarkFuseRegexp.optimal1RegexNotFound        null  avgt    5   235.262 ±   8.533  ms/op
BenchmarkFuseRegexp.selective2Fusing             null  avgt    5   517.545 ± 193.977  ms/op
BenchmarkFuseRegexp.selective2Like             LUCENE  avgt    5    23.622 ±   1.178  ms/op
BenchmarkFuseRegexp.selective2Like             NATIVE  avgt    5    20.547 ±   0.778  ms/op
BenchmarkFuseRegexp.selective2Like               null  avgt    5   399.638 ±  54.746  ms/op
BenchmarkFuseRegexp.selective2Regex            LUCENE  avgt    5    23.883 ±   0.802  ms/op
BenchmarkFuseRegexp.selective2Regex            NATIVE  avgt    5    20.113 ±   0.858  ms/op
BenchmarkFuseRegexp.selective2Regex              null  avgt    5   416.987 ±  28.789  ms/op

With this results, the optimization introduced in this PR doesn't seem to be amazing, but to be honest I'm quite worried about what is going on with regexp and Lucene/Native indexes. As said above, I hope that the benchmark is not correctly configuring the indexes, but even with that, it seems very dangerous that an index can be configured in such a way that it affects the semantics of the queries.

@Jackie-Jiang @atris I would really appreciate if any of you can take a look at the benchmark trying to find what is going on.

@atris
Copy link
Contributor

atris commented Jun 14, 2022

@gortiz are you creating a FST index or a text index?

@atris
Copy link
Contributor

atris commented Jun 14, 2022

@gortiz another thing to remember is that neither Lucene nor native FST index follow the java regex pattern language.

If you look up the Lucene regex documentation, and look at the FST index and LIKE tests in Pinot, you will notice the difference.

Unfortunately, since we do not actually parse the regex engine until deep inside the FST territory, we have no way of reporting unknown characters. That may be a factor leading to 0 results for your queries

@gortiz
Copy link
Contributor Author

gortiz commented Jun 15, 2022

@gortiz another thing to remember is that neither Lucene nor native FST index follow the java regex pattern language.

If you look up the Lucene regex documentation, and look at the FST index and LIKE tests in Pinot, you will notice the difference.

Unfortunately, since we do not actually parse the regex engine until deep inside the FST territory, we have no way of reporting unknown characters. That may be a factor leading to 0 results for your queries

Wow, that is shocking and I would see that as a huge problem to use Pinot in a big team. In cases like that I would expect to have a new operation that does the search using the Lucene expression language or whatever we want to use and fail if there is no index. What you are saying is that if I have a column that is not indexed and suddenly a colleague of mine decides to index that column, queries I had will start to return different (and incorrect) results.

Anyway, if that is what was decided, that is what we have. I'm going to close this PR, as it doesn't make sense to apply this optimization if it can be produce false results when someone adds an index.

@gortiz gortiz closed this Jun 15, 2022
@gortiz gortiz deleted the join-regex branch June 15, 2022 06:37
@atris
Copy link
Contributor

atris commented Jun 15, 2022

@gortiz There has been a long comment trail on this PR so I am summarising my responses in this comment:

  1. Performance -- Like any other index, FST indices also have their worst case. FST is an automaton, and in an extreme scenario where a high number of branches are traversed for a query, the performance of the index will approach the brute force approximation.

In this case, the query you tried had an abnormally large number of branches, to the extent that Lucene decided to not even try and process the query. The average user rarely fires such queries, and in the corner cases where they do, they should be ok with a brute force latency, as long as their 99th percentile queries keep performing significantly better, which is the case as is evident from the benchmarks that test the FST index.

  1. Language Semantics -- Lucene has, for long, published its regex dialect clearly. Engines based on Lucene, such as ElasticSearch and Apache Solr, follow the same dialect (https://help.oncrawl.com/en/articles/1685982-lucene-regex-cheat-sheet)

It is arguable whether this is correct or not, but we have to keep historical perspective in mind, since many Pinot users migrate from one of these engines.

Please note that, from native FST index's perspective, it is pretty simple to support java regex language. The reason I did not do that was because the existing FST index was based on Lucene and thus supported Lucene's regex dialect. For back compatibility, native FST index supports the same dialect.

  1. Parsing And Reporting Invalid Regexes: I agree that in an ideal scenario, we should be parsing regexes and reporting characters that we do not accept. However, given the complexity of an average regex, the time complexity of doing this will be non trivial.

  2. Specific To This PR: I looked into the PR and while the effort is appreciative, I do not see significant performance improvement. Also, the API surface changes required for this change are high, which makes me a bit uneasy.

@gortiz
Copy link
Contributor Author

gortiz commented Jun 15, 2022

Yeah, I completely understand that we should maintain the Lucene language when Natife FST was created. My concern is with the fact that when Lucene index was added, we decided to accept a different language in regexp_match. IMHO that is a consistency problem that is not easy to explain to Pinot customers. Pinot doesn't include documentation about this operation and how it changes depending on whether there is an index or not and I feel sorry for whoever has to write that doc, because it seems quite difficult to describe.

IMHO a better solution would be to do not optimize regexp_match with indexes and to have another operation whose string argument was a Lucene expression.

It is too late to do that, so let's focus on useful things. This PR is closed, but I'm going to cherry pick the like to regex optimization to add it in another PR.

@Jackie-Jiang
Copy link
Contributor

@gortiz @atris Thanks for taking time looking into this issue. Even though we might not move forward with the query rewrite, we got several valuable insights from the benchmark.

With regard to the query syntax, +100 on we should keep the query behavior consistent with or without the FST index. Currently we try to convert java regex to Lucene search query using RegexpPatternConverterUtils.regexpLikeToLuceneRegExp(). The support is very limited though, and might only be able to cover the basic LIKE predicate.

We want to utilize the FST to accelerate the regex matching (LIKE and REGEXP_LIKE). There are 2 approaches:

  • Add a new predicate type, e.g. FST_LIKE and take the Lucene dialect
  • Make FST work on java regex

I personally prefer the second approach because it can improve the existing query performance, but it requires changes to the native FST, and we cannot support Lucene FST anymore. Or we can support both, and make the FST_LIKE only work with Lucene FST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants