Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not perform another pass of Query Automaton Minimization #8237

Merged
merged 1 commit into from Feb 22, 2022

Conversation

atris
Copy link
Contributor

@atris atris commented Feb 21, 2022

Native text engine minimises the query automaton post construction using Hopcroft's algorithm. This can get expensive for large query automatons, and does not yield much improvement anyways since the query automaton is build once and use once.

Post this change, performance numbers using BenchmarkNativeAndLuceneBasedLike:

Benchmark (_fstType) (_intBaseValue) (_numBlocks) (_numRows) (_query) Mode Cnt Score Error Units
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 0 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 40.436 ± 8.662 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 0 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 50.320 ± 4.254 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 1 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 42.378 ± 2.669 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 1 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 53.890 ± 2.951 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 10 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 47.751 ± 1.149 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 10 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 60.890 ± 1.949 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 100 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 93.937 ± 8.493 us/op
BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 100 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 129.687 ± 16.903 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 0 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 55.362 ± 10.320 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 0 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 16.610 ± 1.297 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 1 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 54.800 ± 1.501 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 1 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 18.417 ± 0.696 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 10 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 60.187 ± 3.858 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 10 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 25.549 ± 1.694 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 100 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 5 106.765 ± 13.996 us/op
BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 100 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%' avgt 5 99.888 ± 1.029 us/op

Note that for generic match queries '%domain%', Lucene and Native FST are at parity from 0 blocks to 100 blocks. For prefix queries, Native FST is 4x faster on 0 and 10 blocks, and 33% faster on 100 blocks.

BenchmarkNativeAndLuceneBasedLike.query      LUCENE             1000             0     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5   50.320 ±  4.254  us/op

BenchmarkNativeAndLuceneBasedLike.query      NATIVE             1000             0     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5   16.610 ±  1.297  us/op

BenchmarkNativeAndLuceneBasedLike.query      LUCENE             1000           100     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5  129.687 ± 16.903  us/op

BenchmarkNativeAndLuceneBasedLike.query      NATIVE             1000           100     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5   99.888 ±  1.029  us/op

This behaviour was observed over multiple runs of the benchmark. Detailed results at:

https://docs.google.com/document/d/1Jd-Oe0F9gx9WAB1sa5YdW7KZ_EsPOJdcHK1bsON9JHM/edit?usp=sharing

@atris atris changed the title Do not do another pass of Query Automaton Minimization Do not perform another pass of Query Automaton Minimization Feb 21, 2022
@codecov-commenter
Copy link

codecov-commenter commented Feb 21, 2022

Codecov Report

Merging #8237 (d7cc6d7) into master (4f17ede) will decrease coverage by 6.95%.
The diff coverage is 78.43%.

❗ Current head d7cc6d7 differs from pull request most recent head 95554f2. Consider uploading reports for the commit 95554f2 to get more accurate results

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #8237      +/-   ##
============================================
- Coverage     70.96%   64.00%   -6.96%     
+ Complexity     4320     4239      -81     
============================================
  Files          1626     1584      -42     
  Lines         85081    83250    -1831     
  Branches      12803    12608     -195     
============================================
- Hits          60377    53288    -7089     
- Misses        20545    26127    +5582     
+ Partials       4159     3835     -324     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 66.95% <78.43%> (-0.40%) ⬇️
unittests2 14.10% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ery/optimizer/filter/NumericalFilterOptimizer.java 83.22% <ø> (-2.49%) ⬇️
...inot/core/operator/filter/FilterOperatorUtils.java 80.00% <42.85%> (-7.68%) ⬇️
...ot/common/request/context/RequestContextUtils.java 72.77% <50.00%> (-1.60%) ⬇️
.../pinot/core/operator/filter/NotFilterOperator.java 55.55% <55.55%> (ø)
...ava/org/apache/pinot/core/plan/FilterPlanNode.java 87.85% <75.00%> (-1.37%) ⬇️
...core/operator/dociditerators/NotDocIdIterator.java 95.00% <95.00%> (ø)
...he/pinot/common/request/context/FilterContext.java 78.37% <100.00%> (ø)
.../apache/pinot/pql/parsers/pql2/ast/FilterKind.java 100.00% <100.00%> (ø)
...che/pinot/core/operator/docidsets/NotDocIdSet.java 100.00% <100.00%> (ø)
...egment/local/utils/nativefst/automaton/RegExp.java 42.96% <100.00%> (-1.49%) ⬇️
... and 389 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f17ede...95554f2. Read the comment docs.

Copy link
Member

@richardstartin richardstartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observe a good speed up with this change.

@richardstartin
Copy link
Member

richardstartin commented Feb 21, 2022

I got these numbers with JDK11 (coretto) on my MacBook Pro with the CLI args below:

java -jar pinot-perf/target/benchmarks.jar -wi 5 -i 5 -r 1 -w 2 -f 1 -bm avgt -jvmArgsPrepend "-ms4G -mx4G -XX:+AlwaysPreTouch -XX:+UseParallelGC" BenchmarkNativeAndLuceneBasedLike
Benchmark                                (_fstType)  (_intBaseValue)  (_numBlocks)  (_numRows)                                                                    (_query)  Mode  Cnt    Score   Error  Units
BenchmarkNativeAndLuceneBasedLike.query      LUCENE             1000             0     2500000     SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%'  avgt    5   47.434 ± 0.878  us/op
BenchmarkNativeAndLuceneBasedLike.query      LUCENE             1000             0     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5   49.754 ± 1.126  us/op
BenchmarkNativeAndLuceneBasedLike.query      LUCENE             1000             1     2500000     SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%'  avgt    5   51.999 ± 0.557  us/op
BenchmarkNativeAndLuceneBasedLike.query      LUCENE             1000             1     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5   53.120 ± 0.825  us/op
BenchmarkNativeAndLuceneBasedLike.query      LUCENE             1000            10     2500000     SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%'  avgt    5   59.283 ± 1.135  us/op
BenchmarkNativeAndLuceneBasedLike.query      LUCENE             1000            10     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5   62.024 ± 1.437  us/op
BenchmarkNativeAndLuceneBasedLike.query      LUCENE             1000           100     2500000     SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%'  avgt    5  117.280 ± 0.580  us/op
BenchmarkNativeAndLuceneBasedLike.query      LUCENE             1000           100     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5  164.621 ± 8.522  us/op
BenchmarkNativeAndLuceneBasedLike.query      NATIVE             1000             0     2500000     SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%'  avgt    5   44.811 ± 1.573  us/op
BenchmarkNativeAndLuceneBasedLike.query      NATIVE             1000             0     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5   14.112 ± 0.155  us/op
BenchmarkNativeAndLuceneBasedLike.query      NATIVE             1000             1     2500000     SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%'  avgt    5   49.487 ± 0.414  us/op
BenchmarkNativeAndLuceneBasedLike.query      NATIVE             1000             1     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5   19.014 ± 0.583  us/op
BenchmarkNativeAndLuceneBasedLike.query      NATIVE             1000            10     2500000     SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%'  avgt    5   54.869 ± 0.861  us/op
BenchmarkNativeAndLuceneBasedLike.query      NATIVE             1000            10     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5   29.343 ± 0.206  us/op
BenchmarkNativeAndLuceneBasedLike.query      NATIVE             1000           100     2500000     SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%'  avgt    5  115.629 ± 5.838  us/op
BenchmarkNativeAndLuceneBasedLike.query      NATIVE             1000           100     2500000  SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 'www.domain%'  avgt    5  128.661 ± 3.260  us/op

So no integer multiple differences for unanchored prefixes in this run, anchored prefixes are much faster than lucene, but the native implementation appears to warm up faster. I can run this on some more stable machines, but we wouldn't see this kind of improvement by accident.

@atris atris merged commit fa0db64 into apache:master Feb 22, 2022
xiangfu0 pushed a commit to xiangfu0/pinot that referenced this pull request Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants