Remove the construction of second bitmap in text index reader to improve performance #5199

siddharthteotia · 2020-03-31T05:55:38Z

This is a follow-up to optimization implemented in PR #5177.

Since we now have pre-built mapping of luceneDocId to pinotDocId, we can directly build the result bitmap with pinotDocId. This PR removes the construction of second bitmap since earlier we had to do build the result in two phases -- (1) run search query to get luceneDocIDs in a bitmap. Iterate over this bitmap and build a second bitmap with corresponding pinotDocIds.

Now in our Lucene collector callback, we can directly build the final bitmap.

This change along with previous PR provides significant performance improvements.

Ran an increasingQPS test on real data (single segment with 10million docs and a text index). QPS was increased from 1 to 40

REPORT FOR TARGET QPS: 1.0
Current Target QPS: 1.0, Time Passed: 30084ms, Queries Executed: 30, Average QPS: 0.997207818109294, Average Broker Time: 44.6ms, Average Client Time: 50.1ms, Queries Queued: 0.

REPORT FOR TARGET QPS: 3.0
Current Target QPS: 3.0, Time Passed: 30255ms, Queries Executed: 90, Average QPS: 2.974714923153198, Average Broker Time: 50.3ms, Average Client Time: 53.24444444444445ms, Queries Queued: 0.

REPORT FOR TARGET QPS: 5.0
Current Target QPS: 5.0, Time Passed: 30412ms, Queries Executed: 150, Average QPS: 4.932263580165724, Average Broker Time: 41.44ms, Average Client Time: 44.1ms, Queries Queued: 0.

REPORT FOR TARGET QPS: 7.0
Current Target QPS: 7.0, Time Passed: 30489ms, Queries Executed: 210, Average QPS: 6.887730000983961, Average Broker Time: 41.82857142857143ms, Average Client Time: 44.00476190476191ms, Queries Queued: 0.

REPORT FOR TARGET QPS: 9.0
Current Target QPS: 9.0, Time Passed: 30868ms, Queries Executed: 270, Average QPS: 8.746922379162887, Average Broker Time: 43.385185185185186ms, Average Client Time: 45.27037037037037ms, Queries Queued: 0.

REPORT FOR TARGET QPS: 25.0
Current Target QPS: 25.0, Time Passed: 30233ms, Queries Executed: 694, Average QPS: 22.955049118512882, Average Broker Time: 37.27089337175793ms, Average Client Time: 38.53746397694525ms, Queries Queued: 0.

REPORT FOR TARGET QPS: 27.0
Current Target QPS: 27.0, Time Passed: 30254ms, Queries Executed: 740, Average QPS: 24.459575593309975, Average Broker Time: 39.71351351351351ms, Average Client Time: 40.87837837837838ms, Queries Queued: 0.

REPORT FOR TARGET QPS: 29.0
Current Target QPS: 29.0, Time Passed: 30147ms, Queries Executed: 798, Average QPS: 26.4702955517962, Average Broker Time: 37.06516290726817ms, Average Client Time: 38.22431077694235ms, Queries Queued: 0.

REPORT FOR TARGET QPS: 31.0
Current Target QPS: 31.0, Time Passed: 30160ms, Queries Executed: 843, Average QPS: 27.950928381962864, Average Broker Time: 37.79359430604982ms, Average Client Time: 38.86476868327402ms, Queries Queued: 0.

FINAL REPORT:
Current Target QPS: 39.0, Time Passed: 27344ms, Queries Executed: 947, Average QPS: 34.632826214160325, Average Broker Time: 36.91024287222809ms, Average Client Time: 37.91129883843717ms.

10th percentile: 4.0
25th percentile: 26.0
50th percentile: 37.0
90th percentile: 68.0
95th percentile: 75.0
99th percentile: 129.0
99.9th percentile: 150.0

The optimizations implemented in this and previous PR are not directly applicable to realtime (we have the exact same performance overhead in realtime too) since we can't have a pre-built mapping there. We mostly need to build a cache on-the-fly as queries are processed on realtime lucene index. A solution is in progress. Will put PR soon

performance

Jackie-Jiang

I would recommend implementing the real-time docId translator instead of having a separate real-time collector. You can also do that as the next step.
LGTM otherwise.

Jackie-Jiang · 2020-03-31T18:29:56Z

...ain/java/org/apache/pinot/core/realtime/impl/invertedindex/RealtimeLuceneDocIdCollector.java

+import org.roaringbitmap.buffer.MutableRoaringBitmap;
+
+/**
+ * DocID collector for Lucene search query. We have optimized


(OCD) Can we always use docId for naming instead of mixing use of docId and docID

I would recommend implementing the real-time docId translator instead of having a separate real-time collector. You can also do that as the next step.
LGTM otherwise.

Yes, that is the next step (implementing this optimization for realtime)

(OCD) Can we always use docId for naming instead of mixing use of docId and docID

Done.

siddharthteotia force-pushed the docid_optim_1 branch from 26ea0f8 to 20dfffd Compare March 31, 2020 16:38

Remove second bitmap in text index reader to improve

431e09b

performance

siddharthteotia force-pushed the docid_optim_1 branch from 20dfffd to 431e09b Compare March 31, 2020 17:44

Jackie-Jiang approved these changes Mar 31, 2020

View reviewed changes

review comments

223d044

siddharthteotia merged commit 1474344 into apache:master Mar 31, 2020

This was referenced Apr 3, 2020

Make text index query cache a configurable option #5176

Merged

Add inter segment tests for text search and fix bug for Lucene query parser creation #5226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the construction of second bitmap in text index reader to improve performance #5199

Remove the construction of second bitmap in text index reader to improve performance #5199

siddharthteotia commented Mar 31, 2020 •

edited

Loading

Jackie-Jiang left a comment

Jackie-Jiang Mar 31, 2020

siddharthteotia Mar 31, 2020

siddharthteotia Mar 31, 2020

Remove the construction of second bitmap in text index reader to improve performance #5199

Remove the construction of second bitmap in text index reader to improve performance #5199

Conversation

siddharthteotia commented Mar 31, 2020 • edited Loading

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang Mar 31, 2020

Choose a reason for hiding this comment

siddharthteotia Mar 31, 2020

Choose a reason for hiding this comment

siddharthteotia Mar 31, 2020

Choose a reason for hiding this comment

siddharthteotia commented Mar 31, 2020 •

edited

Loading