LUCENE-10297: Speed up medium cardinality fields with readLongs and SIMD #530

gf2121 · 2021-12-08T10:29:59Z

We introduced a bitset optimization for extremly low cardinality fields in https://issues.apache.org/jira/browse/LUCENE-10233, but medium cardinality fields (like 32/128) can rarely trigger this optimization, I'm trying to find out a way to speed up them.

In apache/lucene-solr#1538, we made some effort to use readLELongs to speed up BKD id blocks, but did not get a obvious gain on this approach. I think the reason could probably be that we were trying to optimize the unsorted situation (typically happens for high cardinality fields) and the bottleneck of queries on high cardinality fields is visitDocValues but not readDocIds.

However, medium cardinality fields may be tempted for this optimization because they need to read lots of ids for each term. The basic idea is that we can compute the delta of the sorted ids and encode/decode them like what we do in StoredFieldsInts. I benchmarked the optimization by mocking some random longPoint and querying them with PointInSetQuery. As expected, the medium cardinality fields got spped up and high cardinality fields get even results.

Benchmark Result

doc count	field cardinality	query point	baseline(ms)	candidate(ms)	diff percentage	baseline(QPS)	candidate(QPS)	diff percentage
100000000	32	1	19	16	-15.79%	52.63	62.5	18.75%
100000000	32	2	34	14	-58.82%	29.41	71.43	142.86%
100000000	32	4	76	22	-71.05%	13.16	45.45	245.45%
100000000	32	8	139	42	-69.78%	7.19	23.81	230.95%
100000000	32	16	279	82	-70.61%	3.58	12.2	240.24%
100000000	128	1	17	11	-35.29%	58.82	90.91	54.55%
100000000	128	8	75	23	-69.33%	13.33	43.48	226.09%
100000000	128	16	126	25	-80.16%	7.94	40	404.00%
100000000	128	32	245	50	-79.59%	4.08	20	390.00%
100000000	128	64	528	97	-81.63%	1.89	10.31	444.33%
100000000	1024	1	3	2	-33.33%	333.33	500	50.00%
100000000	1024	8	13	8	-38.46%	76.92	125	62.50%
100000000	1024	32	31	19	-38.71%	32.26	52.63	63.16%
100000000	1024	128	120	67	-44.17%	8.33	14.93	79.10%
100000000	1024	512	480	133	-72.29%	2.08	7.52	260.90%
100000000	8192	1	3	3	0.00%	333.33	333.33	0.00%
100000000	8192	16	18	15	-16.67%	55.56	66.67	20.00%
100000000	8192	64	19	14	-26.32%	52.63	71.43	35.71%
100000000	8192	512	69	43	-37.68%	14.49	23.26	60.47%
100000000	8192	2048	236	134	-43.22%	4.24	7.46	76.12%
100000000	1048576	1	3	2	-33.33%	333.33	500	50.00%
100000000	1048576	16	18	19	5.56%	55.56	52.63	-5.26%
100000000	1048576	64	17	17	0.00%	58.82	58.82	0.00%
100000000	1048576	512	34	32	-5.88%	29.41	31.25	6.25%
100000000	1048576	2048	89	93	4.49%	11.24	10.75	-4.30%

gf2121 · 2021-12-13T07:13:25Z

lucene/core/src/java/org/apache/lucene/util/bkd/Run.java

+import org.apache.lucene.store.FSDirectory;
+
+/** java doc */
+public class Run {


This is the benchmark script, i post it here in case someone would like to play with it. I'll delete this file later.

gf2121 added 10 commits December 8, 2021 17:23

stash

9646663

format

b738e1e

format

a4ceca0

iter

4dc5f87

java doc

0030115

iter

6b448be

remove unrelated codes

e8bb832

iter

ddd5127

spotless

289ab06

format

f513f5a

gf2121 changed the title ~~LUCENE-10297: Speed up medium cardinality fields with readLELongs and SIMD~~ LUCENE-10297: Speed up medium cardinality fields with readLongs and SIMD Dec 9, 2021

gf2121 added 3 commits December 9, 2021 15:38

CHANGES

83b957e

bug fix

a5b1f40

bench mark script

55e6574

gf2121 commented Dec 13, 2021

View reviewed changes

avoid duplicate loop

1654c91

gf2121 closed this Dec 15, 2021

asfimport mentioned this pull request Dec 15, 2021

Speed up medium cardinality fields with readLongs and SIMD [LUCENE-10297] #11333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10297: Speed up medium cardinality fields with readLongs and SIMD #530

LUCENE-10297: Speed up medium cardinality fields with readLongs and SIMD #530

gf2121 commented Dec 8, 2021 •

edited

gf2121 Dec 13, 2021

LUCENE-10297: Speed up medium cardinality fields with readLongs and SIMD #530

LUCENE-10297: Speed up medium cardinality fields with readLongs and SIMD #530

Conversation

gf2121 commented Dec 8, 2021 • edited

gf2121 Dec 13, 2021

Choose a reason for hiding this comment

gf2121 commented Dec 8, 2021 •

edited