Use radix sort to speed up the sorting of deleted terms #12573

gf2121 · 2023-09-20T10:12:46Z

Description

Recently, we captured a flame graph in a scene with frequent updates, which showed that sorting deleted terms occupied a high CPU ratio.

In scenarios with many deleted terms, most terms could have the same field name. So a data structure like Map<String, Map<BytesRef, Integer>> instead of Map<Term, Integer> could be better here —— We can avoid field name compare and use MSBRadixSort to sort the bytes for each field.

We can also take advantage of BytesRefHash to implement Map<BytesRef, Integer> to get a more efficient memory layout, and there's already a MSBRadixSort impl in BytesRefHash we can reuse.

Benchmark

We benchmarked the sort logic for 1,000,000 terms, showing 66% took decreasing:

	Baseline	Candidate	Took Diff
total took	692	234	-66.18%

An E2E benchmark that delete document with term 1,000,000 times showing took decreased 43% (with default iw config).

	Baseline	Candidate	Took Diff
total took	1629	923	-43.34%

jpountz

Nice! I left some minor comments but it looks great in general.

lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java

jpountz

LGTM. I like that it also makes buffered deletes more memory-efficient as a side-effect.

gf2121 · 2023-09-22T05:32:33Z

Thanks @jpountz !

gf2121 added 6 commits September 20, 2023 16:11

stash

224b158

ordered

de0566c

stash

c214852

fix

e29dea0

fix

4fe9590

add CHANGES

8da10fd

gf2121 changed the title ~~Speed up sort on deleted terms~~ Use radix sort to speed up the sorting of deleted terms Sep 21, 2023

gf2121 added 2 commits September 21, 2023 16:41

fix change

80fe33c

reuse pool

28881b3

jpountz reviewed Sep 21, 2023

View reviewed changes

gf2121 added 3 commits September 21, 2023 21:58

review fix

14d90b6

conflict

283c421

improve diff

8c35ed1

jpountz approved these changes Sep 21, 2023

View reviewed changes

gf2121 merged commit 8b84f6c into apache:main Sep 22, 2023
4 checks passed

gf2121 added a commit to gf2121/lucene that referenced this pull request Sep 22, 2023

Use radix sort to speed up the sorting of deleted terms (apache#12573)

85addb1

gf2121 mentioned this pull request Sep 22, 2023

Use radix sort to speed up the sorting of deleted terms (Backport 9x) #12584

Merged

gf2121 added a commit that referenced this pull request Sep 22, 2023

Use radix sort to speed up the sorting of deleted terms (#12573)

d3a3391

gf2121 mentioned this pull request Oct 7, 2023

DeletedTerms#clear should reset ByteBlockPool #12630

Merged

gf2121 added this to the 9.9.0 milestone Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use radix sort to speed up the sorting of deleted terms #12573

Use radix sort to speed up the sorting of deleted terms #12573

gf2121 commented Sep 20, 2023 •

edited

Loading

jpountz left a comment

jpountz left a comment

gf2121 commented Sep 22, 2023

Use radix sort to speed up the sorting of deleted terms #12573

Use radix sort to speed up the sorting of deleted terms #12573

Conversation

gf2121 commented Sep 20, 2023 • edited Loading

Description

Benchmark

jpountz left a comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

gf2121 commented Sep 22, 2023

gf2121 commented Sep 20, 2023 •

edited

Loading