Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use radix sort to speed up the sorting of deleted terms #12573

Merged
merged 11 commits into from
Sep 22, 2023

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Sep 20, 2023

Description

Recently, we captured a flame graph in a scene with frequent updates, which showed that sorting deleted terms occupied a high CPU ratio.

In scenarios with many deleted terms, most terms could have the same field name. So a data structure like Map<String, Map<BytesRef, Integer>> instead of Map<Term, Integer> could be better here —— We can avoid field name compare and use MSBRadixSort to sort the bytes for each field.

We can also take advantage of BytesRefHash to implement Map<BytesRef, Integer> to get a more efficient memory layout, and there's already a MSBRadixSort impl in BytesRefHash we can reuse.

Benchmark

We benchmarked the sort logic for 1,000,000 terms, showing 66% took decreasing:

  Baseline Candidate Took Diff
total took 692 234 -66.18%

An E2E benchmark that delete document with term 1,000,000 times showing took decreased 43% (with default iw config).

  Baseline Candidate Took Diff
total took 1629 923 -43.34%

@gf2121 gf2121 changed the title Speed up sort on deleted terms Use radix sort to speed up the sorting of deleted terms Sep 21, 2023
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I left some minor comments but it looks great in general.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like that it also makes buffered deletes more memory-efficient as a side-effect.

@gf2121
Copy link
Contributor Author

gf2121 commented Sep 22, 2023

Thanks @jpountz !

@gf2121 gf2121 merged commit 8b84f6c into apache:main Sep 22, 2023
4 checks passed
gf2121 added a commit to gf2121/lucene that referenced this pull request Sep 22, 2023
@gf2121 gf2121 added this to the 9.9.0 milestone Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants