Skip to content

Deduplicate min and max term in single-term FieldReader#13618

Merged
original-brownbear merged 1 commit intoapache:mainfrom
original-brownbear:save-heap-single-term-tree
Jul 31, 2024
Merged

Deduplicate min and max term in single-term FieldReader#13618
original-brownbear merged 1 commit intoapache:mainfrom
original-brownbear:save-heap-single-term-tree

Conversation

@original-brownbear
Copy link
Member

I noticed that single-term readers are an edge case but not that uncommon in Elasticsearch heap dumps. It seems quite common to have a constant value for some field across a complete segment (e.g. a version value that is repeated endlessly in logs).
Seems simple enough to deduplicate here to save a couple MB of heap, though it's admittedly an edge case that mostly matters for large segment counts.

I noticed that single-term readers are an edge case but not that
uncommon in Elasticsearch heap dumps. It seems quite common to have a
constant value for some field across a complete segment (e.g. a version
value that is repeated endlessly in logs).
Seems simple enough to deduplicate here to save a couple MB of heap.
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like a corner case, but the change is simple enough. LGTM.

@original-brownbear
Copy link
Member Author

Thanks Adrien!

@original-brownbear original-brownbear merged commit 47650a4 into apache:main Jul 31, 2024
@original-brownbear original-brownbear added this to the 9.12.0 milestone Jul 31, 2024
@original-brownbear original-brownbear deleted the save-heap-single-term-tree branch July 31, 2024 18:38
original-brownbear added a commit that referenced this pull request Jul 31, 2024
I noticed that single-term readers are an edge case but not that
uncommon in Elasticsearch heap dumps. It seems quite common to have a
constant value for some field across a complete segment (e.g. a version
value that is repeated endlessly in logs).
Seems simple enough to deduplicate here to save a couple MB of heap.
original-brownbear added a commit to original-brownbear/lucene that referenced this pull request Mar 10, 2025
Found this case in Elasticsearch heap dump's and it'as the same as apache#13618 but for BKDReader
instances. Admittedly, as for terms this is a corner case but
observed use cases where the distribution of these values is heavily weighted
towards a single value would save up to a O(5M) in heap and some cache
pressure from this change.
original-brownbear added a commit to original-brownbear/lucene that referenced this pull request Mar 10, 2025
Found this case in Elasticsearch heap dump's and it'as the same as apache#13618 but for BKDReader
instances. Admittedly, as for terms this is a corner case but
observed use cases where the distribution of these values is heavily weighted
towards a single value would save up to a O(5M) in heap and some cache
pressure from this change.
original-brownbear added a commit that referenced this pull request Mar 10, 2025
Found this case in Elasticsearch heap dump's and it'as the same as #13618 but for BKDReader
instances. Admittedly, as for terms this is a corner case but
observed use cases where the distribution of these values is heavily weighted
towards a single value would save up to a O(5M) in heap and some cache
pressure from this change.
original-brownbear added a commit that referenced this pull request Mar 10, 2025
Found this case in Elasticsearch heap dump's and it'as the same as #13618 but for BKDReader
instances. Admittedly, as for terms this is a corner case but
observed use cases where the distribution of these values is heavily weighted
towards a single value would save up to a O(5M) in heap and some cache
pressure from this change.
hanbj pushed a commit to hanbj/lucene that referenced this pull request Mar 12, 2025
Found this case in Elasticsearch heap dump's and it'as the same as apache#13618 but for BKDReader
instances. Admittedly, as for terms this is a corner case but
observed use cases where the distribution of these values is heavily weighted
towards a single value would save up to a O(5M) in heap and some cache
pressure from this change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants