LUCENE-9536: Optimize OrdinalMap when one segment contains all distinct values. #1948

jtibshirani · 2020-10-05T20:24:47Z

For doc values that are not too high cardinality, it is common for some large
segments to contain all distinct values. In this case, we can check if the first
segment ords map perfectly to global ords, and if so store the global ord deltas
and first segment indices as LongValues.ZEROES
to save some space.

…ct values. For doc values that are not too high cardinality, it is common for some large segments to contain all distinct values. In this case, we can check if the first segment ords map perfectly to global ords, and if so store the global ord deltas and first segment indices as `LongValues.ZEROES` to save some space.

jtibshirani · 2020-10-05T20:31:23Z

I used TestOrdinalMap to test a map with 10,000 terms and ~10 segments. In the scenario where one segment contains all ordinal values, it shows a small improvement:

baseline bytes used: 11184
new bytes used: 10536

jpountz

Thanks Julie, this looks good to me, I only left some minor comments. Do you know if we already have tests that exercise this optimization?

jpountz · 2020-10-13T06:48:16Z

lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java

+    if (ordDeltaBits.length > 0 && ordDeltaBits[0] == 0L && ordDeltas[0].size() == this.valueCount) {
+      this.firstSegments = LongValues.ZEROES;
+      this.globalOrdDeltas = LongValues.ZEROES;
+      ramBytesUsed += RamUsageEstimator.shallowSizeOf(LongValues.ZEROES);


We could ignore it completely from ramBytesUsed, since this singleton is allocated anyway, regardless of whether the optimization uses it.

I added this to address a failure in TestOrdinalMap. But now I see it makes more sense to modify the test !

jpountz · 2020-10-13T06:50:40Z

lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java

    resources.add(Accountables.namedAccountable("segment map", segmentMap));
-    // TODO: would be nice to return actual child segment deltas too, but the optimizations are confusing
+    // TODO: would be nice to return the ordinal and segment maps too, but it's not straightforward
+    //  because of optimizations.


could be do something like if (firstSegments != LongValues.ZEROES) { resources.add(Accountables.namedAccountable("first segments", firstSegments)); }?

I think we'd need a cast here, since LongValues doesn't implement Accountable. Alternatively, we could consider a bigger change to have LongValues implement Accountable.

Update: I just saw LUCENE-9387, it probably doesn't make sense to increase usage of Accountable.

mikemccand · 2020-10-14T17:50:58Z

lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java

+
+    // If the first segment contains all of the global ords, then we can apply a small optimization
+    // and hardcode the first segments and global ord deltas as all zeroes.
+    if (ordDeltaBits.length > 0 && ordDeltaBits[0] == 0L && ordDeltas[0].size() == this.valueCount) {


Hmm why only the first segment? Couldn't it be the 3rd segment, in addition, that matches the global ords?

Edit: ahh OK I understand now -- this opto is indeed specific to the first segment, so we can store this.firstSegments as all 0s. Good!

Do we (somewhere, couldn't find it here) pre-sort all segments by the cardinality descending? Then we could know all segments that meet this optimization are at the start of the segments list, and possibly building the ordinal map is faster (not sure). But then we would need to un-sort in the end to return the final OrdinalMap. But it might enable this opto to apply more often, except, I think we would then need an additional dereference on lookup, hmm.

Does our PackedLongValues.monotonicBuilder already optimize for the case where it is all 0s, for the case where another segment (not the first) has all the global values as well?

Do we (somewhere, couldn't find it here) pre-sort all segments by the cardinality descending?

We do in fact -- the segments are sorted by 'weight', which in all call sites corresponds to the number of unique terms. This was added in LUCENE-5782.

Does our PackedLongValues.monotonicBuilder already optimize for the case where it is all 0s, for the case where another segment (not the first) has all the global values as well?

When constructing the individual PackedInts.Reader instances, we do identify the all 0s case and use the lightweight PackedInts.NullReader. It's great we optimize that case, but it does mean this PR doesn't make an enormous space difference.

Do we (somewhere, couldn't find it here) pre-sort all segments by the cardinality descending?

We do in fact -- the segments are sorted by 'weight', which in all call sites corresponds to the number of unique terms. This was added in LUCENE-5782.

Ahh, so then we know the first segment will indeed have the most unique terms, and therefore the highest chance of having "all 0s" ord deltas.

I think 2nd and 3rd segments also might have all 0s ord deltas? But we can try to optimize that in a followon issue ... progress not perfection!

Does our PackedLongValues.monotonicBuilder already optimize for the case where it is all 0s, for the case where another segment (not the first) has all the global values as well?

When constructing the individual PackedInts.Reader instances, we do identify the all 0s case and use the lightweight PackedInts.NullReader. It's great we optimize that case, but it does mean this PR doesn't make an enormous space difference.

Got it. Well, it's great that all these layers optimize :)

I think 2nd and 3rd segments also might have all 0s ord deltas? But we can try to optimize that in a followon issue ... progress not perfection!

I wonder if you are confused here, the proposed changes optimize the mapping from global ordinals to the ordinals of one arbitrary segment. When a segment has all value, we can simplify by always picking this segment, but there is no need to optimize this for the 2nd or 3rd segments, since we only need to be able to translate global ordinals to the ordinal of a single segment. Or maybe I'm the one confused by what you were suggesting. :)

Aha! Sorry, I was indeed confused ;)

This is to enable "retrieve BytesRef for this global ordinal" use-case, right? For that, we first pick a segment to use (the first one also containing that BytesRef), then map to its segment-local ordinal, then retrieve the BytesRef for that using the existing doc values API for that segment.

We do not (need to, nor) expose an API today to "retrieve segment N's ordinal corresponding to global ordinal M". Only the reverse direction (segment N's ordinal M maps to global ordinal O).

I think I understand now!

This is to enable "retrieve BytesRef for this global ordinal" use-case, right?

Right!

We do not (need to, nor) expose an API today to "retrieve segment N's ordinal corresponding to global ordinal M"

Correct.

mikemccand · 2020-10-14T17:51:20Z

lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java

+    this.valueCount = globalOrd;
+
+    // If the first segment contains all of the global ords, then we can apply a small optimization
+    // and hardcode the first segments and global ord deltas as all zeroes.


Insert possessive quote (first segment's)?

I don't think the possessive quote gives the right meaning? Perhaps I could say 'first segment indices' here to be more clear.

Oh I thought it was the first segment's deltas as all zeros and alos the global ord deltas as all zeros? But I'm OK with just rewording it to make it less controversial, or even just leaving this wording!

jtibshirani · 2020-10-15T21:30:58Z

Do you know if we already have tests that exercise this optimization?

I think several randomized tests will hit it, for example TestLucene80DocValuesFormat failed when I had a bug in the logic. I added a quick test case to TestOrdinalMap too for more solid coverage.

jtibshirani · 2020-10-26T18:43:36Z

@jpountz @mikemccand no rush, but this is ready for another look.

mikemccand

Thank you @jtibshirani, this opto makes sense to me now and looks great!

mikemccand · 2020-10-30T15:25:52Z

lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java

+
+    // If the first segment contains all of the global ords, then we can apply a small optimization
+    // and hardcode the first segments and global ord deltas as all zeroes.
+    if (ordDeltaBits.length > 0 && ordDeltaBits[0] == 0L && ordDeltas[0].size() == this.valueCount) {


Aha! Sorry, I was indeed confused ;)

This is to enable "retrieve BytesRef for this global ordinal" use-case, right? For that, we first pick a segment to use (the first one also containing that BytesRef), then map to its segment-local ordinal, then retrieve the BytesRef for that using the existing doc values API for that segment.

We do not (need to, nor) expose an API today to "retrieve segment N's ordinal corresponding to global ordinal M". Only the reverse direction (segment N's ordinal M maps to global ordinal O).

I think I understand now!

mikemccand · 2020-10-30T15:31:15Z

lucene/core/src/test/org/apache/lucene/index/TestOrdinalMap.java

+import java.io.IOException;
+import java.lang.reflect.Field;
+import java.util.HashMap;
+


Does (would) https://issues.apache.org/jira/browse/LUCENE-9564 enforce import ordering check?

I think it would avoid this sort of change (I use IntelliJ which autoformats java imports at the end).

…ct values. (#1948) For doc values that are not too high cardinality, it is common for some large segments to contain all distinct values. In this case, we can check if the first segment ords map perfectly to global ords, and if so store the global ord deltas and first segment indices as `LongValues.ZEROES` to save some space.

…ct values. (apache#1948) LUCENE-9536: Optimize OrdinalMap when one segment contains all distinct values. For doc values that are not too high cardinality, it is common for some large segments to contain all distinct values. In this case, we can check if the first segment ords map perfectly to global ords, and if so store the global ord deltas and first segment indices as `LongValues.ZEROES` to save some space.

jpountz reviewed Oct 13, 2020

View reviewed changes

mikemccand reviewed Oct 14, 2020

View reviewed changes

jtibshirani added 3 commits October 15, 2020 11:35

Omit static singleton from RAM estimate.

55ac801

Clarify comment.

e9d7b94

Add test that covers the optomization.

a0330e3

mikemccand approved these changes Oct 30, 2020

View reviewed changes

jpountz merged commit 8f004f7 into apache:master Nov 2, 2020

jtibshirani deleted the ordinal-map branch November 2, 2020 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9536: Optimize OrdinalMap when one segment contains all distinct values. #1948

LUCENE-9536: Optimize OrdinalMap when one segment contains all distinct values. #1948

jtibshirani commented Oct 5, 2020

jtibshirani commented Oct 5, 2020 •

edited

jpountz left a comment

jpountz Oct 13, 2020

jtibshirani Oct 14, 2020

jpountz Oct 13, 2020

jtibshirani Oct 15, 2020 •

edited

mikemccand Oct 14, 2020

jtibshirani Oct 14, 2020 •

edited

mikemccand Oct 15, 2020

jpountz Oct 15, 2020

mikemccand Oct 30, 2020

jpountz Oct 30, 2020

mikemccand Oct 14, 2020

jtibshirani Oct 14, 2020

mikemccand Oct 15, 2020

jtibshirani commented Oct 15, 2020 •

edited

jtibshirani commented Oct 26, 2020

mikemccand left a comment

mikemccand Oct 30, 2020

mikemccand Oct 30, 2020

jtibshirani Oct 30, 2020

LUCENE-9536: Optimize OrdinalMap when one segment contains all distinct values. #1948

LUCENE-9536: Optimize OrdinalMap when one segment contains all distinct values. #1948

Conversation

jtibshirani commented Oct 5, 2020

jtibshirani commented Oct 5, 2020 • edited

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Oct 15, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Oct 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Oct 15, 2020 • edited

jtibshirani commented Oct 26, 2020

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Oct 5, 2020 •

edited

jtibshirani Oct 15, 2020 •

edited

jtibshirani Oct 14, 2020 •

edited

jtibshirani commented Oct 15, 2020 •

edited