-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up NumericDocValuesWriter with index sorting #12381
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments about the implementation but it looks like a good optimization to me. Can you also add a CHANGES entry under 9.8?
@@ -75,4 +75,9 @@ public DocIdSetIterator iterator() { | |||
public int cardinality() { | |||
return cardinality; | |||
} | |||
|
|||
/** Return the FixedBitSet of this set. */ | |||
public FixedBitSet bitSet() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather like to expose something like boolean dense()
instead of the internal bitset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it is fixed
@@ -111,4 +111,70 @@ public void or(DocIdSetIterator iter) throws IOException { | |||
set(doc); | |||
} | |||
} | |||
|
|||
public static final BitSet all(int maxDoc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently only have 2 implementations of BitSet
, which the JVM optimizes better than N implementations. Could we remove this special BitSet
implementation and use a special null
marker instead to imply that all docs match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great suggestion! i tested earlier the virtual function can be inline always if there are only 2 implementations
d00b36c
to
b5a5b1f
Compare
@jpountz Thank you for comments, it's very helpful to me, the code has updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments, it looks good to me otherwise.
@@ -76,7 +76,8 @@ public void flush(SegmentWriteState state, Sorter.DocMap sortMap, NormsConsumer | |||
NumericDocValuesWriter.sortDocValues( | |||
state.segmentInfo.maxDoc(), | |||
sortMap, | |||
new BufferedNorms(values, docsWithField.iterator())); | |||
new BufferedNorms(values, docsWithField.iterator()), | |||
docsWithField.dense() && sortMap.size() == docsWithField.cardinality()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only testing dense()
should be correct right?
docsWithField.dense() && sortMap.size() == docsWithField.cardinality()); | |
docsWithField.dense()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no,DocsWithFieldSet will update it's bitset when a doc with field really added only, in sparse, it will not call DocsWithFieldSet#add, so if the first 64 doc with field has added, and then some doc added without this filed, the docsWithField.dense() will return true, i think we can remove docsWithField.dense()
, use sortMap.size() == docsWithField.cardinality()
only for dense case, the sortMap.size() will return the number of documents for the LeafReader, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right! thanks for explaining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have removed docsWithField.dense()
sortMap.size(), | ||
sortMap, | ||
oldValues, | ||
docsWithField.dense() && sortMap.size() == docsWithField.cardinality()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and likewise here?
docsWithField.dense() && sortMap.size() == docsWithField.cardinality()); | |
docsWithField.dense()); |
if (target < maxDoc) { | ||
return target; | ||
} | ||
return DocIdSetIterator.NO_MORE_DOCS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the call site, this is only called when target
is less than maxDoc
, so thit could be simplified to just return target
without the target < maxDoc
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, it's updated
Description
like pr-399, the DocsWithFieldSet#add() can avoid create instance for FixedBitSet in dense scene, so in NumericDocValuesWriter#sortDocValues() we can do the same thing, just the way to judge dense or sparse, I'm not sure if it's rigorous enough
the benchmark for write ten SortedNumericDocValuesField, the optimization saves ~7% commit time