-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9450 Use BinaryDocValues in the taxonomy writer #1733
Conversation
Changes in this revision (incorporated from feedback on JIRA):
From the error log: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what the NPE
is happening -- I left a comment.
One concern with this change is that we are creating a new BinaryDocValues
per facet ordinal we want to resolve. That is not very efficient -- it'd be better to create a single BinaryDocValues
, then sort all facet ordinals we need to resolve in natural int
order, then resolve them one by one by .advanceExact
'ing in order. However, 1) that'd require major changes to the faceting API, and 2) perhaps, even with the inefficiency of making a new BinaryDocValues
for every facet ordinal, this is still more efficient than loading stored fields for that document (which, at default Codec
must decompress a block of documents each time unless you get lucky and subsequent ordinals happened to be in the same block). So, progress not perfection, but let's try to confirm with benchmarks that this change is indeed faster than the current stored fields solution, once we get tests passing.
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
Woohoo, tests all pass now? What a tiny change it turned out to be :) Can you try to run We can also (separate followon issue!) better optimized the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks promising! It is close! Thanks @gautamworah96.
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Show resolved
Hide resolved
This change looks good to me! I think the biggest issue is what to do about backwards compatibility? Users who upgrade to this release will suddenly see their taxonomy index becomes unreadable. We could 1) make this a Lucene 9.x only change. Normally, for Lucene's main index, the next major release should be able to read all stable releases from the previous major release. But for the taxonomy index, I suspect it is OK if we relax that and make a hard break to the index. There are very few (but, non-zero) users of Lucene's faceting. Or 2) we add a basic backwards compatibility support here, and then we can push this to 8.x stable releases. E.g. if we could differentiate when we are opening an already created (based on stored fields) taxonomy index, use the old way, but if we are making a new taxonomy index, use the new way. This would be pretty simple to build, I suspect. E.g. on opening the index, we could try to pull |
Also add a TODO for the bulk lookup API
I think the back-compat layer at read-time is a good start, but is not quite enough. Imagine Susan. She upgrades to Lucene with this change, pushed, as it is now. Susan runs some queries against her index, resolves ordinals, using the stored fields, and all is good. Susan indexes some more documents with facet labels, and these new segments in the taxonomy index are written using Could you try to add a test case showing this case? Have a look at Lucene's |
Thanks @gautamworah96 for this impactful change and @mikemccand for reviewing it.
|
Wow, +1, this looks like it is (pre-existingly?) double-indexed? Maybe we should do this as a separate pre-cursor PR to this one (switch to
Yeah +1 to, on a hit by hit basis, try
I think this would indeed be tricky. |
On Sep 22, 2020, at 11:04 AM, Michael McCandless ***@***.***> wrote:
So I propose we get rid of the fullPathField altogether.
Wow, +1, this looks like it is (pre-existingly?) double-indexed? Maybe we should do this as a separate pre-cursor PR to this one (switch to StoredField when indexing the fullPathField)?
For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable.
Yeah +1 to, on a hit by hit basis, try BinaryDocValues first, and then fallback to the StoredField. This is the cost of backwards compatibility ... though, for a fully new (all BinaryDocValues) index, the performance should be fine. Also, note that in Lucene 10.x we can remove that back-compat fallback.
Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement.
I think this would indeed be tricky.
Andrzej and I spent quite a bit of time trying to get something similar to work for adding docValues on the fly using a custom merge policy. We realized that you could create a docValues field from an indexed field for primitive types since all the information was already in the index. We never could get it working if there was active indexing happening, so resorted to a batch process that rewrote all segments doing the transformation along the way that had to be run on a quiescent index, the client decided that was good enough and didn’t want to spend more time on it.
Our best guess was that there was a race condition that we somehow couldn’t find in the time allowed… Mostly just FYI...
FWIW,
Erick
…
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Run all tests in a PR
This new revision contains the following changes:
Tests: Follow up issue: |
Thanks @gautamworah96! So the new test failed with the previous revision, then you fixed the back compat and then the test now passes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gautamworah96 -- I left a couple minor comments, but otherwise I think this is ready!
...ne/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
Outdated
Show resolved
Hide resolved
...ne/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
Show resolved
Hide resolved
Yes. The modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks great! Thanks @gautamworah96 -- this is a nice performance gain for Lucene faceting. I will push the change soon to 9.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gautamworah96 -- this looks great! I'll push soon!
Tested the commit with the original Lucene master branch and it passes successfully. This test was failing initially without the dependency.
The earlier revision of this PR had backward compatibility test failures after merging because the Lucene codec has changed. There are minor changes in the The current merge failure is due to this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gautamworah96 -- the test works for me now! I'll push soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for rebasing @gautamworah96!
…ene's facet implementation, yielding ~4-5% red-line QPS gain in pure faceting benchmarks (apache#1733)
…ene's facet implementation, yielding ~4-5% red-line QPS gain in pure faceting benchmarks (apache#1733)
This code change is a duplicate of the effort in the open source LUCENE-9450 as the open source PR change is not backwards compatible at the moment. Github PR link: apache#1733 TEST: bb release SIM: https://issues.amazon.com/issues/LUCENE-3117
This code change is a duplicate of the effort in the open source LUCENE-9450 as the open source PR change is not backwards compatible at the moment. Github PR link: apache#1733 TEST: bb release SIM: https://issues.amazon.com/issues/LUCENE-3117
Description
This PR modifies the taxonomy writer and reader implementation to use BinaryDocValues instead of StoredValues.
The taxonomy index uses stored fields today and must do a number of stored field lookups for each query to resolve taxonomy ordinals back to human presentable facet labels.
Solution
Change the storage format to use DocValues
Tests
ant test fails because
.binaryValue()
returns aNullPointerException
To reproduce the error:
ant test -Dtestcase=TestExpressionAggregationFacetsExample -Dtests.method=testSimple -Dtests.seed=4544BD51622879A4 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=si -Dtests.timezone=Antarctica/DumontDUrville -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
gives
3 other tests also fail at the function call
Checklist
Please review the following and check all that apply:
master
branch.ant precommit
and the appropriate test suite.This is a draft PR