Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9450 Use BinaryDocValues in the taxonomy writer #1733

Merged
merged 26 commits into from
Nov 12, 2020

Conversation

gautamworah96
Copy link
Contributor

@gautamworah96 gautamworah96 commented Aug 10, 2020

Description

This PR modifies the taxonomy writer and reader implementation to use BinaryDocValues instead of StoredValues.
The taxonomy index uses stored fields today and must do a number of stored field lookups for each query to resolve taxonomy ordinals back to human presentable facet labels.

Solution

Change the storage format to use DocValues

Tests

ant test fails because
.binaryValue() returns a NullPointerException

To reproduce the error:
ant test -Dtestcase=TestExpressionAggregationFacetsExample -Dtests.method=testSimple -Dtests.seed=4544BD51622879A4 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=si -Dtests.timezone=Antarctica/DumontDUrville -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

gives

    [mkdir] Created dir: /Users/gauworah/opensource/mystuff/lucene-solr/lucene/build/demo/test/temp
   [junit4] <JUnit4> says Привет! Master seed: 4544BD51622879A4
   [junit4] Executing 1 suite with 1 JVM.
   [junit4] 
   [junit4] Started J0 PID(76859@localhost).
   [junit4] Suite: org.apache.lucene.demo.facet.TestExpressionAggregationFacetsExample
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestExpressionAggregationFacetsExample -Dtests.method=testSimple -Dtests.seed=4544BD51622879A4 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=si -Dtests.timezone=Antarctica/DumontDUrville -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
   [junit4] ERROR   0.61s | TestExpressionAggregationFacetsExample.testSimple <<<
   [junit4]    > Throwable #1: java.lang.NullPointerException
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([4544BD51622879A4:7DF799AF45DBAD75]:0)
   [junit4]    >        at org.apache.lucene.index.MultiDocValues$3.binaryValue(MultiDocValues.java:403)
   [junit4]    >        at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.getPath(DirectoryTaxonomyReader.java:328)
   [junit4]    >        at org.apache.lucene.facet.taxonomy.FloatTaxonomyFacets.getTopChildren(FloatTaxonomyFacets.java:151)
   [junit4]    >        at org.apache.lucene.demo.facet.ExpressionAggregationFacetsExample.search(ExpressionAggregationFacetsExample.java:107)
   [junit4]    >        at org.apache.lucene.demo.facet.ExpressionAggregationFacetsExample.runSearch(ExpressionAggregationFacetsExample.java:118)
   [junit4]    >        at org.apache.lucene.demo.facet.TestExpressionAggregationFacetsExample.testSimple(TestExpressionAggregationFacetsExample.java:28)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:830)

3 other tests also fail at the function call

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ant precommit and the appropriate test suite.
  • I have added tests for my changes.
  • I have added documentation for the Ref Guide (for Solr changes only).

This is a draft PR

@gautamworah96
Copy link
Contributor Author

Changes in this revision (incorporated from feedback on JIRA):

  • Added a call to advanceExact() before calling .binaryValue() and an assert to check that the field exists in the index

  • Re-added the StringField with the Field.Store.YES changed to Field.Store.NO.

  • I've not added new tests at the moment. Trying to get the existing ones to work first.

From the error log:
Note that the code is able to successfully execute the assert found statement (so the field does exist), and it fails on the next line

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what the NPE is happening -- I left a comment.

One concern with this change is that we are creating a new BinaryDocValues per facet ordinal we want to resolve. That is not very efficient -- it'd be better to create a single BinaryDocValues, then sort all facet ordinals we need to resolve in natural int order, then resolve them one by one by .advanceExact'ing in order. However, 1) that'd require major changes to the faceting API, and 2) perhaps, even with the inefficiency of making a new BinaryDocValues for every facet ordinal, this is still more efficient than loading stored fields for that document (which, at default Codec must decompress a block of documents each time unless you get lucky and subsequent ordinals happened to be in the same block). So, progress not perfection, but let's try to confirm with benchmarks that this change is indeed faster than the current stored fields solution, once we get tests passing.

@gautamworah96 gautamworah96 marked this pull request as ready for review August 13, 2020 06:47
@mikemccand
Copy link
Member

Woohoo, tests all pass now? What a tiny change it turned out to be :)

Can you try to run luceneutil benchmarks? Let's see if this is net/net faster. Even if it is the same speed, we should move forward -- stored fields are likely to get more compressed / slower to access over time, e.g. https://issues.apache.org/jira/browse/LUCENE-9447.

We can also (separate followon issue!) better optimized the ord -> FacetLabel lookup to do them in bulk, in order, so we can share a single BinaryDocValues instance per leaf per query.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks promising! It is close! Thanks @gautamworah96.

@mikemccand
Copy link
Member

This change looks good to me!

I think the biggest issue is what to do about backwards compatibility? Users who upgrade to this release will suddenly see their taxonomy index becomes unreadable.

We could 1) make this a Lucene 9.x only change. Normally, for Lucene's main index, the next major release should be able to read all stable releases from the previous major release. But for the taxonomy index, I suspect it is OK if we relax that and make a hard break to the index. There are very few (but, non-zero) users of Lucene's faceting.

Or 2) we add a basic backwards compatibility support here, and then we can push this to 8.x stable releases. E.g. if we could differentiate when we are opening an already created (based on stored fields) taxonomy index, use the old way, but if we are making a new taxonomy index, use the new way. This would be pretty simple to build, I suspect. E.g. on opening the index, we could try to pull BinaryDocValues and if it exists, we know it's the new way, else, use the old way, unless the index is empty, in which case, use the new way?

@mikemccand
Copy link
Member

I think the back-compat layer at read-time is a good start, but is not quite enough.

Imagine Susan. She upgrades to Lucene with this change, pushed, as it is now. Susan runs some queries against her index, resolves ordinals, using the stored fields, and all is good. Susan indexes some more documents with facet labels, and these new segments in the taxonomy index are written using BinaryDocValues. Susan refreshes and runs some queries, and some ordinals resolve using old way (if they came from old segments), and some the new way (if they came from new segments). Life goes on, more documents indexed. Suddenly, Susan's taxonomy index executes a merge! Merging old and new segments together, now the newly merged segment has a mix of some docs that used stored fields while others used BinaryDocValues, and the back compat logic will becomes confused and incorrectly try to use BinaryDocValues instead of stored fields, and I think that assert will trip for such documents?

Could you try to add a test case showing this case? Have a look at Lucene's TestBackwardsCompatibility -- it tests the main index, but you can borrow the ideas (e.g. APIs to zip/unzip) to implement a new unit test confirming we are maintaining back-compat for taxonomy index?

@gautamworah96 gautamworah96 marked this pull request as draft September 7, 2020 20:50
@goankur
Copy link
Contributor

goankur commented Sep 21, 2020

Thanks @gautamworah96 for this impactful change and @mikemccand for reviewing it.
A few thoughts

  1. This change disables STORED fields part but keeps the POSTINGS part here
    fullPathField = new StringField(Consts.FULL, "", Field.Store.NO); which is unnecessary as postings are already enabled for facet labels in FacetsConfig#L364-L399 including dimension drill-down. So I propose we get rid of the fullPathField altogether.

  2. For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable. Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement.

@mikemccand
Copy link
Member

So I propose we get rid of the fullPathField altogether.

Wow, +1, this looks like it is (pre-existingly?) double-indexed? Maybe we should do this as a separate pre-cursor PR to this one (switch to StoredField when indexing the fullPathField)?

For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable.

Yeah +1 to, on a hit by hit basis, try BinaryDocValues first, and then fallback to the StoredField. This is the cost of backwards compatibility ... though, for a fully new (all BinaryDocValues) index, the performance should be fine. Also, note that in Lucene 10.x we can remove that back-compat fallback.

Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement.

I think this would indeed be tricky.

@ErickErickson
Copy link

ErickErickson commented Sep 22, 2020 via email

@gautamworah96
Copy link
Contributor Author

This new revision contains the following changes:

  1. Change in the DirectoryTaxonomyReader to decode values in getPath based on the boolean result from .advanceExact
  2. Added a new test TestBackwardsCompatibility file that reads from an older 8.6.3 StoredFields index and updates it with new fields.
  3. Zip file for old index

Tests:
gradlew check and gradlew test pass.
Verified that the test for generating old index works as expected with gradle (added an Ignore tag to it so that it is not executed on normal runs)

Follow up issue:
The new test class introduced in this PR is similar to org.apache.lucene.index.TestBackwardsCompatibility. This class has some outdated comments that use ant instead of gradle.
I've created a separate follow-on issue for the same.

@gautamworah96 gautamworah96 marked this pull request as ready for review October 30, 2020 17:11
@mikemccand
Copy link
Member

Thanks @gautamworah96! So the new test failed with the previous revision, then you fixed the back compat and then the test now passes?

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gautamworah96 -- I left a couple minor comments, but otherwise I think this is ready!

@gautamworah96
Copy link
Contributor Author

Thanks @gautamworah96! So the new test failed with the previous revision, then you fixed the back compat and then the test now passes?

Yes. The modified if condition in getPath does the trick.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks great! Thanks @gautamworah96 -- this is a nice performance gain for Lucene faceting. I will push the change soon to 9.0.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gautamworah96 -- this looks great! I'll push soon!

Tested the commit with the original Lucene master branch and it passes successfully. This test was failing initially without the dependency.
@gautamworah96
Copy link
Contributor Author

The earlier revision of this PR had backward compatibility test failures after merging because the Lucene codec has changed.
I've added a testCompile dependency on Lucene's 8.6.3 backward codecs.

There are minor changes in the versions.lock file to address the found dependencies that were not in the lock state error.
This file was autogenerated by running ./gradlew --write-locks.

The current merge failure is due to this versions.lock file change.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gautamworah96 -- the test works for me now! I'll push soon.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for rebasing @gautamworah96!

@mikemccand mikemccand merged commit 3f8f84f into apache:master Nov 12, 2020
msfroh pushed a commit to msfroh/lucene-solr that referenced this pull request Nov 18, 2020
…ene's facet implementation, yielding ~4-5% red-line QPS gain in pure faceting benchmarks (apache#1733)
epugh pushed a commit to epugh/lucene-solr-1 that referenced this pull request Jan 15, 2021
…ene's facet implementation, yielding ~4-5% red-line QPS gain in pure faceting benchmarks (apache#1733)
gsmiller pushed a commit to gsmiller/lucene-solr that referenced this pull request Mar 17, 2021
This code change is a duplicate of the effort in the open source LUCENE-9450 as the open source PR change is not backwards compatible at the moment.

Github PR link: apache#1733 TEST: bb release SIM: https://issues.amazon.com/issues/LUCENE-3117
gsmiller pushed a commit to gsmiller/lucene-solr that referenced this pull request Mar 17, 2021
This code change is a duplicate of the effort in the open source LUCENE-9450 as the open source PR change is not backwards compatible at the moment.

Github PR link: apache#1733 TEST: bb release SIM: https://issues.amazon.com/issues/LUCENE-3117
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants