LUCENE-9450 Use BinaryDocValues in the taxonomy writer #1733

gautamworah96 · 2020-08-10T21:40:13Z

Description

This PR modifies the taxonomy writer and reader implementation to use BinaryDocValues instead of StoredValues.
The taxonomy index uses stored fields today and must do a number of stored field lookups for each query to resolve taxonomy ordinals back to human presentable facet labels.

Solution

Change the storage format to use DocValues

Tests

ant test fails because
.binaryValue() returns a NullPointerException

To reproduce the error:
ant test -Dtestcase=TestExpressionAggregationFacetsExample -Dtests.method=testSimple -Dtests.seed=4544BD51622879A4 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=si -Dtests.timezone=Antarctica/DumontDUrville -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

gives

    [mkdir] Created dir: /Users/gauworah/opensource/mystuff/lucene-solr/lucene/build/demo/test/temp
   [junit4] <JUnit4> says Привет! Master seed: 4544BD51622879A4
   [junit4] Executing 1 suite with 1 JVM.
   [junit4] 
   [junit4] Started J0 PID(76859@localhost).
   [junit4] Suite: org.apache.lucene.demo.facet.TestExpressionAggregationFacetsExample
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestExpressionAggregationFacetsExample -Dtests.method=testSimple -Dtests.seed=4544BD51622879A4 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=si -Dtests.timezone=Antarctica/DumontDUrville -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
   [junit4] ERROR   0.61s | TestExpressionAggregationFacetsExample.testSimple <<<
   [junit4]    > Throwable #1: java.lang.NullPointerException
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([4544BD51622879A4:7DF799AF45DBAD75]:0)
   [junit4]    >        at org.apache.lucene.index.MultiDocValues$3.binaryValue(MultiDocValues.java:403)
   [junit4]    >        at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.getPath(DirectoryTaxonomyReader.java:328)
   [junit4]    >        at org.apache.lucene.facet.taxonomy.FloatTaxonomyFacets.getTopChildren(FloatTaxonomyFacets.java:151)
   [junit4]    >        at org.apache.lucene.demo.facet.ExpressionAggregationFacetsExample.search(ExpressionAggregationFacetsExample.java:107)
   [junit4]    >        at org.apache.lucene.demo.facet.ExpressionAggregationFacetsExample.runSearch(ExpressionAggregationFacetsExample.java:118)
   [junit4]    >        at org.apache.lucene.demo.facet.TestExpressionAggregationFacetsExample.testSimple(TestExpressionAggregationFacetsExample.java:28)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:830)

3 other tests also fail at the function call

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the master branch.
I have run ant precommit and the appropriate test suite.
I have added tests for my changes.
I have added documentation for the Ref Guide (for Solr changes only).

This is a draft PR

gautamworah96 · 2020-08-11T15:35:42Z

Changes in this revision (incorporated from feedback on JIRA):

Added a call to advanceExact() before calling .binaryValue() and an assert to check that the field exists in the index
Re-added the StringField with the Field.Store.YES changed to Field.Store.NO.
I've not added new tests at the moment. Trying to get the existing ones to work first.

From the error log:
Note that the code is able to successfully execute the assert found statement (so the field does exist), and it fails on the next line

mikemccand

I see what the NPE is happening -- I left a comment.

One concern with this change is that we are creating a new BinaryDocValues per facet ordinal we want to resolve. That is not very efficient -- it'd be better to create a single BinaryDocValues, then sort all facet ordinals we need to resolve in natural int order, then resolve them one by one by .advanceExact'ing in order. However, 1) that'd require major changes to the faceting API, and 2) perhaps, even with the inefficiency of making a new BinaryDocValues for every facet ordinal, this is still more efficient than loading stored fields for that document (which, at default Codec must decompress a block of documents each time unless you get lucky and subsequent ordinals happened to be in the same block). So, progress not perfection, but let's try to confirm with benchmarks that this change is indeed faster than the current stored fields solution, once we get tests passing.

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java

…into HEAD

…UCENE-9450

mikemccand · 2020-08-13T14:04:09Z

Woohoo, tests all pass now? What a tiny change it turned out to be :)

Can you try to run luceneutil benchmarks? Let's see if this is net/net faster. Even if it is the same speed, we should move forward -- stored fields are likely to get more compressed / slower to access over time, e.g. https://issues.apache.org/jira/browse/LUCENE-9447.

We can also (separate followon issue!) better optimized the ord -> FacetLabel lookup to do them in bulk, in order, so we can share a single BinaryDocValues instance per leaf per query.

mikemccand

This looks promising! It is close! Thanks @gautamworah96.

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java

mikemccand · 2020-08-20T15:16:36Z

This change looks good to me!

I think the biggest issue is what to do about backwards compatibility? Users who upgrade to this release will suddenly see their taxonomy index becomes unreadable.

We could 1) make this a Lucene 9.x only change. Normally, for Lucene's main index, the next major release should be able to read all stable releases from the previous major release. But for the taxonomy index, I suspect it is OK if we relax that and make a hard break to the index. There are very few (but, non-zero) users of Lucene's faceting.

Or 2) we add a basic backwards compatibility support here, and then we can push this to 8.x stable releases. E.g. if we could differentiate when we are opening an already created (based on stored fields) taxonomy index, use the old way, but if we are making a new taxonomy index, use the new way. This would be pretty simple to build, I suspect. E.g. on opening the index, we could try to pull BinaryDocValues and if it exists, we know it's the new way, else, use the old way, unless the index is empty, in which case, use the new way?

Also add a TODO for the bulk lookup API

mikemccand · 2020-08-24T12:29:44Z

I think the back-compat layer at read-time is a good start, but is not quite enough.

Imagine Susan. She upgrades to Lucene with this change, pushed, as it is now. Susan runs some queries against her index, resolves ordinals, using the stored fields, and all is good. Susan indexes some more documents with facet labels, and these new segments in the taxonomy index are written using BinaryDocValues. Susan refreshes and runs some queries, and some ordinals resolve using old way (if they came from old segments), and some the new way (if they came from new segments). Life goes on, more documents indexed. Suddenly, Susan's taxonomy index executes a merge! Merging old and new segments together, now the newly merged segment has a mix of some docs that used stored fields while others used BinaryDocValues, and the back compat logic will becomes confused and incorrectly try to use BinaryDocValues instead of stored fields, and I think that assert will trip for such documents?

Could you try to add a test case showing this case? Have a look at Lucene's TestBackwardsCompatibility -- it tests the main index, but you can borrow the ideas (e.g. APIs to zip/unzip) to implement a new unit test confirming we are maintaining back-compat for taxonomy index?

goankur · 2020-09-21T22:51:11Z

Thanks @gautamworah96 for this impactful change and @mikemccand for reviewing it.
A few thoughts

This change disables STORED fields part but keeps the POSTINGS part here
fullPathField = new StringField(Consts.FULL, "", Field.Store.NO); which is unnecessary as postings are already enabled for facet labels in FacetsConfig#L364-L399 including dimension drill-down. So I propose we get rid of the fullPathField altogether.
For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable. Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement.

mikemccand · 2020-09-22T15:04:32Z

So I propose we get rid of the fullPathField altogether.

Wow, +1, this looks like it is (pre-existingly?) double-indexed? Maybe we should do this as a separate pre-cursor PR to this one (switch to StoredField when indexing the fullPathField)?

For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable.

Yeah +1 to, on a hit by hit basis, try BinaryDocValues first, and then fallback to the StoredField. This is the cost of backwards compatibility ... though, for a fully new (all BinaryDocValues) index, the performance should be fine. Also, note that in Lucene 10.x we can remove that back-compat fallback.

Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement.

I think this would indeed be tricky.

ErickErickson · 2020-09-22T16:20:26Z

On Sep 22, 2020, at 11:04 AM, Michael McCandless ***@***.***> wrote: So I propose we get rid of the fullPathField altogether. Wow, +1, this looks like it is (pre-existingly?) double-indexed? Maybe we should do this as a separate pre-cursor PR to this one (switch to StoredField when indexing the fullPathField)? For maintaining backwards compatibility, we can read facet labels from new BinaryDocValues field, falling back to old StoredField if BinaryDocValues field does not exist or has no value for the docId. The performance penalty of doing so should be acceptable. Yeah +1 to, on a hit by hit basis, try BinaryDocValues first, and then fallback to the StoredField. This is the cost of backwards compatibility ... though, for a fully new (all BinaryDocValues) index, the performance should be fine. Also, note that in Lucene 10.x we can remove that back-compat fallback. Alternatively we can implement a special merge policy that takes care of moving data from old Stored field to BinaryDocValues field at the time of merge but that might be tricky to implement. I think this would indeed be tricky.

Andrzej and I spent quite a bit of time trying to get something similar to work for adding docValues on the fly using a custom merge policy. We realized that you could create a docValues field from an indexed field for primitive types since all the information was already in the index. We never could get it working if there was active indexing happening, so resorted to a batch process that rewrote all segments doing the transformation along the way that had to be run on a quiescent index, the client decided that was good enough and didn’t want to spend more time on it. Our best guess was that there was a race condition that we somehow couldn’t find in the time allowed… Mostly just FYI... FWIW, Erick

…

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Run all tests in a PR

This reverts commit 402d908.

gautamworah96 · 2020-10-30T17:11:11Z

This new revision contains the following changes:

Change in the DirectoryTaxonomyReader to decode values in getPath based on the boolean result from .advanceExact
Added a new test TestBackwardsCompatibility file that reads from an older 8.6.3 StoredFields index and updates it with new fields.
Zip file for old index

Tests:
gradlew check and gradlew test pass.
Verified that the test for generating old index works as expected with gradle (added an Ignore tag to it so that it is not executed on normal runs)

Follow up issue:
The new test class introduced in this PR is similar to org.apache.lucene.index.TestBackwardsCompatibility. This class has some outdated comments that use ant instead of gradle.
I've created a separate follow-on issue for the same.

mikemccand · 2020-10-30T17:51:07Z

Thanks @gautamworah96! So the new test failed with the previous revision, then you fixed the back compat and then the test now passes?

mikemccand

Thanks @gautamworah96 -- I left a couple minor comments, but otherwise I think this is ready!

...ne/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java

gautamworah96 · 2020-10-30T17:58:33Z

Thanks @gautamworah96! So the new test failed with the previous revision, then you fixed the back compat and then the test now passes?

Yes. The modified if condition in getPath does the trick.

mikemccand

PR looks great! Thanks @gautamworah96 -- this is a nice performance gain for Lucene faceting. I will push the change soon to 9.0.

mikemccand

Thanks @gautamworah96 -- this looks great! I'll push soon!

Tested the commit with the original Lucene master branch and it passes successfully. This test was failing initially without the dependency.

gautamworah96 · 2020-11-09T00:41:59Z

The earlier revision of this PR had backward compatibility test failures after merging because the Lucene codec has changed.
I've added a testCompile dependency on Lucene's 8.6.3 backward codecs.

There are minor changes in the versions.lock file to address the found dependencies that were not in the lock state error.
This file was autogenerated by running ./gradlew --write-locks.

The current merge failure is due to this versions.lock file change.

mikemccand

Thanks @gautamworah96 -- the test works for me now! I'll push soon.

mikemccand

Thanks for rebasing @gautamworah96!

…ene's facet implementation, yielding ~4-5% red-line QPS gain in pure faceting benchmarks (apache#1733)

This code change is a duplicate of the effort in the open source LUCENE-9450 as the open source PR change is not backwards compatible at the moment. Github PR link: apache#1733 TEST: bb release SIM: https://issues.amazon.com/issues/LUCENE-3117

gautamworah96 and others added 8 commits August 9, 2020 16:13

WIP: Modify DirectoryTaxonomyWriter to use BinaryDocValues

7c3b7a7

Remove extra comment

df51f04

LUCENE-9450 Use advanceExact to iterate to the field

bbc391a

Merge branch 'master' of https://github.com/gautamworah96/lucene-solr

5ddd5c8

SOLR-14641: PeerSync, remove canHandleVersionRanges check (apache#1663)

4b45f96

SOLR-14641: Update CHANGES.txt

12205b4

LUCENE-9452: remove jenkins.build.ref.guide.sh as it's no longer needed

dc637cf

LUCENE-9450 Use BinaryDocValues in the taxonomy writer

ff5eba7

mikemccand reviewed Aug 11, 2020

View reviewed changes

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

gautamworah96 added 5 commits August 11, 2020 21:36

Merge branch 'master' of https://github.com/apache/lucene-solr into HEAD

f8eeb3d

Store a variable values to store the exact BinaryDocValue

ae1c5d6

Merge branch 'master' of https://github.com/gautamworah96/lucene-solr …

75d0f62

…into HEAD

Fixed the NPE error, also use ReaderUtil to get the docId

a768d9f

Merge branch 'master' of https://github.com/apache/lucene-solr into L…

4b7e345

…UCENE-9450

gautamworah96 marked this pull request as ready for review August 13, 2020 06:47

mikemccand reviewed Aug 13, 2020

View reviewed changes

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java Outdated Show resolved Hide resolved

Extract FacetsConfig.pathToString in a new variable

709c5ae

mikemccand reviewed Aug 20, 2020

View reviewed changes

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Show resolved Hide resolved

Add a condition to make the DirectoryTaxonomyReader backwards compatible

0e80511

Also add a TODO for the bulk lookup API

gautamworah96 marked this pull request as draft September 7, 2020 20:50

gautamworah96 mentioned this pull request Sep 22, 2020

luceneutil should have a benchmark for facet fields merge time mikemccand/luceneutil#79

Open

gautamworah96 and others added 2 commits October 7, 2020 18:22

Run all tests in a PR

402d908

Merge pull request #1 from gautamworah96/github_pr_tests

08901b6

Run all tests in a PR

gautamworah96 added 2 commits October 30, 2020 01:15

Implemented support for back compat using a test case

ee2bdc1

Revert "Run all tests in a PR"

4a1d483

This reverts commit 402d908.

gautamworah96 force-pushed the master branch from 3704209 to 4a1d483 Compare October 30, 2020 08:18

gautamworah96 added 2 commits October 30, 2020 01:22

Add the zip file for the old index

7d6a16c

Minor style fixes

d3b0de9

gautamworah96 marked this pull request as ready for review October 30, 2020 17:11

mikemccand reviewed Oct 30, 2020

View reviewed changes

...ne/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java Outdated Show resolved Hide resolved

...ne/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java Show resolved Hide resolved

Fix order of asserts in the test case

f4bcc23

gautamworah96 requested a review from mikemccand November 1, 2020 23:25

mikemccand approved these changes Nov 2, 2020

View reviewed changes

mikemccand approved these changes Nov 6, 2020

View reviewed changes

gautamworah96 added 3 commits November 6, 2020 15:22

Add backwards codecs to test dependency

9147704

Add comment explaining why we need the Lucene 8.6.3 codec jar.

b10ae39

Tested the commit with the original Lucene master branch and it passes successfully. This test was failing initially without the dependency.

Added dependency locking

87f3919

mikemccand approved these changes Nov 12, 2020

View reviewed changes

Fix merge conflicts. Pull in changes from master branch

f347133

mikemccand approved these changes Nov 12, 2020

View reviewed changes

mikemccand merged commit 3f8f84f into apache:master Nov 12, 2020

msfroh pushed a commit to msfroh/lucene-solr that referenced this pull request Nov 18, 2020

LUCENE-9450 Switch to BinaryDocValues instead of stored fields in Luc…

7909db8

…ene's facet implementation, yielding ~4-5% red-line QPS gain in pure faceting benchmarks (apache#1733)

epugh pushed a commit to epugh/lucene-solr-1 that referenced this pull request Jan 15, 2021

LUCENE-9450 Switch to BinaryDocValues instead of stored fields in Luc…

e3df97f

…ene's facet implementation, yielding ~4-5% red-line QPS gain in pure faceting benchmarks (apache#1733)

asfimport mentioned this pull request Oct 30, 2020

Update java docs in index/TestBackwardsCompatibility class to use gradle and not ant [LUCENE-9593] apache/lucene#10633

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9450 Use BinaryDocValues in the taxonomy writer #1733

LUCENE-9450 Use BinaryDocValues in the taxonomy writer #1733

gautamworah96 commented Aug 10, 2020 •

edited

gautamworah96 commented Aug 11, 2020

mikemccand left a comment

mikemccand commented Aug 13, 2020

mikemccand left a comment

mikemccand commented Aug 20, 2020

mikemccand commented Aug 24, 2020

goankur commented Sep 21, 2020

mikemccand commented Sep 22, 2020

ErickErickson commented Sep 22, 2020 via email

gautamworah96 commented Oct 30, 2020

mikemccand commented Oct 30, 2020

mikemccand left a comment

gautamworah96 commented Oct 30, 2020

mikemccand left a comment

mikemccand left a comment

gautamworah96 commented Nov 9, 2020

mikemccand left a comment

mikemccand left a comment

LUCENE-9450 Use BinaryDocValues in the taxonomy writer #1733

LUCENE-9450 Use BinaryDocValues in the taxonomy writer #1733

Conversation

gautamworah96 commented Aug 10, 2020 • edited

Description

Solution

Tests

Checklist

gautamworah96 commented Aug 11, 2020

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand commented Aug 13, 2020

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand commented Aug 20, 2020

mikemccand commented Aug 24, 2020

goankur commented Sep 21, 2020

mikemccand commented Sep 22, 2020

ErickErickson commented Sep 22, 2020 via email

gautamworah96 commented Oct 30, 2020

mikemccand commented Oct 30, 2020

mikemccand left a comment

Choose a reason for hiding this comment

gautamworah96 commented Oct 30, 2020

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

gautamworah96 commented Nov 9, 2020

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

gautamworah96 commented Aug 10, 2020 •

edited