New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

LUCENE-9476 Add getBulkPath API for the Taxonomy index #2247

Open

gautamworah96 wants to merge 5 commits into apache:master from gautamworah96:LUCENE-9476

Contributor

gautamworah96 commented Jan 26, 2021

Description

In LUCENE-9450 we switched the Taxonomy index from Stored Fields to BinaryDocValues. In the resulting implementation of the getPath code, we create a new BinaryDocValues's values instance for each ordinal.
It may happen that we may traverse over the same nodes over and over again if the getPath API is called multiple times for ordinals in the same segment/with the same readerIndex.

This PR takes advantage of that fact by sorting ordinals and then trying to find out if some of the ordinals are present in the same segment/have the same readerIndex (by trying to advanceExact to the correct position and not failing) thereby allowing us to reuse the previous BinaryDocValues object.

Solution

Steps:

Sort all ordinals and remember their position so as to store the path in the correct position
Try to advanceExact to the correct position with the previously calculated readerIndex. If the operation fails, try to find the correct segment for the ordinal and then advanceExact to the desired position.
Store this position for future ordinals.

Tests

Added a new test for the API that compares the individual getPath results from ordinals with the bulk FacetLabels returned by the getBulkPath API

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the master branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Ref Guide (for Solr changes only).


          WIP: LUCENE-9476 Add basic functionality, basic tests

8a820f1

gautamworah96 changed the title ~~WIP: LUCENE-9476 Add basic functionality, basic tests~~ WIP: LUCENE-9476 Add getBulkPath API for the Taxonomy index


          Misc. style fixes

93bbe5b

mikemccand reviewed

View reviewed changes

Member

mikemccand left a comment

Thanks @gautamworah96

Once we iterate to a solid PR I am very curious how this helps facets performance -- we can switch luceneutil over to this bulk API to test.

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

shaie reviewed

View reviewed changes

shaie left a comment

I am slowly getting back on the horse here 😄 , so this review focuses mainly on style ..

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java Outdated Show resolved Hide resolved


          Fixed small bugs, improved style. Perf test remaining

fd73d7b

gautamworah96 changed the title ~~WIP: LUCENE-9476 Add getBulkPath API for the Taxonomy index~~ LUCENE-9476 Add getBulkPath API for the Taxonomy index

gautamworah96 requested a review from mikemccand

January 29, 2021 17:45

mikemccand reviewed

View reviewed changes

Member

mikemccand left a comment

Thanks @gautamworah96 -- looking closer!

lucene/CHANGES.txt Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java

                   }
                   return ret;
                 }
+                private FacetLabel getPathFromCache(int ordinal) {
+                  // TODO: can we use an int-based hash impl, such as IntToObjectMap,

Member

mikemccand Jan 29, 2021

Oooh that is a great idea, and low-hanging fruit, and would greatly reduce the RAM usage for this cache.

I think DirectoryTaxonomyWriter also has such a cache that we could change to a native map.

Could you open a spinoff issue?

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Outdated Show resolved Hide resolved

...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java Show resolved Hide resolved

gautamworah96 added 2 commits

February 8, 2021 14:54


          Fixed a bug in multiple segments. The API now works for older indexes…

f8425e4

…. Use parallel sort to fix duplicate ordinal bug. Add a test case for it. Minor fixes


          Style fix

0c53c3b

gautamworah96 requested a review from mikemccand

February 8, 2021 23:29

sonatype-lift bot reviewed

View reviewed changes

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Show resolved Hide resolved

sonatype-lift bot reviewed

View reviewed changes

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Show resolved Hide resolved

mikemccand reviewed

View reviewed changes

Member

mikemccand left a comment

Thanks @gautamworah96, looks close!

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java Show resolved Hide resolved

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java

+                        // this check is only needed once to confirm that the index uses BinaryDocValues
+                        boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+                        if (success == false) {
+                          return getBulkPathForOlderIndexes(ordinals);

Member

mikemccand Feb 9, 2021

Hmm, I'm confused -- wouldn't an older index have no BinaryDocValues field? So, values would be null, and we should fallback then?

This code should hit NullPointerException on an old index I think? How come our backwards compatibility test didn't expose this?

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java

+                  for (int i = 0; i < ordinalsLength; i++) {
+                    synchronized (categoryCache) {
+                      categoryCache.put(ordinals[i], bulkPath[originalPosition[i]]);

Member

mikemccand Feb 9, 2021

We will sometimes put ordinals back into the cache that were already there at the start of this method right? I guess that's harmless. Or, maybe we should move this up above? Then we can do it only for those ordinals that were not already cached?

Contributor Author

gautamworah96 Feb 10, 2021

I think intuitively adding the ordinals back into the cache would not be a problem. This should also (theoretically) be faster than trying to get the lock again and again in a loop?

Member

mikemccand Feb 17, 2021

This should also (theoretically) be faster than trying to get the lock again and again in a loop?

Hmm, I'm confused: this code is already getting the lock inside a for loop? I guess we could move the synchronized outside of the for loop? Or, maybe javac is doing this for us already? But let's make it explicit, or, let's just merge this for loop with the one before (and keep acquiring the lock inside the for loop)? One big benefit of the latter approach is that if all of the ordinals were already cached (hopefully typically a common case), we do not need any locking, but with this approach, we still do.

...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java Show resolved Hide resolved

Contributor Author

gautamworah96 commented Jun 9, 2021

Once we iterate to a solid PR I am very curious how this helps facets performance -- we can switch luceneutil over to this bulk API to test.

Today both IntTaxonomyFacets and FloatTaxonomyFacets iteratively call getPath on all the top ordinals and then return the topChildren FacetLabels that the user wanted. With this API change, we could switch over IntTaxonomyFacets and FloatTaxonomyFacets to use this bulk API. All downstream children Facets such as FastTaxonomyFacetCount use this base getTopChildren function so all users will be able to benefit from this change by default.

Surprisingly,
Our luceneutil benchmark only tests the getTopChildren API so we should be able to see the performance change with the stock luceneutil package.

gautamworah96 mentioned this pull request

LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader apache/lucene#179

Merged

6 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment