Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9476 Add getBulkPath API for the Taxonomy index #2247

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

gautamworah96
Copy link
Contributor

Description

In LUCENE-9450 we switched the Taxonomy index from Stored Fields to BinaryDocValues. In the resulting implementation of the getPath code, we create a new BinaryDocValues's values instance for each ordinal.
It may happen that we may traverse over the same nodes over and over again if the getPath API is called multiple times for ordinals in the same segment/with the same readerIndex.

This PR takes advantage of that fact by sorting ordinals and then trying to find out if some of the ordinals are present in the same segment/have the same readerIndex (by trying to advanceExact to the correct position and not failing) thereby allowing us to reuse the previous BinaryDocValues object.

Solution

Steps:

  1. Sort all ordinals and remember their position so as to store the path in the correct position
  2. Try to advanceExact to the correct position with the previously calculated readerIndex. If the operation fails, try to find the correct segment for the ordinal and then advanceExact to the desired position.
  3. Store this position for future ordinals.

Tests

Added a new test for the API that compares the individual getPath results from ordinals with the bulk FacetLabels returned by the getBulkPath API

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Ref Guide (for Solr changes only).

@gautamworah96 gautamworah96 changed the title WIP: LUCENE-9476 Add basic functionality, basic tests WIP: LUCENE-9476 Add getBulkPath API for the Taxonomy index Jan 26, 2021
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gautamworah96

Once we iterate to a solid PR I am very curious how this helps facets performance -- we can switch luceneutil over to this bulk API to test.

Copy link

@shaie shaie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slowly getting back on the horse here 😄 , so this review focuses mainly on style ..

@gautamworah96 gautamworah96 changed the title WIP: LUCENE-9476 Add getBulkPath API for the Taxonomy index LUCENE-9476 Add getBulkPath API for the Taxonomy index Jan 29, 2021
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gautamworah96 -- looking closer!

lucene/CHANGES.txt Outdated Show resolved Hide resolved
}

return ret;
}

private FacetLabel getPathFromCache(int ordinal) {
// TODO: can we use an int-based hash impl, such as IntToObjectMap,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh that is a great idea, and low-hanging fruit, and would greatly reduce the RAM usage for this cache.

I think DirectoryTaxonomyWriter also has such a cache that we could change to a native map.

Could you open a spinoff issue?

…. Use parallel sort to fix duplicate ordinal bug. Add a test case for it. Minor fixes
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gautamworah96, looks close!

// this check is only needed once to confirm that the index uses BinaryDocValues
boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
if (success == false) {
return getBulkPathForOlderIndexes(ordinals);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm confused -- wouldn't an older index have no BinaryDocValues field? So, values would be null, and we should fallback then?

This code should hit NullPointerException on an old index I think? How come our backwards compatibility test didn't expose this?


for (int i = 0; i < ordinalsLength; i++) {
synchronized (categoryCache) {
categoryCache.put(ordinals[i], bulkPath[originalPosition[i]]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will sometimes put ordinals back into the cache that were already there at the start of this method right? I guess that's harmless. Or, maybe we should move this up above? Then we can do it only for those ordinals that were not already cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think intuitively adding the ordinals back into the cache would not be a problem. This should also (theoretically) be faster than trying to get the lock again and again in a loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also (theoretically) be faster than trying to get the lock again and again in a loop?

Hmm, I'm confused: this code is already getting the lock inside a for loop? I guess we could move the synchronized outside of the for loop? Or, maybe javac is doing this for us already? But let's make it explicit, or, let's just merge this for loop with the one before (and keep acquiring the lock inside the for loop)? One big benefit of the latter approach is that if all of the ordinals were already cached (hopefully typically a common case), we do not need any locking, but with this approach, we still do.

@gautamworah96
Copy link
Contributor Author

Once we iterate to a solid PR I am very curious how this helps facets performance -- we can switch luceneutil over to this bulk API to test.

Today both IntTaxonomyFacets and FloatTaxonomyFacets iteratively call getPath on all the top ordinals and then return the topChildren FacetLabels that the user wanted. With this API change, we could switch over IntTaxonomyFacets and FloatTaxonomyFacets to use this bulk API. All downstream children Facets such as FastTaxonomyFacetCount use this base getTopChildren function so all users will be able to benefit from this change by default.

Surprisingly,
Our luceneutil benchmark only tests the getTopChildren API so we should be able to see the performance change with the stock luceneutil package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants