Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index arbitrary fields in taxonomy docs #12337

Merged
merged 7 commits into from
Feb 8, 2024

Conversation

stefanvodita
Copy link
Contributor

This addresses #12336 by letting users pass in an ordinal data appender to the taxonomy writer.
When the taxonomy writer indexes an ordinal data it will also add the custom fields that the user has requested.

Copy link
Contributor

@shaie shaie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a general comment about the usage of this feature. I don't think that passing such mapper to the constructor of TaxoWriter is the right way to go about it, since it implies one can know up front all the facet labels that will encountered during indexing and be able to map them to some Lucene field. Even if someone can know that up front (say when the population of values is known in advance), I don't think that's the way to do it.

On the other hand, once a facet was added, e.g. "Author/Bob", its value can't change (yet) and therefore specifying the payload per addDocument() call makes less sense since if I add two documents with the facet Author/Bob once with the payload value score=42L and then w/ score=68L, what should the code do?

But as I think about this feature and how do I see it mature over time, I DO think the payload should be given when ingesting the documents, and not from some global mapper. We can define the semantics of adding the different values as either hard-failing (a ValueForOrdinalAlreadyExistsException, or silently ignoring them. Conceptually it's the same with the mapper, e.g. someone might expect that if the mapping from Author/Bob -> 42L changed to Author/Bob -> 68L, then it would be reflected in the taxonomy, but since we've already added this label, we won't update it anymore (again, yet; I know the plan is to allow updating them).

Can we explore adding some arbitrary payload to FacetField? It doesn't have to be a strict value, we can add it as OrdinalPayloadProvider which will implement an API like addToDoc(Document doc) the payload fields.

If it's too difficult to add them to the current fields, we can create a new type of field for just that purpose, which will only be consumed by the TaxoWriter.

Directory taxoDir;
IndexReader taxoIndexReader;

private static final Map<String, Long> labelToScore;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use Map.of() to shorten the code

@shaie
Copy link
Contributor

shaie commented May 28, 2023

To add to the comment I wrote above, I was referring to something like AssociationFacetField which today allows indexing an arbitrary byte[] with ordinals. A similar field can be used to add other fields as well (numeric, string, stored, ...).

BTW, if all you're interested in at the moment is to associate some weight to a facet, perhaps the existing associations support will already provide what you need?

@stefanvodita
Copy link
Contributor Author

Thank you for the great feedback @shaie!

I’ve pushed a commit where I try to move the logic to new taxonomy reader/writer implementations. I’ve added an option to reindex the taxonomy using the new writer, hopefully it should make it clear that the user can’t change their custom ordinal data without reindexing.

We can also try what you were suggesting with payloads getting passed in for each document indexed in the main index if you still think making the oridnalDataAppender an attribute of the taxonomy writer is not the right approach.

Regarding association fields - we can use them to achieve the same behaviour, but we can also do it more efficiently. Association fields are per ordinal and per doc, but with this CR we’re getting fields that are per ordinal and across docs.

* This is like a {@link DirectoryTaxonomyReader}, except it provides access to the underlying
* {@link DirectoryReader} and full path field name.
*/
public class DirectoryTaxonomyIndexReader extends DirectoryTaxonomyReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been testing similar approach locally and found that this class should override DirectoryTaxonomyReader .doOpenIfChanged method, otherwise after refresh SearcherAndTaxonomy.taxonomyReader becomes an instance of DirectoryTaxonomyReader again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point @epotyom! I’ve thought a bit more about this and I’d like to consider exposing the IndexReader of DirectoryTaxonomyReader by making getInternalIndexReader public instead of protected. I actually like this solution better than what I coded up previously. It’s cleaner, it’s backwards compatible, and a user could have already gotten the IndexReader anyway by extending DirectoryTaxonomyReader. I’m curious if anyone has other ideas though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we open a separate PR to address this issue? It seems like a crab (a spinoff issue discovered while creating this PR that's fundamentally not otherwise related to this PR)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was more complicated in the first draft, but right now the only change is to make getInternalIndexReader public instead of protected. I think we have to have that in this PR, otherwise we’re offering a way to put data in the taxonomy, but no way to get it back out.

@stefanvodita
Copy link
Contributor Author

The commit I pushed makes DirectoryTaxonomyReader.getInternalIndexReader public. We also stop relying on the full path field. I’m not sure why I thought we needed it, we can use getPath/getBulkPath to get labels if we have the corresponding ordinal.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Net/net I love this idea! It takes advantage of the unique approach (taxonomy sidecar index) that Lucene's taxonomy facets use to index normalized fields associated with taxonomy ordinal and not per document.

We will need to hash out / iterate on the exact API. Let's mark the new APIs @lucene.experimental so we reserve the right to break them on subsequent releases. API design is fundamentally hard and requires many real applications building on them before they iterate to something clean.

@@ -877,6 +886,26 @@ public synchronized void replaceTaxonomy(Directory taxoDir) throws IOException {
++indexEpoch;
}

/** Delete all taxonomy data and start over. */
public synchronized void reset() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to deleteAll (matching IndexWriter.deleteAll).

And could you add some javadoc warnings / when this might be used? If you fully delete your taxo index, you must also reindex any documents index referencing these same facets? Or, you must carefully regenerate the ordinals in the right order?

And perhaps it should be package private, if the intention is to only use it via ReindexingEnrichedDirectoryTaxonomyWriter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestions! I’ve done this in the new commit.

* the ordinal documents in the taxonomy. To update the custom data added to the docs, it is
* required to {@link #reindexWithNewOrdinalData(BiConsumer)}.
*/
public class ReindexingEnrichedDirectoryTaxonomyWriter extends DirectoryTaxonomyWriter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Lucene has updatable doc values, so in theory one could update/add a new doc values field into the taxonomy index without shifting ordinals. This might be a nice path to update fields associated with ordinals?

@mikemccand
Copy link
Member

But as I think about this feature and how do I see it mature over time, I DO think the payload should be given when ingesting the documents

Hmm -- I don't think that's great because we are forcing denormalization onto the user? This is fundamentally nicely normalized content (values per FacetLabel not Document), so we really should enable indexing it in a denormalized manner. If the user really wants to (inefficiently) denormalize they can use AssociationFacets already?

On the other hand, once a facet was added, e.g. "Author/Bob", its value can't change (yet)

Actually, I remember someone 😉 adding this nice feature long ago to Lucene to be able to update doc values in-place -- it seems like that could be a great mechanism for updating these denormalized FacetLabel values. But we should do that as a followon issue -- let's leave this PR to focus on getting these initial values into the taxo index.

We could also (later, separate PR) consider more radical changes to how the taxo index is stored to make it more efficient to update FacetLabel values.

@stefanvodita
Copy link
Contributor Author

Thank you for the review @mikemccand! I’ve integrated your feedback. Updatable doc values are definitely something to consider.
For comparison, I coded up an association facet field that is constant for a facet label across document. I think it helps highlight the advantages of the enriched taxonomy solution.

Copy link

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stefanvodita -- this is a nice feature, giving apps some extensibility to store additional metadata onto each facet label.

@github-actions github-actions bot removed the Stale label Feb 6, 2024
@stefanvodita
Copy link
Contributor Author

Thank you for reviving the PR, Mike; it had been sitting around for a good while. I’ll leave it up for a few more days to see if there are other comments and merge if there aren’t.

@stefanvodita stefanvodita merged commit f339e24 into apache:main Feb 8, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants