-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index arbitrary fields in taxonomy docs #12337
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a general comment about the usage of this feature. I don't think that passing such mapper
to the constructor of TaxoWriter is the right way to go about it, since it implies one can know up front all the facet labels that will encountered during indexing and be able to map them to some Lucene field. Even if someone can know that up front (say when the population of values is known in advance), I don't think that's the way to do it.
On the other hand, once a facet was added, e.g. "Author/Bob", its value can't change (yet) and therefore specifying the payload per addDocument()
call makes less sense since if I add two documents with the facet Author/Bob
once with the payload value score=42L
and then w/ score=68L
, what should the code do?
But as I think about this feature and how do I see it mature over time, I DO think the payload should be given when ingesting the documents, and not from some global mapper. We can define the semantics of adding the different values as either hard-failing (a ValueForOrdinalAlreadyExistsException
, or silently ignoring them. Conceptually it's the same with the mapper, e.g. someone might expect that if the mapping from Author/Bob -> 42L
changed to Author/Bob -> 68L
, then it would be reflected in the taxonomy, but since we've already added this label, we won't update it anymore (again, yet; I know the plan is to allow updating them).
Can we explore adding some arbitrary payload
to FacetField
? It doesn't have to be a strict value, we can add it as OrdinalPayloadProvider
which will implement an API like addToDoc(Document doc)
the payload fields.
If it's too difficult to add them to the current fields, we can create a new type of field for just that purpose, which will only be consumed by the TaxoWriter.
Directory taxoDir; | ||
IndexReader taxoIndexReader; | ||
|
||
private static final Map<String, Long> labelToScore; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can use Map.of()
to shorten the code
To add to the comment I wrote above, I was referring to something like BTW, if all you're interested in at the moment is to associate some weight to a facet, perhaps the existing associations support will already provide what you need? |
Thank you for the great feedback @shaie! I’ve pushed a commit where I try to move the logic to new taxonomy reader/writer implementations. I’ve added an option to reindex the taxonomy using the new writer, hopefully it should make it clear that the user can’t change their custom ordinal data without reindexing. We can also try what you were suggesting with payloads getting passed in for each document indexed in the main index if you still think making the Regarding association fields - we can use them to achieve the same behaviour, but we can also do it more efficiently. Association fields are per ordinal and per doc, but with this CR we’re getting fields that are per ordinal and across docs. |
* This is like a {@link DirectoryTaxonomyReader}, except it provides access to the underlying | ||
* {@link DirectoryReader} and full path field name. | ||
*/ | ||
public class DirectoryTaxonomyIndexReader extends DirectoryTaxonomyReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been testing similar approach locally and found that this class should override DirectoryTaxonomyReader .doOpenIfChanged
method, otherwise after refresh SearcherAndTaxonomy.taxonomyReader
becomes an instance of DirectoryTaxonomyReader
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point @epotyom! I’ve thought a bit more about this and I’d like to consider exposing the IndexReader
of DirectoryTaxonomyReader
by making getInternalIndexReader
public instead of protected. I actually like this solution better than what I coded up previously. It’s cleaner, it’s backwards compatible, and a user could have already gotten the IndexReader
anyway by extending DirectoryTaxonomyReader
. I’m curious if anyone has other ideas though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we open a separate PR to address this issue? It seems like a crab (a spinoff issue discovered while creating this PR that's fundamentally not otherwise related to this PR)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was more complicated in the first draft, but right now the only change is to make getInternalIndexReader
public instead of protected. I think we have to have that in this PR, otherwise we’re offering a way to put data in the taxonomy, but no way to get it back out.
The commit I pushed makes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Net/net I love this idea! It takes advantage of the unique approach (taxonomy sidecar index) that Lucene's taxonomy facets use to index normalized fields associated with taxonomy ordinal and not per document.
We will need to hash out / iterate on the exact API. Let's mark the new APIs @lucene.experimental
so we reserve the right to break them on subsequent releases. API design is fundamentally hard and requires many real applications building on them before they iterate to something clean.
@@ -877,6 +886,26 @@ public synchronized void replaceTaxonomy(Directory taxoDir) throws IOException { | |||
++indexEpoch; | |||
} | |||
|
|||
/** Delete all taxonomy data and start over. */ | |||
public synchronized void reset() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename to deleteAll
(matching IndexWriter.deleteAll
).
And could you add some javadoc warnings / when this might be used? If you fully delete your taxo index, you must also reindex any documents index referencing these same facets? Or, you must carefully regenerate the ordinals in the right order?
And perhaps it should be package private, if the intention is to only use it via ReindexingEnrichedDirectoryTaxonomyWriter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestions! I’ve done this in the new commit.
* the ordinal documents in the taxonomy. To update the custom data added to the docs, it is | ||
* required to {@link #reindexWithNewOrdinalData(BiConsumer)}. | ||
*/ | ||
public class ReindexingEnrichedDirectoryTaxonomyWriter extends DirectoryTaxonomyWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that Lucene has updatable doc values, so in theory one could update/add a new doc values field into the taxonomy index without shifting ordinals. This might be a nice path to update fields associated with ordinals?
Hmm -- I don't think that's great because we are forcing denormalization onto the user? This is fundamentally nicely normalized content (values per
Actually, I remember someone 😉 adding this nice feature long ago to Lucene to be able to update doc values in-place -- it seems like that could be a great mechanism for updating these denormalized We could also (later, separate PR) consider more radical changes to how the taxo index is stored to make it more efficient to update |
Thank you for the review @mikemccand! I’ve integrated your feedback. Updatable doc values are definitely something to consider. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @stefanvodita -- this is a nice feature, giving apps some extensibility to store additional metadata onto each facet label.
Thank you for reviving the PR, Mike; it had been sitting around for a good while. I’ll leave it up for a few more days to see if there are other comments and merge if there aren’t. |
b28ab38
to
81bc10b
Compare
This addresses #12336 by letting users pass in an ordinal data appender to the taxonomy writer.
When the taxonomy writer indexes an ordinal data it will also add the custom fields that the user has requested.