Index arbitrary fields in taxonomy docs #12337

stefanvodita · 2023-05-27T12:50:03Z

This addresses #12336 by letting users pass in an ordinal data appender to the taxonomy writer.
When the taxonomy writer indexes an ordinal data it will also add the custom fields that the user has requested.

shaie

It's a general comment about the usage of this feature. I don't think that passing such mapper to the constructor of TaxoWriter is the right way to go about it, since it implies one can know up front all the facet labels that will encountered during indexing and be able to map them to some Lucene field. Even if someone can know that up front (say when the population of values is known in advance), I don't think that's the way to do it.

On the other hand, once a facet was added, e.g. "Author/Bob", its value can't change (yet) and therefore specifying the payload per addDocument() call makes less sense since if I add two documents with the facet Author/Bob once with the payload value score=42L and then w/ score=68L, what should the code do?

But as I think about this feature and how do I see it mature over time, I DO think the payload should be given when ingesting the documents, and not from some global mapper. We can define the semantics of adding the different values as either hard-failing (a ValueForOrdinalAlreadyExistsException, or silently ignoring them. Conceptually it's the same with the mapper, e.g. someone might expect that if the mapping from Author/Bob -> 42L changed to Author/Bob -> 68L, then it would be reflected in the taxonomy, but since we've already added this label, we won't update it anymore (again, yet; I know the plan is to allow updating them).

Can we explore adding some arbitrary payload to FacetField? It doesn't have to be a strict value, we can add it as OrdinalPayloadProvider which will implement an API like addToDoc(Document doc) the payload fields.

If it's too difficult to add them to the current fields, we can create a new type of field for just that purpose, which will only be consumed by the TaxoWriter.

shaie · 2023-05-28T05:18:08Z

lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestOrdinalData.java

+  Directory taxoDir;
+  IndexReader taxoIndexReader;
+
+  private static final Map<String, Long> labelToScore;


nit: you can use Map.of() to shorten the code

shaie · 2023-05-28T05:56:03Z

To add to the comment I wrote above, I was referring to something like AssociationFacetField which today allows indexing an arbitrary byte[] with ordinals. A similar field can be used to add other fields as well (numeric, string, stored, ...).

BTW, if all you're interested in at the moment is to associate some weight to a facet, perhaps the existing associations support will already provide what you need?

stefanvodita · 2023-06-03T20:33:26Z

Thank you for the great feedback @shaie!

I’ve pushed a commit where I try to move the logic to new taxonomy reader/writer implementations. I’ve added an option to reindex the taxonomy using the new writer, hopefully it should make it clear that the user can’t change their custom ordinal data without reindexing.

We can also try what you were suggesting with payloads getting passed in for each document indexed in the main index if you still think making the oridnalDataAppender an attribute of the taxonomy writer is not the right approach.

Regarding association fields - we can use them to achieve the same behaviour, but we can also do it more efficiently. Association fields are per ordinal and per doc, but with this CR we’re getting fields that are per ordinal and across docs.

epotyom · 2023-08-14T16:26:29Z

.../facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyIndexReader.java

+ * This is like a {@link DirectoryTaxonomyReader}, except it provides access to the underlying
+ * {@link DirectoryReader} and full path field name.
+ */
+public class DirectoryTaxonomyIndexReader extends DirectoryTaxonomyReader {


We've been testing similar approach locally and found that this class should override DirectoryTaxonomyReader .doOpenIfChanged method, otherwise after refresh SearcherAndTaxonomy.taxonomyReader becomes an instance of DirectoryTaxonomyReader again.

Great point @epotyom! I’ve thought a bit more about this and I’d like to consider exposing the IndexReader of DirectoryTaxonomyReader by making getInternalIndexReader public instead of protected. I actually like this solution better than what I coded up previously. It’s cleaner, it’s backwards compatible, and a user could have already gotten the IndexReader anyway by extending DirectoryTaxonomyReader. I’m curious if anyone has other ideas though.

Can we open a separate PR to address this issue? It seems like a crab (a spinoff issue discovered while creating this PR that's fundamentally not otherwise related to this PR)?

This was more complicated in the first draft, but right now the only change is to make getInternalIndexReader public instead of protected. I think we have to have that in this PR, otherwise we’re offering a way to put data in the taxonomy, but no way to get it back out.

stefanvodita · 2023-08-21T18:33:10Z

The commit I pushed makes DirectoryTaxonomyReader.getInternalIndexReader public. We also stop relying on the full path field. I’m not sure why I thought we needed it, we can use getPath/getBulkPath to get labels if we have the corresponding ordinal.

mikemccand

Net/net I love this idea! It takes advantage of the unique approach (taxonomy sidecar index) that Lucene's taxonomy facets use to index normalized fields associated with taxonomy ordinal and not per document.

We will need to hash out / iterate on the exact API. Let's mark the new APIs @lucene.experimental so we reserve the right to break them on subsequent releases. API design is fundamentally hard and requires many real applications building on them before they iterate to something clean.

mikemccand · 2023-09-11T16:29:51Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java

@@ -877,6 +886,26 @@ public synchronized void replaceTaxonomy(Directory taxoDir) throws IOException {
    ++indexEpoch;
  }

+  /** Delete all taxonomy data and start over. */
+  public synchronized void reset() throws IOException {


Maybe rename to deleteAll (matching IndexWriter.deleteAll).

And could you add some javadoc warnings / when this might be used? If you fully delete your taxo index, you must also reindex any documents index referencing these same facets? Or, you must carefully regenerate the ordinals in the right order?

And perhaps it should be package private, if the intention is to only use it via ReindexingEnrichedDirectoryTaxonomyWriter?

Good suggestions! I’ve done this in the new commit.

mikemccand · 2023-09-11T16:30:38Z

...va/org/apache/lucene/facet/taxonomy/directory/ReindexingEnrichedDirectoryTaxonomyWriter.java

+ * the ordinal documents in the taxonomy. To update the custom data added to the docs, it is
+ * required to {@link #reindexWithNewOrdinalData(BiConsumer)}.
+ */
+public class ReindexingEnrichedDirectoryTaxonomyWriter extends DirectoryTaxonomyWriter {


Note that Lucene has updatable doc values, so in theory one could update/add a new doc values field into the taxonomy index without shifting ordinals. This might be a nice path to update fields associated with ordinals?

mikemccand · 2023-09-11T16:39:41Z

But as I think about this feature and how do I see it mature over time, I DO think the payload should be given when ingesting the documents

Hmm -- I don't think that's great because we are forcing denormalization onto the user? This is fundamentally nicely normalized content (values per FacetLabel not Document), so we really should enable indexing it in a denormalized manner. If the user really wants to (inefficiently) denormalize they can use AssociationFacets already?

On the other hand, once a facet was added, e.g. "Author/Bob", its value can't change (yet)

Actually, I remember someone 😉 adding this nice feature long ago to Lucene to be able to update doc values in-place -- it seems like that could be a great mechanism for updating these denormalized FacetLabel values. But we should do that as a followon issue -- let's leave this PR to focus on getting these initial values into the taxo index.

We could also (later, separate PR) consider more radical changes to how the taxo index is stored to make it more efficient to update FacetLabel values.

stefanvodita · 2023-09-12T11:04:05Z

Thank you for the review @mikemccand! I’ve integrated your feedback. Updatable doc values are definitely something to consider.
For comparison, I coded up an association facet field that is constant for a facet label across document. I think it helps highlight the advantages of the enriched taxonomy solution.

github-actions · 2024-01-08T12:24:05Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

mikemccand

Thanks @stefanvodita -- this is a nice feature, giving apps some extensibility to store additional metadata onto each facet label.

stefanvodita · 2024-02-06T18:07:39Z

Thank you for reviving the PR, Mike; it had been sitting around for a good while. I’ll leave it up for a few more days to see if there are other comments and merge if there aren’t.

shaie reviewed May 28, 2023

View reviewed changes

epotyom reviewed Aug 14, 2023

View reviewed changes

mikemccand reviewed Sep 11, 2023

View reviewed changes

Shradha26 mentioned this pull request Sep 13, 2023

[DISCUSS] Identifying Gaps in Lucene’s Faceting #12553

Open

github-actions bot added the Stale label Jan 8, 2024

mikemccand approved these changes Feb 5, 2024

View reviewed changes

github-actions bot removed the Stale label Feb 6, 2024

stefanvodita added 6 commits February 8, 2024 09:27

Index arbitrary fields in taxonomy docs

fe65354

Use new taxonomy implementations for custom ordinal data

1059645

Simplify access to taxo reader internals

955533d

Tidy

88542a9

Tune Javadoc and access modifiers

1f661bf

Add CHANGES

81bc10b

stefanvodita force-pushed the taxo-data branch from b28ab38 to 81bc10b Compare February 8, 2024 09:34

Tidy again

55399b8

stefanvodita merged commit f339e24 into apache:main Feb 8, 2024
4 checks passed

stefanvodita added a commit that referenced this pull request Feb 8, 2024

Index arbitrary fields in taxonomy docs (#12337)

2d713d9

stefanvodita mentioned this pull request Mar 8, 2024

Demo indexing custom ordinal data in the taxonomy #13166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index arbitrary fields in taxonomy docs #12337

Index arbitrary fields in taxonomy docs #12337

stefanvodita commented May 27, 2023

shaie left a comment

shaie May 28, 2023

shaie commented May 28, 2023

stefanvodita commented Jun 3, 2023

epotyom Aug 14, 2023

stefanvodita Aug 18, 2023

mikemccand Sep 11, 2023

stefanvodita Sep 12, 2023

stefanvodita commented Aug 21, 2023

mikemccand left a comment

mikemccand Sep 11, 2023

stefanvodita Sep 12, 2023

mikemccand Sep 11, 2023

mikemccand commented Sep 11, 2023

stefanvodita commented Sep 12, 2023

github-actions bot commented Jan 8, 2024

mikemccand left a comment

stefanvodita commented Feb 6, 2024

Index arbitrary fields in taxonomy docs #12337

Index arbitrary fields in taxonomy docs #12337

Conversation

stefanvodita commented May 27, 2023

shaie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaie commented May 28, 2023

stefanvodita commented Jun 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanvodita commented Aug 21, 2023

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand commented Sep 11, 2023

stefanvodita commented Sep 12, 2023

github-actions bot commented Jan 8, 2024

mikemccand left a comment

Choose a reason for hiding this comment

stefanvodita commented Feb 6, 2024