Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index compatibility issue between Lucene 8 and Lucene 9 #1952

Closed
lintool opened this issue Aug 1, 2022 · 1 comment
Closed

Index compatibility issue between Lucene 8 and Lucene 9 #1952

lintool opened this issue Aug 1, 2022 · 1 comment

Comments

@lintool
Copy link
Member

lintool commented Aug 1, 2022

I encountered an issue with Lucene 9 reading indexes built by Lucene 8.
The exception is something along the lines of:

java.lang.IllegalStateException: unexpected docvalues type SORTED for field 'id' (expected=BINARY). Re-index with correct docvalues type.

The crux of the issue is the following:

In DefaultLuceneDocumentGenerator, we add the (external) docid as a DocValue:

    // Store the collection docid.
    document.add(new StringField(IndexArgs.ID, id, Field.Store.YES));
    // This is needed to break score ties by docid.
    document.add(new BinaryDocValuesField(IndexArgs.ID, new BytesRef(id)));

So that we can break ties by the docid, in SearchCollection we have a Sort:

  public static final Sort BREAK_SCORE_TIES_BY_DOCID =
      new Sort(SortField.FIELD_SCORE, new SortField(IndexArgs.ID, SortField.Type.STRING_VAL));

The reason we do this is to ensure consistent tie breaking, as outlined in this SIGIR 2019 paper.

@tteofili indicated that this was a Lucene 8/Lucene 9 breaking change, due to this issue: fix SortedDocValues to no longer extend BinaryDocValues.

Reindexing with Lucene 9 fixes this issue.

Related, interesting tidbit:

from SortField.STRING_VAL javadoc: Sort using term values as Strings, but comparing by value (using String.compareTo) for all comparisons. This is typically slower than STRING, which uses ordinals to do the sorting.

lintool added a commit that referenced this issue Aug 3, 2022
Addresses #1952 - add a flag -lucene8 that abandons consistent tie breaking,
so retrieval doesn't need to touch the docvalues. In the regression script, a 
similar option --lucene8 allows the score matching to be more lenient.
@lintool
Copy link
Member Author

lintool commented Aug 3, 2022

Closed by #1953 .

@lintool lintool closed this as completed Aug 3, 2022
lintool added a commit that referenced this issue Aug 17, 2022
+ Expose Lucene 8 backwards compatibility bindings in SimpleSearcher and SimpleImpactSearcher:
  Basically, if we detect Lucene 8 indexes, we disable consistent tie-breaking, which depends on docvalues; see #1952
+ General cleanup (fixed code formatting in SimpleImpactSearcher)
+ Remove main in SimpleSearcher
+ Change to Python method names (snake_case)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant