Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9999: CombinedFieldQuery can fail with an exception when document is missing fields #185

Merged
merged 12 commits into from
Jul 17, 2021

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Jun 15, 2021

This change fixes a bug in MultiNormsLeafSimScorer that assumes that each
field should have a norm for every term/document.

As part of the fix, it adds validation that the fields have consistent norms
settings.

@jimczi jimczi added the bug label Jun 15, 2021
@jimczi jimczi changed the title CombinedFieldQuery can fail with an exception when document is missing fields LUCENE-9999: CombinedFieldQuery can fail with an exception when document is missing fields Jun 15, 2021
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides a small comment I left.

if (norms != null) {
boolean found = norms.advanceExact(doc);
assert found;
if (norms != null && norms.advanceExact(doc)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the old logic was correct?

The only case I can think of when it might fail is when some fields are indexed with norms and other fields without norms, but I'm not sure we should even try to support this case with CombinedFieldQuery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow a StringField and a TextField to be used at the moment. We could disallow that upfront (all fields or none have norms enabled) because the old logic would fail with weird exceptions without this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me to disallow it upfront, the combined score won't be meaningful if there is a mix of norms being disabled/ enabled.

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jimczi for fixing this!

if (norms != null) {
boolean found = norms.advanceExact(doc);
assert found;
if (norms != null && norms.advanceExact(doc)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me to disallow it upfront, the combined score won't be meaningful if there is a mix of norms being disabled/ enabled.

@@ -85,7 +86,7 @@ public void testToString() {
assertEquals("CombinedFieldQuery((foo title^3.0)(bar baz))", builder.build().toString());
}

public void testSameScore() throws IOException {
public void testSparseFieldSameScore() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little unclear to me what we aim to test with the 'same score' checks. The testCopyField test seems to best capture the core behavior -- we could have a version of testCopyField where one field is sparse?

@jtibshirani
Copy link
Member

@jimczi asked if I could finish the PR -- I pushed some changes:

  • Add a check that either all fields or no fields have norms enabled
  • Rework the testing strategy

Let me know if it still looks okay or if you have other comments.

Copy link
Contributor Author

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jtibshirani!

throw new IllegalArgumentException(
getClass().getSimpleName()
+ " requires norms to be consistent across fields: some fields cannot"
+ " have norms enabled, while others have norms disabled");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of an edge case but should we also throw the exception if scoring is not needed ?

Copy link
Member

@jtibshirani jtibshirani Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Properly throwing an error would require a more invasive change, since we rewrite the query very early if scores aren't needed:

} else {
// rewrite to a simple disjunction if the score is not needed.
Query bq = rewriteToBoolean();
return searcher.rewrite(bq).createWeight(searcher, ScoreMode.COMPLETE_NO_SCORES, boost);
However I could refactor this part in MultiNormsLeafSimScorer just to be clearer/ more solid?

This also made me notice -- I'm using the fact that LeafReader#getNormsValues is null to detect that norms were disabled. This seems accurate but I couldn't confirm it's really part of the method contract. I wonder if checking field infos is better.

Copy link
Member

@jtibshirani jtibshirani Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing more research, I don't think it's possible to detect if norms are disabled upfront: LeafReader#getNormsValues could be null if a field uses norms but happens to be missing from a segment. Using field infos would have the same problem.

I opted to just document this requirement (as we do for the fact norms must be additive and encoded in a certain way). If the calling application has its own notion of schema, it can enforce the norms requirements upfront.

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Jun 29, 2021
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that
fixes an error in the new CombinedFieldQuery for missing values. Its based on
a PR for LUCENE-9999 (apache/lucene#185). This is a
temporary copy of the affected query and its updated dependencies that should be
removed again once we are able to use the original fix from Lucene.

Relastes to apache/lucene#185
cbuescher pushed a commit to elastic/elasticsearch that referenced this pull request Jun 29, 2021
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that
fixes an error in the new CombinedFieldQuery for missing values. Its based on
a PR for LUCENE-9999 (apache/lucene#185). This is a
temporary copy of the affected query and its updated dependencies that should be
removed again once we are able to use the original fix from Lucene.

Relastes to apache/lucene#185
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Jun 29, 2021
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that
fixes an error in the new CombinedFieldQuery for missing values. Its based on
a PR for LUCENE-9999 (apache/lucene#185). This is a
temporary copy of the affected query and its updated dependencies that should be
removed again once we are able to use the original fix from Lucene.

Relastes to apache/lucene#185
cbuescher pushed a commit to elastic/elasticsearch that referenced this pull request Jun 30, 2021
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that
fixes an error in the new CombinedFieldQuery for missing values. Its based on
a PR for LUCENE-9999 (apache/lucene#185). This is a
temporary copy of the affected query and its updated dependencies that should be
removed again once we are able to use the original fix from Lucene.

Relates to apache/lucene#185
jtibshirani added a commit to elastic/elasticsearch that referenced this pull request Jul 1, 2021
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that
fixes an error in the new CombinedFieldQuery for missing values. Its based on
a PR for LUCENE-9999 (apache/lucene#185). This is a
temporary copy of the affected query and its updated dependencies that should be
removed again once we are able to use the original fix from Lucene.

Relastes to apache/lucene#185

Co-authored-by: Christoph Büscher <cbuescher@posteo.de>
@jtibshirani
Copy link
Member

@jpountz would you be able to take another look?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something we could check could be verifying that all fields that have a non-null FieldInfo have the same value for hasNorms on a per-segment basis. It wouldn't cover 100% of misuse, but probably most of it.

@jtibshirani
Copy link
Member

I tried out your idea. I opted to do it on a global basis in createWeight to ensure we always validate this, even when the query is rewritten.

I'm going to merge, but feel free to give post merge feedback if anything seems off.

@jtibshirani jtibshirani merged commit f333b70 into apache:main Jul 17, 2021
mikemccand pushed a commit to mikemccand/lucene that referenced this pull request Sep 3, 2021
…ent is missing fields (apache#185)

This change fixes a bug in `MultiNormsLeafSimScorer` that assumes that each
field should have a norm for every term/document.

As part of the fix, it adds validation that the fields have consistent norms
settings.
2lambda123 pushed a commit to 2lambda123/elastic-elasticsearch that referenced this pull request May 3, 2024
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that
fixes an error in the new CombinedFieldQuery for missing values. Its based on
a PR for LUCENE-9999 (apache/lucene#185). This is a
temporary copy of the affected query and its updated dependencies that should be
removed again once we are able to use the original fix from Lucene.

Relastes to apache/lucene#185

Co-authored-by: Christoph Büscher <cbuescher@posteo.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants