-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9999: CombinedFieldQuery can fail with an exception when document is missing fields #185
LUCENE-9999: CombinedFieldQuery can fail with an exception when document is missing fields #185
Conversation
…ent is missing some fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides a small comment I left.
if (norms != null) { | ||
boolean found = norms.advanceExact(doc); | ||
assert found; | ||
if (norms != null && norms.advanceExact(doc)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the old logic was correct?
The only case I can think of when it might fail is when some fields are indexed with norms and other fields without norms, but I'm not sure we should even try to support this case with CombinedFieldQuery
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We allow a StringField and a TextField to be used at the moment. We could disallow that upfront (all fields or none have norms enabled) because the old logic would fail with weird exceptions without this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me to disallow it upfront, the combined score won't be meaningful if there is a mix of norms being disabled/ enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jimczi for fixing this!
if (norms != null) { | ||
boolean found = norms.advanceExact(doc); | ||
assert found; | ||
if (norms != null && norms.advanceExact(doc)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me to disallow it upfront, the combined score won't be meaningful if there is a mix of norms being disabled/ enabled.
@@ -85,7 +86,7 @@ public void testToString() { | |||
assertEquals("CombinedFieldQuery((foo title^3.0)(bar baz))", builder.build().toString()); | |||
} | |||
|
|||
public void testSameScore() throws IOException { | |||
public void testSparseFieldSameScore() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little unclear to me what we aim to test with the 'same score' checks. The testCopyField
test seems to best capture the core behavior -- we could have a version of testCopyField
where one field is sparse?
@jimczi asked if I could finish the PR -- I pushed some changes:
Let me know if it still looks okay or if you have other comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jtibshirani!
throw new IllegalArgumentException( | ||
getClass().getSimpleName() | ||
+ " requires norms to be consistent across fields: some fields cannot" | ||
+ " have norms enabled, while others have norms disabled"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of an edge case but should we also throw the exception if scoring is not needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Properly throwing an error would require a more invasive change, since we rewrite the query very early if scores aren't needed:
lucene/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
Lines 262 to 265 in 9942d59
} else { | |
// rewrite to a simple disjunction if the score is not needed. | |
Query bq = rewriteToBoolean(); | |
return searcher.rewrite(bq).createWeight(searcher, ScoreMode.COMPLETE_NO_SCORES, boost); |
MultiNormsLeafSimScorer
just to be clearer/ more solid?
This also made me notice -- I'm using the fact that LeafReader#getNormsValues
is null
to detect that norms were disabled. This seems accurate but I couldn't confirm it's really part of the method contract. I wonder if checking field infos is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing more research, I don't think it's possible to detect if norms are disabled upfront: LeafReader#getNormsValues
could be null
if a field uses norms but happens to be missing from a segment. Using field infos would have the same problem.
I opted to just document this requirement (as we do for the fact norms must be additive and encoded in a certain way). If the calling application has its own notion of schema, it can enforce the norms requirements upfront.
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relastes to apache/lucene#185
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relastes to apache/lucene#185
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relastes to apache/lucene#185
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relates to apache/lucene#185
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relastes to apache/lucene#185 Co-authored-by: Christoph Büscher <cbuescher@posteo.de>
@jpountz would you be able to take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something we could check could be verifying that all fields that have a non-null FieldInfo have the same value for hasNorms
on a per-segment basis. It wouldn't cover 100% of misuse, but probably most of it.
I tried out your idea. I opted to do it on a global basis in I'm going to merge, but feel free to give post merge feedback if anything seems off. |
…ent is missing fields (apache#185) This change fixes a bug in `MultiNormsLeafSimScorer` that assumes that each field should have a norm for every term/document. As part of the fix, it adds validation that the fields have consistent norms settings.
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relastes to apache/lucene#185 Co-authored-by: Christoph Büscher <cbuescher@posteo.de>
This change fixes a bug in
MultiNormsLeafSimScorer
that assumes that eachfield should have a norm for every term/document.
As part of the fix, it adds validation that the fields have consistent norms
settings.