Skip to content

Commit

Permalink
Avoid negative scores with cross_fields type (#89016)
Browse files Browse the repository at this point in the history
The cross_fields scoring type can produce negative scores when some documents
are missing fields. When blending term document frequencies, we take the maximum
document frequency across all fields. If one field appears in fewer documents
than another, this means that its IDF can become negative. This is because IDF
is calculated as `Math.log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))`

This change adjusts the docFreq for each field to `Math.min(docCount, docFreq)`
so that the IDF can never become negative. It makes sense that the term document
frequency should never exceed the number of documents containing the field.
  • Loading branch information
jtibshirani committed Sep 6, 2022
1 parent 07c28bb commit 59f96a2
Show file tree
Hide file tree
Showing 4 changed files with 49 additions and 6 deletions.
6 changes: 6 additions & 0 deletions docs/changelog/89016.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pr: 89016
summary: Avoid negative scores with `cross_fields` type
area: Ranking
type: bug
issues:
- 44700
11 changes: 6 additions & 5 deletions docs/reference/query-dsl/multi-match-query.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -388,11 +388,12 @@ explanation:
Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
`lenient` and `zero_terms_query`.

WARNING: The `cross_fields` type blends field statistics in a way that does
not always produce well-formed scores (for example scores can become
negative). As an alternative, you can consider the
<<query-dsl-combined-fields-query,`combined_fields`>> query, which is also
term-centric but combines field statistics in a more robust way.
WARNING: The `cross_fields` type blends field statistics in a complex way that
can be hard to interpret. The score combination can even be incorrect, in
particular when some documents contain some of the search fields, but not all
of them. You should consider the
<<query-dsl-combined-fields-query,`combined_fields`>> query as an alternative,
which is also term-centric but combines field statistics in a more robust way.

[[cross-field-analysis]]
===== `cross_field` and analysis
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,10 @@ protected int compare(int i, int j) {
if (prev > current) {
actualDf++;
}
contexts[i] = ctx = adjustDF(reader.getContext(), ctx, Math.min(maxDoc, actualDf));

int docCount = reader.getDocCount(terms[i].field());
int newDocFreq = Math.min(actualDf, docCount);
contexts[i] = ctx = adjustDF(reader.getContext(), ctx, newDocFreq);
prev = current;
sumTTF += ctx.totalTermFreq();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,39 @@ public void testMinTTF() throws IOException {
dir.close();
}

public void testMissingFields() throws IOException {
Directory dir = newDirectory();
IndexWriter w = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random())));
FieldType ft = new FieldType(TextField.TYPE_NOT_STORED);
ft.freeze();

for (int i = 0; i < 10; i++) {
Document d = new Document();
d.add(new TextField("id", Integer.toString(i), Field.Store.YES));
d.add(new Field("dense", "foo", ft));
// Add a sparse field with high totalTermFreq but low docCount
if (i % 5 == 0) {
d.add(new Field("sparse", "foo", ft));
d.add(new Field("sparse", "one two three four five size", ft));
}
w.addDocument(d);
}
w.commit();

DirectoryReader reader = DirectoryReader.open(w);
IndexSearcher searcher = setSimilarity(newSearcher(reader));

String[] fields = new String[] { "dense", "sparse" };
Query query = BlendedTermQuery.dismaxBlendedQuery(toTerms(fields, "foo"), 0.1f);
TopDocs search = searcher.search(query, 10);
ScoreDoc[] scoreDocs = search.scoreDocs;
assertEquals(Integer.toString(0), reader.document(scoreDocs[0].doc).getField("id").stringValue());

reader.close();
w.close();
dir.close();
}

public void testEqualsAndHash() {
String[] fields = new String[1 + random().nextInt(10)];
for (int i = 0; i < fields.length; i++) {
Expand Down

0 comments on commit 59f96a2

Please sign in to comment.