LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test #444

zacharymorn · 2021-11-16T04:21:56Z

Description

Updated field-weight used in CombinedFieldQuery scoring calculation

Tests

Added a new test from LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery #418
Run ./gradlew clean; ./gradlew check -Pvalidation.git.failOnModified=false

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

… calculation, and added a test

jimczi

LGTM, good catch @zacharymorn

jimczi · 2021-11-16T07:45:31Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java

+            : "There is duplicated field ["
+                + field.field
+                + "] used to construct MultiNormsLeafSimScorer";
+        duplicateCheckingSet.add(field.field);


Could be assert duplicateCheckingSet.add(field.field) == false ?

Ah yes. I assume you meant assert duplicateCheckingSet.add(field.field) and have updated it accordingly.

jtibshirani

Thanks for catching this!

jtibshirani · 2021-11-16T17:55:32Z

lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java

@@ -165,6 +169,117 @@ public void testSameScore() throws IOException {
    dir.close();
  }

+  public void testSameScoreAndCollectionBetweenCompleteAndTopScores() throws IOException {


Could you explain how this test relates to the fix? Would it make sense to have a more targeted test similar to testCopyField?

This test was actually developed in another related PR #418, and it would generate duplicated field-weight pairs in fields object to trigger the condition. As the existing tests don't currently trigger that condition, I thought this would be a good test to put in first for both PRs.

I guess if we were to have a focused test for this PR, it would still look something very similar to this test (and testCopyField), maybe something like the following, and it just ensures the test run through without assertion exception. The delta between the tests are just I removed some doc content randomizations and result comparison between TOP_SCORE and COMPLETE collection.

public void testShouldRunWithoutAssertionException() throws IOException { int numDocs = randomBoolean() ? atLeast(1000) : atLeast(128 * 8 * 8 * 3); // make sure some terms have skip data int numMatchDoc = randomIntBetween(200, 500); int numHits = atMost(100); int boost1 = Math.max(1, random().nextInt(5)); int boost2 = Math.max(1, random().nextInt(5)); Directory dir = newDirectory(); Similarity similarity = randomCompatibleSimilarity(); IndexWriterConfig iwc = new IndexWriterConfig(); iwc.setSimilarity(similarity); RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc); // adding potentially matching doc for (int i = 0; i < numMatchDoc; i++) { Document doc = new Document(); int freqA = random().nextInt(20) + 1; if (randomBoolean()) { for (int j = 0; j < freqA; j++) { doc.add(new TextField("a", "foo", Store.NO)); } } freqA = random().nextInt(20) + 1; if (randomBoolean()) { for (int j = 0; j < freqA; j++) { doc.add(new TextField("a", "foo" + j, Store.NO)); } } freqA = random().nextInt(20) + 1; if (randomBoolean()) { for (int j = 0; j < freqA; j++) { doc.add(new TextField("a", "zoo", Store.NO)); } } int freqB = random().nextInt(20) + 1; if (randomBoolean()) { for (int j = 0; j < freqB; j++) { doc.add(new TextField("b", "zoo", Store.NO)); } } freqB = random().nextInt(20) + 1; if (randomBoolean()) { for (int j = 0; j < freqB; j++) { doc.add(new TextField("b", "zoo" + j, Store.NO)); } } int freqC = random().nextInt(20) + 1; for (int j = 0; j < freqC; j++) { doc.add(new TextField("c", "bla" + j, Store.NO)); } w.addDocument(doc); } IndexReader reader = w.getReader(); IndexSearcher searcher = newSearcher(reader); searcher.setSimilarity(similarity); CombinedFieldQuery query = new CombinedFieldQuery.Builder() .addField("a", (float) boost1) .addField("b", (float) boost2) .addTerm(new BytesRef("foo")) .addTerm(new BytesRef("zoo")) .build(); TopScoreDocCollector completeCollector = TopScoreDocCollector.create(numHits, null, Integer.MAX_VALUE); searcher.search(query, completeCollector); reader.close(); w.close(); dir.close(); }

I guess I'm fine either way? If you prefer a more focused test, I can replace it with the above one.

I think starting with this more focused test makes sense for this PR.

Sounds good. I've replaced the test with a more focused one.

zacharymorn · 2021-11-17T04:56:03Z

Thanks @jimczi @jtibshirani for the review and feedback!

jtibshirani

Looks good to me too!

zacharymorn · 2021-11-19T06:44:01Z

Will backport this to 8.11 and 8x.

zacharymorn · 2021-12-20T06:01:19Z

Will backport this to 8.11 and 8x.

Hi @jimczi @jtibshirani @mikemccand, just to confirm as I saw there was a thread on the proper handling of branch_8x, this change should be backported to branches branch_8_11 (version 8.11.2), branch_9_0 (version 9.0.1) & branch_9x (version 9.1.0), but not to branch branch_8x right?

jpountz · 2022-01-05T16:13:56Z

Correct, changes should no longer be backported to branch_8x.

…calculation (apache#444) (cherry picked from commit 07ee3ba)

LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring…

5e1bd47

… calculation, and added a test

zacharymorn requested review from jpountz and jimczi November 16, 2021 04:21

zacharymorn mentioned this pull request Nov 16, 2021

LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery #418

Open

jimczi approved these changes Nov 16, 2021

View reviewed changes

jtibshirani reviewed Nov 16, 2021

View reviewed changes

address feedback

d6a478f

zacharymorn requested review from jimczi and jtibshirani November 17, 2021 04:56

jtibshirani approved these changes Nov 17, 2021

View reviewed changes

address feedback - more focused test

78573f0

zacharymorn requested a review from jtibshirani November 18, 2021 05:10

jtibshirani approved these changes Nov 18, 2021

View reviewed changes

zacharymorn merged commit 07ee3ba into apache:main Nov 19, 2021

zacharymorn mentioned this pull request Jan 6, 2022

LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring calculation (8.11.2 Backporting) apache/lucene-solr#2637

Open

zacharymorn added a commit to zacharymorn/lucene that referenced this pull request Jan 6, 2022

LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring …

cd7da4a

…calculation (apache#444) (cherry picked from commit 07ee3ba)

zacharymorn mentioned this pull request Jan 6, 2022

LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring calculation (9.0.1 Backporting) #587

Closed

zacharymorn added a commit to zacharymorn/lucene that referenced this pull request Jan 6, 2022

LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring …

e568046

…calculation (apache#444) (cherry picked from commit 07ee3ba)

zacharymorn mentioned this pull request Jan 6, 2022

LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring calculation (9.1.0 Backporting) #588

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test #444

LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test #444

zacharymorn commented Nov 16, 2021

jimczi left a comment

jimczi Nov 16, 2021

zacharymorn Nov 17, 2021

jtibshirani left a comment

jtibshirani Nov 16, 2021

zacharymorn Nov 17, 2021

jtibshirani Nov 17, 2021

zacharymorn Nov 18, 2021

zacharymorn commented Nov 17, 2021

jtibshirani left a comment

zacharymorn commented Nov 19, 2021

zacharymorn commented Dec 20, 2021

jpountz commented Jan 5, 2022

LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test #444

LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test #444

Conversation

zacharymorn commented Nov 16, 2021

Description

Tests

Checklist

jimczi left a comment

Choose a reason for hiding this comment

jimczi Nov 16, 2021

Choose a reason for hiding this comment

zacharymorn Nov 17, 2021

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

jtibshirani Nov 16, 2021

Choose a reason for hiding this comment

zacharymorn Nov 17, 2021

Choose a reason for hiding this comment

jtibshirani Nov 17, 2021

Choose a reason for hiding this comment

zacharymorn Nov 18, 2021

Choose a reason for hiding this comment

zacharymorn commented Nov 17, 2021

jtibshirani left a comment

Choose a reason for hiding this comment

zacharymorn commented Nov 19, 2021

zacharymorn commented Dec 20, 2021

jpountz commented Jan 5, 2022