Skip to content

Conversation

@alexmm-amzn
Copy link
Contributor

@alexmm-amzn alexmm-amzn commented Sep 22, 2025

Description

Extends the FirstPassGroupingCollector to support pruning (for numeric sort fields using competitiveIterator) and skipping of non-competitive documents (for relevance score sorting using Scorable#setMinCompetitiveScore).

Both optimizations are enabled automatically, thereby reducing the hit count of the collector if circumstances allow.

@jainankitk Are we fine with enabling this by default, or do we need this configurable (e.g. configurable hit threshold)?

Benchmark results using luceneutils for the TermBGroup1M scenario (combines first and second pass grouping) using a modified wikimedium.10M.nostopwords.tasks job. This scenario uses sort by relevance score.

> grep TermBGroup1M tasks/wikimedium500.tasks > tasks/wikimedium.10M.nostopwords.tasks
> python src/python/localrun.py -source wikimediumall 

Running on m6a.2xlarge using Corretto 24:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup      161.09     (12.8%)      156.56     (11.2%)   -2.8% ( -23% -   24%) 0.460
                    TermBGroup1M       11.47     (14.8%)       13.44     (13.7%)   17.1% (  -9% -   53%) 0.000

=> ~17% overall performance improvement (first+second pass).

@jpountz I'm getting some rare test failures for TestGrouping caused by the assert canSetMinCompetitiveScore assertion in AssertingScorer#setMinCompetitiveScore, even though the FirstPassGroupingCollector uses ScoreMode.TOP_SCORES in all configurations when it calls Scorable#setMinCompetitiveScore. Is this a known issue?

Reproduce with: gradlew test --tests TestGrouping.testRandom -Dtests.seed=EC2EC279F564DD82 -Dtests.locale=de-AT -Dtests.timezone=America/St_Thomas -Dtests.asserts=true -Dtests.file.encoding=UTF-8

edit//Seems to be caused by the Weight that gets instantiated by the unit tests with either ScoreMode.COMPLETE or ScoreMode.COMPLETE_NO_SCORES regardless of the actual collectors. I updated the code to instantiate a new Weight instance for every collector that is in line with the collector ScoreMode.

@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

…roupingCollector (apache#15136)

Also adjust CachingCollector behavior to disable minimum competitive scores to allow caching the exhaustive list of hits.
@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 11.0.0 milestone Sep 23, 2025
@github-actions
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Oct 15, 2025
Copy link
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks correct to me. Will wait for couple of days, in case, someone has concerns with this change!


final Pruning pruning;
if (i == 0) {
pruning = compIDXEnd >= 0 ? Pruning.GREATER_THAN : Pruning.GREATER_THAN_OR_EQUAL_TO;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't pruning always equal to Pruning.GREATER_THAN since compIDXEnd = sortFields.length - 1?

Comment on lines +115 to +116
scoreMode = groupSort.needsScores() ? ScoreMode.TOP_DOCS_WITH_SCORES : ScoreMode.TOP_DOCS;
canSetMinScore = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need ScoreMode.COMPLETE / ScoreMode.COMPLETE_NO_SCORES here, since we don't have totalHitsThreshold like parameter?

@github-actions github-actions bot removed the Stale label Oct 28, 2025
@github-actions
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants