Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager #240

Conversation

zacharymorn
Copy link
Contributor

@zacharymorn zacharymorn commented Aug 12, 2021

Description / Solution

Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager), with the following changes:

  1. Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager, TotalHitCountCollectorManager and FixedBitSetCollector
  2. Refactor some tests to use the above collector manager
    3. Refactor DrillSideways to use extracted out collector managers

#11041

Tests

Passed updated tests with ./gradlew clean; ./gradlew check -Ptests.nightly=true (currently there are nocommit messages that are failing precommit though)

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

… of IndexSearcher#search(Query, CollectorManager)

Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager, TotalHitCountCollectorManager and FixedBitSetCollector
Refactor some tests to use the above collector manager
/*
nocommit
Should the following two be passed in instead? Possible custom initialization based on executor status and slices?
On the other hand, in a single-threaded environment, shared HitsThresholdChecker and MaxScoreAccumulator should be fast without lock contention anyway?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if luceneutil benchmarks become slower with this change? Hopefully they're not slower and we can keep this logic simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great suggestion! I just ran luceneutil, and it turns out that the changes do get slower due to this thread-safe HitsThresholdChecker (confirmed by playing with the atomic long inside a bit). Here is the full benchmark result:

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
    HighTermTitleBDVSort       52.86     (14.0%)       42.72      (6.6%)  -19.2% ( -34% -    1%) 0.000
       HighTermMonthSort       31.66     (10.7%)       25.90      (6.3%)  -18.2% ( -31% -   -1%) 0.000
   HighTermDayOfYearSort       24.00     (15.7%)       20.02      (8.5%)  -16.6% ( -35% -    9%) 0.000
              TermDTSort       20.22      (9.0%)       17.13      (6.6%)  -15.3% ( -28% -    0%) 0.000
                 MedTerm     1270.02      (3.9%)     1208.29      (5.1%)   -4.9% ( -13% -    4%) 0.001
                HighTerm      941.14      (4.0%)      900.05      (5.9%)   -4.4% ( -13% -    5%) 0.006
            OrHighNotMed      531.63      (5.6%)      514.68      (6.2%)   -3.2% ( -14% -    9%) 0.089
                 LowTerm     1112.50      (5.1%)     1081.48      (6.2%)   -2.8% ( -13% -    8%) 0.118
             MedSpanNear       12.50      (2.2%)       12.18      (2.2%)   -2.5% (  -6% -    1%) 0.000
            HighSpanNear       12.82      (1.7%)       12.52      (1.7%)   -2.4% (  -5% -    1%) 0.000
            OrHighNotLow      562.50      (6.9%)      550.15      (5.6%)   -2.2% ( -13% -   11%) 0.269
           OrHighNotHigh      577.10      (5.6%)      565.51      (6.7%)   -2.0% ( -13% -   10%) 0.303
           OrNotHighHigh      633.91      (6.7%)      621.85      (6.0%)   -1.9% ( -13% -   11%) 0.345
    HighIntervalsOrdered        6.94      (3.1%)        6.81      (3.2%)   -1.9% (  -7% -    4%) 0.061
                PKLookup      174.06      (4.2%)      171.17      (4.5%)   -1.7% (  -9% -    7%) 0.230
               OrHighLow      283.56      (5.3%)      279.05      (5.8%)   -1.6% ( -12% -   10%) 0.365
     MedIntervalsOrdered       19.00      (3.7%)       18.74      (3.6%)   -1.4% (  -8% -    6%) 0.237
                  Fuzzy1       59.38     (11.6%)       58.65     (12.0%)   -1.2% ( -22% -   25%) 0.740
        HighSloppyPhrase       12.68      (2.7%)       12.53      (2.1%)   -1.1% (  -5% -    3%) 0.135
                  Fuzzy2       63.95      (8.6%)       63.40      (9.0%)   -0.9% ( -17% -   18%) 0.759
         MedSloppyPhrase       17.53      (2.7%)       17.38      (2.2%)   -0.8% (  -5% -    4%) 0.279
         LowSloppyPhrase       23.90      (3.0%)       23.70      (2.8%)   -0.8% (  -6% -    5%) 0.354
                 Respell       39.23      (3.1%)       38.90      (3.1%)   -0.8% (  -6% -    5%) 0.403
              AndHighMed       61.17      (3.8%)       60.72      (3.8%)   -0.7% (  -8% -    7%) 0.542
               OrHighMed       48.81      (3.0%)       48.46      (3.0%)   -0.7% (  -6% -    5%) 0.444
     LowIntervalsOrdered       14.64      (2.3%)       14.55      (2.3%)   -0.6% (  -5% -    4%) 0.380
             LowSpanNear       11.65      (1.6%)       11.57      (1.6%)   -0.6% (  -3% -    2%) 0.223
             AndHighHigh       16.57      (3.6%)       16.46      (4.0%)   -0.6% (  -7% -    7%) 0.608
                  IntNRQ       17.25     (28.0%)       17.17     (27.3%)   -0.5% ( -43% -   76%) 0.959
              OrHighHigh        9.28      (2.5%)        9.24      (2.5%)   -0.4% (  -5% -    4%) 0.635
               MedPhrase       72.99      (3.3%)       72.75      (3.7%)   -0.3% (  -7% -    6%) 0.763
               LowPhrase       78.02      (3.5%)       77.77      (4.0%)   -0.3% (  -7% -    7%) 0.786
   BrowseMonthSSDVFacets        3.19      (1.8%)        3.19      (1.7%)   -0.2% (  -3% -    3%) 0.653
BrowseDayOfYearSSDVFacets        3.01      (1.5%)        3.01      (1.2%)   -0.0% (  -2% -    2%) 0.991
                Wildcard       27.98      (2.8%)       27.98      (2.7%)    0.0% (  -5% -    5%) 0.981
              AndHighLow      454.42      (5.4%)      455.10      (4.6%)    0.1% (  -9% -   10%) 0.925
              HighPhrase      289.77      (5.1%)      290.54      (4.0%)    0.3% (  -8% -    9%) 0.854
            OrNotHighLow      521.43      (6.5%)      523.38      (5.7%)    0.4% ( -11% -   13%) 0.846
                 Prefix3       39.82      (4.8%)       40.02      (3.1%)    0.5% (  -7% -    8%) 0.707
            OrNotHighMed      560.49      (5.8%)      563.51      (5.5%)    0.5% ( -10% -   12%) 0.762
   BrowseMonthTaxoFacets        0.99      (4.8%)        1.00      (4.6%)    0.6% (  -8% -   10%) 0.702
    BrowseDateTaxoFacets        0.94      (4.6%)        0.94      (4.6%)    0.6% (  -8% -   10%) 0.683
BrowseDayOfYearTaxoFacets        0.94      (4.5%)        0.94      (4.5%)    0.7% (  -7% -   10%) 0.638

To fix the performance regression, but still try to keep the logic simple then, I'm considering adding an additional flag supportsConcurrency to the collector managers' constructor like such:

public TopScoreDocCollectorManager(int numHits, ScoreDoc after, int totalHitsThreshold, boolean supportsConcurrency) {
    ...
    this.hitsThresholdChecker = supportsConcurrency ? 
        HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, numHits)) : 
        HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits));
    this.minScoreAcc = supportsConcurrency ? new MaxScoreAccumulator() : null;
  }

public TopFieldCollectorManager(Sort sort, int numHits, FieldDoc after, int totalHitsThreshold, boolean supportsConcurrency) {
   ...
    this.hitsThresholdChecker = supportsConcurrency ? 
        HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, numHits)) : 
        HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits));
    this.minScoreAcc = supportsConcurrency ? new MaxScoreAccumulator() : null;
  }

What do you think about this approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jpountz, for now I've implemented the above approach in 4af7740, and the lucenenutil benchmark shows there's no obvious impact to performance now:

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
             MedSpanNear        6.59      (5.6%)        6.30     (10.4%)   -4.5% ( -19% -   12%) 0.091
BrowseDayOfYearTaxoFacets        0.45      (6.5%)        0.43     (10.0%)   -4.3% ( -19% -   13%) 0.106
            OrNotHighLow      321.23      (8.0%)      308.14      (7.5%)   -4.1% ( -18% -   12%) 0.097
            HighSpanNear        2.96      (6.4%)        2.84      (9.5%)   -3.9% ( -18% -   12%) 0.129
    BrowseDateTaxoFacets        0.45      (6.9%)        0.43      (9.6%)   -3.9% ( -19% -   13%) 0.142
    HighIntervalsOrdered        4.12      (4.7%)        3.97      (8.7%)   -3.6% ( -16% -   10%) 0.102
   BrowseMonthSSDVFacets        1.68      (6.6%)        1.63     (11.7%)   -3.5% ( -20% -   15%) 0.239
        HighSloppyPhrase        6.41      (8.6%)        6.18     (10.0%)   -3.5% ( -20% -   16%) 0.233
                  IntNRQ       14.11      (6.2%)       13.62     (11.2%)   -3.5% ( -19% -   14%) 0.225
                Wildcard       14.90      (7.5%)       14.39      (8.2%)   -3.4% ( -17% -   13%) 0.170
             LowSpanNear        4.53      (5.7%)        4.38      (8.1%)   -3.3% ( -16% -   11%) 0.135
               OrHighLow      164.11      (6.2%)      158.84      (8.5%)   -3.2% ( -16% -   12%) 0.174
   BrowseMonthTaxoFacets        0.47      (6.1%)        0.45      (9.2%)   -3.2% ( -17% -   12%) 0.193
     LowIntervalsOrdered        3.02      (5.5%)        2.93      (9.1%)   -3.2% ( -16% -   12%) 0.185
         LowSloppyPhrase        8.80      (8.7%)        8.53      (9.7%)   -3.0% ( -19% -   16%) 0.298
     MedIntervalsOrdered        6.46      (5.8%)        6.26     (10.1%)   -3.0% ( -17% -   13%) 0.248
              AndHighMed       41.11      (6.6%)       39.91      (8.0%)   -2.9% ( -16% -   12%) 0.205
                  Fuzzy1       30.54     (13.2%)       29.75     (14.1%)   -2.6% ( -26% -   28%) 0.550
BrowseDayOfYearSSDVFacets        1.46      (6.1%)        1.43     (10.6%)   -2.6% ( -18% -   15%) 0.350
              HighPhrase      104.16      (5.5%)      101.53      (8.9%)   -2.5% ( -16% -   12%) 0.284
            OrHighNotLow      394.47      (6.8%)      385.27      (8.3%)   -2.3% ( -16% -   13%) 0.329
                  Fuzzy2       25.00      (8.9%)       24.42      (9.8%)   -2.3% ( -19% -   18%) 0.440
         MedSloppyPhrase        9.24      (8.3%)        9.03      (9.5%)   -2.3% ( -18% -   16%) 0.421
              OrHighHigh        4.58      (5.9%)        4.47      (7.7%)   -2.2% ( -14% -   12%) 0.304
            OrNotHighMed      326.95      (6.6%)      319.95      (7.1%)   -2.1% ( -14% -   12%) 0.322
               LowPhrase      128.75      (5.9%)      126.06      (8.9%)   -2.1% ( -15% -   13%) 0.382
               MedPhrase       34.71      (5.7%)       34.01      (8.0%)   -2.0% ( -14% -   12%) 0.354
           OrNotHighHigh      303.59      (6.7%)      297.55      (8.0%)   -2.0% ( -15% -   13%) 0.394
            OrHighNotMed      304.72      (5.9%)      299.56      (7.5%)   -1.7% ( -14% -   12%) 0.429
                 Prefix3       62.22      (5.1%)       61.22      (7.5%)   -1.6% ( -13% -   11%) 0.431
               OrHighMed       32.41      (6.8%)       32.02      (6.8%)   -1.2% ( -13% -   13%) 0.579
    HighTermTitleBDVSort       24.77     (20.2%)       24.53     (18.1%)   -1.0% ( -32% -   46%) 0.871
                 Respell       20.64      (5.3%)       20.47      (7.1%)   -0.8% ( -12% -   12%) 0.682
             AndHighHigh       23.51      (6.0%)       23.34      (6.7%)   -0.7% ( -12% -   12%) 0.727
                 LowTerm      724.06      (7.1%)      719.45      (5.6%)   -0.6% ( -12% -   12%) 0.751
              AndHighLow      272.87      (5.9%)      271.24      (7.3%)   -0.6% ( -13% -   13%) 0.777
                PKLookup       90.10      (5.5%)       89.65      (8.8%)   -0.5% ( -14% -   14%) 0.831
                HighTerm      876.85      (7.3%)      875.67      (7.8%)   -0.1% ( -14% -   16%) 0.955
           OrHighNotHigh      291.95      (8.8%)      291.77      (8.8%)   -0.1% ( -16% -   19%) 0.982
   HighTermDayOfYearSort       37.60     (16.6%)       37.65     (17.9%)    0.1% ( -29% -   41%) 0.981
       HighTermMonthSort       18.43     (18.6%)       18.50     (20.3%)    0.4% ( -32% -   48%) 0.954
              TermDTSort       41.45     (15.4%)       41.99     (15.1%)    1.3% ( -25% -   37%) 0.786
                 MedTerm      700.62      (6.8%)      711.10      (6.0%)    1.5% ( -10% -   15%) 0.461

Please let me know how this looks to you.

…hread-safe internal states are needed, and remove static methods in favor of constructors
…rillSideways#searchSequentially(query, collectorManager) to by pass caching

This is needed to pass test case from ./gradlew test --tests TestDrillSideways.testRandom -Dtests.seed=69A3EF02D8E3465E -Dtests.nightly=true
Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Looks good overall. Lots of little changes! Those test cases... :)

@@ -180,6 +185,7 @@ protected int withTopDocs(IndexSearcher searcher, Query q, TopDocs hits) throws
return res;
}

@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if anyone out there has sub-classed this to provide their own Collector through overriding this method? It's possible right? Would we want to support this with a CollectorManager now instead? You might consider creating a new protected method--createCollectorManager--that sub-classes could override if they want, then add javadoc here with a @lucene.deprecated tag pointing users to the new method.

Also, if this change is going onto main, I might kill this method entirely and then add it back in as deprecated if you choose to backport to 8x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if anyone out there has sub-classed this to provide their own Collector through overriding this method? It's possible right?

Hmm you are right. I reverted some changes in this class to still use the ReadTask#createCollector method, in case there's overriding from users.

Would we want to support this with a CollectorManager now instead? You might consider creating a new protected method--createCollectorManager--that sub-classes could override if they want, then add javadoc here with a @lucene.deprecated tag pointing users to the new method.

I think this createCollector method was originally added (11 years ago) to support benchmarking, and not sure how widely this is used actually? So hopefully marking it as deprecated for release 9.0 and removing it entirely once we release 10.0 should be good enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking along the lines of keeping the change you had but providing a protected method that would allow sub-classes to provide their own CollectorManager if they wanted to do so (which would give users a migration path if they were previously providing their own Collector). So in 9.0, I think you'd move completely to the CollectorManager approach like you had. What do you think of that approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you are coming from. However, upon looking at this method more closely, I'm afraid this method is effectively not useful, since the result of using this collector was commented out :D :

Collector collector = createCollector();
searcher.search(q, collector);
// hits = collector.topDocs();

So I'm now leaning more towards just removing this method altogether if no users ever noticed / complaint about this. What do you think @gsmiller @jpountz ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it would still be useful to assess the performance of a collector but you probably have a point that if performance is the only thing that the benchmark module reports, then there's a chance that it's not actually used much.

I did a quick GitHub search to see how much code I could find that extends this code:

Given that the benchmark module is quite expert, I'd be ok with breaking hard here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I've removed this method, and also fixed the commented out code.

@@ -527,7 +503,10 @@ public TopDocs search(Query query, int n) throws IOException {
*
* @throws TooManyClauses If a query would exceed {@link IndexSearcher#getMaxClauseCount()}
* clauses.
* @deprecated This method is being deprecated in favor of {@link IndexSearcher#search(Query,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we like the @lucene.deprecated tag?

Is your reasoning behind keeping this in there for now that all of our uses of this (internal to Lucene) haven't yet migrated as part of this change? Would the plan me to migrate all usages off of this internally and then actually remove it on main/9.0, or are you thinking of keeping it around until 10.0? I think our backwards compatibility policy is such that we could just directly remove this on main/9.0, but then leave it like you have it (marked deprecated) if you choose to backport this to 8x. Since this method is so fundamental though, I could easily see an argument to keep it around for an extra major release to give users more time to migrate. Then again, the migration path seems pretty straight-forward. What do you think?

Copy link
Contributor Author

@zacharymorn zacharymorn Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we like the @lucene.deprecated tag?

Do you mean @lucene.deprecated should be created and used instead of using @deprecated? I can't seems to find that tag is being used in lucene actually...I think @lucene.deprecated most likely won't be recognized by IDE or build tool to signal / warn the use of deprecated methods though.

Is your reasoning behind keeping this in there for now that all of our uses of this (internal to Lucene) haven't yet migrated as part of this change? Would the plan me to migrate all usages off of this internally and then actually remove it on main/9.0, or are you thinking of keeping it around until 10.0? I think our backwards compatibility policy is such that we could just directly remove this on main/9.0, but then leave it like you have it (marked deprecated) if you choose to backport this to 8x. Since this method is so fundamental though, I could easily see an argument to keep it around for an extra major release to give users more time to migrate. Then again, the migration path seems pretty straight-forward. What do you think?

As explained in https://issues.apache.org/jira/browse/LUCENE-10002?focusedCommentId=17397827&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17397827, I use deprecation instead of direct removal in this PR exactly for the reasons you mentioned:

  1. Not all internal uses have been migrated in this PR (which may require another few thousand lines of changes).
  2. I personally feel that we should give users more time for migration and testing out the changes, so 10.0 release might be a good timing for complete removal.

I would be interested in seeing how folks feel about the timing of removal, but after #1 is completed, complete removal of IndexSearcher#search(Query, Collector) in lucene codebase should require only straightforward deletion changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to keep IndexSearcher#search(Query, Collector) and the TopXXXCollector#create factory methods during the 9.x series, as I expect it to be quite commonly used. However in my opinion it would be fine to cut over expert classes like those in the benchmark module without providing such long bw compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 makes sense to keep it in 9.0. As for my comment about @lucene.deprecated, I was getting this mixed up with something else. Apologies for any confusion!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for my comment about @lucene.deprecated, I was getting this mixed up with something else. Apologies for any confusion!

Though, I think there are both @deprecated (javadoc tag) and @Deprecated (Java class annotation) -- we do both of them typically (looks like you did here @zacharymorn, great!).

@zacharymorn
Copy link
Contributor Author

Left a few comments. Looks good overall. Lots of little changes! Those test cases... :)

Thanks @gsmiller for the review! Yes there are indeed many small changes, and more will come :D

@@ -527,7 +503,10 @@ public TopDocs search(Query query, int n) throws IOException {
*
* @throws TooManyClauses If a query would exceed {@link IndexSearcher#getMaxClauseCount()}
* clauses.
* @deprecated This method is being deprecated in favor of {@link IndexSearcher#search(Query,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to keep IndexSearcher#search(Query, Collector) and the TopXXXCollector#create factory methods during the 9.x series, as I expect it to be quite commonly used. However in my opinion it would be fine to cut over expert classes like those in the benchmark module without providing such long bw compatibility.

this.hitsThresholdChecker =
supportsConcurrency
? HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, numHits))
: HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also store supportsConcurrency in a member of this class so that we can check that users never create more than one collector when it is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I like this idea better than the approach described in #240 (comment), since it doesn't require an additional API to be added to CollectorManager for thread-safety! I've implemented this in the latest commit.

…or) to DrillSideways#searchSequentially(query, collectorManager) to by pass caching"

This reverts commit 86bac52.
@zacharymorn
Copy link
Contributor Author

Hi @jpountz @gsmiller , just want to check back on this PR to see if you have any further feedback?

@jpountz
Copy link
Contributor

jpountz commented Sep 15, 2021

Sorry @zacharymorn I have not forgotten about it but the change is very large and it's hard to find enough adjacent time to review it. I'll do my best to find time in the coming week.

@zacharymorn
Copy link
Contributor Author

Sorry @zacharymorn I have not forgotten about it but the change is very large and it's hard to find enough adjacent time to review it. I'll do my best to find time in the coming week.

No worry @jpountz , this PR is indeed large and time consuming! Please take your time and I appreciate your help on the review!

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but the approach that you took so far looks good to me. I see that we still need to migrate some collectors like LargeNumHitsTopDocsCollector or the internal JoinUtil collectors to a CollectorManager?

@@ -180,6 +185,7 @@ protected int withTopDocs(IndexSearcher searcher, Query q, TopDocs hits) throws
return res;
}

@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it would still be useful to assess the performance of a collector but you probably have a point that if performance is the only thing that the benchmark module reports, then there's a chance that it's not actually used much.

I did a quick GitHub search to see how much code I could find that extends this code:

Given that the benchmark module is quite expert, I'd be ok with breaking hard here.


int ret = totalHitCountCollector.getTotalHits();
int ret = indexSearcher.search(booleanQuery.build(), new TotalHitCountCollectorManager());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it even better by using IndexSearcher#count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -179,8 +178,7 @@ protected int countDocsWithClass() throws IOException {
if (query != null) {
q.add(query, BooleanClause.Occur.MUST);
}
indexSearcher.search(q.build(), classQueryCountCollector);
docCount = classQueryCountCollector.getTotalHits();
docCount = indexSearcher.search(q.build(), new TotalHitCountCollectorManager());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's simplify by using IndexSearcher#count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also updated the implementation of IndexSearcher#count to use TotalHitCountCollectorManager.

TotalHitCountCollector totalHitCountCollector = new TotalHitCountCollector();
indexSearcher.search(booleanQuery.build(), totalHitCountCollector);
return totalHitCountCollector.getTotalHits();
return indexSearcher.search(booleanQuery.build(), new TotalHitCountCollectorManager());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Same as above with additional parameters to allow passing in the threshold checker and the max
* score accumulator.
*/
static TopFieldCollector create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we don't support access to pkg-private code.


if (sort.fields.length == 0) {
throw new IllegalArgumentException("Sort must contain at least one field");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move as much validation as possible to the constructor, so that we would detect inconsistencies earlier: when users create the collector manager rather than when Lucene starts collecting hits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh good catch! Updated.

return TopDocs.merge(sort, 0, numHits, topDocs);
}

public List<TopFieldCollector> getCollectors() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it pkg-private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this method was added to support early termination detection in tests like the following:

collectorManager.getCollectors().stream().anyMatch(TopFieldCollector::isEarlyTerminated)

I think Lucene users may have the need for this check as well?


searcher.search(testQuery, largeCollector);
searcher.search(testQuery, regularCollector);
TopDocs topDocs = searcher.search(testQuery, regularCollector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it called secondTopDocs, I think it helped readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@jpountz
Copy link
Contributor

jpountz commented Sep 16, 2021

@zacharymorn @gsmiller If we try to do everything in a single PR, I worry that this will become very hard to review. I wonder if we should split by replacing the deprecation warnings of IndexSearcher#search(Query,Collector) and TopXXXCollector#create by some words about how the CollectorManager variants are recommended as they make it possible to search using multiple threads. Then when this PR is merged we can work more iteratively on replacing usage of Collector with CollectorManager, and we could add the deprecation warnings in the end once we have migrated all our code?

@zacharymorn
Copy link
Contributor Author

I left some comments but the approach that you took so far looks good to me. I see that we still need to migrate some collectors like LargeNumHitsTopDocsCollector or the internal JoinUtil collectors to a CollectorManager?

Thanks @jpountz for the review! Indeed there are more that need to be migrated as well. In fact, I have been working on the 2nd PR to migrate more "easy" ones, and after that, we still have the following:

Explanation related

ExplanationAsserter
MatchesAsserter

Facet related

FacetsCollector
RandomSamplingFacetsCollector

Grouping related

GroupFacetCollector
TopGroupsCollector
BlockGroupingCollector
FirstPassGroupingCollector
AllGroupHeadsCollector
AllGroupsCollector

Global ordinal related

GlobalOrdinalsCollector
GlobalOrdinalsWithScoreCollector

Anonymous classes

Anonymous Collector used in JoinUtil
Anonymous Collector used in TestJoinUtil
Anonymous Collector used in QueryUtils

Profiling related

ProfilerCollector
MemoryAccountingBitsetCollector
MonitorQueryCollector

Counting related

DocValuesStatsCollector
MyHitCollector
MatchCollector

Result diversification related

DistinctValuesCollector
DiversifiedTopDocsCollector

Other

CachingCollector
TerminateAfterCollector
TimeLimitingCollector
LargeNumHitsTopDocsCollector

As each of these categories / collectors will likely require more thoughts / time to understand the current logic, and come up with CollectorManager implementation that's also thread-safe, I'm thinking to create follow-up Jira sub-tickets for each of these categories to track them. What do you think?

I wonder if we should split by replacing the deprecation warnings of IndexSearcher#search(Query,Collector) and TopXXXCollector#create by some words about how the CollectorManager variants are recommended as they make it possible to search using multiple threads. Then when this PR is merged we can work more iteratively on replacing usage of Collector with CollectorManager, and we could add the deprecation warnings in the end once we have migrated all our code?

Yeah I can definitely follow this strategy as well. The only concern I may have is that, as can be seen above, it may still take some time to migrate all existing collectors to collectorManagers. So using suggestion comment for now and adding deprecation warning at the end may likely see a few more collectors without collectorManagers added accidentally (i.e. folks may not notice the comment when they just code against the IndexSearch#search(Query, Collector) API)?

@javanna
Copy link
Contributor

javanna commented Jan 10, 2022

hi @zacharymorn I looked at this issue a long while ago, before you started working on it, and I am now catching up. I see you made great progress on it! I also see that reviewing it as a single change is a bit of a challenge due to the size of the PR. Could I help out splitting this in smaller PRs so that we make progress towards getting the change in, step by step?

@zacharymorn
Copy link
Contributor Author

We are still a ways away (from seeing Lucene fully utilize available hardware concurrency available at search time to reduce query latencies)

For example: query concurrency is still tied to segment geometry, which is insane. An "optimized" (forceMerge to 1 segment) index loses all of its concurrency! Hideously, at my team (Amazon product search) we had to create a neutered merge policy that intentionally avoids "accidentally" making a lopsided index (where too many docs are in a single segment) because it harms our long pole query latencies (we run all queries concurrently, except close to red-line), due to this silly Lucene limitation.

We have a longstanding issue open for this: #9721, and it ought not be so difficult to fix because Lucene already has strong support for one thread to search a "slice" of a big segment, i.e. we can make the thread work units slices of segments instead of whole segments or collections of segments (what IndexSearcher now calls slices, confusingly).

Thanks @mikemccand for sharing the additional insight and context! Looks like there's already a lot of good discussions in that ticket as well. Will definitely take a look at that and see what I can contribute, once most of the work for deprecating IndexSearch#search(Query, Collector) has been completed.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @zacharymorn! I just think we should revert the lucene/benchmark change that breaks testing of a custom collector for this first step ...

@zacharymorn
Copy link
Contributor Author

Looks great, thanks @zacharymorn! I just think we should revert the lucene/benchmark change that breaks testing of a custom collector for this first step ...

Thanks @mikemccand for the review feedback and approval!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you @zacharymorn! Awesome to see Lucene bit by bit progressing towards fully utilizing concurrent hardware by default. It's about time we pay down this longstanding tech debt.

@zacharymorn
Copy link
Contributor Author

Hi @mikemccand @jpountz @javanna @gsmiller , I have updated this PR to pick up the latest from main, as well as revert some changes to save them for follow-up PRs that address other collectors. This PR is now focused on the following:

  • Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager
  • Refactor some tests to use the above collector manager

Could you please take a look, and let me know if you have any feedback?

Hi @jpountz @javanna @gsmiller, just want to have a quick follow-up to see if you may have any further feedback on this PR? I plan to merge it in the next few days if there's no major concern.

@javanna
Copy link
Contributor

javanna commented Nov 16, 2023

Thanks for reviving this PR @zacharymorn ! the changes look good to me, having top score doc and top field collector managers sounds like a natural next step, and removes code duplication. I wish that we could avoid having the supportConcurrency flag in them, but that's ok for now, and I don't have a good alternative in mind to avoid unnecessary overhead when there's a single slice.

One thing that I wonder is whether we are ok already deprecating search(Query, Collector) given that we have a lot of usages still within Lucene. I was thinking that we may want to replace them all before deprecating, and then remove the method in main, but this is where I got distracted from this project last year and ended up dropping it after all. Should we consider getting your changes in, minus the deprecation, perhaps, and split the work on removing usages of the method?

@zacharymorn
Copy link
Contributor Author

Thanks @javanna for the feedback!

One thing that I wonder is whether we are ok already deprecating search(Query, Collector) given that we have a lot of usages still within Lucene. I was thinking that we may want to replace them all before deprecating, and then remove the method in main, but this is where I got distracted from this project last year and ended up dropping it after all. Should we consider getting your changes in, minus the deprecation, perhaps, and split the work on removing usages of the method?

My current preference is actually to get the deprecation annotation in early, so that while we take on the remaining work to migrate other collector usage (which could take some time due to the amount of it), other contributors / application developers could start to take note of those deprecation messages, and stay away from using search(Query, Collector) in favor of search(Query, CollectorManager) as early as possible. Otherwise, there may be more and more of those usage being added into lucene and we will always be playing catch-up to migrate them until we deprecate those methods. What do you think?

@javanna
Copy link
Contributor

javanna commented Nov 21, 2023

That is fine with me @zacharymorn . Indeed I have observed as well that there will be new usages introduced while we work on removing current usages, and deprecating early can help with that.

@zacharymorn zacharymorn merged commit 38ca8d3 into apache:main Nov 28, 2023
4 checks passed
mikemccand added a commit that referenced this pull request Nov 29, 2023
@javanna
Copy link
Contributor

javanna commented Nov 30, 2023

heya @zacharymorn given that this is a deprecation, I guess you meant on backporting to branch_9x as well?

@zacharymorn
Copy link
Contributor Author

Hi @javanna , I was actually thinking to have it for 10.0.0 (added an entry into that section in CHANGES.txt), as deprecating IndexSearcher#search(Query, Collector) has a rather large impact to lucene applications (not that it requires immediate changes, but the method itself is very frequently used and hence requires some attentions). But I can change it to branch_9x instead if that's preferred?

@javanna
Copy link
Contributor

javanna commented Dec 4, 2023

I think we would do the removal in 10, as that's the type of change that we would make for a major release. Or can decide to delay the removal if need be, but delaying the deprecation kind of defeats the purpose of deprecating?

@zacharymorn
Copy link
Contributor Author

I think we would do the removal in 10, as that's the type of change that we would make for a major release. Or can decide to delay the removal if need be, but delaying the deprecation kind of defeats the purpose of deprecating?

Do you know what's the timeline we have for releasing version 10? Based on my quick look over the last few days, most likely it will take a good few months for us to completely remove that usage in lucene codebase (I'm also a bit occupied at the moment), hence we may need to delay the removal if version 10 is planned to be released within like a year.

With the above said, like I shared earlier, I feel getting the deprecation warning message into our codebase as early as possible is critical for lucene contributors to start moving away from using that method, so that we won't be playing catch-up with ourselves and will be able to remove that method one day. Checking the deprecation warning into main already achieved this purpose. It will be great for lucene application developer to see this as early as possible as well, but its not as urgent (and it will take them longer time to migrate anyway), hence I was having it for version 10 given its potential impact. But I could change it to be released for 9x if that's still preferred.

@javanna
Copy link
Contributor

javanna commented Dec 6, 2023

I think that it would be even better to get the deprecation out with 9x, regardless of whether we will effectively remove with 10. At least we let users know that they should use a different method. I don't see advantages around delaying a deprecation somehow. Thoughts @jpountz ?

@mikemccand
Copy link
Member

+1 to backport the deprecation message/tags to 9.x, even if we cannot fully remove our own internal usage of the deprecated APIs before releasing 10.0. The two can / should be decoupled since it's so much work?

Let's also open a dedicated spinoff issue to "remove all deprecated IS.search usage / IS.search methods in 10.0", and mark it as Blocker for now? If 10.0 wants to get out and we haven't done this blocker we can revisit how/when to do it, whether to downgrade, etc.?

zacharymorn added a commit to zacharymorn/lucene that referenced this pull request Dec 7, 2023
… of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager (apache#240)
@zacharymorn
Copy link
Contributor Author

Thanks both @javanna @mikemccand . I have adjusted the change entry and opened a backport PR, as well as a new spinoff issue for 10.0.0 #12892.

zacharymorn added a commit that referenced this pull request Dec 9, 2023
… of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager (#240) (#12891)
@javanna javanna modified the milestones: 10.0.0, 9.10.0 Dec 11, 2023
javanna pushed a commit that referenced this pull request Jul 29, 2024
…ethods (#create, #createSharedManager) (#13500)

These methods were deprecated in #240 which is part of Lucene 10.0.

Addresses #13499
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants