Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CombinedFieldsQuery needs dynamic pruning support [LUCENE-10061] #11099

Open
asfimport opened this issue Aug 23, 2021 · 8 comments
Open

CombinedFieldsQuery needs dynamic pruning support [LUCENE-10061] #11099

asfimport opened this issue Aug 23, 2021 · 8 comments

Comments

@asfimport
Copy link

asfimport commented Aug 23, 2021

CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, forcing Lucene to collect all matches in order to figure the top-k hits.


Migrated from LUCENE-10061 by Adrien Grand (@jpountz), updated Feb 01 2022
Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks
Linked issues:

Pull requests: #418

@asfimport
Copy link
Author

asfimport commented Oct 16, 2021

Zach Chen (@zacharymorn) (migrated from JIRA)

Hi @jpountz, I'm interested in working on this one, but have a question on its potential implementation and would like to get some advices for it.

I found #9359 during research for this, and thought the solution should be very similar here (using merged impacts to prune docs that are not competitive), except for maybe how impacts get merged. However, while I understand for SynonymQuery, impacts can be merged effectively by summing term frequencies for each unique norm value as the impacts all come from the same field, I'm not sure how that could be done efficiently in the case of CombinedFieldsQuery. If I understand it correctly, in order to merge impacts from multiple fields for CombinedFieldsQuery, we may need to compute all the possible summation combinations of competitive {freq, norm} across all fields, and find again the competitive ones among them. So for the case of 4 fields with a list of 4 competitive impacts each during impacts merge, in the worst case we may need to compute 4 * 4 * 4 * 4 = 256 combinations of merged impacts ({field1FreqA + field2FreqB + field3FreqC + field4FreqD, field1NormA + field2NormB + field3NormC + field4NormD}), and then filter out the ones that are not competitive. This seems to be inefficient.

I'm wondering if you may have any suggestion on this, or if using impacts for CombinedFieldsQuery pruning support is the right approach to begin with?

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

in order to merge impacts from multiple fields for CombinedFieldsQuery, we may need to compute all the possible summation combinations of competitive {freq, norm} across all fields

I agree that there is a combinatorial explosion issue, and I fear that it's even worse than the example that you gave since we also need to consider the case when some fields do not match the query.

In the examples I've seen, there's often a field that has a much higher weight than other fields (e.g. a title field that has a 10x greater weight than a body field), so I am wondering if we could leverage this property to start from the impacts of the field that has the highest weight and see how we can cheaply incorporate impacts from other fields, even if this would overestimate the actual maximum score for the query.

@asfimport
Copy link
Author

Zach Chen (@zacharymorn) (migrated from JIRA)

Thanks for the confirmation @jpountz! I've actually given it a try in the last few days and just opened a WIP PR #418 for it, before seeing your comment above.

From the results of a few samples (documented in the PR), assuming there's no bug in the implementation, it does seem that the basic pruning would be most effective in the overall performance when there's significant difference in terms' doc frequencies (HighLow), but would indeed slow down when doc frequencies are close (HighHigh / HighMed) and very likely the overhead of combinatorial calculation / pruning logic outweighs the benefit of skipping. I will try to implement your optimization idea above as well and see how it performs.

In addition, I have been searching around to see if I can leverage luceneutil for benchmarking, but I can't seem to find a way to express combined fields query like those in https://github.com/mikemccand/luceneutil/blob/master/tasks/wikimedium.10M.tasks . I'm wondering if you may have any pointer for that as well?

 

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

For luceneutil integration, you will need to enhance TaskParser to create CombinedFieldQuery instances. You can look at how it handles minimumShouldMatch for inspiration.

@asfimport
Copy link
Author

asfimport commented Oct 30, 2021

Zach Chen (@zacharymorn) (migrated from JIRA)

Thanks @jpountz for the pointer! I have created a spin-off task for luceneutil integration #11249, and will actually work on it first and circle back to this task afterward. 

@asfimport
Copy link
Author

Zach Chen (@zacharymorn) (migrated from JIRA)

Hi @jpountz, I've implemented a quick optimization to replace combinatorial calculation with an upper-bound approximation (commit) .

With this and other bug fixes / optimizations based on CPU profiler, I was able to get the following performance test results (perf test index rebuilt to enable norm for title field, task file attached, and luceneutil integration available at https://github.com/mikemccand/luceneutil/pull/148):

 # Run 1
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
     CFQHighHighHigh        4.64      (6.5%)        3.30      (4.7%)  -29.0% ( -37% -  -19%) 0.000
         CFQHighHigh       11.09      (6.0%)        9.61      (6.0%)  -13.3% ( -23% -   -1%) 0.000
            PKLookup      103.38      (4.4%)      108.04      (4.3%)    4.5% (  -4% -   13%) 0.001
       CFQHighMedLow       10.58      (6.1%)       12.30      (8.7%)   16.2% (   1% -   33%) 0.000
          CFQHighMed       10.70      (7.4%)       15.51     (11.2%)   44.9% (  24% -   68%) 0.000
       CFQHighLowLow        8.18      (8.2%)       12.87     (11.6%)   57.3% (  34% -   84%) 0.000
          CFQHighLow       14.57      (7.5%)       30.81     (15.1%)  111.4% (  82% -  144%) 0.000


# Run 2
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
     CFQHighHighHigh        5.33      (5.7%)        4.02      (7.7%)  -24.4% ( -35% -  -11%) 0.000
       CFQHighLowLow       17.14      (6.2%)       13.06      (5.4%)  -23.8% ( -33% -  -13%) 0.000
          CFQHighMed       17.37      (5.8%)       14.38      (7.7%)  -17.2% ( -29% -   -3%) 0.000
            PKLookup      103.57      (5.5%)      108.84      (5.9%)    5.1% (  -6% -   17%) 0.005
       CFQHighMedLow       11.25      (7.2%)       12.70      (9.0%)   12.9% (  -3% -   31%) 0.000
         CFQHighHigh        5.00      (6.2%)        7.54     (12.1%)   51.0% (  30% -   73%) 0.000
          CFQHighLow       21.60      (5.2%)       34.57     (14.1%)   60.0% (  38% -   83%) 0.000


# Run 3
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
     CFQHighHighHigh        5.40      (6.9%)        4.06      (5.1%)  -24.8% ( -34% -  -13%) 0.000
       CFQHighMedLow        7.64      (7.4%)        5.79      (6.3%)  -24.2% ( -35% -  -11%) 0.000
         CFQHighHigh       11.11      (7.0%)        9.60      (5.9%)  -13.6% ( -24% -    0%) 0.000
       CFQHighLowLow       21.21      (7.6%)       21.22      (6.6%)    0.0% ( -13% -   15%) 0.993
            PKLookup      103.15      (5.9%)      107.60      (6.9%)    4.3% (  -8% -   18%) 0.034
          CFQHighLow       21.85      (8.1%)       34.18     (13.5%)   56.4% (  32% -   84%) 0.000
          CFQHighMed       12.07      (8.4%)       19.98     (16.7%)   65.5% (  37% -   98%) 0.000


# Run 4
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
         CFQHighHigh        8.50      (5.8%)        6.85      (5.2%)  -19.5% ( -28% -   -8%) 0.000
       CFQHighMedLow       10.89      (5.7%)        8.96      (5.4%)  -17.8% ( -27% -   -7%) 0.000
          CFQHighMed        8.41      (5.8%)        7.74      (5.6%)   -7.9% ( -18% -    3%) 0.000
     CFQHighHighHigh        3.45      (6.7%)        3.38      (5.3%)   -2.0% ( -13% -   10%) 0.287
       CFQHighLowLow        7.82      (6.4%)        8.20      (7.5%)    4.8% (  -8% -   20%) 0.030
            PKLookup      103.50      (5.0%)      110.69      (5.4%)    6.9% (  -3% -   18%) 0.000
          CFQHighLow       11.46      (6.0%)       13.16      (6.7%)   14.8% (   1% -   29%) 0.000

I think overall this shows that the pruning will be most effective when there's a significant difference between terms' frequencies, but will slow things down if they are close, as the cost of pruning outweighs the efficacy of skipping. I'm wondering if we should then gate the pruning by checking the frequencies as well, but from some quick trials that seems to be an expensive operation? Do you have any recommendation for this scenario?

@asfimport
Copy link
Author

asfimport commented Nov 8, 2021

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks for exploring this area @zacharymorn! I wonder if #10375 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND.

I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on https://www.staff.city.ac.uk/\~sbrp622/papers/foundations_bm25_review.pdf and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset.

@asfimport
Copy link
Author

asfimport commented Nov 9, 2021

Zach Chen (@zacharymorn) (migrated from JIRA)

Thanks for exploring this area @zacharymorn!

No problem, I'm always interested in exploring and learning about lucene querying!

I wonder if #10375 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND.

I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is actually not used there (it very much doesn't get re-written to BooleanQuery). In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here?

I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on https://www\.staff\.city\.ac\.uk/\~sbrp622/papers/foundations_bm25_review\.pdf and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset.

Sounds good will give that a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant