-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CombinedFieldsQuery needs dynamic pruning support [LUCENE-10061] #11099
Comments
Zach Chen (@zacharymorn) (migrated from JIRA) Hi @jpountz, I'm interested in working on this one, but have a question on its potential implementation and would like to get some advices for it. I found #9359 during research for this, and thought the solution should be very similar here (using merged impacts to prune docs that are not competitive), except for maybe how impacts get merged. However, while I understand for SynonymQuery, impacts can be merged effectively by summing term frequencies for each unique norm value as the impacts all come from the same field, I'm not sure how that could be done efficiently in the case of CombinedFieldsQuery. If I understand it correctly, in order to merge impacts from multiple fields for CombinedFieldsQuery, we may need to compute all the possible summation combinations of competitive {freq, norm} across all fields, and find again the competitive ones among them. So for the case of 4 fields with a list of 4 competitive impacts each during impacts merge, in the worst case we may need to compute 4 * 4 * 4 * 4 = 256 combinations of merged impacts ({field1FreqA + field2FreqB + field3FreqC + field4FreqD, field1NormA + field2NormB + field3NormC + field4NormD}), and then filter out the ones that are not competitive. This seems to be inefficient. I'm wondering if you may have any suggestion on this, or if using impacts for CombinedFieldsQuery pruning support is the right approach to begin with? |
Adrien Grand (@jpountz) (migrated from JIRA)
I agree that there is a combinatorial explosion issue, and I fear that it's even worse than the example that you gave since we also need to consider the case when some fields do not match the query. In the examples I've seen, there's often a field that has a much higher weight than other fields (e.g. a title field that has a 10x greater weight than a body field), so I am wondering if we could leverage this property to start from the impacts of the field that has the highest weight and see how we can cheaply incorporate impacts from other fields, even if this would overestimate the actual maximum score for the query. |
Zach Chen (@zacharymorn) (migrated from JIRA) Thanks for the confirmation @jpountz! I've actually given it a try in the last few days and just opened a WIP PR #418 for it, before seeing your comment above. From the results of a few samples (documented in the PR), assuming there's no bug in the implementation, it does seem that the basic pruning would be most effective in the overall performance when there's significant difference in terms' doc frequencies (HighLow), but would indeed slow down when doc frequencies are close (HighHigh / HighMed) and very likely the overhead of combinatorial calculation / pruning logic outweighs the benefit of skipping. I will try to implement your optimization idea above as well and see how it performs. In addition, I have been searching around to see if I can leverage luceneutil for benchmarking, but I can't seem to find a way to express combined fields query like those in https://github.com/mikemccand/luceneutil/blob/master/tasks/wikimedium.10M.tasks . I'm wondering if you may have any pointer for that as well?
|
Adrien Grand (@jpountz) (migrated from JIRA) For luceneutil integration, you will need to enhance TaskParser to create CombinedFieldQuery instances. You can look at how it handles minimumShouldMatch for inspiration. |
Zach Chen (@zacharymorn) (migrated from JIRA) Thanks @jpountz for the pointer! I have created a spin-off task for luceneutil integration #11249, and will actually work on it first and circle back to this task afterward. |
Zach Chen (@zacharymorn) (migrated from JIRA) Hi @jpountz, I've implemented a quick optimization to replace combinatorial calculation with an upper-bound approximation (commit) . With this and other bug fixes / optimizations based on CPU profiler, I was able to get the following performance test results (perf test index rebuilt to enable norm for title field, task file attached, and luceneutil integration available at https://github.com/mikemccand/luceneutil/pull/148): # Run 1
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighHighHigh 4.64 (6.5%) 3.30 (4.7%) -29.0% ( -37% - -19%) 0.000
CFQHighHigh 11.09 (6.0%) 9.61 (6.0%) -13.3% ( -23% - -1%) 0.000
PKLookup 103.38 (4.4%) 108.04 (4.3%) 4.5% ( -4% - 13%) 0.001
CFQHighMedLow 10.58 (6.1%) 12.30 (8.7%) 16.2% ( 1% - 33%) 0.000
CFQHighMed 10.70 (7.4%) 15.51 (11.2%) 44.9% ( 24% - 68%) 0.000
CFQHighLowLow 8.18 (8.2%) 12.87 (11.6%) 57.3% ( 34% - 84%) 0.000
CFQHighLow 14.57 (7.5%) 30.81 (15.1%) 111.4% ( 82% - 144%) 0.000
# Run 2
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighHighHigh 5.33 (5.7%) 4.02 (7.7%) -24.4% ( -35% - -11%) 0.000
CFQHighLowLow 17.14 (6.2%) 13.06 (5.4%) -23.8% ( -33% - -13%) 0.000
CFQHighMed 17.37 (5.8%) 14.38 (7.7%) -17.2% ( -29% - -3%) 0.000
PKLookup 103.57 (5.5%) 108.84 (5.9%) 5.1% ( -6% - 17%) 0.005
CFQHighMedLow 11.25 (7.2%) 12.70 (9.0%) 12.9% ( -3% - 31%) 0.000
CFQHighHigh 5.00 (6.2%) 7.54 (12.1%) 51.0% ( 30% - 73%) 0.000
CFQHighLow 21.60 (5.2%) 34.57 (14.1%) 60.0% ( 38% - 83%) 0.000
# Run 3
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighHighHigh 5.40 (6.9%) 4.06 (5.1%) -24.8% ( -34% - -13%) 0.000
CFQHighMedLow 7.64 (7.4%) 5.79 (6.3%) -24.2% ( -35% - -11%) 0.000
CFQHighHigh 11.11 (7.0%) 9.60 (5.9%) -13.6% ( -24% - 0%) 0.000
CFQHighLowLow 21.21 (7.6%) 21.22 (6.6%) 0.0% ( -13% - 15%) 0.993
PKLookup 103.15 (5.9%) 107.60 (6.9%) 4.3% ( -8% - 18%) 0.034
CFQHighLow 21.85 (8.1%) 34.18 (13.5%) 56.4% ( 32% - 84%) 0.000
CFQHighMed 12.07 (8.4%) 19.98 (16.7%) 65.5% ( 37% - 98%) 0.000
# Run 4
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CFQHighHigh 8.50 (5.8%) 6.85 (5.2%) -19.5% ( -28% - -8%) 0.000
CFQHighMedLow 10.89 (5.7%) 8.96 (5.4%) -17.8% ( -27% - -7%) 0.000
CFQHighMed 8.41 (5.8%) 7.74 (5.6%) -7.9% ( -18% - 3%) 0.000
CFQHighHighHigh 3.45 (6.7%) 3.38 (5.3%) -2.0% ( -13% - 10%) 0.287
CFQHighLowLow 7.82 (6.4%) 8.20 (7.5%) 4.8% ( -8% - 20%) 0.030
PKLookup 103.50 (5.0%) 110.69 (5.4%) 6.9% ( -3% - 18%) 0.000
CFQHighLow 11.46 (6.0%) 13.16 (6.7%) 14.8% ( 1% - 29%) 0.000 I think overall this shows that the pruning will be most effective when there's a significant difference between terms' frequencies, but will slow things down if they are close, as the cost of pruning outweighs the efficacy of skipping. I'm wondering if we should then gate the pruning by checking the frequencies as well, but from some quick trials that seems to be an expensive operation? Do you have any recommendation for this scenario? |
Adrien Grand (@jpountz) (migrated from JIRA) Thanks for exploring this area @zacharymorn! I wonder if #10375 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND. I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on https://www.staff.city.ac.uk/\~sbrp622/papers/foundations_bm25_review.pdf and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset. |
Zach Chen (@zacharymorn) (migrated from JIRA)
No problem, I'm always interested in exploring and learning about lucene querying!
I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is actually not used there (it very much doesn't get re-written to BooleanQuery). In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here?
Sounds good will give that a try! |
CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, forcing Lucene to collect all matches in order to figure the top-k hits.
Migrated from LUCENE-10061 by Adrien Grand (@jpountz), updated Feb 01 2022
Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks
Linked issues:
Pull requests: #418
The text was updated successfully, but these errors were encountered: