Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize facet counting data structures lazily #12408

Merged
merged 3 commits into from
Jul 25, 2023

Conversation

gsmiller
Copy link
Contributor

This change proposes some faceting optimizations for situations where there are no hits to facet over. While this seems like an odd scenario, at Amazon product search we actually have some situations where this can become common (sparse queries that don't have results in some segments, or have no results altogether).

You could argue that the calling code should handle this optimization and avoid faceting altogether if there are no hits, but that only solves for the case of no results, not no results in some segments. I think it's also nice to move this optimization into the faceting module so that every user doesn't have to do this themselves.

Curious what people think of this idea. Happy to hear feedback, counterarguments, etc. :)

This change covers:

  • Taxonomy faceting
    • FastTaxonomyFacetCounts
    • TaxonomyFacetIntAssociations
    • TaxonomyFacetFloatAssociations
  • SSDV faceting
    • SortedSetDocValuesFacetCounts
    • ConcurrentSortedSetDocValuesFacetCounts
    • StringValueFacetCounts
  • Range faceting:
    • LongRangeFacetCounts
    • DoubleRangeFacetCounts
  • Long faceting:
    • LongValueFacetCounts

Left for a future iteration (I'll open a follow up issue if we move forward with this one):

  • RangeOnRange faceting
  • FacetSet faceting

@gsmiller
Copy link
Contributor Author

Note that luceneutil benchmarks (wikimedium10m) show no obvious change:

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
     BrowseRandomLabelSSDVFacets       10.42     (11.2%)       10.00      (6.3%)   -4.0% ( -19% -   15%) 0.165
            HighIntervalsOrdered        7.44      (3.6%)        7.34      (3.4%)   -1.2% (  -7% -    5%) 0.260
             MedIntervalsOrdered       26.25      (3.0%)       25.94      (2.6%)   -1.2% (  -6% -    4%) 0.191
             LowIntervalsOrdered       50.62      (2.6%)       50.04      (2.4%)   -1.1% (  -5% -    3%) 0.141
               HighTermTitleSort       91.96      (3.5%)       91.28      (5.4%)   -0.7% (  -9% -    8%) 0.611
                       LowPhrase      564.16      (2.8%)      560.27      (3.0%)   -0.7% (  -6% -    5%) 0.448
                      HighPhrase       60.59      (3.3%)       60.25      (3.7%)   -0.5% (  -7% -    6%) 0.626
                       MedPhrase       34.94      (2.2%)       34.77      (2.3%)   -0.5% (  -4% -    4%) 0.485
                HighSloppyPhrase        6.86      (3.5%)        6.83      (4.8%)   -0.5% (  -8% -    8%) 0.719
                     MedSpanNear       31.74      (1.4%)       31.60      (1.5%)   -0.5% (  -3% -    2%) 0.321
                     LowSpanNear      175.09      (1.7%)      174.36      (1.7%)   -0.4% (  -3% -    3%) 0.445
                      OrHighHigh       32.32      (4.4%)       32.20      (4.2%)   -0.4% (  -8% -    8%) 0.786
                          Fuzzy2       43.49      (1.4%)       43.35      (1.1%)   -0.3% (  -2% -    2%) 0.426
                          Fuzzy1       84.24      (1.2%)       83.98      (1.1%)   -0.3% (  -2% -    2%) 0.402
                 MedSloppyPhrase       66.92      (2.5%)       66.73      (2.7%)   -0.3% (  -5% -    5%) 0.723
                    HighSpanNear       27.97      (1.9%)       27.89      (2.1%)   -0.3% (  -4% -    3%) 0.648
                         Respell       56.04      (1.6%)       55.90      (1.7%)   -0.3% (  -3% -    3%) 0.615
                 LowSloppyPhrase        9.17      (3.0%)        9.14      (3.8%)   -0.3% (  -6% -    6%) 0.814
         AndHighMedDayTaxoFacets       18.77      (2.0%)       18.73      (2.0%)   -0.3% (  -4% -    3%) 0.690
          OrHighMedDayTaxoFacets        9.85      (5.1%)        9.82      (6.0%)   -0.3% ( -10% -   11%) 0.887
                         LowTerm      662.10      (5.3%)      660.84      (4.2%)   -0.2% (  -9% -    9%) 0.900
                       OrHighMed      130.08      (2.5%)      129.85      (3.3%)   -0.2% (  -5% -    5%) 0.850
            HighTermTitleBDVSort       20.10      (5.0%)       20.08      (4.5%)   -0.1% (  -9% -    9%) 0.935
        AndHighHighDayTaxoFacets       10.91      (2.6%)       10.91      (2.7%)   -0.0% (  -5% -    5%) 0.956
                       OrHighLow      155.39      (4.4%)      155.50      (4.9%)    0.1% (  -8% -    9%) 0.960
                        Wildcard       41.13      (2.7%)       41.17      (2.9%)    0.1% (  -5% -    5%) 0.924
            MedTermDayTaxoFacets       44.29      (2.7%)       44.34      (2.6%)    0.1% (  -5% -    5%) 0.892
                      TermDTSort      113.73      (2.1%)      113.89      (2.2%)    0.1% (  -4% -    4%) 0.832
           HighTermDayOfYearSort      228.36      (3.0%)      228.92      (3.0%)    0.2% (  -5% -    6%) 0.799
       BrowseDayOfYearSSDVFacets       13.10      (2.6%)       13.14      (2.3%)    0.3% (  -4% -    5%) 0.695
                         MedTerm      760.46      (5.1%)      763.10      (4.1%)    0.3% (  -8% -    9%) 0.812
                      AndHighLow      771.16      (3.1%)      775.01      (1.9%)    0.5% (  -4% -    5%) 0.533
               HighTermMonthSort     2542.63      (4.0%)     2556.42      (3.3%)    0.5% (  -6% -    8%) 0.640
                         Prefix3      114.92      (1.5%)      115.66      (1.5%)    0.6% (  -2% -    3%) 0.164
                    OrNotHighLow     1063.02      (2.4%)     1071.85      (2.5%)    0.8% (  -4% -    5%) 0.287
                          IntNRQ       69.75      (7.4%)       70.36      (7.5%)    0.9% ( -13% -   16%) 0.712
                      AndHighMed      175.09      (5.4%)      176.80      (5.7%)    1.0% (  -9% -   12%) 0.577
                        PKLookup      176.37      (2.5%)      178.19      (2.3%)    1.0% (  -3% -    5%) 0.170
                   OrHighNotHigh      445.72      (4.5%)      450.81      (5.6%)    1.1% (  -8% -   11%) 0.477
                    OrNotHighMed      354.86      (4.2%)      358.95      (5.1%)    1.2% (  -7% -   10%) 0.438
                     AndHighHigh       58.67      (5.7%)       59.38      (6.5%)    1.2% ( -10% -   14%) 0.528
                        HighTerm      557.96      (5.3%)      565.01      (4.9%)    1.3% (  -8% -   12%) 0.432
            BrowseDateSSDVFacets        3.60      (7.3%)        3.65      (8.0%)    1.3% ( -12% -   17%) 0.588
           BrowseMonthSSDVFacets       13.87      (2.5%)       14.08      (1.8%)    1.5% (  -2% -    5%) 0.032
                   OrNotHighHigh      249.87      (5.3%)      254.47      (6.1%)    1.8% (  -9% -   13%) 0.307
            BrowseDateTaxoFacets       30.57     (14.1%)       31.14      (4.5%)    1.9% ( -14% -   23%) 0.571
       BrowseDayOfYearTaxoFacets       30.78     (14.2%)       31.38      (4.7%)    2.0% ( -14% -   24%) 0.558
     BrowseRandomLabelTaxoFacets       22.89     (12.5%)       23.35      (2.5%)    2.0% ( -11% -   19%) 0.473
                    OrHighNotMed      278.47      (4.8%)      284.88      (6.6%)    2.3% (  -8% -   14%) 0.209
                    OrHighNotLow      312.77      (5.9%)      321.00      (7.6%)    2.6% ( -10% -   17%) 0.218
           BrowseMonthTaxoFacets       31.51     (14.7%)       32.62      (4.4%)    3.5% ( -13% -   26%) 0.300

@gsmiller
Copy link
Contributor Author

In some internal benchmarks though, for situations where it's semi-common to get no results or sparse results, we see a 9.6% average latency reduction. In cases where results tend to be pretty dense, this still offers a 1.9% average latency reduction. So it's definitely helpful in situations where sparse/no-results are semi-common.

@mikemccand
Copy link
Member

+1, this makes sense to me.

@gsmiller
Copy link
Contributor Author

gsmiller commented Jul 5, 2023

Thanks @mikemccand. Just removed the errant "nocommit" comment I left hanging in the initial PR (doh!) and added a CHANGES entry, so this should be a clean change now.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gsmiller!

This change covers:
* Taxonomy faceting
  * FastTaxonomyFacetCounts
  * TaxonomyFacetIntAssociations
  * TaxonomyFacetFloatAssociations
* SSDV faceting
  * SortedSetDocValuesFacetCounts
  * ConcurrentSortedSetDocValuesFacetCounts
  * StringValueFacetCounts
* Range faceting:
  * LongRangeFacetCounts
  * DoubleRangeFacetCounts
* Long faceting:
  * LongValueFacetCounts

Left for a future iteration:
* RangeOnRange faceting
* FacetSet faceting
@gsmiller gsmiller merged commit 179b45b into apache:main Jul 25, 2023
4 checks passed
@gsmiller gsmiller deleted the explore/lazy-faceting branch July 25, 2023 19:20
@gsmiller gsmiller added this to the 10.0.0 milestone Jul 25, 2023
@jpountz
Copy link
Contributor

jpountz commented Aug 3, 2023

Is this change the reason why we are seeing a major slowdown on AndHighMedDayTaxoFacets and speedup on OrHighMedDayTaxoFacets and OrHighMedDayTaxoFacets? I wouldn't expect these queries to have sparse hits. Maybe the introduction of counting tasks is also related to this change by making the JVM compile things differently? (facets were the only tasks to use non-scoring boolean queries before, not anymore)

stefanvodita pushed a commit to stefanvodita/lucene that referenced this pull request Apr 13, 2024
This change covers:
* Taxonomy faceting
  * FastTaxonomyFacetCounts
  * TaxonomyFacetIntAssociations
  * TaxonomyFacetFloatAssociations
* SSDV faceting
  * SortedSetDocValuesFacetCounts
  * ConcurrentSortedSetDocValuesFacetCounts
  * StringValueFacetCounts
* Range faceting:
  * LongRangeFacetCounts
  * DoubleRangeFacetCounts
* Long faceting:
  * LongValueFacetCounts

Left for a future iteration:
* RangeOnRange faceting
* FacetSet faceting
stefanvodita added a commit that referenced this pull request May 3, 2024
This change covers:
* Taxonomy faceting
  * FastTaxonomyFacetCounts
  * TaxonomyFacetIntAssociations
  * TaxonomyFacetFloatAssociations
* SSDV faceting
  * SortedSetDocValuesFacetCounts
  * ConcurrentSortedSetDocValuesFacetCounts
  * StringValueFacetCounts
* Range faceting:
  * LongRangeFacetCounts
  * DoubleRangeFacetCounts
* Long faceting:
  * LongValueFacetCounts

Left for a future iteration:
* RangeOnRange faceting
* FacetSet faceting

Co-authored-by: Greg Miller <gsmiller@gmail.com>
@stefanvodita stefanvodita modified the milestones: 10.0.0, 9.11.0 May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants