Initialize facet counting data structures lazily #12408

gsmiller · 2023-06-30T23:07:02Z

This change proposes some faceting optimizations for situations where there are no hits to facet over. While this seems like an odd scenario, at Amazon product search we actually have some situations where this can become common (sparse queries that don't have results in some segments, or have no results altogether).

You could argue that the calling code should handle this optimization and avoid faceting altogether if there are no hits, but that only solves for the case of no results, not no results in some segments. I think it's also nice to move this optimization into the faceting module so that every user doesn't have to do this themselves.

Curious what people think of this idea. Happy to hear feedback, counterarguments, etc. :)

This change covers:

Taxonomy faceting
- FastTaxonomyFacetCounts
- TaxonomyFacetIntAssociations
- TaxonomyFacetFloatAssociations
SSDV faceting
- SortedSetDocValuesFacetCounts
- ConcurrentSortedSetDocValuesFacetCounts
- StringValueFacetCounts
Range faceting:
- LongRangeFacetCounts
- DoubleRangeFacetCounts
Long faceting:
- LongValueFacetCounts

Left for a future iteration (I'll open a follow up issue if we move forward with this one):

RangeOnRange faceting
FacetSet faceting

gsmiller · 2023-06-30T23:07:54Z

Note that luceneutil benchmarks (wikimedium10m) show no obvious change:

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
     BrowseRandomLabelSSDVFacets       10.42     (11.2%)       10.00      (6.3%)   -4.0% ( -19% -   15%) 0.165
            HighIntervalsOrdered        7.44      (3.6%)        7.34      (3.4%)   -1.2% (  -7% -    5%) 0.260
             MedIntervalsOrdered       26.25      (3.0%)       25.94      (2.6%)   -1.2% (  -6% -    4%) 0.191
             LowIntervalsOrdered       50.62      (2.6%)       50.04      (2.4%)   -1.1% (  -5% -    3%) 0.141
               HighTermTitleSort       91.96      (3.5%)       91.28      (5.4%)   -0.7% (  -9% -    8%) 0.611
                       LowPhrase      564.16      (2.8%)      560.27      (3.0%)   -0.7% (  -6% -    5%) 0.448
                      HighPhrase       60.59      (3.3%)       60.25      (3.7%)   -0.5% (  -7% -    6%) 0.626
                       MedPhrase       34.94      (2.2%)       34.77      (2.3%)   -0.5% (  -4% -    4%) 0.485
                HighSloppyPhrase        6.86      (3.5%)        6.83      (4.8%)   -0.5% (  -8% -    8%) 0.719
                     MedSpanNear       31.74      (1.4%)       31.60      (1.5%)   -0.5% (  -3% -    2%) 0.321
                     LowSpanNear      175.09      (1.7%)      174.36      (1.7%)   -0.4% (  -3% -    3%) 0.445
                      OrHighHigh       32.32      (4.4%)       32.20      (4.2%)   -0.4% (  -8% -    8%) 0.786
                          Fuzzy2       43.49      (1.4%)       43.35      (1.1%)   -0.3% (  -2% -    2%) 0.426
                          Fuzzy1       84.24      (1.2%)       83.98      (1.1%)   -0.3% (  -2% -    2%) 0.402
                 MedSloppyPhrase       66.92      (2.5%)       66.73      (2.7%)   -0.3% (  -5% -    5%) 0.723
                    HighSpanNear       27.97      (1.9%)       27.89      (2.1%)   -0.3% (  -4% -    3%) 0.648
                         Respell       56.04      (1.6%)       55.90      (1.7%)   -0.3% (  -3% -    3%) 0.615
                 LowSloppyPhrase        9.17      (3.0%)        9.14      (3.8%)   -0.3% (  -6% -    6%) 0.814
         AndHighMedDayTaxoFacets       18.77      (2.0%)       18.73      (2.0%)   -0.3% (  -4% -    3%) 0.690
          OrHighMedDayTaxoFacets        9.85      (5.1%)        9.82      (6.0%)   -0.3% ( -10% -   11%) 0.887
                         LowTerm      662.10      (5.3%)      660.84      (4.2%)   -0.2% (  -9% -    9%) 0.900
                       OrHighMed      130.08      (2.5%)      129.85      (3.3%)   -0.2% (  -5% -    5%) 0.850
            HighTermTitleBDVSort       20.10      (5.0%)       20.08      (4.5%)   -0.1% (  -9% -    9%) 0.935
        AndHighHighDayTaxoFacets       10.91      (2.6%)       10.91      (2.7%)   -0.0% (  -5% -    5%) 0.956
                       OrHighLow      155.39      (4.4%)      155.50      (4.9%)    0.1% (  -8% -    9%) 0.960
                        Wildcard       41.13      (2.7%)       41.17      (2.9%)    0.1% (  -5% -    5%) 0.924
            MedTermDayTaxoFacets       44.29      (2.7%)       44.34      (2.6%)    0.1% (  -5% -    5%) 0.892
                      TermDTSort      113.73      (2.1%)      113.89      (2.2%)    0.1% (  -4% -    4%) 0.832
           HighTermDayOfYearSort      228.36      (3.0%)      228.92      (3.0%)    0.2% (  -5% -    6%) 0.799
       BrowseDayOfYearSSDVFacets       13.10      (2.6%)       13.14      (2.3%)    0.3% (  -4% -    5%) 0.695
                         MedTerm      760.46      (5.1%)      763.10      (4.1%)    0.3% (  -8% -    9%) 0.812
                      AndHighLow      771.16      (3.1%)      775.01      (1.9%)    0.5% (  -4% -    5%) 0.533
               HighTermMonthSort     2542.63      (4.0%)     2556.42      (3.3%)    0.5% (  -6% -    8%) 0.640
                         Prefix3      114.92      (1.5%)      115.66      (1.5%)    0.6% (  -2% -    3%) 0.164
                    OrNotHighLow     1063.02      (2.4%)     1071.85      (2.5%)    0.8% (  -4% -    5%) 0.287
                          IntNRQ       69.75      (7.4%)       70.36      (7.5%)    0.9% ( -13% -   16%) 0.712
                      AndHighMed      175.09      (5.4%)      176.80      (5.7%)    1.0% (  -9% -   12%) 0.577
                        PKLookup      176.37      (2.5%)      178.19      (2.3%)    1.0% (  -3% -    5%) 0.170
                   OrHighNotHigh      445.72      (4.5%)      450.81      (5.6%)    1.1% (  -8% -   11%) 0.477
                    OrNotHighMed      354.86      (4.2%)      358.95      (5.1%)    1.2% (  -7% -   10%) 0.438
                     AndHighHigh       58.67      (5.7%)       59.38      (6.5%)    1.2% ( -10% -   14%) 0.528
                        HighTerm      557.96      (5.3%)      565.01      (4.9%)    1.3% (  -8% -   12%) 0.432
            BrowseDateSSDVFacets        3.60      (7.3%)        3.65      (8.0%)    1.3% ( -12% -   17%) 0.588
           BrowseMonthSSDVFacets       13.87      (2.5%)       14.08      (1.8%)    1.5% (  -2% -    5%) 0.032
                   OrNotHighHigh      249.87      (5.3%)      254.47      (6.1%)    1.8% (  -9% -   13%) 0.307
            BrowseDateTaxoFacets       30.57     (14.1%)       31.14      (4.5%)    1.9% ( -14% -   23%) 0.571
       BrowseDayOfYearTaxoFacets       30.78     (14.2%)       31.38      (4.7%)    2.0% ( -14% -   24%) 0.558
     BrowseRandomLabelTaxoFacets       22.89     (12.5%)       23.35      (2.5%)    2.0% ( -11% -   19%) 0.473
                    OrHighNotMed      278.47      (4.8%)      284.88      (6.6%)    2.3% (  -8% -   14%) 0.209
                    OrHighNotLow      312.77      (5.9%)      321.00      (7.6%)    2.6% ( -10% -   17%) 0.218
           BrowseMonthTaxoFacets       31.51     (14.7%)       32.62      (4.4%)    3.5% ( -13% -   26%) 0.300

gsmiller · 2023-06-30T23:10:51Z

In some internal benchmarks though, for situations where it's semi-common to get no results or sparse results, we see a 9.6% average latency reduction. In cases where results tend to be pretty dense, this still offers a 1.9% average latency reduction. So it's definitely helpful in situations where sparse/no-results are semi-common.

mikemccand · 2023-07-03T12:05:36Z

+1, this makes sense to me.

gsmiller · 2023-07-05T15:24:22Z

Thanks @mikemccand. Just removed the errant "nocommit" comment I left hanging in the initial PR (doh!) and added a CHANGES entry, so this should be a clean change now.

mikemccand

Thanks @gsmiller!

This change covers: * Taxonomy faceting * FastTaxonomyFacetCounts * TaxonomyFacetIntAssociations * TaxonomyFacetFloatAssociations * SSDV faceting * SortedSetDocValuesFacetCounts * ConcurrentSortedSetDocValuesFacetCounts * StringValueFacetCounts * Range faceting: * LongRangeFacetCounts * DoubleRangeFacetCounts * Long faceting: * LongValueFacetCounts Left for a future iteration: * RangeOnRange faceting * FacetSet faceting

jpountz · 2023-08-03T05:39:14Z

Is this change the reason why we are seeing a major slowdown on AndHighMedDayTaxoFacets and speedup on OrHighMedDayTaxoFacets and OrHighMedDayTaxoFacets? I wouldn't expect these queries to have sparse hits. Maybe the introduction of counting tasks is also related to this change by making the JVM compile things differently? (facets were the only tasks to use non-scoring boolean queries before, not anymore)

This change covers: * Taxonomy faceting * FastTaxonomyFacetCounts * TaxonomyFacetIntAssociations * TaxonomyFacetFloatAssociations * SSDV faceting * SortedSetDocValuesFacetCounts * ConcurrentSortedSetDocValuesFacetCounts * StringValueFacetCounts * Range faceting: * LongRangeFacetCounts * DoubleRangeFacetCounts * Long faceting: * LongValueFacetCounts Left for a future iteration: * RangeOnRange faceting * FacetSet faceting

This change covers: * Taxonomy faceting * FastTaxonomyFacetCounts * TaxonomyFacetIntAssociations * TaxonomyFacetFloatAssociations * SSDV faceting * SortedSetDocValuesFacetCounts * ConcurrentSortedSetDocValuesFacetCounts * StringValueFacetCounts * Range faceting: * LongRangeFacetCounts * DoubleRangeFacetCounts * Long faceting: * LongValueFacetCounts Left for a future iteration: * RangeOnRange faceting * FacetSet faceting Co-authored-by: Greg Miller <gsmiller@gmail.com>

gsmiller force-pushed the explore/lazy-faceting branch from b6d77fa to e98d3b3 Compare July 5, 2023 13:20

mikemccand approved these changes Jul 18, 2023

View reviewed changes

gsmiller added 3 commits July 25, 2023 11:40

remove lingering nocommit comment I'd left myself... oops

60d619b

changes entry

0e0b5b4

gsmiller force-pushed the explore/lazy-faceting branch from e98d3b3 to 0e0b5b4 Compare July 25, 2023 18:52

gsmiller merged commit 179b45b into apache:main Jul 25, 2023
4 checks passed

gsmiller deleted the explore/lazy-faceting branch July 25, 2023 19:20

gsmiller added this to the 10.0.0 milestone Jul 25, 2023

mikemccand mentioned this pull request Aug 3, 2023

Nightly tasks should never have more than 5 queries in each category mikemccand/luceneutil#226

Open

stefanvodita mentioned this pull request Apr 13, 2024

Backport to 9x: Initialize facet counting data structures lazily #12408 #13300

Merged

stefanvodita modified the milestones: 10.0.0, 9.11.0 May 3, 2024

mikemccand mentioned this pull request May 10, 2024

Reduce duplication in taxonomy facets; always do counts #12966

Merged

shuttie mentioned this pull request Jun 17, 2024

NullPointerException in StringValueFacetCounts when using MultiCollectorManager #13493

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize facet counting data structures lazily #12408

Initialize facet counting data structures lazily #12408

gsmiller commented Jun 30, 2023

gsmiller commented Jun 30, 2023

gsmiller commented Jun 30, 2023

mikemccand commented Jul 3, 2023

gsmiller commented Jul 5, 2023

mikemccand left a comment

jpountz commented Aug 3, 2023

Initialize facet counting data structures lazily #12408

Initialize facet counting data structures lazily #12408

Conversation

gsmiller commented Jun 30, 2023

gsmiller commented Jun 30, 2023

gsmiller commented Jun 30, 2023

mikemccand commented Jul 3, 2023

gsmiller commented Jul 5, 2023

mikemccand left a comment

Choose a reason for hiding this comment

jpountz commented Aug 3, 2023