SOLR-14254: Docs for text tagger: FST50 trade-off #1332

dsmiley · 2020-03-09T17:36:40Z

Description

Please provide a short description of the changes you're making with this pull request.

Solution

Please provide a short description of the approach taken to implement your solution.

Tests

Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the master branch.
I have run ant precommit and the appropriate test suite.
I have added tests for my changes.
I have added documentation for the Ref Guide (for Solr changes only).

gerlowskija · 2020-03-10T19:22:54Z

solr/solr-ref-guide/src/the-tagger-handler.adoc

+* Follow the recommended configuration field settings above.
+Additionally, for the best tagger performance, set `postingsFormat=FST50`.
+However, non-default postings formats have no backwards-compatibility guarantees, and so if you upgrade Solr then you may find a nasty exception on startup as it fails to read the older index.
+If the input text to be tagged is small (e.g. you are tagging queries or tweets) then the postings format choice isn't as important.


[Q] Interesting. I didn't realize that the FST50 vs default performance decreased the smaller the individual document size was. Did you do a particular performance test to bear this out, or are you just intuiting that behavior from knowing how postingsFormats work?

Is the performance comparable even if numTweets or whatever gets large and the posting-lists grow due to the sheer number of tiny docs?

I didn't realize that the FST50 vs default performance decreased the smaller the individual document size was

The tagger works by looping over each token from the input and doing a term dictionary lookup on the local index. Logically, if your input text is small then there is less work to do than for large input text. Knowing this requires tagger knowledge but not how any particular postings format works. See? No I didn't benchmark this ;-).

FYI the SolrTextTagger was benchmarked a couple years ago to compare the old "Memory" PF and FST50 -- OpenSextant/SolrTextTagger#38 (comment) we never tried the default (blocktree). I believe the input data in that experiment were whole articles, and thus would be impacted by the postings format choice.

Makes sense.

I ran into that old perf-test comment back when I was raising SOLR-14254, but wasn't sure how relevant it was. The situation user's face today is very different: "memory" (the most performant option) is gone entirely and the test doesn't even try to cover blocktree.

But that's independent of this PR. These docs are worth getting in as-is. If one of us (or someone else entirely) is able to shed more light on blocktree performance in the future, docs can be updated at that point. No reason to let the perfect get in the way of the good.

gerlowskija

I had one question, but LGTM. The wording is very clear and spells out the "cons" of either choice.

dsmiley · 2020-03-12T02:44:30Z

I thought it'd be good to add a solr-upgrade-notes warning too, both in 8.4 and 8.5

(cherry picked from commit cbd0dcb)

dsmiley requested a review from gerlowskija March 9, 2020 17:36

gerlowskija reviewed Mar 10, 2020

View reviewed changes

gerlowskija approved these changes Mar 10, 2020

View reviewed changes

dsmiley added 2 commits March 11, 2020 17:30

SOLR-14254: Docs for text tagger: FST50 trade-off

7438ea5

solr-upgrade-notes as well

57d9118

dsmiley force-pushed the taggerWarnPf branch from 1c78b96 to 57d9118 Compare March 12, 2020 02:43

dsmiley merged commit cbd0dcb into apache:master Mar 14, 2020

dsmiley added a commit that referenced this pull request Mar 14, 2020

SOLR-14254: Docs for text tagger: FST50 trade-off (#1332)

30810b1

(cherry picked from commit cbd0dcb)

dsmiley added a commit that referenced this pull request Mar 14, 2020

SOLR-14254: Docs for text tagger: FST50 trade-off (#1332)

c4623ad

(cherry picked from commit cbd0dcb)

dsmiley deleted the taggerWarnPf branch August 2, 2023 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-14254: Docs for text tagger: FST50 trade-off #1332

SOLR-14254: Docs for text tagger: FST50 trade-off #1332

dsmiley commented Mar 9, 2020

gerlowskija Mar 10, 2020

dsmiley Mar 10, 2020

dsmiley Mar 11, 2020

gerlowskija Mar 11, 2020

gerlowskija left a comment

dsmiley commented Mar 12, 2020

SOLR-14254: Docs for text tagger: FST50 trade-off #1332

SOLR-14254: Docs for text tagger: FST50 trade-off #1332

Conversation

dsmiley commented Mar 9, 2020

Description

Solution

Tests

Checklist

gerlowskija Mar 10, 2020

Choose a reason for hiding this comment

dsmiley Mar 10, 2020

Choose a reason for hiding this comment

dsmiley Mar 11, 2020

Choose a reason for hiding this comment

gerlowskija Mar 11, 2020

Choose a reason for hiding this comment

gerlowskija left a comment

Choose a reason for hiding this comment

dsmiley commented Mar 12, 2020