New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-14254: Docs for text tagger: FST50 trade-off #1332
Conversation
* Follow the recommended configuration field settings above. | ||
Additionally, for the best tagger performance, set `postingsFormat=FST50`. | ||
However, non-default postings formats have no backwards-compatibility guarantees, and so if you upgrade Solr then you may find a nasty exception on startup as it fails to read the older index. | ||
If the input text to be tagged is small (e.g. you are tagging queries or tweets) then the postings format choice isn't as important. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Q] Interesting. I didn't realize that the FST50 vs default performance decreased the smaller the individual document size was. Did you do a particular performance test to bear this out, or are you just intuiting that behavior from knowing how postingsFormats work?
Is the performance comparable even if numTweets or whatever gets large and the posting-lists grow due to the sheer number of tiny docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize that the FST50 vs default performance decreased the smaller the individual document size was
The tagger works by looping over each token from the input and doing a term dictionary lookup on the local index. Logically, if your input text is small then there is less work to do than for large input text. Knowing this requires tagger knowledge but not how any particular postings format works. See? No I didn't benchmark this ;-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI the SolrTextTagger was benchmarked a couple years ago to compare the old "Memory" PF and FST50 -- OpenSextant/SolrTextTagger#38 (comment) we never tried the default (blocktree). I believe the input data in that experiment were whole articles, and thus would be impacted by the postings format choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
I ran into that old perf-test comment back when I was raising SOLR-14254, but wasn't sure how relevant it was. The situation user's face today is very different: "memory" (the most performant option) is gone entirely and the test doesn't even try to cover blocktree.
But that's independent of this PR. These docs are worth getting in as-is. If one of us (or someone else entirely) is able to shed more light on blocktree performance in the future, docs can be updated at that point. No reason to let the perfect get in the way of the good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one question, but LGTM. The wording is very clear and spells out the "cons" of either choice.
I thought it'd be good to add a solr-upgrade-notes warning too, both in 8.4 and 8.5 |
(cherry picked from commit cbd0dcb)
(cherry picked from commit cbd0dcb)
Description
Please provide a short description of the changes you're making with this pull request.
Solution
Please provide a short description of the approach taken to implement your solution.
Tests
Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem.
Checklist
Please review the following and check all that apply:
master
branch.ant precommit
and the appropriate test suite.