Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-14254: Docs for text tagger: FST50 trade-off #1332

Merged
merged 2 commits into from Mar 14, 2020

Conversation

dsmiley
Copy link
Contributor

@dsmiley dsmiley commented Mar 9, 2020

Description

Please provide a short description of the changes you're making with this pull request.

Solution

Please provide a short description of the approach taken to implement your solution.

Tests

Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ant precommit and the appropriate test suite.
  • I have added tests for my changes.
  • I have added documentation for the Ref Guide (for Solr changes only).

@dsmiley dsmiley requested a review from gerlowskija March 9, 2020 17:36
* Follow the recommended configuration field settings above.
Additionally, for the best tagger performance, set `postingsFormat=FST50`.
However, non-default postings formats have no backwards-compatibility guarantees, and so if you upgrade Solr then you may find a nasty exception on startup as it fails to read the older index.
If the input text to be tagged is small (e.g. you are tagging queries or tweets) then the postings format choice isn't as important.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Interesting. I didn't realize that the FST50 vs default performance decreased the smaller the individual document size was. Did you do a particular performance test to bear this out, or are you just intuiting that behavior from knowing how postingsFormats work?

Is the performance comparable even if numTweets or whatever gets large and the posting-lists grow due to the sheer number of tiny docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that the FST50 vs default performance decreased the smaller the individual document size was

The tagger works by looping over each token from the input and doing a term dictionary lookup on the local index. Logically, if your input text is small then there is less work to do than for large input text. Knowing this requires tagger knowledge but not how any particular postings format works. See? No I didn't benchmark this ;-).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the SolrTextTagger was benchmarked a couple years ago to compare the old "Memory" PF and FST50 -- OpenSextant/SolrTextTagger#38 (comment) we never tried the default (blocktree). I believe the input data in that experiment were whole articles, and thus would be impacted by the postings format choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

I ran into that old perf-test comment back when I was raising SOLR-14254, but wasn't sure how relevant it was. The situation user's face today is very different: "memory" (the most performant option) is gone entirely and the test doesn't even try to cover blocktree.

But that's independent of this PR. These docs are worth getting in as-is. If one of us (or someone else entirely) is able to shed more light on blocktree performance in the future, docs can be updated at that point. No reason to let the perfect get in the way of the good.

Copy link
Contributor

@gerlowskija gerlowskija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one question, but LGTM. The wording is very clear and spells out the "cons" of either choice.

@dsmiley
Copy link
Contributor Author

dsmiley commented Mar 12, 2020

I thought it'd be good to add a solr-upgrade-notes warning too, both in 8.4 and 8.5

@dsmiley dsmiley merged commit cbd0dcb into apache:master Mar 14, 2020
dsmiley added a commit that referenced this pull request Mar 14, 2020
dsmiley added a commit that referenced this pull request Mar 14, 2020
@dsmiley dsmiley deleted the taggerWarnPf branch August 2, 2023 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants