port speller perf fix from 4.8 #4975
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Addresses Issue #3350. This is a port of a servicing fix in .NET 4.7-4.8.
Description
PR #2871 added logic to the word-breaking phase to extend misspelled tokens ("words") with non-token characters ("punctuation") if such an extension fixes the misspelling. I've found ways to speed it up:
The logic checks each token and potential extension for misspellings, using an existing method ComprehensiveCheck that automatically populates each spelling error with a list of suggestions obtained from the native layer. These suggestions are never used, and the native calls were accounting for 80% of the time (says Trevor Fellman). The logic only cares whether any errors exist, so use a new method HasErrors that answers that question without asking for suggestions.
The check for potential extensions is skipped if the non-token characters that follow the token are all white-space. In practice, those characters often include nulls ('\0') at the end. Removing these nulls first allows us to skip the extension test altogether in many cases - about 40% in my experiments.
Don't consider trailing whitespace or nulls in the non-token characters. (This generalizes the original heuristic that discarded the non-token characters if they were all whitespace.)
Don't consider whitespace or nulls interior to the non-token characters, until reaching a character that is not whitespace or null.
Cache the 10 most recent HasError results, and answer queries from the cache instead of calling the native spell-checker.
This speeds up the spell-checker quite a bit, but it is still slower than the original. The word-breaking phase still has to check each token for misspelling (which it didn't do before #2871); there's no getting around this expense.
Customer Impact
This bug is blocking migration to .NET Core.
Regression
Regression in .NET 5.0.
Testing
Ad-hoc with customer scenarios.
Standard regression testing.
Risk
Low. Straightforward port of .NETFx fix that was released last year.