-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings #2753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e945a40 to
ebaaa93
Compare
cd40a1d to
e6368a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK as far as functionality. What's the perf hit? It looks like
the next word is mispelled------------------------------------------n dashes
costs O(n^2), trying O(n) additions to the misspelling, each costing O(n). Maybe you should limit the heuristic to some small number of additions from the missing fragment.
...tNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInteropExtensions.cs
Outdated
Show resolved
Hide resolved
...tNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInteropExtensions.cs
Outdated
Show resolved
Hide resolved
Can we be certain that the concatenations will always be misspelled words in a row until they become correct? Then we could binary search for first correct position. If it would potentially flip-flop that's obviously out. |
|
Oops, my bad, accidentally hit close. |
Setting a upper limit to protect perf sounds like a good idea. It would be an arbitrary limit, and I have no basis for selecting such a constant really. Conservatively, I'd select
Though the solution looks reasonably elegant, I've gone through about 5-6 iterations/redesigns that were an order-of magnitude more complicated before arriving at this one. I had one design goal from the start - don't make or change linguistic decisions in WPF that's the purview of a different layer. This meant not changing the word-breakers output, and it certainly precluded commingling word-breaking and spell-checking together to improve the word-breaking results. The ideal solution would have been something like this:
This keeps us from introducing linguistic decisions in WPF - that's the domain of word-breaking and spell-checking API's and keeps things separated (word-breaking vs. spell-checking). Unfortunately, this turned out to be exceedingly hard to implement. The word-breaker changes were quite doable, but plumbing the changes into WPF's speller-stack wasn't easy. WPF does some word-breaking of its own (instead of relying completely on an external word-breaker API) which complicates matters, and I finally decided to compromise a bit on what the solution would look like. This solution now directly mutates the result of the word-breaker by involving the spell-checker as part of the decision-making process. It's not ideal, but it eliminates the need for extensive refactoring of the rest of the codebase. Now coming to your question:
We can't be certain, and I don't think it's our concern either. The word-breaker gave us a token, and that forms our basis or prefix. Whatever fragment we are left with that was discovered in the interstitial space between that token and the next one can only be concatenated to form a new token, if at all. The only possibilities look like these: If one of these don't come out as clean from the spell-checker, then we have no useful candidates for substituting We actually don't want to run a binary-search here - we are looking for the shortest-match. We don't know anything else about the linguistic characteristics of these permutations. There may be 0, 1, or a larger number of permutations that are spellcheck clean. Without having a great deal of working knowledge and vocabulary in several languages, I don't think we can reason about this casually - and so I think we should try to make conservative choices (i.e., look for the shortest-match). |
e6368a7 to
4e0b35b
Compare
|
Leaving the 'WIP" in title in place until servicing-approval happens. This doesn't apply to the PR targeting master #2871 |
df3f957 to
03ed1ea
Compare
…t the end of words like "etc." This results in well-formed words like "etc.", "Mr.", "Mrs." being marked as spelling-errors. Further, it is impossible to "correct" these words in WPF. Becasue of the way word-breaking is performed by MS.Data.Text.WordsSegmenter under the hood, these punctuations are always dropped during word-breaking. The older spell-checker (NaturalLanguage6.dll) used on Win7 does not have this defect. MS.Data.Text.WordsSegmenter has declined to 'fix'/'modify this behavior, so we are adding a workaround in WPF. This change introduces a wrapper on top of the word-breaker and recalculates the tokens, taking into account dropped text that occurs immediately after a token that is likely to fail spell-checking.
…reaking with aid from a spell-checker
…d organizing regions
f0dcccb to
0279bdf
Compare
|
@SamBent I'd like to present this to .NET Core shiproom since this is now checked into .NET 4.8 servicing. Can I get your signoff on this? |
...crosoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInterop.cs
Show resolved
Hide resolved
...crosoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInterop.cs
Show resolved
Hide resolved
...crosoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInterop.cs
Show resolved
Hide resolved
...tNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInteropExtensions.cs
Show resolved
Hide resolved
|
This was rejected by the .NET Core Tactics (the forum where servicing related approval decisions are made for .NET Core). A request for fixing this on .NET Framework 4.8 and 4.7.2 had been initiated by one customer, and it has since been fixed in those versions of .NET (it may still be in the process of being delivered externally at this time). It has also been fixed in master (i.e., .NET 5). Given (a) this issue was introduced in .NET 4.6.2, which was a considerable time ago, (b) the first substantial report of this bug that came to the attention of the team didn't happen until very recently, and (c) the customer who requested fixes in .NET 4.8/4.7.2 hasn't expressed immediate plans to migrate to .NET Core, .NET Core Tactics stakeholders felt that this fix need not be taken in .NET Core 3.1 servicing at this time. If someone absolutely needs this fix on 3.1 and is unable to adopt .NET 5 (which will be complete not too far in the future) instead, then we can resurrect the discussion easily enough. |
Description
WPF's spell-checker on Win8+ ignores punctuations like
.(period) at the end of words likeetc.This results in well-formed words likeetc.,Mr.,Mrs.being marked as spelling-errors.Further, it is impossible to correct these words in WPF.
Customer Impact
This has been fixed in .NET Framework 4.8 servicing on behalf of a customer. It's being proposed for inclusion in .NET Core 3.1 servicing to ensure that customers upgrading from .NET Framework 4.8 to .NET Core 3.1 are not regressed.
Regression?
Not a regression in .NET Core 3.1.
For historical reference, this was a regression introduced in .NET Framework 4.6.1 when WPF switched its spell-checking implementation from NaturalLanguage6.dll to ISpellChecker/Windows.Data.WordsSegmenter based implementation.
This bug did not affect Windows 8 and older OS platforms.
Risk (see taxonomy)
Low.
The change is isolated, and well tested.
Link the PR to the original issue and/or the PR to master.
#2871
Packaging impact? (if a libraries change)
None
Additional Discussion
Because of the way word-breaking is performed by
MS.Data.Text.WordsSegmenterunder the hood, these punctuations are always dropped during word-breaking. The older spell-checker (NaturalLanguage6.dll) used on Win7 does not have this defect.MS.Data.Text.WordsSegmenterhas declined to 'fix'/'modify this behavior, so we are adding a workaround in WPF.This change introduces a wrapper on top of the word-breaker and recalculates the tokens, taking into account dropped text that occurs immediately after a token that is likely to fail spell-checking.