[release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings #2753

vatsan-madhavan · 2020-03-11T00:51:58Z

Description

WPF's spell-checker on Win8+ ignores punctuations like . (period) at the end of words like etc. This results in well-formed words like etc., Mr., Mrs. being marked as spelling-errors.

Further, it is impossible to correct these words in WPF.

Customer Impact

This has been fixed in .NET Framework 4.8 servicing on behalf of a customer. It's being proposed for inclusion in .NET Core 3.1 servicing to ensure that customers upgrading from .NET Framework 4.8 to .NET Core 3.1 are not regressed.

Regression?

Not a regression in .NET Core 3.1.

For historical reference, this was a regression introduced in .NET Framework 4.6.1 when WPF switched its spell-checking implementation from NaturalLanguage6.dll to ISpellChecker/Windows.Data.WordsSegmenter based implementation.

This bug did not affect Windows 8 and older OS platforms.

Risk (see taxonomy)

Low.

The change is isolated, and well tested.

Link the PR to the original issue and/or the PR to master.

#2871

Packaging impact? (if a libraries change)

None

Additional Discussion

Because of the way word-breaking is performed by MS.Data.Text.WordsSegmenter under the hood, these punctuations are always dropped during word-breaking. The older spell-checker (NaturalLanguage6.dll) used on Win7 does not have this defect.

MS.Data.Text.WordsSegmenter has declined to 'fix'/'modify this behavior, so we are adding a workaround in WPF.

This change introduces a wrapper on top of the word-breaker and recalculates the tokens, taking into account dropped text that occurs immediately after a token that is likely to fail spell-checking.

vatsan-madhavan · 2020-03-12T22:08:48Z

@rladuca @SamBent PTAL. This is only marked as WIP to prevent accidentally merging.

SamBent

Looks OK as far as functionality. What's the perf hit? It looks like

          the next word is mispelled------------------------------------------n dashes

costs O(n^2), trying O(n) additions to the misspelling, each costing O(n). Maybe you should limit the heuristic to some small number of additions from the missing fragment.

...tNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInteropExtensions.cs

rladuca · 2020-03-13T00:15:50Z

Looks OK as far as functionality. What's the perf hit? It looks like
          the next word is mispelled------------------------------------------n dashes
costs O(n^2), trying O(n) additions to the misspelling, each costing O(n). Maybe you should limit the heuristic to some small number of additions from the missing fragment.

Can we be certain that the concatenations will always be misspelled words in a row until they become correct? Then we could binary search for first correct position. If it would potentially flip-flop that's obviously out.

rladuca · 2020-03-13T00:16:55Z

Oops, my bad, accidentally hit close.

vatsan-madhavan · 2020-03-21T23:48:27Z

Looks OK as far as functionality. What's the perf hit? It looks like
          the next word is mispelled------------------------------------------n dashes
costs O(n^2), trying O(n) additions to the misspelling, each costing O(n). Maybe you should limit the heuristic to some small number of additions from the missing fragment.

Setting a upper limit to protect perf sounds like a good idea. It would be an arbitrary limit, and I have no basis for selecting such a constant really. Conservatively, I'd select k=4 as a limit and leave it at that. I haven't come across an actual example that requires k>2, so that's probably a good margin. "I haven't come across..." isn't a strong basis for this choice, unfortunately. Open to suggestions.

Can we be certain that the concatenations will always be misspelled words in a row until they become correct? Then we could binary search for first correct position. If it would potentially flip-flop that's obviously out.

Though the solution looks reasonably elegant, I've gone through about 5-6 iterations/redesigns that were an order-of magnitude more complicated before arriving at this one. I had one design goal from the start - don't make or change linguistic decisions in WPF that's the purview of a different layer. This meant not changing the word-breakers output, and it certainly precluded commingling word-breaking and spell-checking together to improve the word-breaking results.

The ideal solution would have been something like this:

In the word-breaker, identify and keep track of alternate-forms for each token.
- Some alternate-forms are already identified by the word-breaking API's.
- We can augment this list by identifying additional alternate-forms by looking at the interstitial characters between tokens, in any.
- At the end of the day, don't change the word-breaking results; esp. don't bring in the speller to bear etc.
During spell-checking, take the alternate-forms of tokens into account.
- When a token fails spell-checking, check its alternate-forms and make additional decisions, etc.

This keeps us from introducing linguistic decisions in WPF - that's the domain of word-breaking and spell-checking API's and keeps things separated (word-breaking vs. spell-checking).

Unfortunately, this turned out to be exceedingly hard to implement. The word-breaker changes were quite doable, but plumbing the changes into WPF's speller-stack wasn't easy. WPF does some word-breaking of its own (instead of relying completely on an external word-breaker API) which complicates matters, and I finally decided to compromise a bit on what the solution would look like.

This solution now directly mutates the result of the word-breaker by involving the spell-checker as part of the decision-making process. It's not ideal, but it eliminates the need for extensive refactoring of the rest of the codebase.

Now coming to your question:

Can we be certain that the concatenations will always be misspelled words in a row until they become correct?

We can't be certain, and I don't think it's our concern either. The word-breaker gave us a token, and that forms our basis or prefix. Whatever fragment we are left with that was discovered in the interstitial space between that token and the next one can only be concatenated to form a new token, if at all.

The only possibilities look like these:

token
tokenf
tokenfr
tokenfra
tokenfrag
tokenfragm
tokenfragme
tokenfragmen
tokenfragment

If one of these don't come out as clean from the spell-checker, then we have no useful candidates for substituting token with, so we might as well not upend the word-breakers initial choice and stick with token.

We actually don't want to run a binary-search here - we are looking for the shortest-match. We don't know anything else about the linguistic characteristics of these permutations. There may be 0, 1, or a larger number of permutations that are spellcheck clean. Without having a great deal of working knowledge and vocabulary in several languages, I don't think we can reason about this casually - and so I think we should try to make conservative choices (i.e., look for the shortest-match).

vatsan-madhavan · 2020-04-10T22:22:53Z

Leaving the 'WIP" in title in place until servicing-approval happens. This doesn't apply to the PR targeting master #2871

…t the end of words like "etc." This results in well-formed words like "etc.", "Mr.", "Mrs." being marked as spelling-errors. Further, it is impossible to "correct" these words in WPF. Becasue of the way word-breaking is performed by MS.Data.Text.WordsSegmenter under the hood, these punctuations are always dropped during word-breaking. The older spell-checker (NaturalLanguage6.dll) used on Win7 does not have this defect. MS.Data.Text.WordsSegmenter has declined to 'fix'/'modify this behavior, so we are adding a workaround in WPF. This change introduces a wrapper on top of the word-breaker and recalculates the tokens, taking into account dropped text that occurs immediately after a token that is likely to fail spell-checking.

…reaking with aid from a spell-checker

…d organizing regions

vatsan-madhavan · 2020-06-03T22:27:12Z

@SamBent I'd like to present this to .NET Core shiproom since this is now checked into .NET 4.8 servicing. Can I get your signoff on this?

...crosoft.DotNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInterop.cs

...tNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInteropExtensions.cs

vatsan-madhavan · 2020-07-20T18:30:10Z

This was rejected by the .NET Core Tactics (the forum where servicing related approval decisions are made for .NET Core).

A request for fixing this on .NET Framework 4.8 and 4.7.2 had been initiated by one customer, and it has since been fixed in those versions of .NET (it may still be in the process of being delivered externally at this time). It has also been fixed in master (i.e., .NET 5).

Given (a) this issue was introduced in .NET 4.6.2, which was a considerable time ago, (b) the first substantial report of this bug that came to the attention of the team didn't happen until very recently, and (c) the customer who requested fixes in .NET 4.8/4.7.2 hasn't expressed immediate plans to migrate to .NET Core, .NET Core Tactics stakeholders felt that this fix need not be taken in .NET Core 3.1 servicing at this time.

If someone absolutely needs this fix on 3.1 and is unable to adopt .NET 5 (which will be complete not too far in the future) instead, then we can resurrect the discussion easily enough.

ghost requested a review from rladuca March 11, 2020 00:52

ghost added the PR metadata: Label to tag PRs, to facilitate with triage label Mar 11, 2020

vatsan-madhavan self-assigned this Mar 11, 2020

ghost requested a review from SamBent March 11, 2020 00:52

vatsan-madhavan force-pushed the 3.1/speller-fixup branch from e945a40 to ebaaa93 Compare March 12, 2020 20:24

vatsan-madhavan marked this pull request as ready for review March 12, 2020 20:29

vatsan-madhavan force-pushed the 3.1/speller-fixup branch from cd40a1d to e6368a7 Compare March 12, 2020 21:33

vatsan-madhavan added the * NO MERGE * metadata: The PR is not ready for merge yet (see discussion for detailed reasons) label Mar 12, 2020

SamBent reviewed Mar 12, 2020

View reviewed changes

rladuca suggested changes Mar 12, 2020

View reviewed changes

...tNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInteropExtensions.cs Outdated Show resolved Hide resolved

...tNet.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInteropExtensions.cs Outdated Show resolved Hide resolved

rladuca closed this Mar 13, 2020

rladuca reopened this Mar 13, 2020

vatsan-madhavan force-pushed the 3.1/speller-fixup branch from e6368a7 to 4e0b35b Compare April 10, 2020 20:27

vatsan-madhavan requested review from SamBent and rladuca April 10, 2020 20:28

vatsan-madhavan changed the title ~~WIP: SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings~~ WIP: [release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings Apr 10, 2020

vatsan-madhavan mentioned this pull request Apr 10, 2020

SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings #2871

Merged

rladuca approved these changes Apr 10, 2020

View reviewed changes

vatsan-madhavan force-pushed the 3.1/speller-fixup branch from df3f957 to 03ed1ea Compare May 11, 2020 17:46

vatsan-madhavan added 6 commits June 3, 2020 15:25

Limit the number of alt-forms considered when running augmented wordb…

108232f

…reaking with aid from a spell-checker

Fixing use of inheritdoc

48237cc

Marking S.W.D.WinRTSpellerInterop.TextRange as internal readonly an…

ee6f642

…d organizing regions

Improving readability

cbc766b

Return Array.Empty when there are no tokens

6c2c6cc

When segments==null, init tokens to Array.Empty

0279bdf

vatsan-madhavan force-pushed the 3.1/speller-fixup branch from f0dcccb to 0279bdf Compare June 3, 2020 22:25

SamBent reviewed Jun 3, 2020

View reviewed changes

vatsan-madhavan added 3 commits June 4, 2020 10:58

Fixing blank line

f601882

Update TextRange ctor to match NetFx implementation.

80d9f51

Fixing missing :

fa75a9e

vatsan-madhavan requested a review from SamBent July 6, 2020 22:48

SamBent approved these changes Jul 6, 2020

View reviewed changes

Merge branch 'release/3.1' into 3.1/speller-fixup

4419fa0

vatsan-madhavan changed the title ~~WIP: [release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings~~ [release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings Jul 17, 2020

vatsan-madhavan added Servicing-consider Servicing-rejected and removed * NO MERGE * metadata: The PR is not ready for merge yet (see discussion for detailed reasons) Servicing-consider labels Jul 17, 2020

vatsan-madhavan closed this Jul 20, 2020

vatsan-madhavan deleted the 3.1/speller-fixup branch July 20, 2020 18:30

ghost locked as resolved and limited conversation to collaborators Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings #2753

[release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings #2753

Uh oh!

vatsan-madhavan commented Mar 11, 2020 •

edited

Loading

Uh oh!

vatsan-madhavan commented Mar 12, 2020

Uh oh!

SamBent left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

rladuca commented Mar 13, 2020

Uh oh!

rladuca commented Mar 13, 2020

Uh oh!

vatsan-madhavan commented Mar 21, 2020

Uh oh!

vatsan-madhavan commented Apr 10, 2020

Uh oh!

vatsan-madhavan commented Jun 3, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vatsan-madhavan commented Jul 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings #2753

[release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings #2753

Uh oh!

Conversation

vatsan-madhavan commented Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Customer Impact

Regression?

Risk (see taxonomy)

Link the PR to the original issue and/or the PR to master.

Packaging impact? (if a libraries change)

Additional Discussion

Uh oh!

vatsan-madhavan commented Mar 12, 2020

Uh oh!

SamBent left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rladuca commented Mar 13, 2020

Uh oh!

rladuca commented Mar 13, 2020

Uh oh!

vatsan-madhavan commented Mar 21, 2020

Uh oh!

vatsan-madhavan commented Apr 10, 2020

Uh oh!

vatsan-madhavan commented Jun 3, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vatsan-madhavan commented Jul 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vatsan-madhavan commented Mar 11, 2020 •

edited

Loading

SamBent left a comment •

edited

Loading