Skip to content

Conversation

@vatsan-madhavan
Copy link
Member

@vatsan-madhavan vatsan-madhavan commented Mar 11, 2020

Description

WPF's spell-checker on Win8+ ignores punctuations like . (period) at the end of words like etc. This results in well-formed words like etc., Mr., Mrs. being marked as spelling-errors.

Further, it is impossible to correct these words in WPF.

Customer Impact

This has been fixed in .NET Framework 4.8 servicing on behalf of a customer. It's being proposed for inclusion in .NET Core 3.1 servicing to ensure that customers upgrading from .NET Framework 4.8 to .NET Core 3.1 are not regressed.

Regression?

Not a regression in .NET Core 3.1.

For historical reference, this was a regression introduced in .NET Framework 4.6.1 when WPF switched its spell-checking implementation from NaturalLanguage6.dll to ISpellChecker/Windows.Data.WordsSegmenter based implementation.

This bug did not affect Windows 8 and older OS platforms.

Risk (see taxonomy)

Low.

The change is isolated, and well tested.

Link the PR to the original issue and/or the PR to master.

#2871

Packaging impact? (if a libraries change)

None

Additional Discussion

Because of the way word-breaking is performed by MS.Data.Text.WordsSegmenter under the hood, these punctuations are always dropped during word-breaking. The older spell-checker (NaturalLanguage6.dll) used on Win7 does not have this defect.

MS.Data.Text.WordsSegmenter has declined to 'fix'/'modify this behavior, so we are adding a workaround in WPF.

This change introduces a wrapper on top of the word-breaker and recalculates the tokens, taking into account dropped text that occurs immediately after a token that is likely to fail spell-checking.

@ghost ghost requested a review from rladuca March 11, 2020 00:52
@ghost ghost added the PR metadata: Label to tag PRs, to facilitate with triage label Mar 11, 2020
@vatsan-madhavan vatsan-madhavan self-assigned this Mar 11, 2020
@ghost ghost requested a review from SamBent March 11, 2020 00:52
@vatsan-madhavan vatsan-madhavan marked this pull request as ready for review March 12, 2020 20:29
@vatsan-madhavan
Copy link
Member Author

@rladuca @SamBent PTAL. This is only marked as WIP to prevent accidentally merging.

@vatsan-madhavan vatsan-madhavan added the * NO MERGE * metadata: The PR is not ready for merge yet (see discussion for detailed reasons) label Mar 12, 2020
Copy link
Contributor

@SamBent SamBent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK as far as functionality. What's the perf hit? It looks like

          the next word is mispelled------------------------------------------n dashes

costs O(n^2), trying O(n) additions to the misspelling, each costing O(n). Maybe you should limit the heuristic to some small number of additions from the missing fragment.

@rladuca
Copy link
Member

rladuca commented Mar 13, 2020

Looks OK as far as functionality. What's the perf hit? It looks like

          the next word is mispelled------------------------------------------n dashes

costs O(n^2), trying O(n) additions to the misspelling, each costing O(n). Maybe you should limit the heuristic to some small number of additions from the missing fragment.

Can we be certain that the concatenations will always be misspelled words in a row until they become correct? Then we could binary search for first correct position. If it would potentially flip-flop that's obviously out.

@rladuca rladuca closed this Mar 13, 2020
@rladuca rladuca reopened this Mar 13, 2020
@rladuca
Copy link
Member

rladuca commented Mar 13, 2020

Oops, my bad, accidentally hit close.

@vatsan-madhavan
Copy link
Member Author

Looks OK as far as functionality. What's the perf hit? It looks like

          the next word is mispelled------------------------------------------n dashes

costs O(n^2), trying O(n) additions to the misspelling, each costing O(n). Maybe you should limit the heuristic to some small number of additions from the missing fragment.

Setting a upper limit to protect perf sounds like a good idea. It would be an arbitrary limit, and I have no basis for selecting such a constant really. Conservatively, I'd select k=4 as a limit and leave it at that. I haven't come across an actual example that requires k>2, so that's probably a good margin. "I haven't come across..." isn't a strong basis for this choice, unfortunately. Open to suggestions.

Can we be certain that the concatenations will always be misspelled words in a row until they become correct? Then we could binary search for first correct position. If it would potentially flip-flop that's obviously out.

Though the solution looks reasonably elegant, I've gone through about 5-6 iterations/redesigns that were an order-of magnitude more complicated before arriving at this one. I had one design goal from the start - don't make or change linguistic decisions in WPF that's the purview of a different layer. This meant not changing the word-breakers output, and it certainly precluded commingling word-breaking and spell-checking together to improve the word-breaking results.

The ideal solution would have been something like this:

  • In the word-breaker, identify and keep track of alternate-forms for each token.
    • Some alternate-forms are already identified by the word-breaking API's.
    • We can augment this list by identifying additional alternate-forms by looking at the interstitial characters between tokens, in any.
    • At the end of the day, don't change the word-breaking results; esp. don't bring in the speller to bear etc.
  • During spell-checking, take the alternate-forms of tokens into account.
    • When a token fails spell-checking, check its alternate-forms and make additional decisions, etc.

This keeps us from introducing linguistic decisions in WPF - that's the domain of word-breaking and spell-checking API's and keeps things separated (word-breaking vs. spell-checking).

Unfortunately, this turned out to be exceedingly hard to implement. The word-breaker changes were quite doable, but plumbing the changes into WPF's speller-stack wasn't easy. WPF does some word-breaking of its own (instead of relying completely on an external word-breaker API) which complicates matters, and I finally decided to compromise a bit on what the solution would look like.

This solution now directly mutates the result of the word-breaker by involving the spell-checker as part of the decision-making process. It's not ideal, but it eliminates the need for extensive refactoring of the rest of the codebase.

Now coming to your question:

Can we be certain that the concatenations will always be misspelled words in a row until they become correct?

We can't be certain, and I don't think it's our concern either. The word-breaker gave us a token, and that forms our basis or prefix. Whatever fragment we are left with that was discovered in the interstitial space between that token and the next one can only be concatenated to form a new token, if at all.

The only possibilities look like these:

token
tokenf
tokenfr
tokenfra
tokenfrag
tokenfragm
tokenfragme
tokenfragmen
tokenfragment

If one of these don't come out as clean from the spell-checker, then we have no useful candidates for substituting token with, so we might as well not upend the word-breakers initial choice and stick with token.

We actually don't want to run a binary-search here - we are looking for the shortest-match. We don't know anything else about the linguistic characteristics of these permutations. There may be 0, 1, or a larger number of permutations that are spellcheck clean. Without having a great deal of working knowledge and vocabulary in several languages, I don't think we can reason about this casually - and so I think we should try to make conservative choices (i.e., look for the shortest-match).

@vatsan-madhavan vatsan-madhavan changed the title WIP: SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings WIP: [release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings Apr 10, 2020
@vatsan-madhavan
Copy link
Member Author

Leaving the 'WIP" in title in place until servicing-approval happens. This doesn't apply to the PR targeting master #2871

…t the end of words like "etc." This results in well-formed words like "etc.", "Mr.", "Mrs." being marked as spelling-errors.

Further, it is impossible to "correct" these words in WPF.

Becasue of the way word-breaking is performed by MS.Data.Text.WordsSegmenter under the hood, these punctuations are always dropped during word-breaking. The older spell-checker (NaturalLanguage6.dll) used on Win7 does not have this defect.

MS.Data.Text.WordsSegmenter has declined to 'fix'/'modify this behavior, so we are adding a workaround in WPF.

This change introduces a wrapper on top of the word-breaker and recalculates the tokens, taking into account dropped text that occurs immediately after a token that is likely to fail spell-checking.
@vatsan-madhavan
Copy link
Member Author

@SamBent I'd like to present this to .NET Core shiproom since this is now checked into .NET 4.8 servicing. Can I get your signoff on this?

@vatsan-madhavan vatsan-madhavan requested a review from SamBent July 6, 2020 22:48
@vatsan-madhavan vatsan-madhavan changed the title WIP: [release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings [release/3.1] SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings Jul 17, 2020
@vatsan-madhavan vatsan-madhavan added Servicing-consider Servicing-rejected and removed * NO MERGE * metadata: The PR is not ready for merge yet (see discussion for detailed reasons) Servicing-consider labels Jul 17, 2020
@vatsan-madhavan
Copy link
Member Author

This was rejected by the .NET Core Tactics (the forum where servicing related approval decisions are made for .NET Core).

A request for fixing this on .NET Framework 4.8 and 4.7.2 had been initiated by one customer, and it has since been fixed in those versions of .NET (it may still be in the process of being delivered externally at this time). It has also been fixed in master (i.e., .NET 5).

Given (a) this issue was introduced in .NET 4.6.2, which was a considerable time ago, (b) the first substantial report of this bug that came to the attention of the team didn't happen until very recently, and (c) the customer who requested fixes in .NET 4.8/4.7.2 hasn't expressed immediate plans to migrate to .NET Core, .NET Core Tactics stakeholders felt that this fix need not be taken in .NET Core 3.1 servicing at this time.

If someone absolutely needs this fix on 3.1 and is unable to adopt .NET 5 (which will be complete not too far in the future) instead, then we can resurrect the discussion easily enough.

@vatsan-madhavan vatsan-madhavan deleted the 3.1/speller-fixup branch July 20, 2020 18:30
@ghost ghost locked as resolved and limited conversation to collaborators Apr 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

PR metadata: Label to tag PRs, to facilitate with triage Servicing-rejected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants