-
Notifications
You must be signed in to change notification settings - Fork 1.2k
SpellCheck incorrectly detects common abbreviations like "e.g." as misspellings #2871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
0fd06bd
WPF's spell-checker on Win8+ ignores punctuations like '.' (period) a…
vatsan-madhavan e4161b0
Limit the number of alt-forms considered when running augmented wordb…
vatsan-madhavan c00984e
Fixing use of inheritdoc
vatsan-madhavan 30fa0e9
Marking S.W.D.WinRTSpellerInterop.TextRange as `internal readonly` an…
vatsan-madhavan aab2b4e
Improving readability
vatsan-madhavan 042322a
Return Array.Empty when there are no tokens
vatsan-madhavan 3d24e0d
When segments==null, init tokens to Array.Empty
vatsan-madhavan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
191 changes: 191 additions & 0 deletions
191
...t.Wpf/src/PresentationFramework/System/Windows/Documents/WinRTSpellerInteropExtensions.cs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,191 @@ | ||
| // Licensed to the .NET Foundation under one or more agreements. | ||
| // The .NET Foundation licenses this file to you under the MIT license. | ||
| // See the LICENSE file in the project root for more information. | ||
|
|
||
| using MS.Internal.WindowsRuntime.Windows.Data.Text; | ||
| using System.Collections.Generic; | ||
|
|
||
| using System.Windows.Documents.MsSpellCheckLib; | ||
| using System.Windows.Documents.Tracing; | ||
| using static System.Windows.Documents.WinRTSpellerInterop; | ||
|
|
||
| namespace System.Windows.Documents | ||
| { | ||
| internal static class WinRTSpellerInteropExtensions | ||
| { | ||
| /// <summary> | ||
| /// Tokenizes <paramref name="text"/> using <paramref name="segmenter"/>, and then identifies fixes-up | ||
| /// the tokens to account for any missed text "in-between" those tokens. | ||
| /// </summary> | ||
| /// <param name="segmenter">Word-breaker instance</param> | ||
| /// <param name="text">The text being tokenized</param> | ||
| /// <param name="spellChecker">The spell-checker instance used to augment the tokenizing process</param> | ||
| /// <param name="owner">Calling <see cref="WinRTSpellerInterop"/> instance</param> | ||
| /// <returns></returns> | ||
| /// <remarks> | ||
| /// Windows.Data.Text.WordsSegmenter tends to drop punctuation characters like period ('.') | ||
| /// when tokenizing text. Though this behavior is compatible with a vast majority of text-processing | ||
| /// scenarios (like word-counting), it is not ideal for spell-checking. | ||
| /// | ||
| /// In this method, the following <paramref name="spellChecker"/> augmented heuristic is applied to update the token-list generated by | ||
| /// <paramref name="segmenter"/>. | ||
| /// | ||
| /// - Identify if any text 'missingFragment' has been dropped by the <paramref name="segmenter"/> | ||
| /// - If the token immediately preceding 'missingFragment', previousToken, has a spelling error, then attempt to | ||
| /// create new candiate tokens in the following order: | ||
| /// | ||
| /// previousToken + missingFragment[0..0] | ||
| /// previousToken + missingFragment[0..1] | ||
| /// previousToken + missingFragment[0..2] | ||
| /// ... | ||
| /// ... | ||
| /// previousToken + missingFragment[0..LEN-1], where LEN = LEN(missingFragment) | ||
| /// | ||
| /// - Select the first candidate token that is free of spelling errors, and replace 'previousToken' with it. | ||
| /// - For performance reasons, we choose a constant MAXLEN = 4 such that when LEN > MAXLEN, only MAXLEN | ||
| /// tokens are considered. | ||
| /// - MAXLEN = 4 is a somewhat arbitrary choice, though it seems more than sufficient to address common | ||
| /// problems this heuristic is intended to help with. | ||
| /// | ||
| /// - Typical word-breaking problems that have been observed empirically involve only one missed character, | ||
| /// for which MAXLEN=1 would be sufficient. MAXLEN=4 is chosen as a sufficiently-large tradeoff between | ||
| /// correctness and performance. | ||
| /// | ||
| /// - Also see https://github.com/dotnet/wpf/pull/2753#issuecomment-602120768 for a discussion related to this. | ||
| /// </remarks> | ||
| public static IReadOnlyList<SpellerSegment> ComprehensiveGetTokens( | ||
| this WordsSegmenter segmenter, | ||
| string text, | ||
| SpellChecker spellChecker, | ||
| WinRTSpellerInterop owner) | ||
| { | ||
| IReadOnlyList<WordSegment> tokens = segmenter?.GetTokens(text) ?? Array.Empty<WordSegment>(); | ||
| if (tokens.Count == 0) | ||
| { | ||
| return Array.Empty<SpellerSegment>(); | ||
| } | ||
|
|
||
| var allTokens = new List<SpellerSegment>(); | ||
| int predictedNextTokenStartPosition = 0; | ||
|
|
||
| for (int i = 0; i < tokens.Count; i++) | ||
| { | ||
| int nextTokenStartPosition = (int)tokens[i].SourceTextSegment.StartPosition; | ||
| int nextTokenLength = (int)tokens[i].SourceTextSegment.Length; | ||
|
|
||
| if (spellChecker != null) | ||
| { | ||
| if (nextTokenStartPosition > predictedNextTokenStartPosition) | ||
| { | ||
| // There is a "gap" between the last recorded token and the current token. | ||
| // Identify the missing token and add it as a "supplementary word segment" - but only if the token | ||
| // turns out to be a substantial one (i.e., if the string is non-blank/non-empty). | ||
| var missingFragment = | ||
| new SpellerSegment( | ||
| text, | ||
| new WinRTSpellerInterop.TextRange( | ||
| predictedNextTokenStartPosition, | ||
| nextTokenStartPosition - predictedNextTokenStartPosition), | ||
| spellChecker, | ||
| owner); | ||
| if (allTokens.Count > 0) | ||
| { | ||
| var substToken = GetSpellCheckCleanSubstitutionToken(spellChecker, text, allTokens[allTokens.Count - 1], missingFragment); | ||
| if (substToken != null) | ||
| { | ||
| allTokens[allTokens.Count - 1] = new SpellerSegment(text, substToken.Value, spellChecker, owner); | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
|
|
||
| allTokens.Add( | ||
| new SpellerSegment( | ||
| text, | ||
| new WinRTSpellerInterop.TextRange( | ||
| nextTokenStartPosition, | ||
| nextTokenLength), | ||
| spellChecker, | ||
| owner)); | ||
| predictedNextTokenStartPosition = nextTokenStartPosition + nextTokenLength; | ||
| } | ||
|
|
||
| if (tokens.Count > 0 && | ||
| spellChecker?.ComprehensiveCheck(tokens[tokens.Count - 1].Text)?.Count != 0 && | ||
| predictedNextTokenStartPosition < text.Length) | ||
| { | ||
| // There is a token possibly missing at the end of the string | ||
| var missingFragment = | ||
| new SpellerSegment( | ||
| text, | ||
| new WinRTSpellerInterop.TextRange( | ||
| predictedNextTokenStartPosition, | ||
| text.Length - predictedNextTokenStartPosition), | ||
| spellChecker, | ||
| owner); | ||
|
|
||
| if (allTokens.Count > 0) | ||
| { | ||
| var substToken = GetSpellCheckCleanSubstitutionToken(spellChecker, text, allTokens[allTokens.Count - 1], missingFragment); | ||
| if (substToken != null) | ||
| { | ||
| allTokens[allTokens.Count - 1] = new SpellerSegment(text, substToken.Value, spellChecker, owner); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return allTokens.AsReadOnly(); | ||
| } | ||
|
|
||
| /// <summary> | ||
| /// Checks through combinations of <paramref name="lastToken"/> + substrings(<paramref name="missingFragment"/>) and | ||
| /// returns the first spellcheck-clean result. | ||
| /// </summary> | ||
| /// <param name="spellChecker">Spell-checker</param> | ||
| /// <param name="documentText">Overall document text within which the text-ranges are computed</param> | ||
| /// <param name="lastToken">Previous token immediately preceding <paramref name="missingFragment"/></param> | ||
| /// <param name="missingFragment">The missing-fragment identified immediately after <paramref name="lastToken"/></param> | ||
| /// <returns></returns> | ||
| /// <remarks> | ||
| /// See note about MAXLEN in <see cref="ComprehensiveGetTokens(WordsSegmenter, string, SpellChecker, WinRTSpellerInterop)"/> | ||
| /// which explains the rationale behind the value of the constant AlternateFormsMaximumCount. | ||
| /// </remarks> | ||
| private static WinRTSpellerInterop.TextRange? GetSpellCheckCleanSubstitutionToken( | ||
| SpellChecker spellChecker, | ||
| string documentText, | ||
| SpellerSegment lastToken, | ||
| SpellerSegment missingFragment) | ||
| { | ||
| const int AlternateFormsMaximumCount = 4; | ||
|
|
||
| if (string.IsNullOrWhiteSpace(missingFragment?.Text) || | ||
| string.IsNullOrWhiteSpace(lastToken?.Text) || | ||
| string.IsNullOrWhiteSpace(documentText)) | ||
| { | ||
| return null; | ||
| } | ||
|
|
||
| int altFormsCount = Math.Min(missingFragment.TextRange.Length, AlternateFormsMaximumCount); | ||
| var spellingErrors = spellChecker?.ComprehensiveCheck(lastToken.Text); | ||
| if (spellingErrors?.Count != 0) | ||
| { | ||
| // One of the substring-permutations of the missingFragment - when concatenated with 'lastToken' - could be a viable | ||
| // replacement for 'lastToken' | ||
| for (int i = 1; i <= altFormsCount; i++) | ||
| { | ||
| var altForm = documentText.Substring(lastToken.TextRange.Start, lastToken.TextRange.Length + i).TrimEnd(); | ||
| if (spellChecker?.ComprehensiveCheck(altForm)?.Count == 0) | ||
| { | ||
| // Use this altForm in place lastToken | ||
| return new WinRTSpellerInterop.TextRange( | ||
| lastToken.TextRange.Start, | ||
| altForm.Length); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return null; | ||
| } | ||
| } | ||
| } |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.