English spellchecking #84

mrodin52 · 2021-02-02T08:01:00Z

Hello Team!
I am new to the Project and I have a question.

I use python 3.7 and run into problem with this test program:

from spellchecker import SpellChecker
spell = SpellChecker()                         
split_words = spell.split_words
spell_unknown = spell.unknown

words = split_words("That's how t and s don't fit.")
print(words)
misspelled = spell_unknown(words)
print(misspelled)

With pyspellchecker ver 0.5.4 the printout is:

['that', 's', 'how', 't', 'and', 's', 'don', 't', 'fit']
set()

So free standing 't' and 's' are not marked as errors neither are contractions.

If I change the phrase to:

words = split_words("That is how that's and don't do not fit.")

and use pyspellchecker ver 0.5.6 the printout is:

['that', 'is', 'how', 'that', 's', 'and', 'don', 't', 'do', 'not', 'fit']
{'t', 's'}

So contractions are marked as mistakes again.

(I read barrust comment on Oct 22, 2019}

Please, assist.

barrust · 2021-02-06T13:50:48Z

So the issue is in the split_words() function. It uses a simple regex to split contiguous letters out. So that's -> that s as two words. Try splitting on white space instead of using the utility function.

Note that the contraction isn't marked as a mistake, it is that they are turned into more than one word. So don't becomes don and t; don is a real word in English but t is not. don't should be checked, as is. The issue is that split_words() isn't maintaining contractions.

mrodin52 · 2021-02-06T14:28:09Z

I am afraid that is not a solution since there are punctuation signs (see the last word in my example), and " fit." is placed into misspelled.

By the way, what is the difference between ver 0.5.4 and ver 0.5.6 that produced different spelling results?

barrust · 2021-02-06T16:38:34Z

You can see the information in the Change log as to the differences. The biggest are new dictionaries that attempt to fix these exact issues, a fix for python 3.9, and removing python 2.7 support.

As for how to parse your string, that isn't really this libraries goal. The goal is to be simple to use and pure python and to not require any dependencies.

I used the NLTK WhitespaceTokenizer to build the dictionaries (non-spanish). It is up to you to figure out how you would like to parse your text to make it testable. If there is a good method that can be used to update the simplistic split_words() function, then a PR would be greatly appreciated.

For your instance, perhaps something like this would work:

from spellchecker import SpellChecker
spell = SpellChecker()

words = "That is how that's and don't do not fit.".split()
misspelled = spell.unknown(words)  
# NOTE: this is based on a simple split. Up to the user to figure out what is best!
# This example is only dealing with trailing punctuation, not leading. 
for w in misspelled:
    if w.endswith(tuple([".", "?", ",", '"', "'", "!", "]", ")"])) and w[:-1] in spell:  
        # the word is not misspelled, it was punctuation!
        # likely, you would want to make sure there are 
        # not more punctuation in a row, etc. But this is a 
        # possible solution for your exact problem. 
       print("({}) is not misspelled!".format(w))

mrodin52 · 2021-02-06T19:35:45Z

Understood. Thank you very much.

barrust · 2021-02-12T13:15:30Z

perhaps something like this would work?

From StackOverflow:

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

If this makes sense, I can update the basic split_words() function to do something like this.

mrodin52 changed the title ~~Hello Team!~~ English spellshecking Feb 2, 2021

mrodin52 changed the title ~~English spellshecking~~ English spellchecking Feb 2, 2021

mrodin52 closed this as completed Feb 6, 2021

barrust reopened this Feb 12, 2021

barrust mentioned this issue Feb 22, 2021

Split word #87

Merged

barrust closed this as completed in #87 Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English spellchecking #84

English spellchecking #84

mrodin52 commented Feb 2, 2021 •

edited by barrust

barrust commented Feb 6, 2021

mrodin52 commented Feb 6, 2021

barrust commented Feb 6, 2021

mrodin52 commented Feb 6, 2021

barrust commented Feb 12, 2021

English spellchecking #84

English spellchecking #84

Comments

mrodin52 commented Feb 2, 2021 • edited by barrust

barrust commented Feb 6, 2021

mrodin52 commented Feb 6, 2021

barrust commented Feb 6, 2021

mrodin52 commented Feb 6, 2021

barrust commented Feb 12, 2021

mrodin52 commented Feb 2, 2021 •

edited by barrust