Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English spellchecking #84

Closed
mrodin52 opened this issue Feb 2, 2021 · 5 comments · Fixed by #87
Closed

English spellchecking #84

mrodin52 opened this issue Feb 2, 2021 · 5 comments · Fixed by #87

Comments

@mrodin52
Copy link

mrodin52 commented Feb 2, 2021

Hello Team!
I am new to the Project and I have a question.

I use python 3.7 and run into problem with this test program:

from spellchecker import SpellChecker
spell = SpellChecker()                         
split_words = spell.split_words
spell_unknown = spell.unknown

words = split_words("That's how t and s don't fit.")
print(words)
misspelled = spell_unknown(words)
print(misspelled)

With pyspellchecker ver 0.5.4 the printout is:

['that', 's', 'how', 't', 'and', 's', 'don', 't', 'fit']
set()

So free standing 't' and 's' are not marked as errors neither are contractions.

If I change the phrase to:

words = split_words("That is how that's and don't do not fit.")

and use pyspellchecker ver 0.5.6 the printout is:

['that', 'is', 'how', 'that', 's', 'and', 'don', 't', 'do', 'not', 'fit']
{'t', 's'}

So contractions are marked as mistakes again.

(I read barrust comment on Oct 22, 2019}

Please, assist.

@mrodin52 mrodin52 changed the title Hello Team! English spellshecking Feb 2, 2021
@mrodin52 mrodin52 changed the title English spellshecking English spellchecking Feb 2, 2021
@barrust
Copy link
Owner

barrust commented Feb 6, 2021

So the issue is in the split_words() function. It uses a simple regex to split contiguous letters out. So that's -> that s as two words. Try splitting on white space instead of using the utility function.

Note that the contraction isn't marked as a mistake, it is that they are turned into more than one word. So don't becomes don and t; don is a real word in English but t is not. don't should be checked, as is. The issue is that split_words() isn't maintaining contractions.

@mrodin52
Copy link
Author

mrodin52 commented Feb 6, 2021

I am afraid that is not a solution since there are punctuation signs (see the last word in my example), and " fit." is placed into misspelled.

By the way, what is the difference between ver 0.5.4 and ver 0.5.6 that produced different spelling results?

@barrust
Copy link
Owner

barrust commented Feb 6, 2021

You can see the information in the Change log as to the differences. The biggest are new dictionaries that attempt to fix these exact issues, a fix for python 3.9, and removing python 2.7 support.

As for how to parse your string, that isn't really this libraries goal. The goal is to be simple to use and pure python and to not require any dependencies.

I used the NLTK WhitespaceTokenizer to build the dictionaries (non-spanish). It is up to you to figure out how you would like to parse your text to make it testable. If there is a good method that can be used to update the simplistic split_words() function, then a PR would be greatly appreciated.

For your instance, perhaps something like this would work:

from spellchecker import SpellChecker
spell = SpellChecker()

words = "That is how that's and don't do not fit.".split()
misspelled = spell.unknown(words)  
# NOTE: this is based on a simple split. Up to the user to figure out what is best!
# This example is only dealing with trailing punctuation, not leading. 
for w in misspelled:
    if w.endswith(tuple([".", "?", ",", '"', "'", "!", "]", ")"])) and w[:-1] in spell:  
        # the word is not misspelled, it was punctuation!
        # likely, you would want to make sure there are 
        # not more punctuation in a row, etc. But this is a 
        # possible solution for your exact problem. 
       print("({}) is not misspelled!".format(w))

@mrodin52
Copy link
Author

mrodin52 commented Feb 6, 2021

Understood. Thank you very much.

@mrodin52 mrodin52 closed this as completed Feb 6, 2021
@barrust
Copy link
Owner

barrust commented Feb 12, 2021

perhaps something like this would work?

From StackOverflow:

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

If this makes sense, I can update the basic split_words() function to do something like this.

@barrust barrust reopened this Feb 12, 2021
@barrust barrust mentioned this issue Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants