Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dictionary inconsistency #115

Closed
mhillendahl opened this issue Feb 8, 2022 · 2 comments · Fixed by #165
Closed

dictionary inconsistency #115

mhillendahl opened this issue Feb 8, 2022 · 2 comments · Fixed by #165

Comments

@mhillendahl
Copy link

mhillendahl commented Feb 8, 2022

Python 3.9.5
Windows 10 x64

Expected Behavior
Each language setting only contains itself. Words written in a different language, deliberately or by mistake, are unknown.

Observed Behavior
Each language appears to contain itself plus one or more additional language(s).
en contains words from English as expected, but also from Spanish, French, and German.
es contains words from Spanish as expected, but also from English and French.
fr contains words from French as expected, but also from English.
pt contains words from Portuguese as expected, but also from English.
de contains words from German as expected, but also from English.

Impact
Typos in the selected language are undetectable if they incidentally match one of the extra languages. (see console output below)

Steps to Reproduce

spellCheckerTest.py

from spellchecker import SpellChecker

langs = ['en', 'es', 'fr', 'pt', 'de']
words = ['word', 'palabra', 'mot', 'palavra', 'wort', 'notaword']

for i in range(5):
    lang = SpellChecker(language=langs[i])
    known = lang.known(words)
    unknown = lang.unknown(words)
    print(f'{langs[i]} : {", ".join(sorted(known))} ({", ".join(sorted(unknown))})')
    
input()

console output

en : mot, palabra, word, wort (notaword, palavra)
es : mot, palabra, word (notaword, palavra, wort)
fr : mot, word (notaword, palabra, palavra, wort)
pt : palavra, word (mot, notaword, palabra, wort)
de : word, wort (mot, notaword, palabra, palavra)
@mhillendahl mhillendahl changed the title english dictionary contains spanish words dictionary inconsistency Feb 9, 2022
@barrust
Copy link
Owner

barrust commented Mar 19, 2022

The data used to build the dictionaries are pulled from the opensubtitles project. The build process is automated using script/build_dictionaries.py. I don't know of a good method to automatically find and flag each of these "cross-overs" but any help in making the build_dictionaries.py script more robust would be appreciated.

@stephencawood
Copy link
Contributor

stephencawood commented Jun 6, 2023

There are almost 1.5 million entries in the English dictionary. It's clear that there are far too many. But it's not just French, Spanish, and German. For example:

"開かれた": 1,
"闇を切り裂いてさ": 2,
"阎东生": 1,
"阿昭": 1,
"降り出した雪": 10,
"限りがあるってのを知っていてムダにしちゃうんだろう": 2,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants