Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use wordnet instead of spell to find English words #3242

Merged
merged 3 commits into from Jul 3, 2023

Conversation

theseion
Copy link
Contributor

No description provided.

@fzipi
Copy link
Member

fzipi commented Jun 21, 2023

Ok, I went for another tool in the case of the Python version: https://pypi.org/project/spylls/. Do we want to use the same tool for both? E.g https://pypi.org/project/wn/ instead.

util/fp-finder/spell.sh Outdated Show resolved Hide resolved
@dune73
Copy link
Member

dune73 commented Jun 21, 2023

@theseion What's the advantage of wordnet over the spell/ispell/aspell family for us?

@fzipi are you booked on using python for the PHP stuff or can we switch to shell and use wordnet there too? I do not see anything out of the ordinary in the python scripts. If we keep python, I think the gen.sh should be transferred to python as well.

@fzipi
Copy link
Member

fzipi commented Jun 21, 2023

I don't mind using other scripting possibilities, shell included (if someone wants to port it).

@theseion
Copy link
Contributor Author

theseion commented Jun 22, 2023

WordNet is a corpus with additional linguistics information. The other programs use dictionaries to correct spelling. These dictionaries also contain words that aren't really English words (because users would probably still want correction on them) like "xterm" and "comm". In addition, they weren't built for finding English words so their interfaces are really not built for solving our problem.

I spent some time with aspell and hunspell and compared results to WordNet and WordNet produces the least false positives. To give you an idea:

  • WordNet does not think that "7z" is a word
  • WordNet does not know "xterm" or "comm"
  • WordNet is case insensitive
  • WordNet detects words from variants automatically. For example, "printing" is correctly detected as a form of "print". aspell does not detect that (without additional arguments)

WordNet isn't perfect of course but it easily beat the other choices. Added benefit: WordNet is very easy to use in Python (e.g. with nltk), and probably other languages.

@theseion
Copy link
Contributor Author

Primitive example of Python with nltk:

# don't store data in home
export NLTK_DATA=/tmp/nltk_corpi
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

words = set(wordnet.words())
'xterm' in words

@RedXanadu
Copy link
Member

RedXanadu commented Jun 22, 2023

WordNet wn seems to be less universal than aspell which was originally talked about using. So potentially a similar problem with WordNet to the original problem with using spell: it's easily available for some OSs but not others, or packages may not be up to date, or it relies on 3rd party repos, etc.

There does appear to be a Debian package, though, which maybe should be a good rule for what tooling we do or don't include and rely on.

@dune73
Copy link
Member

dune73 commented Jun 22, 2023

That is a point. But the features @theseion described make it very worthwhile in my eyes. Thus:

Importance of FP > ease of installation of developer tool

@theseion
Copy link
Contributor Author

Fair point @RedXanadu. I did check for a Debian package. And since it's in Homebrew, it's available on macOS and Linux. And there's a Windows binary on the WordNet page too.

@theseion theseion requested a review from fzipi July 3, 2023 19:02
@dune73
Copy link
Member

dune73 commented Jul 3, 2023

Look over this and it looks good. The spell.sh is not overly beautiful, but improving that was not the goal of this PR. So all OK and I'll merge now.

(@fzipi : I know you've been assigned during the meeting, but I had a few spare cycles.)

@dune73 dune73 merged commit f2e3541 into coreruleset:v4.0/dev Jul 3, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants