Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

words.txt lacks words that are in words_alpha.txt #93

Open
carlosaguilarmelchor opened this issue Jan 13, 2021 · 2 comments · May be fixed by #143
Open

words.txt lacks words that are in words_alpha.txt #93

carlosaguilarmelchor opened this issue Jan 13, 2021 · 2 comments · May be fixed by #143

Comments

@carlosaguilarmelchor
Copy link

carlosaguilarmelchor commented Jan 13, 2021

Example :

# cat words_alpha.txt|grep ^ned                                        
ned
nedder
neddy
neddies
nederlands
# cat words.txt|grep ^ned
nedder
neddies
#

The documentation states that words_alpha.txt is a subset from words.txt which apparently is not the case as of now.

@adsteel
Copy link

adsteel commented Feb 6, 2022

I think this is just a case sensitivity issue.

$ cat words.txt|grep -i ^ned
NED
Neda
NEDC
Nedda
nedder
Neddy
Neddie
neddies
Neddra
Nederland
Nederlands
Nedi
Nedra
Nedrah
Nedry
Nedrow
Nedrud

While it would be nice for these files to be perfectly formatted, this is a good reminder to clean your data before doing calculations.

@JaviSorribes
Copy link

This problem does exist, however. I found 25 missing words with these python3 commands (pasted here for reference):

> import requests
> r = requests.get('https://raw.githubusercontent.com/dwyl/english-words/words.txt')
> r.status_code
200
> w = set(r.text.lower().split())
> len(w)
466546
> r = requests.get('https://raw.githubusercontent.com/dwyl/english-words/words_alpha.txt')
> r.status_code
200
> wa = set(r.text.lower().split())
> len(wa)
370103
> missing = wa - w
> len(missing)
25
> missing
{'preinferredpreinferring', 'stegnosisstegnotic', 'tangantangan', 'false', 'sturdiersturdies', 'peroxidicperoxiding', 'gynecicgynecidal', 'coevolvedcoevolves', 'preobtrudingpreobtrusion', 'kestrelkestrels', 'aliyahaliyahs', 'coracoprocoracoid', 'cylindrocylindric', 'killeekillee', 'antinganting', 'epigonousepigons', 'snailfishessnailflower', 'outwardsoutwarred', 'regeneratoryregeneratress', 'cryptocurrency', 'quadriquadric', 'subsultorysubsultus', 'brigantinebrigantines', 'caducecaducean', 'hypophypophysism'}

Note that there's this other problem of there seemingly being several words that have been merged together somehow, but it's also true that not all words in words_alpha.txt are in words.txt (ex "false").

MBM1607 added a commit to MBM1607/english-words that referenced this issue Sep 2, 2022
- Resolves dwyl#93
  - By syncing using the given gen.sh script

- Resolved dwyl#135
  - By adding missing words

- Resolves dwyl#125
  - By adding missing words

- Resolves dwyl#123
  - By adding the missing words

- Resolves dwyl#122
  - By removing the incorrect two words, the correct words were already
    present
    - negqtiator
    - penetrolqgy
@MBM1607 MBM1607 linked a pull request Sep 2, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants