Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange words - bug? #59

Open
ndvbd opened this issue Mar 12, 2019 · 14 comments
Open

Strange words - bug? #59

ndvbd opened this issue Mar 12, 2019 · 14 comments
Labels
question A question needs to be answered before progress can be made on this issue

Comments

@ndvbd
Copy link

ndvbd commented Mar 12, 2019

There are words like "isn" "aren" "wouldn" - smells like a bug?

@nelsonic
Copy link
Member

@NadavB I suspect the apostrophes may have been removed by mistake. Feel free to re-add them in. PR gladly accepted.

@nelsonic nelsonic added the question A question needs to be answered before progress can be made on this issue label Mar 12, 2019
@ndvbd
Copy link
Author

ndvbd commented Mar 14, 2019

How?

@dbrakman
Copy link

@NadavB As far as I can tell, it looks like the process would be ad hoc for this project. Open words.txt in a text editor, use a regex to find pairs of lines like (.*nt$).*n't$ and delete the lines that look bad.
Then, remove copies of the deleted lines from words_alpha, update the corresponding zip files (why?), and submit a pull request.

@ndvbd
Copy link
Author

ndvbd commented Mar 16, 2019

But where is the current code that generated words_alpha.txt from words.txt so we can modify it?

@dbrakman
Copy link

I don't think it was ever committed. What I see in the history is that someone just added a words_alpha file, and other people modified it directly.

@PeskyPotato
Copy link

Also is "giggish" actually a word?

@dbrakman
Copy link

@LameLemon I couldn't find a definition for "giggish," and it looks like it came from the original infochimps dataset. You can probably remove it.

To address to the original issue of "are strange words a bug," I think we should say no and close the thread. The underlying reason for the presence of nonwords is the choice of data sources. More carefully curated corpora either cost more or have fewer words.

@ndvbd
Copy link
Author

ndvbd commented Mar 17, 2019

@dbrakman so can you commit it please? Otherwise people can't contribute to it...

@dbrakman
Copy link

@NadavB I understand why it should be committed, but I don't have that script. I didn't make these lists.

@ndvbd
Copy link
Author

ndvbd commented Mar 17, 2019

Ahh, I understand. So if someone from the authors see this thread, please commit, thanks...

@ndvbd
Copy link
Author

ndvbd commented Mar 17, 2019

@dbrakman It won't help. The word "aren" is found in words.txt as well. So unless someone show how the file words.txt was extracted from the corpus, I don't think this whole repository is usable at all.

@campbellgoe
Copy link

'aaa' isn't a word either

@tiptyus82
Copy link

H

tiptyus82 referenced this issue in pixijs/examples-v4 May 13, 2019
tiptyus82 referenced this issue in pixijs/examples-v4 May 13, 2019
Example chooser added
Re-allowed non-es6 browsers, as this repo will stay es5 for backwards compatibility
Removed examples that were v5 only
Use terser rather than uglify-js, better maintained
Only grab v4 tags. Default to 'release'
@ShahoodulHassan
Copy link

Also is "giggish" actually a word?

Yes it is: https://www.wordnik.com/words/giggish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question A question needs to be answered before progress can be made on this issue
Projects
None yet
Development

No branches or pull requests

7 participants