Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude "bad words" from Corpora? #256

Closed
hugovk opened this issue Mar 17, 2017 · 5 comments
Closed

Exclude "bad words" from Corpora? #256

hugovk opened this issue Mar 17, 2017 · 5 comments

Comments

@hugovk
Copy link
Collaborator

hugovk commented Mar 17, 2017

I ran wordfilter on Corpora:

./data/foods/herbs_n_spices.json may contain bad words: [u'spices']
./data/religion/fictional_religions.json may contain bad words: [u'Esoteric Order of Dagon']
./data/religion/religions.json may contain bad words: [u'Ancient Egyptian religion', u'Syrian-Egyptic Gnosticism']
./data/societies_and_groups/animal_welfare.json may contain bad words: [u'Japan', u'Pakistan', u'Egypt']
./data/words/word_clues/clues_five.json may contain bad words: [u'spade', u'tardy', u'blame', u'spice', u'flame', u'spook']
./data/words/word_clues/clues_four.json may contain bad words: [u'gash', u'lame', u'gimp']
./data/words/word_clues/clues_six.json may contain bad words: [u'spooks', u'spooky', u'blames', u'flames', u'script', u'spices', u'retard']

(This excludes keys matching "Description", "description", "descriptions", "scripts", "wine_descriptions".)

  1. Should any of those found words be removed from Corpora?
  2. Would it be useful to pop this into the CI?
    a. Just for information, or
    b. To fail a build if it finds something?

If yes to 2a, should some of those be added to a whitelist?
If yes to 2b, there needs to be a whitelist for any which aren't removed.

@enkiv2
Copy link
Contributor

enkiv2 commented Mar 17, 2017 via email

@blinkdog
Copy link

This may also be relevant to the topic:
https://www.2600.com/googleblacklist/

@hugovk
Copy link
Collaborator Author

hugovk commented Mar 17, 2017

Yes, wordfilter falls for Scunthorpe by design:

Also note that due to the complexities of the English language, I am considering anything containing the substring of a bad word to be blacklisted. For example, even though "homogenous" is not a bad word, it contains the substring "homo" and it gets filtered. The reason for this is that new slang pops up all the time using compound words and I can't possibly keep up with it. I'm willing to lose a few words like "homogenous" and "Pakistan" in order to avoid false negatives.

https://github.com/dariusk/wordfilter#documentation

@enkiv2
Copy link
Contributor

enkiv2 commented Mar 20, 2017 via email

@dariusk
Copy link
Owner

dariusk commented Mar 20, 2017

Yup -- I considered automatically testing for wordfilter words very early on but decided against it. This is curated and we can submit PRs to remove words we don't want.

@dariusk dariusk closed this as completed Mar 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants