Exclude "bad words" from Corpora? #256

hugovk · 2017-03-17T06:56:23Z

I ran wordfilter on Corpora:

./data/foods/herbs_n_spices.json may contain bad words: [u'spices']
./data/religion/fictional_religions.json may contain bad words: [u'Esoteric Order of Dagon']
./data/religion/religions.json may contain bad words: [u'Ancient Egyptian religion', u'Syrian-Egyptic Gnosticism']
./data/societies_and_groups/animal_welfare.json may contain bad words: [u'Japan', u'Pakistan', u'Egypt']
./data/words/word_clues/clues_five.json may contain bad words: [u'spade', u'tardy', u'blame', u'spice', u'flame', u'spook']
./data/words/word_clues/clues_four.json may contain bad words: [u'gash', u'lame', u'gimp']
./data/words/word_clues/clues_six.json may contain bad words: [u'spooks', u'spooky', u'blames', u'flames', u'script', u'spices', u'retard']

(This excludes keys matching "Description", "description", "descriptions", "scripts", "wine_descriptions".)

Should any of those found words be removed from Corpora?
Would it be useful to pop this into the CI?
a. Just for information, or
b. To fail a build if it finds something?

If yes to 2a, should some of those be added to a whitelist?
If yes to 2b, there needs to be a whitelist for any which aren't removed.

The text was updated successfully, but these errors were encountered:

enkiv2 · 2017-03-17T11:24:29Z

Wordfilter seems to be picking up on word components that are slurs (i.e., the scunthorpe problem). You could probably lower the false positive rate by splitting by non-alphabetic characters & anchoring your check to the start and end of each string. (It seems like only the word clues contain actual slurs. Some of them are probably being used in the more common non-slur context: specifically 'spade', 'spook', and 'gash'. I'm not sure if we care about being context-aware here, since it's pretty easy to take corpus elements out of context.)

…

On Fri, Mar 17, 2017 at 2:56 AM Hugo ***@***.***> wrote: I ran wordfilter <https://github.com/dariusk/wordfilter> on Corpora: ./data/foods/herbs_n_spices.json may contain bad words: [u'spices'] ./data/religion/fictional_religions.json may contain bad words: [u'Esoteric Order of Dagon'] ./data/religion/religions.json may contain bad words: [u'Ancient Egyptian religion', u'Syrian-Egyptic Gnosticism'] ./data/societies_and_groups/animal_welfare.json may contain bad words: [u'Japan', u'Pakistan', u'Egypt'] ./data/words/word_clues/clues_five.json may contain bad words: [u'spade', u'tardy', u'blame', u'spice', u'flame', u'spook'] ./data/words/word_clues/clues_four.json may contain bad words: [u'gash', u'lame', u'gimp'] ./data/words/word_clues/clues_six.json may contain bad words: [u'spooks', u'spooky', u'blames', u'flames', u'script', u'spices', u'retard'] (This excludes keys matching "Description", "description", "descriptions", "scripts", "wine_descriptions".) 1. Should any of those found words be removed from Corpora? 2. Would it be useful to pop this into the CI? a. Just for information, or b. To fail a build if it finds something? If yes to 2a, should some of those be added to a whitelist? If yes to 2b, there needs to be a whitelist for any which aren't removed. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#256>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAd6GVeeOK0dqL9OyyY4zac_gvuQpbXgks5rmi6YgaJpZM4MgPaA> .

blinkdog · 2017-03-17T14:45:39Z

This may also be relevant to the topic:
https://www.2600.com/googleblacklist/

hugovk · 2017-03-17T16:01:12Z

Yes, wordfilter falls for Scunthorpe by design:

Also note that due to the complexities of the English language, I am considering anything containing the substring of a bad word to be blacklisted. For example, even though "homogenous" is not a bad word, it contains the substring "homo" and it gets filtered. The reason for this is that new slang pops up all the time using compound words and I can't possibly keep up with it. I'm willing to lose a few words like "homogenous" and "Pakistan" in order to avoid false negatives.

https://github.com/dariusk/wordfilter#documentation

enkiv2 · 2017-03-20T14:54:54Z

Which makes sense for most applications of wordfilter -- with twitter bots, we have both automation and lack of oversight, so the threat model is that offensive content will be produced either by accident or through the manipulation of some griefer, and the loss from rejecting even a false positive is no big deal. For corpora (where everything must be approved by a human anyhow, and where every entry is the result of manual effort), the threat model is that someone has accidentally failed to scrub a word from a list & the person approving the request also failed to notice. Option 2b is likely to create a lot of work (and possibly a lot of strange gaps, if whitelists don't get updated often enough) while not actually providing more utility than option 2a. Changing the filter rules to eliminate some false positives will make 2b more reasonable. Someone submitting a pull request to this repo is very unlikely to be carefully constructing a list of hybrid slurs that wordfilter will miss and injecting them into otherwise innocent-looking json files. Someone on twitter absolutely would do that kind of thing.

…

On Fri, Mar 17, 2017 at 12:02 PM Hugo ***@***.***> wrote: Yes, wordfilter falls for Scunthorpe by design: Also note that due to the complexities of the English language, I am considering anything containing the substring of a bad word to be blacklisted. For example, even though "homogenous" is not a bad word, it contains the substring "homo" and it gets filtered. The reason for this is that new slang pops up all the time using compound words and I can't possibly keep up with it. I'm willing to lose a few words like "homogenous" and "Pakistan" in order to avoid false negatives. https://github.com/dariusk/wordfilter#documentation — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#256 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAd6GcJ5inZA-LQYbqZdwo9a6RSugxhYks5rmq5JgaJpZM4MgPaA> .

dariusk · 2017-03-20T18:16:41Z

Yup -- I considered automatically testing for wordfilter words very early on but decided against it. This is curated and we can submit PRs to remove words we don't want.

dariusk closed this as completed Mar 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude "bad words" from Corpora? #256

Exclude "bad words" from Corpora? #256

hugovk commented Mar 17, 2017

enkiv2 commented Mar 17, 2017 via email

blinkdog commented Mar 17, 2017

hugovk commented Mar 17, 2017

enkiv2 commented Mar 20, 2017 via email

dariusk commented Mar 20, 2017

Exclude "bad words" from Corpora? #256

Exclude "bad words" from Corpora? #256

Comments

hugovk commented Mar 17, 2017

enkiv2 commented Mar 17, 2017 via email

blinkdog commented Mar 17, 2017

hugovk commented Mar 17, 2017

enkiv2 commented Mar 20, 2017 via email

dariusk commented Mar 20, 2017