Tokeniser exclusion list ignores last word in list #275

robbydigital · 2022-07-04T15:41:06Z

I'm filtering some commonly used words out of a corpus with the Tokenise processor and it only seems to be partially successful. For example in one month there are 37,325 instances of one word. When I add the word to the reject list there are still 6307 instances of the word. So it's getting most but not at all. I'm having the same issue with some common swear words that I'm trying to filter out - most are gone, but some remain. Is there a reason for this?

Thanks for any insight!

robbydigital · 2022-07-11T09:06:23Z

Ok so when I tokenised my list of exclusions included "word," a few common curse words, and as the last item in the list of exclusions "word's" (a common formulation of this word in my dataset). When I checked the tokenised strings against the dataset I noticed the formulation "word's" had been included for some reason. I tried again with "word's" as the second item in the list of exclusions and for whatever reason it worked. I don't really know why. But if anyone else has this problem, give that a go I guess...

stijn-uva · 2022-07-11T09:09:24Z

Thanks for the investigation @robbydigital :) This sounds like something we should actually double-check, so I'm reopening this to keep it on our to-do list!

dale-wahl · 2022-07-11T13:48:12Z

@robbydigital, is the dataset available publicly? I do not see anything immediately obvious in the code that could account for that bug.

robbydigital · 2022-07-11T14:47:47Z

@dale-wahl yes, it's public - it's this one: https://4cat.oilab.nl/results/2cca8dd122068fa0f4179f040eba1a01/. Hopefully that link should work. You should be able to see that I tried the tokeniser several times with various exclusion lists.

When my exclusion list was "npc, fuck, fucking, shit, npc's" I was till getting about 6307 "npc" tokens. I looked through the JSON file of tokenised strings and compared it against the full dataset and it appeared that "npc's" and "npcs" were both being tokenised as "npc" in the 2018-10 JSON file that I was checking the full dataset against.

When I revised the exclusion list to "npc, npc's, npcs, npc?, npc!, npc's!, npcs!, npcs?, fuck, fucking, shit" it solved the problem for me. I couldn't find any "npc" tokens in the 2018-10 JSON file.

Perhaps it's not actually a bug but an issue with excluding acronyms, although that doesn't explain why so many "npc's" were retained when I listed that on the exclusion list initially.

dale-wahl · 2022-09-14T09:09:48Z

So there are some interesting things going on here, but I think the order of everything is likely the cause of the behavior you experienced.

Use the chosen tokenizer (tweet or word) to break a document into tokens/words
Check if the exact token/word is in the reject list
Stem the token/word
Lemmatise the token/word

We reject words before they are stemmed/lemmoned(?) so they match exactly what you intend. This way if you really don't want to hear about "farmers" we do not also reject "farm", "farms", "farming", etc. However, if you reject the word "npc", the word "npcs" will not match and the stemmer will turn "npcs" into "npc" and count that word.

Looking at your datasets, it had nothing to do with ignoring the last word and simply that "npcs" was not rejected and was changed by the lemmatiser to "npc" in your case. I hope that makes sense.

As a side note for others who may have a similar experience, depending on which tokenizer you choose, breaking words apart in the first step is handled differently.

nltk word_tokenize has it's own rules on breaking up apostrophes which can sometimes seem odd. For example, it purposefully breaks a word like "we'll" in to "we" and "'ll" (which represents "we" and "will"). You can look at some of the differences between the word_tokenize and the tweettokenizer here. It seems like the tweettokenizer isn't breaking apostrophes in the same way.

I'm not exactly sure if either tokenizer would break "npc's" into "npc" and "'s" or would keep it as "npc's", but neither of those is an exact match to "npcs" which would leave some remaining "npc" stemmed tokens.

dale-wahl · 2022-09-14T09:11:58Z

I'm going to close this issue @robbydigital since I do not think there is an action for us to take, but do let us know if you experience anything else odd that we should take a look at or seems like a bug. Thanks for reporting!

robbydigital closed this as completed Jul 11, 2022

stijn-uva changed the title ~~Tokenise 'Always delete these words' not entirely successful~~ Tokeniser exclusion list ignores last word in list Jul 11, 2022

stijn-uva reopened this Jul 11, 2022

dale-wahl closed this as completed Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokeniser exclusion list ignores last word in list #275

Tokeniser exclusion list ignores last word in list #275

robbydigital commented Jul 4, 2022 •

edited

robbydigital commented Jul 11, 2022 •

edited

stijn-uva commented Jul 11, 2022

dale-wahl commented Jul 11, 2022

robbydigital commented Jul 11, 2022 •

edited

dale-wahl commented Sep 14, 2022

dale-wahl commented Sep 14, 2022

Tokeniser exclusion list ignores last word in list #275

Tokeniser exclusion list ignores last word in list #275

Comments

robbydigital commented Jul 4, 2022 • edited

robbydigital commented Jul 11, 2022 • edited

stijn-uva commented Jul 11, 2022

dale-wahl commented Jul 11, 2022

robbydigital commented Jul 11, 2022 • edited

dale-wahl commented Sep 14, 2022

dale-wahl commented Sep 14, 2022

robbydigital commented Jul 4, 2022 •

edited

robbydigital commented Jul 11, 2022 •

edited

robbydigital commented Jul 11, 2022 •

edited