Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokeniser exclusion list ignores last word in list #275

Closed
robbydigital opened this issue Jul 4, 2022 · 6 comments
Closed

Tokeniser exclusion list ignores last word in list #275

robbydigital opened this issue Jul 4, 2022 · 6 comments

Comments

@robbydigital
Copy link

robbydigital commented Jul 4, 2022

I'm filtering some commonly used words out of a corpus with the Tokenise processor and it only seems to be partially successful. For example in one month there are 37,325 instances of one word. When I add the word to the reject list there are still 6307 instances of the word. So it's getting most but not at all. I'm having the same issue with some common swear words that I'm trying to filter out - most are gone, but some remain. Is there a reason for this?

Thanks for any insight!

@robbydigital
Copy link
Author

robbydigital commented Jul 11, 2022

Ok so when I tokenised my list of exclusions included "word," a few common curse words, and as the last item in the list of exclusions "word's" (a common formulation of this word in my dataset). When I checked the tokenised strings against the dataset I noticed the formulation "word's" had been included for some reason. I tried again with "word's" as the second item in the list of exclusions and for whatever reason it worked. I don't really know why. But if anyone else has this problem, give that a go I guess...

@stijn-uva stijn-uva changed the title Tokenise 'Always delete these words' not entirely successful Tokeniser exclusion list ignores last word in list Jul 11, 2022
@stijn-uva
Copy link
Member

Thanks for the investigation @robbydigital :) This sounds like something we should actually double-check, so I'm reopening this to keep it on our to-do list!

@stijn-uva stijn-uva reopened this Jul 11, 2022
@dale-wahl
Copy link
Member

@robbydigital, is the dataset available publicly? I do not see anything immediately obvious in the code that could account for that bug.

@robbydigital
Copy link
Author

robbydigital commented Jul 11, 2022

@dale-wahl yes, it's public - it's this one: https://4cat.oilab.nl/results/2cca8dd122068fa0f4179f040eba1a01/. Hopefully that link should work. You should be able to see that I tried the tokeniser several times with various exclusion lists.

When my exclusion list was "npc, fuck, fucking, shit, npc's" I was till getting about 6307 "npc" tokens. I looked through the JSON file of tokenised strings and compared it against the full dataset and it appeared that "npc's" and "npcs" were both being tokenised as "npc" in the 2018-10 JSON file that I was checking the full dataset against.

When I revised the exclusion list to "npc, npc's, npcs, npc?, npc!, npc's!, npcs!, npcs?, fuck, fucking, shit" it solved the problem for me. I couldn't find any "npc" tokens in the 2018-10 JSON file.

Perhaps it's not actually a bug but an issue with excluding acronyms, although that doesn't explain why so many "npc's" were retained when I listed that on the exclusion list initially.

@dale-wahl
Copy link
Member

So there are some interesting things going on here, but I think the order of everything is likely the cause of the behavior you experienced.

  1. Use the chosen tokenizer (tweet or word) to break a document into tokens/words
  2. Check if the exact token/word is in the reject list
  3. Stem the token/word
  4. Lemmatise the token/word

We reject words before they are stemmed/lemmoned(?) so they match exactly what you intend. This way if you really don't want to hear about "farmers" we do not also reject "farm", "farms", "farming", etc. However, if you reject the word "npc", the word "npcs" will not match and the stemmer will turn "npcs" into "npc" and count that word.

Looking at your datasets, it had nothing to do with ignoring the last word and simply that "npcs" was not rejected and was changed by the lemmatiser to "npc" in your case. I hope that makes sense.

As a side note for others who may have a similar experience, depending on which tokenizer you choose, breaking words apart in the first step is handled differently.

nltk word_tokenize has it's own rules on breaking up apostrophes which can sometimes seem odd. For example, it purposefully breaks a word like "we'll" in to "we" and "'ll" (which represents "we" and "will"). You can look at some of the differences between the word_tokenize and the tweettokenizer here. It seems like the tweettokenizer isn't breaking apostrophes in the same way.

I'm not exactly sure if either tokenizer would break "npc's" into "npc" and "'s" or would keep it as "npc's", but neither of those is an exact match to "npcs" which would leave some remaining "npc" stemmed tokens.

@dale-wahl
Copy link
Member

I'm going to close this issue @robbydigital since I do not think there is an action for us to take, but do let us know if you experience anything else odd that we should take a look at or seems like a bug. Thanks for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants