Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect result while running on large dataset #134

Open
un-lock-me opened this issue Feb 1, 2022 · 2 comments
Open

incorrect result while running on large dataset #134

un-lock-me opened this issue Feb 1, 2022 · 2 comments

Comments

@un-lock-me
Copy link

un-lock-me commented Feb 1, 2022

Hello,

I am trying your tools and I experienced a weird bug. I really appreciate it if you can share your thought regarding this issue with me. I have a dataset of let's say 1000 instances(Some are positive, some negative, and the rest neutral). When I run the tools on the csv file only a portion of each category will be labeled correctly!
For example, "Great place" will be labeled positive but "GREAT!" will be labeled Neutral. And if I remove the "Great place" instance from the dataset then "Great" will be labeled positive!!!!

So, I have tried different scenarios to find the bug and the only conclusion I could make is that it does not work when the number of samples increases. But I don't get why??

I tried another scenario as well. I kept the code run on top of the CSV file and have the result saved on the CSV file. Then, I pass just "GREAT!" to the model right after finishing labeling of CSV file. It labeled it as neutral again!! (If I pass "GREAT!" before running the model on the csv file then it label it as "Positive") which kinda confirmed what I said earlier.

Could you please share with me what could be the reason? The code seems very straightforward I don't know why this is happening?

Thanks in advance @cjhutto

@cjhutto
Copy link
Owner

cjhutto commented Feb 2, 2022

Hi @un-lock-me ... this does seem strange, indeed. 1000 instances should be extremely easy for VADER (I and others routinely use it for files with thousands and millions of records). Would you mind sharing a sample of the structure of the CSV file and your pipeline/code to show how you are parsing and processing the CSV file?

@Siddharth-Latthe-07
Copy link

@un-lock-me , The vader module works on the basis of finding the lexical meaning of the phrases and then providing the scores between -1 and +1. There might be different sentiment outputs for words sending individually(Great, place) or sending it in a phrase(Great Place), to the model. Apart from this, the difference in the sentiment output for word great, is sought of related to how the model processes the word with symbols, like words like Great! and Great might have different sentiment scores, though the word is same, but their lexical meaning might differ.
Hope this helps
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants