Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drawing the line between positive, neutral and negative comments #4

Open
gam-ba opened this issue May 16, 2019 · 13 comments
Open

Drawing the line between positive, neutral and negative comments #4

gam-ba opened this issue May 16, 2019 · 13 comments

Comments

@gam-ba
Copy link

gam-ba commented May 16, 2019

Hi!

First of all: EXCELLENT JOB, it is really awesome and works as a charm.

Secondly, this is not really an "issue" but rather a question: where would you draw the line between positive, neutral and negative comments?

After trying out several statements, I was thinking about a threshold somewhere between 0.3 and 0.35, would you agree?

Thanks a lot!
Kind regards,

G

@DonPeregrina
Copy link

Hi gam-ba

Interesting analysis , how did you discovered that threshold?
I have definitely seen that some comments from my users are being tagged as 0.01 or 0.1 and they seem neutral to me

@gam-ba
Copy link
Author

gam-ba commented May 17, 2019

Hello!

I've just manually "evaluated" several comments, but I'm not really sure whether that's a valid threshold (hence the question :) ).

@el-cornetillo
Copy link
Owner

Hi gam-ba
Thank you for your kind message (and my apologies for this long answer!)

According to your question, I would say that it depends quite a lot on your dataset and what you are trying to do. I mean, if for your application it is tragic to predict a positive sentiment when the statement was actually rather negative (also known as "false positive"), you should set the threshold for binary prediction (1-0) higher. And the other way around if you care more about false negatives! (validate negative sentiment when probability is under a lower threshold).

Also, as DonPeregrina noticed in an other post-issue, the model was trained on a corpora that has a few biases. For instance, it is mainly trained on south-american/Argentine comments, that were relative to hotels/restaurants/movies/... I had some really good performances on my dataset, but results can differ a bit when applied to some other datasets, because of these biases. My corpora also included a fair amount of insults, but once again they were highly biased towards how rude statements are made in Argentina. You can think also about sarcasm and irony, I dont think my model would behave well with that.

So i can not guarantee you some absolute thresholds, as it depends on both your dataset and your human appreciation of how you want to deal with the false positive/false negative trade-off (regarding to your application). If you have a labelled dataset, I would suggest to use it in order to compute the thresholds that fit best your data.

Finally, I used some NaiveBayes model that is not known for giving good calibrated probabilities. That is, it might be good at putting the statement in the right box (positive or negative sentiment), but beyond this probability scores might not be well calibrated/ranked. This because of the "naive assumption" that it makes, causing big and very little probabilities to be multiplicated separetely, hence the numeric instability. In practice, I observed that the final scores did not behave "sooo" badly, and I also set up some post-processing heuristics to rescale and improve the scores (see README.md file).

It was not the objective when I shared this project, but i am seeing a little success in this project these days. I am very pleased it is actually being used, and I might consider doing a v2. The model in this package is afterall quite simple and is absolutely not the "state of the art" solution. When I get some time, I will try to gather more various data and train some deep learning architectures, which I am quite confident will perform best (from my experiences at work). It could also be the opportunity to rewrite a cleaner API.

To be continued !

@gam-ba
Copy link
Author

gam-ba commented May 27, 2019

Hey! Thanks for the thoughtful reply!

I mean, if for your application it is tragic to predict a positive sentiment when the statement was actually rather negative (also known as "false positive"), you should set the threshold for binary prediction (1-0) higher.

When you say "set the threshold", you mean in this same classifier or as complementary function of my own? In other words, is there a parameter in your classifier where we can set these thresholds?

In any case, my "concern" has more to do with finding ways of labelling "neutral" comments. I'm not really afraid of false positives or false negatives, but rather finding a way to sort my corpora in the three classical clusters of sentiment analysis (positive, negative and neutral).

For instance, it is mainly trained on south-american/Argentine comments, that were relative to hotels/restaurants/movies/... I had some really good performances on my dataset, but results can differ a bit when applied to some other datasets, because of these biases. My corpora also included a fair amount of insults, but once again they were highly biased towards how rude statements are made in Argentina.

IMHO, I guess that's the best part of your model. I'm also working on several corpora written in argentinian spanish :) . What's more, all my material comes from social networks, so the vocabulary (including insults) should be quite alike.

It was not the objective when I shared this project, but i am seeing a little success in this project these days. I am very pleased it is actually being used, and I might consider doing a v2.

Here you already have someone looking forward to it!!

Thanks, once again :)

@DonPeregrina
Copy link

Like Gamba said , you have audience now , its very cool proyect.

I have a question , I have not gone too deep into the source code , but is there a way that I could change the corpora? lets say that I want now to give some mexican spanish students lexicon , is that possible?

Thanks a lot for your replies.

@el-cornetillo
Copy link
Owner

When you say "set the threshold", you mean in this same classifier or as complementary function of my own? In other words, is there a parameter in your classifier where we can set these thresholds?

Nop! The model only outputs a float number p which corresponds to the probability that statement is associated with a positive sentiment. However, all you have to do is to use a "if else" condition on it afterwards, to sort the statement within one of your three boxes - neg, neutral, pos.

In any case, my "concern" has more to do with finding ways of labelling "neutral" comments. I'm not really afraid of false positives or false negatives, but rather finding a way to sort my corpora in the three classical clusters of sentiment analysis (positive, negative and neutral).

I understand that this is the main point, however I have no absolute thresholds to give you, as I said it could depend fairly on your dataset. For my application, I only cared about the pos/neg separation. I think you got it, but formally what you need is a rule of thumb like :

  • if 0 < p < threshold1 ---> negative
  • if threshold1 < p < threshold2 ---> neutral
  • if threshold2 < p < 1 ---> positive

with threshold1 in the range (0, 0.5) and threshold2 in the range (0.5, 1).

The best would be, if you have a dataset with for each statement the (neg,neutral,pos) label, then compute all the prediction scores and see which pair of thresholds give you the best results. Otherwise, set them manually, thinking about the trade-off we were talking about above. That is, if you want for instance to be sure that all the statements you classify as negative are actually really negative, threshold1 should be low. But then you have the risk that some negative comments will be classified as neutral. If you set it a bit higher, maybe some neutral comments will be classified into negative, but what you classify as neutral is more likely to be really neutral. Same thing with threshold2.

What's more, all my material comes from social networks, so the vocabulary (including insults) should be quite alike.

Good! v2 should also incorporate algorithms to deal with misspells in a more robust manner (people tend to write with errors on social media, especially when they are insulting!)

@el-cornetillo
Copy link
Owner

I have a question , I have not gone too deep into the source code , but is there a way that I could change the corpora? lets say that I want now to give some mexican spanish students lexicon , is that possible?

I looked quickly, it really doesn't seem easy

when you call : clf = SentimentClassifier()
clf is a class i wrote to wrap around the real Scikit learn estimator.
You can access this estimator by : source_estimator = clf.classifier

At that point you have the real core of the classifier in the variable source_estimator.
What is it ?
a scikitlearn "Pipeline" object that contains :

  • A TfidfVectorizer , which is itself wrapped around a MarisaTrie structure to reduce memory usage
  • A feature selector "SelectKBest", from scikitlearn aswell
  • A classifier "MultinomialNB", from scikit too

If you wanted to retrain on additional data, you should apply this Pipeline

But you would need to apply first the exact same preprocessing function i used to your data (it does a lot of things, it is on the github repo)
And then tell the object source_estimator to fit upon your additional data
Problem is : this is not possible.

1 / The TfIDFvectorizer needs access to the whole corpora to perform the tfidf weights vectorization
2 / MultinomialNB is not a scikitlearn classifier that can be retrained (it does not have the "warm_start" parameter according to documentation, and this seems normal when you think about how a naive bayes classifier is trained)

So even if you get to that point, if you call the fit(X, y) method of source_estimator, it will train on your additional data but forget everything about the dataset I used myself to train on.

I don't know if everything was clear, but conclusion is I think you would have hard times trying to retrain the model, and still it would forget about the former classifier and train the architecture only on your additional data.

So the easiest way, would be to either use it as it is, or retraining it from the beggining on your own data ( maybe the script under /master/classifier/script_global.py would give you some hints on how to do this).

I am sorry, the architecture was thought to be a "ready-to-use" classifier, it was not meant to be retrained. I am aware that it is a clear limitation, but as I said before I had no idea the model would be used further than my personal project when I designed it.
It's defo something that should be added in case I implement a v2!

@gam-ba
Copy link
Author

gam-ba commented Jun 8, 2019

The best would be, if you have a dataset with for each statement the (neg,neutral,pos) label, then compute all the prediction scores and see which pair of thresholds give you the best results. Otherwise, set them manually, thinking about the trade-off we were talking about above. That is, if you want for instance to be sure that all the statements you classify as negative are actually really negative, threshold1 should be low. But then you have the risk that some negative comments will be classified as neutral. If you set it a bit higher, maybe some neutral comments will be classified into negative, but what you classify as neutral is more likely to be really neutral. Same thing with threshold2.

Yes, I was thinking about something like that... I'll see. In any case, thank you for the advice!

Good! v2 should also incorporate algorithms to deal with misspells in a more robust manner (people tend to write with errors on social media, especially when they are insulting!)

I had to deal with that once, and I decided to use the Hunspell engine. It worked quite well but I had to feed it with a huge "exceptions" list to account for insults or local slang.

Thank you, again!

@AnnaDeluca
Copy link

Awesome work thank you gam-ba :)

@val-st
Copy link

val-st commented Jul 12, 2019

Hi! I am just wondering if your plans to do a second version (as mentioned in this thread) are concrete? When would you consider releasing it? I would love to use this for an academic project!
(Ps: sorry for posting in this thread, I am new to github and don't know where the proper place to ask such a question would be)

@el-cornetillo
Copy link
Owner

Hi val-st
I am seeing so much things that could be improved (the machine learning itself, the run-time, and also the overall cleanness of the code). So I'm quite sure I'll indeed do a second version one of these days, but to be honest I can not tell you when and it will probably take some time.

At least, feel free to use this first version in your academic project :)

@val-st
Copy link

val-st commented Jul 15, 2019

Thank you, I will use this first version for now and am looking forward to an updated version at some point in the future!

@gastondg
Copy link

Hi @aylliote first of all, amazing work!

I was having this same issue and I thougth in giving a label to the different datasets crawled, include some neutral comments and re train the model using your work on pre and post processesing. The problem was I didn't find the datasets, I just found "goodLexStem", "badLexStem", "cities", "countries" and "expressions".

Did you remove "mercadoPos.txt" and another alike files? That could be useful in this scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants