-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drawing the line between positive, neutral and negative comments #4
Comments
Hi gam-ba Interesting analysis , how did you discovered that threshold? |
Hello! I've just manually "evaluated" several comments, but I'm not really sure whether that's a valid threshold (hence the question :) ). |
Hi gam-ba According to your question, I would say that it depends quite a lot on your dataset and what you are trying to do. I mean, if for your application it is tragic to predict a positive sentiment when the statement was actually rather negative (also known as "false positive"), you should set the threshold for binary prediction (1-0) higher. And the other way around if you care more about false negatives! (validate negative sentiment when probability is under a lower threshold). Also, as DonPeregrina noticed in an other post-issue, the model was trained on a corpora that has a few biases. For instance, it is mainly trained on south-american/Argentine comments, that were relative to hotels/restaurants/movies/... I had some really good performances on my dataset, but results can differ a bit when applied to some other datasets, because of these biases. My corpora also included a fair amount of insults, but once again they were highly biased towards how rude statements are made in Argentina. You can think also about sarcasm and irony, I dont think my model would behave well with that. So i can not guarantee you some absolute thresholds, as it depends on both your dataset and your human appreciation of how you want to deal with the false positive/false negative trade-off (regarding to your application). If you have a labelled dataset, I would suggest to use it in order to compute the thresholds that fit best your data. Finally, I used some NaiveBayes model that is not known for giving good calibrated probabilities. That is, it might be good at putting the statement in the right box (positive or negative sentiment), but beyond this probability scores might not be well calibrated/ranked. This because of the "naive assumption" that it makes, causing big and very little probabilities to be multiplicated separetely, hence the numeric instability. In practice, I observed that the final scores did not behave "sooo" badly, and I also set up some post-processing heuristics to rescale and improve the scores (see README.md file). It was not the objective when I shared this project, but i am seeing a little success in this project these days. I am very pleased it is actually being used, and I might consider doing a v2. The model in this package is afterall quite simple and is absolutely not the "state of the art" solution. When I get some time, I will try to gather more various data and train some deep learning architectures, which I am quite confident will perform best (from my experiences at work). It could also be the opportunity to rewrite a cleaner API. To be continued ! |
Hey! Thanks for the thoughtful reply!
When you say "set the threshold", you mean in this same classifier or as complementary function of my own? In other words, is there a parameter in your classifier where we can set these thresholds? In any case, my "concern" has more to do with finding ways of labelling "neutral" comments. I'm not really afraid of false positives or false negatives, but rather finding a way to sort my corpora in the three classical clusters of sentiment analysis (positive, negative and neutral).
IMHO, I guess that's the best part of your model. I'm also working on several corpora written in argentinian spanish :) . What's more, all my material comes from social networks, so the vocabulary (including insults) should be quite alike.
Here you already have someone looking forward to it!! Thanks, once again :) |
Like Gamba said , you have audience now , its very cool proyect. I have a question , I have not gone too deep into the source code , but is there a way that I could change the corpora? lets say that I want now to give some mexican spanish students lexicon , is that possible? Thanks a lot for your replies. |
Nop! The model only outputs a float number p which corresponds to the probability that statement is associated with a positive sentiment. However, all you have to do is to use a "if else" condition on it afterwards, to sort the statement within one of your three boxes - neg, neutral, pos.
I understand that this is the main point, however I have no absolute thresholds to give you, as I said it could depend fairly on your dataset. For my application, I only cared about the pos/neg separation. I think you got it, but formally what you need is a rule of thumb like :
with threshold1 in the range (0, 0.5) and threshold2 in the range (0.5, 1). The best would be, if you have a dataset with for each statement the (neg,neutral,pos) label, then compute all the prediction scores and see which pair of thresholds give you the best results. Otherwise, set them manually, thinking about the trade-off we were talking about above. That is, if you want for instance to be sure that all the statements you classify as negative are actually really negative, threshold1 should be low. But then you have the risk that some negative comments will be classified as neutral. If you set it a bit higher, maybe some neutral comments will be classified into negative, but what you classify as neutral is more likely to be really neutral. Same thing with threshold2.
Good! v2 should also incorporate algorithms to deal with misspells in a more robust manner (people tend to write with errors on social media, especially when they are insulting!) |
I looked quickly, it really doesn't seem easy when you call : clf = SentimentClassifier() At that point you have the real core of the classifier in the variable source_estimator.
If you wanted to retrain on additional data, you should apply this Pipeline But you would need to apply first the exact same preprocessing function i used to your data (it does a lot of things, it is on the github repo) 1 / The TfIDFvectorizer needs access to the whole corpora to perform the tfidf weights vectorization So even if you get to that point, if you call the fit(X, y) method of source_estimator, it will train on your additional data but forget everything about the dataset I used myself to train on. I don't know if everything was clear, but conclusion is I think you would have hard times trying to retrain the model, and still it would forget about the former classifier and train the architecture only on your additional data. So the easiest way, would be to either use it as it is, or retraining it from the beggining on your own data ( maybe the script under /master/classifier/script_global.py would give you some hints on how to do this). I am sorry, the architecture was thought to be a "ready-to-use" classifier, it was not meant to be retrained. I am aware that it is a clear limitation, but as I said before I had no idea the model would be used further than my personal project when I designed it. |
Yes, I was thinking about something like that... I'll see. In any case, thank you for the advice!
I had to deal with that once, and I decided to use the Hunspell engine. It worked quite well but I had to feed it with a huge "exceptions" list to account for insults or local slang. Thank you, again! |
Awesome work thank you gam-ba :) |
Hi! I am just wondering if your plans to do a second version (as mentioned in this thread) are concrete? When would you consider releasing it? I would love to use this for an academic project! |
Hi val-st At least, feel free to use this first version in your academic project :) |
Thank you, I will use this first version for now and am looking forward to an updated version at some point in the future! |
Hi @aylliote first of all, amazing work! I was having this same issue and I thougth in giving a label to the different datasets crawled, include some neutral comments and re train the model using your work on pre and post processesing. The problem was I didn't find the datasets, I just found "goodLexStem", "badLexStem", "cities", "countries" and "expressions". Did you remove "mercadoPos.txt" and another alike files? That could be useful in this scenario. |
Hi!
First of all: EXCELLENT JOB, it is really awesome and works as a charm.
Secondly, this is not really an "issue" but rather a question: where would you draw the line between positive, neutral and negative comments?
After trying out several statements, I was thinking about a threshold somewhere between 0.3 and 0.35, would you agree?
Thanks a lot!
Kind regards,
G
The text was updated successfully, but these errors were encountered: