# Sentiment Analysis

- To determine the emotional affect of text.
- Generally done very poorly.
- Little agreement on sentiment.
- Lots of amazing machine learning performance scores.
- Works poorly with software engineering.
- Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research https://arxiv.org/pdf/2005.00357.pdf

# LIWC

- Word Lists of different categories
- https://lit.eecs.umich.edu/geoliwc/liwc_dictionary.html
- Not just positive and negative.
- Pennebaker, J.W., & Francis, M.E. (1996). Cognitive, emotional, and language processes in disclosure. Cognition and Emotion, 10, 601-626.
- SE Paper: P. C. Rigby and A. E. Hassan, "What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List," Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007), Minneapolis, MN, 2007, pp. 23-23, doi: 10.1109/MSR.2007.35. https://ieeexplore.ieee.org/document/4228660

# "Modern" Sentiment Analysis

- Instead of being more descriptive modern senitment analysis has made it easy for computers rather than useful for people.
- Typically 2 to 3 labels: Positive, Neutral, Negative, or just Positive and Negative
- Ignores context
- Simple classifiers.
- 

In [34]:
# NLTK Portion based on
# Based on https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk
# By By Shaumik Daityari, September 26, 2019
#
# Run this if you don't have twitter_samples
import nltk
nltk.download('twitter_samples')
# Run this if you don't have vader_lexicon
nltk.download('vader_lexicon')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/hindle1/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/hindle1/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

# Step 1 Get data

- Here's some non-SE data.

In [35]:
import nltk
from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

In [36]:
print(positive_tweets[0:10])
print(negative_tweets[0:10])

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)', '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!', '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!', '@97sides CONGRATS :)', 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days', '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM', "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI", '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.', 'Jgh , but we have to go to Bayan :D bye', 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing app Katam

# Preprocess -- remove stop words, etc.

In [37]:
import gensim
from gensim.parsing.preprocessing import preprocess_documents
positives = preprocess_documents( positive_tweets )
negatives = preprocess_documents( negative_tweets )

In [5]:
print(positives[0:10])
print(negatives[0:10])

[['followfridai', 'franc', 'int', 'pkuchli', 'milipol', 'pari', 'engag', 'member', 'commun', 'week'], ['lambja', 'hei', 'jame', 'odd', 'contact', 'centr', 'abl', 'assist', 'thank'], ['despiteoffici', 'listen', 'night', 'bleed', 'amaz', 'track', 'scotland'], ['side', 'congrat'], ['yeaaaah', 'yippppi', 'accnt', 'verifi', 'rqst', 'succe', 'got', 'blue', 'tick', 'mark', 'profil', 'dai'], ['bhaktisbant', 'pallaviruhail', 'irresist', 'flipkartfashionfridai', 'http', 'ebzlvenm'], ['like', 'love', 'custom', 'wait', 'long', 'hope', 'enjoi', 'happi', 'fridai', 'lwwf', 'http', 'smyyriipxi'], ['impatientraid', 'second', 'thought', 'there’', 'time', 'new', 'short', 'enter', 'sheep', 'bui'], ['jgh', 'bayan', 'bye'], ['act', 'mischiev', 'call', 'etl', 'layer', 'hous', 'wareh', 'app', 'katamari', 'well…', 'impli']]
[['hopeless', 'tmr'], ['kid', 'section', 'ikea', 'cute', 'shame', 'nearli', 'month'], ['hegelbon', 'heart', 'slide', 'wast', 'basket'], ['ketchburn', 'hate', 'japanes', 'bani'], ['dang', 's

# Reduce vocabulary and move to bag of words

In [39]:
from gensim.corpora import Dictionary
words = Dictionary(positives + negatives)
MAXWORDS=30000
words.filter_extremes(no_below=5, no_above=0.5, keep_n=MAXWORDS)
positive_docs = [words.doc2bow(doc) for doc in positives]
negative_docs = [words.doc2bow(doc) for doc in negatives]



In [40]:
print(positive_docs[0:10])
print(negative_docs[0:10])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(13, 1), (14, 1), (15, 1), (16, 1)], [(17, 1)], [(18, 1), (19, 1), (20, 1), (21, 1)], [(22, 1), (23, 1), (24, 1), (25, 1)], [(24, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)], [(35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1)], [(42, 1), (43, 1)], [(44, 1), (45, 1), (46, 1), (47, 1)]]
[[], [(152, 1), (180, 1), (619, 1), (630, 1), (1204, 1), (1298, 1)], [(554, 1), (712, 1)], [(545, 1), (703, 1)], [(6, 1), (73, 1), (259, 1)], [(24, 1), (179, 1), (181, 1), (1255, 1)], [(302, 1)], [(24, 1), (73, 1), (475, 1), (545, 1), (751, 1), (830, 1)], [(24, 1)], [(24, 1)]]


# Train a classifier

In [41]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from gensim.matutils import sparse2full

X = [sparse2full(x,length=MAXWORDS) for x in positive_docs + negative_docs]
y = ["P" for x in positive_docs] + \
    ["N" for x in negative_docs]

X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size=0.33)


nb = MultinomialNB()
nb.fit(X_train, y_train)
y_hat_train = nb.predict(X_train)
y_hat_test  = nb.predict(X_test)

# Evaluate a classifier

In [42]:
from sklearn.metrics import confusion_matrix, f1_score

print(confusion_matrix(y_train, y_hat_train))
print(f1_score(y_train, y_hat_train, pos_label="P"))
print(confusion_matrix(y_test, y_hat_test))
print(f1_score(y_test, y_hat_test, pos_label="P"))


[[2337  996]
 [ 535 2832]]
0.7872133425990271
[[1105  562]
 [ 354 1279]]
0.7363270005757053


# Here's a rule based learn called Vader

http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.


In [48]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
scores = [sia.polarity_scores(x) for x in positive_tweets + negative_tweets]
print(scores[0:10])
pn_scores = ["N" if x['neg'] > x['pos'] else "P" for x in scores]
print(f1_score(y, pn_scores,pos_label="P"))



[{'neg': 0.0, 'neu': 0.615, 'pos': 0.385, 'compound': 0.7579}, {'neg': 0.145, 'neu': 0.585, 'pos': 0.27, 'compound': 0.6229}, {'neg': 0.0, 'neu': 0.706, 'pos': 0.294, 'compound': 0.7959}, {'neg': 0.0, 'neu': 0.123, 'pos': 0.877, 'compound': 0.7983}, {'neg': 0.0, 'neu': 0.718, 'pos': 0.282, 'compound': 0.795}, {'neg': 0.0, 'neu': 0.565, 'pos': 0.435, 'compound': 0.6597}, {'neg': 0.063, 'neu': 0.417, 'pos': 0.52, 'compound': 0.9466}, {'neg': 0.0, 'neu': 0.87, 'pos': 0.13, 'compound': 0.4588}, {'neg': 0.0, 'neu': 0.619, 'pos': 0.381, 'compound': 0.7615}, {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}]
0.8531605113636365


## From Ahmed, T. , Bosu, A., Iqbal, A. and Rahimi, S. 's README for SentiCR

https://github.com/senticr/SentiCR

Or my python3 version 

https://github.com/abramhindle/SentiCR


# SentiCR

SentiCR is an automated sentiment analysis tool for code review comments. SentiCR uses supervised learning algorithms to train
models based on 1600 manually label code review comments (https://github.com/senticr/SentiCR/blob/master/SentiCR/oracle.xlsx). Features of SentiCR include:

- Special preprocessing steps to exclude URLs and code snippets
- Special preprocessing for emoticons
- Preprocessing steps for contractions
- Special handling of negation phrases through precise identification
- Optimized for the SE domain

## Performance
In our hundred ten-fold cross-validations, SentiCR achieved 83.03% accuracy (i.e., human level accuracy), 67.84% precision,
58.35% recall, and 0.62 f-score on a Gradient Boosting Tree based model. Details cross validation results are included here:
https://github.com/senticr/SentiCR/tree/master/cross-validation-results

## Cite

Ahmed, T. , Bosu, A., Iqbal, A. and Rahimi, S., "SentiCR: A Customized Sentiment Analysis Tool for Code Review Interactions", In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (NIER track).

```
@INPROCEEDINGS{Ahmed-et-al-SentiCR,
   author = {Ahmed, Toufique and Bosu, Amiangshu and Iqbal, Anindya and Rahimi, Shahram},
   title = {{SentiCR: A Customized Sentiment Analysis Tool for Code Review Interactions}},
   year = {2017},
   series = {ASE '17},
   booktitle = {32nd IEEE/ACM International Conference on Automated Software Engineering (NIER track)},
}
```

http://amiangshu.com/papers/senticr-ase.pdf



In [49]:
from SentiCR import  SentiCR

sentiment_analyzer=SentiCR()
senticr = sentiment_analyzer
#All examples are acutal code review comments from Go lang

sentences=["I'm not sure I entirely understand what you are saying. "+\
           "However, looking at file_linux_test.go I'm pretty sure an interface type would be easier for people to use.",
           "I think it always returns it as 0.",
           "If the steal does not commit, there's no need to clean up _p_'s runq. If it doesn't commit,"+\
             " runqsteal just won't update runqtail, so it won't matter what's in _p_.runq.",
           "Please change the subject: s:internal/syscall/windows:internal/syscall/windows/registry:",
           "I don't think the name Sockaddr is a good choice here, since it means something very different in "+\
           "the C world.  What do you think of SocketConnAddr instead?",
           "could we use sed here? "+\
            " https://go-review.googlesource.com/#/c/10112/1/src/syscall/mkall.sh "+\
            " it will make the location of the build tag consistent across files (always before the package statement).",
           "Is the implementation hiding here important? This would be simpler still as: "+\
          " typedef struct GoSeq {   uint8_t *buf;   size_t off;   size_t len;   size_t cap; } GoSeq;",
           "Make sure you test both ways, or a bug that made it always return false would cause the test to pass. "+\
        " assertTrue(Testpkg.Negate(false)); "+\
        " assertFalse(Testpkg.Negate(true)); +"\
        " If you want to use the assertEquals form, be sure the message makes clear what actually happened and " +\
        "what was expected (e.g. Negate(true) != false). "]

for sent in sentences:
    score=sentiment_analyzer.get_sentiment_polarity(sent)
    print(sent+"\n Score: "+str(score))
    



Reading data from oracle..
Training classifier model..


  'stop_words.' % sorted(inconsistent))


I'm not sure I entirely understand what you are saying. However, looking at file_linux_test.go I'm pretty sure an interface type would be easier for people to use.
 Score: [-1.]
I think it always returns it as 0.
 Score: [0.]
If the steal does not commit, there's no need to clean up _p_'s runq. If it doesn't commit, runqsteal just won't update runqtail, so it won't matter what's in _p_.runq.
 Score: [-1.]
Please change the subject: s:internal/syscall/windows:internal/syscall/windows/registry:
 Score: [0.]
I don't think the name Sockaddr is a good choice here, since it means something very different in the C world.  What do you think of SocketConnAddr instead?
 Score: [-1.]
could we use sed here?  https://go-review.googlesource.com/#/c/10112/1/src/syscall/mkall.sh  it will make the location of the build tag consistent across files (always before the package statement).
 Score: [0.]
Is the implementation hiding here important? This would be simpler still as:  typedef struct GoSeq {   uin

In [50]:
scores = [sentiment_analyzer.get_sentiment_polarity(x) for x in positive_tweets + negative_tweets]
print(scores[0:10])
pn_scores = ["N" if x < 0 else "P" for x in scores]
print(f1_score(y, pn_scores,pos_label="P"))

[array([0.]), array([0.]), array([0.]), array([0.]), array([0.]), array([0.]), array([0.]), array([0.]), array([0.]), array([0.])]
0.6959335866047559


# Latest deep learning transformer based sentiment analysis

- Hugging Face Transformers

In [51]:
# From Alexander Wong
# using transformers to sentiment analysis
import nltk
from  transformers import pipeline
sentence_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
sentiment_analyzer = pipeline(
    "sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english"
)

In [52]:
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import  word_tokenize
tokenizer = TweetTokenizer()
N = 600
y_hug = ["P" for x in range(N)] + ["N" for x in range(N)]
sentences = [sentence_tokenizer.tokenize(x)[0] for x in positive_tweets[0:N] + negative_tweets[0:N]]


In [53]:
# run through the sentiment analyzer to get pos/neg label and score
sentiment_batches = [sentiment_analyzer(sentence) for sentence in sentences]

In [54]:
sentiment_batches[0]


[{'label': 'POSITIVE', 'score': 0.9260645508766174}]

In [55]:
sentiments = [sentence for batch in sentiment_batches for sentence in batch]
sentiments[0:10]

[{'label': 'POSITIVE', 'score': 0.9260645508766174},
 {'label': 'POSITIVE', 'score': 0.9929360747337341},
 {'label': 'POSITIVE', 'score': 0.9997010827064514},
 {'label': 'POSITIVE', 'score': 0.9924265146255493},
 {'label': 'NEGATIVE', 'score': 0.9931758046150208},
 {'label': 'POSITIVE', 'score': 0.9995020627975464},
 {'label': 'NEGATIVE', 'score': 0.9852420091629028},
 {'label': 'NEGATIVE', 'score': 0.9981480836868286},
 {'label': 'POSITIVE', 'score': 0.9309679865837097},
 {'label': 'POSITIVE', 'score': 0.9934095144271851}]

In [56]:
from sklearn.metrics import confusion_matrix, f1_score

y_hat = ["N" if x['label'] == 'NEGATIVE' else 'P' for x in sentiments]
print(f1_score(y_hug, y_hat,pos_label="P"))




0.6245353159851301


In [57]:
"""
From 
Pletea D, Vasilescu B, Serebrenik A. 
Security and emotion: sentiment analysis of security discussions on GitHub. 
InProceedings of the 11th working conference on mining software repositories 
2014 May 31 (pp. 348-351).
"""
keywords = """access policy, access role, access-policy, access-role, accesspolicy, accessrole, aes, audit, authentic, authority, au-
thoriz, biometric, black list, black-list, blacklist, blacklist,
cbc, certificate, checksum, cipher, clearance, confidentiality,
cookie, crc, credential, crypt, csrf, decode, defensive programming,
defensive-programming, delegation, denial of service,
denial-of-service, diffie-hellman, dmz, dotfuscator, dsa,
ecdsa, encode, escrow, exploit, firewall, forge, forgery, gss
api, gss-api, gssapi, hack, hash, hmac, honey pot, honeypot, 
honeypot, inject, integrity, kerberos, ldap, login, malware,
md5, nonce, nss, oauth, obfuscat, open auth, openauth,
openauth, openid, owasp, password, pbkdf2, pgp, phishing, pki, privacy, private key, private-key, privatekey, privilege, 
public key, public-key, publickey, rbac, rc4, repudiation,
rfc 2898, rfc-2898, rfc2898, rijndael, rootkit, rsa, salt, saml,
sanitiz, secur, sha, shell code, shell-code, shellcode, shibboleth, 
signature, signed, signing, sing sign-on, single signon,
single-sign-on, smart assembly, smart-assembly, smartassembly, 
snif, spam, spnego, spoofing, spyware, ssl, sso,
steganography, tampering, trojan, trust, violat, virus, white
list, white-list, whitelist, x509, xss."""

In [58]:
print(sentiment_analyzer(sentence_tokenizer.tokenize(keywords)))

[{'label': 'NEGATIVE', 'score': 0.9758636355400085}]


In [27]:
print(SentiCR().get_sentiment_polarity(keywords))

Reading data from oracle..
Training classifier model..


  'stop_words.' % sorted(inconsistent))


[0.]


In [59]:
print(sia.polarity_scores(keywords))

{'neg': 0.025, 'neu': 0.887, 'pos': 0.087, 'compound': 0.8225}


In [60]:
keyword_doc = sparse2full(words.doc2bow(preprocess_documents(keywords)[0]),length=MAXWORDS)
keyword_doc

array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)

In [61]:
nb.predict([keyword_doc])

array(['P'], dtype='<U1')

In [47]:
phrase = "I forgot to charge my phone over night"
phrase = "The sky is blue"
phrase = "Stuff is not working out for me today"
phrase = "I don't like pizza"
custom_doc = sparse2full(words.doc2bow(preprocess_documents(phrase)[0]),length=MAXWORDS)
nb.predict([custom_doc])

array(['P'], dtype='<U1')