# Sentiment Analysis
Method #1 -> use established dictionaries of sentiment

Acknowledgements:
* https://peerchristensen.netlify.app/post/fair-is-foul-and-foul-is-fair-a-tidytext-entiment-analysis-of-shakespeare-s-tragedies/

We'll look at the evolution of sentiment in Shakespeare's plays.

In [None]:
library('gutenbergr')
library(tidyverse)

# Needed for sentiment collection
library(tidytext)

In [None]:
shakespeare <- gutenberg_works(author == "Shakespeare, William") 

In [None]:
head(shakespeare)

In [None]:
IDs = shakespeare[c(16,24,34,35,54,55,56,57,58,59),]$gutenberg_id
shakespeare %>% filter(gutenberg_id %in% IDs)

In [None]:
plays = gutenberg_download(IDs,meta_fields = "title")

In [None]:
plays

In [None]:
get_sentiments('bing')

Tidytext also has its own set of NLP tools that one can use.  For example, here we use it for word tokenization:

In [None]:
plays %>% 
group_by(title) %>% 
unnest_tokens(word, text)

In [None]:
plays %>% 
group_by(title) %>% 
unnest_tokens(word, text) %>% 
inner_join(get_sentiments("bing"))

In [None]:
sentiments <- plays %>% 
                group_by(title) %>% 
                unnest_tokens(word, text) %>% 
                inner_join(get_sentiments("bing"))

In [None]:
sentiments %>% group_by(title) %>% count(sentiment)

In [None]:
sentiments %>% group_by(title) %>% count(sentiment) %>%
 ggplot(aes(x = sentiment, y = n, fill = title)) + 
 geom_bar(stat = "identity") +
 facet_wrap(~title)

In [None]:
sentiments <- plays            %>% 
  group_by(title)             %>%
  unnest_tokens(word, text)   %>%      # tokenize words
  #anti_join(stop_words) %>%           # in case we would like to remove stop words
  inner_join(get_sentiments("bing"))   # keep only words found in the Bing lexicon

In [None]:
sentiments

In [None]:
sentiments <- mutate(sentiments, line = row_number())

In [None]:
plays            %>% 
  group_by(title)             %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("bing")) %>%
  mutate(line = row_number()) %>%
  count(title, index = line %/% 100, sentiment) %>%  
  spread(sentiment, n, fill = 0)                %>%                 
  mutate(sentiment = positive - negative)       %>%
  ggplot(aes(index, sentiment, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title,scales = "free_x")

## Method #2 for Sentiment Analysis
Supervised Machine Learning with Naives Bayes

In [None]:
library(quanteda)

In [None]:
txt <- c(d1 = "Best Good Best",
         d2 = "Best Best Ok",
         d3 = "Best Blah",
         d4 = "Bad Worst Best",
         d5 = "Best Best Best Bad Worst")
txt <- tokens(txt)
trainingset <- dfm(txt, tolower = FALSE)
trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)

In [None]:
trainingset

In [None]:
trainingclass

Classification with Naive Bayes.

We take Bayes theorem:

$$P(c|F) = \frac{P(F|c)P(c)}{P(F)}$$

with the (sometimes unreasonable but also unreasonably effective) assumption that:

$$P(F|c)P(c) = P(f_1|c)P(f_2|c)...P(f_n|c)P(c)$$

For Quanteda's Naive Bayes classifier, need quanteda.textmodels

In [None]:
library('quanteda.textmodels')

In [None]:
(tmod1 <- textmodel_nb(x = trainingset, y = trainingclass, prior = "docfreq"))

In [None]:
summary(tmod1)

In [None]:
trainingset

Example for `Best`:

* Y for d1,d2,d3
* N for d4
* P(Best|Y) = (2+2+1) / (3+3+2) = 5/8 = 0.625
* P(Best|N) = (1) / (3) = 0.333

?

* There is the danger of having 0's for probabilities -> this is taken care of by add-one or Laplace smoothing (effectively as if every term occurs at least once):
  * P(Best|Y) = (5 + 1) / (8 + 6), where the top +1 comes from adding one per term and the bottom +6 comes from adding one for all terms (6 unique terms)
  * P(Best|Y) = 6/14 = 0.42857...
  * P(Best|N) = (1+1) / (3+6) = 2/9 = 0.222...

In [None]:
coef(tmod1)

In [None]:
predict(tmod1)

In [None]:
predict(tmod1, type = "prob")

In [None]:
# contrast with other priors
predict(textmodel_nb(trainingset, trainingclass, prior = "uniform"))

In [None]:
predict(textmodel_nb(trainingset, trainingclass, prior = "termfreq"))

In [None]:
tmod2 <- textmodel_nb(trainingset, trainingclass, distribution = "Bernoulli", prior = "docfreq")

In [None]:
coef(tmod2)

In [None]:
trainingset

In [None]:
dfm_weight(trainingset, scheme="boolean")

The probabilities now are the fraction of documents of class Y that contain the term Best:
* P(Best|Y) = [(1+1+1) + 1] / [(1+1+1) + 2] = 4/5 = 0.8
* P(Best|N) = [(1) + 1] / [(1) + 2] = 2/3 = 0.667

In [None]:
predict(tmod2, newdata = trainingset[5, ], type = "prob")

In [None]:
predict(tmod2, newdata = trainingset[5, ])