# Performing Analysis on Tidy Text

## Packages to Install

In [20]:
install.packages("tidytext")
install.packages("textdata")
install.packages("dplyr")
install.packages("stringr")
install.packages("janeaustenr")




The downloaded binary packages are in
	/var/folders/2h/84wxzls579b1yv00g4jj02fh0000gn/T//RtmpkIayK9/downloaded_packages

The downloaded binary packages are in
	/var/folders/2h/84wxzls579b1yv00g4jj02fh0000gn/T//RtmpkIayK9/downloaded_packages

The downloaded binary packages are in
	/var/folders/2h/84wxzls579b1yv00g4jj02fh0000gn/T//RtmpkIayK9/downloaded_packages

The downloaded binary packages are in
	/var/folders/2h/84wxzls579b1yv00g4jj02fh0000gn/T//RtmpkIayK9/downloaded_packages

The downloaded binary packages are in
	/var/folders/2h/84wxzls579b1yv00g4jj02fh0000gn/T//RtmpkIayK9/downloaded_packages


## Sentiment Analysis
In the previous chapter, we explored the tidy text format and how it helps analyze word frequency and compare documents. Now, we shift focus to opinion mining or sentiment analysis, where we assess the emotional tone of a text. Just as human readers infer emotions like positivity, negativity, surprise, or disgust from words, we can use text mining tools to programmatically analyze a text's emotional content. A common approach to sentiment analysis is to break the text into individual words, assign sentiment scores to those words, and sum them to understand the overall sentiment of the text. While this isn’t the only method for sentiment analysis, it is widely used and fits naturally within the tidy data framework and toolset.

### The `sentiments` Dataset 
As discussed earlier, there are several methods and dictionaries available for evaluating opinion or emotion in text. The **tidytext** package provides access to three commonly used general-purpose sentiment lexicons: **AFINN**, developed by [Finn Årup Nielsen](https://www2.imm.dtu.dk/pubdb/pubs/6010-full.html); **bing**, from [Bing Liu and collaborators](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html); and **nrc**, created by [Saif Mohammad and Peter Turney](https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm). All three lexicons are based on unigrams, meaning they assign sentiment scores to individual words. The *nrc* lexicon categorizes words as positive, negative, or into specific emotions like joy, anger, sadness, and trust, using a simple "yes"/"no" system. The *bing* lexicon also applies a binary positive or negative label to each word, while the *AFINN* lexicon assigns numerical sentiment scores ranging from -5 (most negative) to 5 (most positive). These lexicons are distributed under different licenses, so it’s important to review the terms and ensure they align with your project requirements before use.


The `get_sentiments()` function allows us to easily access specific sentiment lexicons along with their corresponding sentiment measures. By specifying the lexicon name (such as `"afinn"`, `"bing"`, or `"nrc"`), we can retrieve the appropriate set of words and their sentiment scores or categories, making it straightforward to incorporate sentiment analysis into a tidy text workflow.


In [14]:
library(tidytext)

print(get_sentiments("afinn"))


[90m# A tibble: 2,477 x 2[39m
   word       value
   [3m[90m<chr>[39m[23m      [3m[90m<dbl>[39m[23m
[90m 1[39m abandon       -[31m2[39m
[90m 2[39m abandoned     -[31m2[39m
[90m 3[39m abandons      -[31m2[39m
[90m 4[39m abducted      -[31m2[39m
[90m 5[39m abduction     -[31m2[39m
[90m 6[39m abductions    -[31m2[39m
[90m 7[39m abhor         -[31m3[39m
[90m 8[39m abhorred      -[31m3[39m
[90m 9[39m abhorrent     -[31m3[39m
[90m10[39m abhors        -[31m3[39m
[90m# i 2,467 more rows[39m


In [16]:
print(get_sentiments("bing"))

[90m# A tibble: 6,786 x 2[39m
   word        sentiment
   [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m    
[90m 1[39m 2-faces     negative 
[90m 2[39m abnormal    negative 
[90m 3[39m abolish     negative 
[90m 4[39m abominable  negative 
[90m 5[39m abominably  negative 
[90m 6[39m abominate   negative 
[90m 7[39m abomination negative 
[90m 8[39m abort       negative 
[90m 9[39m aborted     negative 
[90m10[39m aborts      negative 
[90m# i 6,776 more rows[39m


In [19]:
print(get_sentiments("nrc"))

[90m# A tibble: 13,872 x 2[39m
   word        sentiment
   [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m    
[90m 1[39m abacus      trust    
[90m 2[39m abandon     fear     
[90m 3[39m abandon     negative 
[90m 4[39m abandon     sadness  
[90m 5[39m abandoned   anger    
[90m 6[39m abandoned   fear     
[90m 7[39m abandoned   negative 
[90m 8[39m abandoned   sadness  
[90m 9[39m abandonment anger    
[90m10[39m abandonment fear     
[90m# i 13,862 more rows[39m


The sentiment lexicons used in text mining were developed through either crowdsourcing platforms like Amazon Mechanical Turk or the manual effort of individual researchers, and they were validated using datasets such as crowdsourced opinions, movie or restaurant reviews, or Twitter posts. Because of this, applying these lexicons to texts that differ greatly in style or era—like 200-year-old narrative fiction—may produce less precise results, although shared vocabulary still allows meaningful sentiment analysis. In addition to these general-purpose lexicons, domain-specific options exist, such as those designed for financial texts. These dictionary-based methods work by summing the sentiment scores of individual words but have limitations, as they don't account for context, qualifiers, or negation (e.g., "not good"). For many types of narrative text where sarcasm or constant negation is minimal, this limitation is less significant, though later chapters explore strategies for handling negation. It's also important to choose an appropriate text chunk size for analysis—large text sections can have sentiment scores that cancel each other out, whereas sentence- or paragraph-level analysis often yields clearer results.


### Sentiment Analysis with Inner Join

When text data is in a tidy format, performing sentiment analysis becomes as simple as using an inner join. This highlights one of the key advantages of approaching text mining as a tidy data task—the tools and workflows remain consistent and intuitive. Just as removing stop words involves an `anti_join`, sentiment analysis works by applying an `inner_join` to combine the tokenized text with a sentiment lexicon, matching words in the text to their corresponding sentiment values. This tidy approach simplifies the process and integrates seamlessly with other tidyverse functions.

To find the most common joy-related words in *Emma* using the NRC lexicon, we first need to convert the novel's text into a tidy format with one word per row, using `unnest_tokens()`, just like we did earlier. To preserve important context, we can also create additional columns that track which line and chapter each word appears in. This is done using `group_by()` to organize the text by book and `mutate()` to assign line numbers and detect chapters, often using a regular expression. Once the text is structured this way, we can easily filter for words associated with joy in the NRC lexicon and analyze their frequency within the novel.

Notice that we named the output column from `unnest_tokens()` as **word**, which is a practical and consistent choice. The sentiment lexicons and stop word datasets provided by **tidytext** also use **word** as the column name for individual tokens. By keeping this naming consistent, performing operations like `inner_join()` for sentiment analysis or `anti_join()` for removing stop words becomes straightforward and seamless, avoiding unnecessary renaming or data wrangling.

In [22]:
library(janeaustenr)
suppressPackageStartupMessages(library(dplyr))
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Now that the text is in a tidy format with one word per row, we can easily perform sentiment analysis. First, we use the NRC lexicon and apply `filter()` to select only the words associated with the emotion *joy*. Next, we filter the text data to include only words from *Emma* and use `inner_join()` to connect the text with the joy words from the lexicon. Finally, we use `count()` from **dplyr** to find the most common joy-related words in the novel. The results show mostly positive, happy words like *hope*, *friendship*, and *love*, which align with the theme of joy. However, some words like *found* or *present* may not always carry joyful meaning in Austen’s writing, a nuance that will be explored further in Section 2.4.

In [24]:
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE) %>%
  print()

[1m[22mJoining with `by = join_by(word)`


[90m# A tibble: 301 x 2[39m
   word          n
   [3m[90m<chr>[39m[23m     [3m[90m<int>[39m[23m
[90m 1[39m good        359
[90m 2[39m friend      166
[90m 3[39m hope        143
[90m 4[39m happy       125
[90m 5[39m love        117
[90m 6[39m deal         92
[90m 7[39m found        92
[90m 8[39m present      89
[90m 9[39m kind         82
[90m10[39m happiness    76
[90m# i 291 more rows[39m


We can also explore how sentiment shifts across the course of each novel using just a few lines of code, mostly relying on **dplyr** functions. First, we apply `inner_join()` with the Bing lexicon to assign a sentiment score (positive or negative) to each word in the tidy text. Then, we count the number of positive and negative words within specific sections of each book to track changes in sentiment over time. To divide the text into sections, we create an **index** column that marks every 80 lines of text using integer division. The `%/%` operator performs integer division, meaning `x %/% y` returns the largest whole number less than or equal to `x/y`. This gives us consistent, sequential sections of 80 lines, allowing us to visualize how positive and negative sentiment patterns vary throughout the narrative.
