# Tidy Text Case Studies

## Acknowledgements
We would like to acknowledge the work of Julia Silge and David Robinson, whose materials were used under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License. Their contributions to open data science education, particularly the [Text Mining with R](https://www.tidytextmining.com), provided valuable resources for this project.

This notebook was created by Meara Cox using their code and examples as a foundation, with additional explanations and adaptations to support the goals of this project.

## Install Packages

In [None]:
install.packages("lubridate")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("readr")

## Case study: Comparing Twitter Archives

Online text, especially from platforms like Twitter, receives a lot of attention in text analysis. Many sentiment lexicons used in this book were specifically designed for or validated on tweets, reflecting the unique language and style found there. Both authors of this book are active Twitter users, so this case study explores a comparison between the entire Twitter archives of [Julia](https://x.com/juliasilge) and [David](https://x.com/drob).


### Getting the Data and Distribution

Individuals can download their own Twitter archives by following [instructions provided on Twitter’s website](https://help.x.com/en/managing-your-account/how-to-download-your-x-archive). After downloading their archives, the next step is to load the data and use the **lubridate** package to convert the string-formatted timestamps into proper date-time objects. This conversion enables easier analysis of tweeting patterns over time. Initially, we can explore overall tweeting behavior by summarizing or visualizing the frequency and timing of tweets.


In [None]:
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)

tweets_julia <- read_csv("data/tweets_julia.csv", show_col_types = FALSE)
tweets_dave <- read_csv("data/tweets_dave.csv", show_col_types = FALSE)
tweets <- bind_rows(tweets_julia %>% 
                      mutate(person = "Julia"),
                    tweets_dave %>% 
                      mutate(person = "David")) %>%
  mutate(timestamp = ymd_hms(timestamp))

ggplot(tweets, aes(x = timestamp, fill = person)) +
  geom_histogram(position = "identity", bins = 20, show.legend = FALSE) +
  facet_wrap(~person, ncol = 1)

David and Julia currently tweet at roughly the same rate and joined Twitter about a year apart, but there was a period of about five years when David was inactive on the platform while Julia remained active. As a result, Julia has accumulated approximately four times as many tweets as David overall.

### Word Frequencies

We’ll start by using `unnest_tokens()` to convert our tweets into a tidy format with one token per row, but because tweets often include unique structures like hashtags, mentions, links, and special characters, we’ll apply some preprocessing first. We’ll exclude retweets to focus only on original content, then clean the text by removing URLs and unwanted characters such as ampersands.

When tokenizing, we’ll use a regular expression within `unnest_tokens()` to preserve mentions (starting with `@`) and hashtags (starting with `#`), which are meaningful in Twitter data. Since hashtags and mentions are retained, we can’t rely solely on `anti_join()` with a standard stop word list. Instead, we’ll use `filter()` with `str_detect()` from the stringr package to exclude common stop words, while still keeping elements like usernames and hashtags intact for analysis.

In [None]:
library(tidytext)
library(stringr)

replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"

tidy_tweets <- tweets %>% 
  filter(!str_detect(text, "^RT")) %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]"))

We can now compute word frequencies for each individual. To do this, we first group the data by person and word, counting the number of times each word appears per person. Next, we perform a `left\_join()` to bring in the total word counts for each person — Julia’s total is larger since she has more tweets overall. Finally, we calculate the frequency of each word for each person by dividing the word count by the total words they used.

In [None]:
frequency <- tidy_tweets %>% 
  count(person, word, sort = TRUE) %>% 
  left_join(tidy_tweets %>% 
              count(person, name = "total")) %>%
  mutate(freq = n/total)

print(frequency)

This gives us a clean and tidy data frame, but to visualize those frequencies with one person’s frequency on the x-axis and the other’s on the y-axis, we need to reshape the data. We can use tidyr’s `pivot\_wider()` function to transform the data frame so that each person’s word frequencies become separate columns, ready for plotting.

In [None]:
library(tidyr)

frequency <- frequency %>% 
  select(person, word, freq) %>% 
  pivot_wider(names_from = person, values_from = freq) %>%
  arrange(Julia, David)

print(frequency)

This reshaped data is now ready for visualization. We can use `geom_jitter()` to help spread out the points, reducing the visual clumping that happens with low-frequency words. Additionally, setting `check_overlap = TRUE` on the text labels will prevent them from overlapping excessively, so only a subset of labels will appear clearly on the plot.

In [None]:
suppressPackageStartupMessages(library(scales))

ggplot(frequency, aes(Julia, David)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

Words that fall close to the diagonal line on the plot are used at roughly equal rates by both David and Julia. In contrast, words that sit farther from the line are more heavily favored by one person over the other. The plot only shows words, hashtags, and usernames that both individuals have used at least once in their tweets.

It’s clear from this visualization—and will continue to be throughout the chapter—that David and Julia have approached Twitter in very different ways over the years. David's tweets have been almost entirely professional since he became active, whereas Julia's early Twitter use was personal, and even now, her account retains more personal content than David's. This difference in usage patterns is reflected immediately in how they use language on the platform.

### Comparing Word Usage

We’ve already visualized a comparison of raw word frequencies across the entire Twitter histories of David and Julia. Now, let’s dig deeper by identifying which words are more or less likely to originate from each person’s account using the log odds ratio. To keep the comparison meaningful, we’ll limit the analysis to tweets from 2016—a year when David was consistently active on Twitter and Julia was transitioning into her data science career.

In [None]:
tidy_tweets <- tidy_tweets %>%
  filter(timestamp >= as.Date("2016-01-01"),
         timestamp < as.Date("2017-01-01"))

Next, we’ll use `str_detect()` to filter out Twitter usernames from our word column. Otherwise, the analysis would be dominated by names of people David or Julia know, which doesn’t tell us much about broader word usage differences.

Once usernames are removed, we count how many times each person uses each word and keep only words that appear more than 10 times to avoid results driven by very rare terms. After reshaping the data with `pivot_wider()`, we calculate the **log odds ratio** for each word as:

$$
\text{log odds ratio} = \ln \left( \frac{(n + 1) / (total + 1) \text{ for David}}{(n + 1) / (total + 1) \text{ for Julia}} \right)
$$

Here:

* $n$ is the count of the word for each person,
* $total$ is the total word count for each person,
* Adding 1 to both numerator and denominator prevents division by zero for rare words.

This gives a clearer picture of which words are disproportionately associated with one person versus the other.


In [None]:
word_ratios <- tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  count(word, person) %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  pivot_wider(names_from = person, values_from = n, values_fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(David / Julia)) %>%
  arrange(desc(logratio))

Here are some words that have been about equally likely to come from David or Julia’s account during 2016.

In [None]:
word_ratios %>% 
  arrange(abs(logratio)) %>%
  print()

To find the words most strongly associated with each account, we select the top 15 words with the highest positive log odds ratio (most likely from David’s account) and the top 15 words with the lowest (most negative) log odds ratio (most likely from Julia’s account).

Plotting these words will clearly show which terms are most distinctive for each person’s tweets, highlighting the differences in their language use during 2016.

In [None]:
word_ratios %>%
  group_by(logratio < 0) %>%
  slice_max(abs(logratio), n = 15) %>% 
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (David/Julia)") +
  scale_fill_discrete(name = "", labels = c("David", "Julia"))

David’s tweets tend to focus on professional topics like conferences and programming, reflecting his career interests, while Julia’s tweets highlight personal themes like family and regional topics such as Utah and Census data.

### Changes in Word Use

The previous section focused on overall word usage, but now we want to explore a different question: which words have changed in frequency the most over time in the Twitter feeds? In other words, which words have they tweeted about more or less as time has gone on?

To analyze this, we first create a new time variable in the data frame that groups each tweet into a specific time unit. Using lubridate’s `floor_date()` function, we round timestamps down to the start of each month, which works well for this year of tweets.

Once these time bins are set, we count how often each person used each word within each month. Then, we add columns showing the total words tweeted by each person per month and the total uses of each word by each person overall. Finally, we filter the data to keep only words that appear at least 30 times, ensuring we focus on words with sufficient usage for meaningful analysis.


In [None]:
words_by_time <- tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  mutate(time_floor = floor_date(timestamp, unit = "1 month")) %>%
  count(time_floor, person, word) %>%
  group_by(person, time_floor) %>%
  mutate(time_total = sum(n)) %>%
  group_by(person, word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  filter(word_total > 30)

print(words_by_time)

Each row in this data frame represents a single person’s use of a specific word within a particular time bin. The `count` column shows how many times that person used the word in that time period, `time_total` gives the total number of words the person tweeted during that time bin, and `word_total` indicates the total number of times that person used that word throughout the entire year. This dataset is ready to be used for modeling purposes.

Next, we can apply `nest()` from tidyr to create a data frame where each word corresponds to a list-column containing smaller data frames—essentially mini datasets—for each word. Let’s go ahead and do that, then examine the structure of the resulting nested data frame.

In [None]:
nested_data <- words_by_time %>%
  nest(data = c(-word, -person)) 

print(nested_data)

This data frame contains one row for each unique person-word pair, and the `data` column is a list-column holding smaller data frames for each of those pairs. We’ll use `map()` from the purrr package (Henry and Wickham 2018) to run our modeling function on each of these nested data frames.

Since the data involves counts, we’ll fit a generalized linear model using `glm()` with a binomial family. Essentially, this model answers a question like: “Was this particular word mentioned during this time bin (yes or no)? How does the likelihood of mentioning this word change over time?”

In [None]:
suppressPackageStartupMessages(library(purrr))

nested_models <- nested_data %>%
  mutate(models = map(data, ~ glm(cbind(count, time_total) ~ time_floor, ., 
                                  family = "binomial")))

print(nested_models)

Now we see a new column holding the modeling results—this is another list-column, each entry containing a `glm` model object. The next step is to use `map()` together with `tidy()` from the broom package to extract the slope coefficients from each model.

Since we’re testing many slopes simultaneously and some won’t be statistically significant, we’ll adjust the p-values to account for multiple comparisons, helping to control for false positives.

In [None]:
library(broom)

slopes <- nested_models %>%
  mutate(models = map(models, tidy)) %>%
  unnest(cols = c(models)) %>%
  filter(term == "time_floor") %>%
  mutate(adjusted.p.value = p.adjust(p.value))

Next, we’ll identify the most notable slopes—that is, the words whose usage frequencies have changed over time with moderate statistical significance in our tweets. By filtering the results for adjusted p-values below a chosen threshold (e.g., 0.05), we can pinpoint these words and understand which topics or terms have become more or less common during the period analyzed.

In [None]:
top_slopes <- slopes %>% 
  filter(adjusted.p.value < 0.05)

print(top_slopes)

To visualize these findings, we can create plots showing how the usage of these words has changed over the course of the year for both David and Julia. This will help illustrate trends in word frequency over time, highlighting the differences and similarities in their tweeting patterns throughout the year.

In [None]:
words_by_time %>%
  inner_join(top_slopes, by = c("word", "person")) %>%
  filter(person == "David") %>%
  ggplot(aes(time_floor, count/time_total, color = word)) +
  geom_line(linewidth = 1.3) +
  labs(x = NULL, y = "Word frequency")

This shows that David tweeted extensively about the UseR conference during the event, but his mentions dropped off quickly afterward. Additionally, his tweets about Stack Overflow increased toward the end of the year, while references to ggplot2 declined over time.

Now let’s plot words that have changed frequency in Julia’s tweets.


In [None]:
words_by_time %>%
  inner_join(top_slopes, by = c("word", "person")) %>%
  filter(person == "Julia") %>%
  ggplot(aes(time_floor, count/time_total, color = word)) +
  geom_line(linewidth = 1.3) +
  labs(x = NULL, y = "Word frequency")

All of Julia’s significant slopes are negative, indicating she hasn’t increased her use of any particular words over the year. Instead, she’s used a wider variety of words, with the ones shown in this plot appearing more frequently earlier in the year. For example, words related to sharing new blog posts—like the hashtag #rstats and the word “post”—have decreased in usage over time.

### Favorites and Retweets

Another key aspect of tweets is how often they get favorited or retweeted. To analyze which words are associated with higher engagement on Julia’s and David’s tweets, they created a separate dataset that includes favorites and retweets information. Since Twitter archives don’t include this data, they collected roughly 3,200 tweets for each of them directly from the Twitter API, covering about the last 18 months of activity—a period during which both of them increased their tweeting frequency and follower counts.


In [None]:
tweets_julia <- read_csv("data/juliasilge_tweets.csv", show_col_types = FALSE)
tweets_dave <- read_csv("data/drob_tweets.csv", show_col_types = FALSE)
tweets <- bind_rows(tweets_julia %>% 
                      mutate(person = "Julia"),
                    tweets_dave %>% 
                      mutate(person = "David")) %>%
  mutate(created_at = ymd_hms(created_at))

Now that we have this smaller, more recent dataset, we’ll again use `unnest_tokens()` to convert the tweets into a tidy format. We’ll filter out all retweets and replies to focus solely on the original tweets posted directly by David and Julia.

In [None]:
tidy_tweets <- tweets %>% 
  filter(!str_detect(text, "^(RT|@)")) %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"))

print(tidy_tweets)

First, let’s examine how many times each person’s tweets were retweeted. We’ll calculate the total retweet count separately for David and Julia.

In [None]:
totals <- tidy_tweets %>% 
  group_by(person, id) %>% 
  summarise(rts = first(retweets)) %>% 
  group_by(person) %>% 
  summarise(total_rts = sum(rts))

print(totals)

First, we group by person, tweet, and word to summarize the retweet counts per tweet-word-person combination, counting how many times each word was retweeted in each tweet. Then, we group again by person and word to calculate the median retweet count for each word-person pair and count the total usage of each word by each person, saving that in a column called `uses`. After that, we join this summary with the total retweet counts per person. Finally, we filter the data to include only words that were mentioned at least five times.

In [None]:
word_by_rts <- tidy_tweets %>% 
  group_by(id, word, person) %>% 
  summarise(rts = first(retweets)) %>% 
  group_by(person, word) %>% 
  summarise(retweets = median(rts), uses = n()) %>%
  left_join(totals) %>%
  filter(retweets != 0) %>%
  ungroup()

word_by_rts %>% 
  filter(uses >= 5) %>%
  arrange(desc(retweets)) %>%
  print()

At the top of this sorted data frame, the most retweeted words for both Julia and David relate to packages they contribute to, such as *gganimate* and *tidytext*. Next, we can create a plot showing the words with the highest median retweet counts for each account, highlighting which terms tend to garner more engagement on their tweets.

In [None]:
word_by_rts %>%
  filter(uses >= 5) %>%
  group_by(person) %>%
  slice_max(retweets, n = 10) %>% 
  arrange(retweets) %>%
  ungroup() %>%
  mutate(word = factor(word, unique(word))) %>%
  ungroup() %>%
  ggplot(aes(word, retweets, fill = person)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ person, scales = "free", ncol = 2) +
  coord_flip() +
  labs(x = NULL, 
       y = "Median # of retweets for tweets containing each word")

We notice many words related to R packages, including *tidytext*—the very package this book focuses on! The “0” values for David come from tweets mentioning package version numbers, like “broom 0.4.0” and similar.

Using a similar approach, we can analyze which words correspond to higher numbers of favorites. It will be interesting to see whether those words differ from the ones that lead to more retweets.

In [None]:
totals <- tidy_tweets %>% 
  group_by(person, id) %>% 
  summarise(favs = first(favorites)) %>% 
  group_by(person) %>% 
  summarise(total_favs = sum(favs))

word_by_favs <- tidy_tweets %>% 
  group_by(id, word, person) %>% 
  summarise(favs = first(favorites)) %>% 
  group_by(person, word) %>% 
  summarise(favorites = median(favs), uses = n()) %>%
  left_join(totals) %>%
  filter(favorites != 0) %>%
  ungroup()

We have built the data frames we need. Now let’s make our visualization.

In [None]:
word_by_favs %>%
  filter(uses >= 5) %>%
  group_by(person) %>%
  slice_max(favorites, n = 10) %>% 
  arrange(favorites) %>%
  ungroup() %>%
  mutate(word = factor(word, unique(word))) %>%
  ungroup() %>%
  ggplot(aes(word, favorites, fill = person)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ person, scales = "free", ncol = 2) +
  coord_flip() +
  labs(x = NULL, 
       y = "Median # of favorites for tweets containing each word")

We notice some small differences between David and Julia, especially toward the lower end of the top 10 lists, but overall the words are mostly the same as those linked to retweets. Generally, the words that attract retweets also tend to attract favorites. A standout word for Julia in both charts is the hashtag for the NASA Datanauts program she’s been involved with.

## Case study: Mining NASA Metadata

[NASA](https://www.nasa.gov) maintains over 32,000 datasets that span topics ranging from Earth science to aerospace engineering to NASA's internal operations. To better understand the relationships between these datasets, we can analyze their metadata.

**What is metadata?**
Metadata refers to information that describes other data. In the context of NASA's datasets, this includes details like each dataset's title, description, responsible NASA organization, and human-assigned keywords. The metadata helps users understand the content and purpose of a dataset but does *not* include the dataset's actual measurements or results.

NASA strongly emphasizes open science, requiring [publicly accessible research outputs](https://www.nasa.gov/news-release/nasa-unveils-new-public-web-portal-for-research-results/). As part of this commitment, the metadata for all NASA datasets is available [online in JSON format](https://data.nasa.gov/data.json).

Below, we’ll treat NASA’s metadata as a text dataset and apply tidy text analysis methods to it. Using tools like word co-occurrence analysis, correlations, tf-idf, and topic modeling, we’ll explore questions such as:

* Can we identify relationships between datasets?
* Are there clusters of datasets with similar themes?
* How do different metadata fields (like titles, descriptions, and keywords) reveal patterns across NASA’s data catalog?

This approach demonstrates how text mining techniques can be applied to real-world, domain-specific metadata—whether in the space industry or any other field dealing with large collections of text. Let’s dive into the NASA dataset and begin exploring.

### How Data is Organized at NASA

We’ll start by downloading the JSON file and examining the field names contained in the metadata.

In [None]:
library(jsonlite)
metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)

From this, we can pull details ranging from the publishing organization for each dataset to the type of license under which they’re released.

In [None]:
class(metadata$dataset$title)

class(metadata$dataset$description)

class(metadata$dataset$keyword)

The title and description fields are saved as character vectors, while the keywords are stored as a list of character vectors.

#### Wrangling and Tidying the Data

We’ll create separate tidy data frames for the title, description, and keywords, making sure to keep the dataset IDs with each so we can link them together later if needed.

In [None]:
library(dplyr)

nasa_title <- tibble(id = metadata$dataset$`_id`$`$oid`, 
                     title = metadata$dataset$title)
print(nasa_title)

These are just a few sample dataset titles that we’ll be working with. You’ll notice the NASA-assigned IDs are included, and that some datasets share the same title.

In [None]:
nasa_desc <- tibble(id = metadata$dataset$`_id`$`$oid`, 
                    desc = metadata$dataset$description)

nasa_desc %>% 
  select(desc) %>% 
  sample_n(5) %>%
  print()

Here are excerpts from a few selected description fields in the metadata.

Next, we’ll create a tidy data frame for the keywords. Since these are stored in a list-column, we’ll use tidyr’s `unnest()` function to expand them.

In [None]:
library(tidyr)

nasa_keyword <- tibble(id = metadata$dataset$`_id`$`$oid`, 
                       keyword = metadata$dataset$keyword) %>%
  unnest(keyword)

print(nasa_keyword)