## Setting Up



### Install necessary packages



We begin by installing packages required for analysis and visualization.



In [1]:
install.packages("tidyverse")
install.packages("tidytext")
install.packages("topicmodels")
install.packages("ggplot2")
install.packages("RColorBrewer")

### Load libraries



Before we can use the libraries, we have to load them into our
environment.



In [1]:
library(tidyverse)
library(tidytext)
library(ggplot2)
library(topicmodels)
library(RColorBrewer)

### Load dataset



The data we will use is available for download as a CSV file from a
simple API created using Datasette. I have crafted a SQL query to get
just the data that we need. This query will return a table containing
each bill summary's unique ID in the database and its text.



In [1]:
data <- read.csv("https://llc.herokuapp.com/summaries.csv?sql=select%0D%0A++rowid%2C%0D%0A++content_text%0D%0Afrom%0D%0A++files%0D%0Awhere%0D%0A++%22path%22+like+%2274_2_s%25%22%0D%0Aorder+by%0D%0A++path&_size=max")
head(data)

## A Little Exploration



We start by getting just the text of the bills.



In [1]:
text <- data %>% select(content_text)

### Tokenizing



The goal is to find a reasonable set of words to feed into the LDA
algorithm for topic modeling. We start by tokenizing the text of the
bill summaries.



In [1]:
tokens <- text %>% unnest_tokens(word, content_text)
data("stop_words")
tokens <- tokens %>% anti_join(stop_words)

And now we can take a quick look at the tokens that appear more than 100
times in the bill summaries.



In [1]:
tokens %>%
  count(word, sort = TRUE) %>% 
  filter(n > 100) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

There's a lot here that won't contribute to meaningful topic modeling,
specifically numbers and names of months. Let's get rid of those.



### Removing numbers and months



In [1]:
tokens <- tokens %>%
  filter(!grepl('[0-9]', word))

month_tokens <- tibble(month.name) %>% 
  unnest_tokens(word, month.name)

tokens <- tokens %>% anti_join(month_tokens)

### Token frequency without numbers and months



In [1]:
tokens %>%
  count(word, sort = TRUE) %>% 
  filter(n > 100) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

This is better, but there's more we can do. First, let's remove so more
words that might clutter our model.



### Additional filters



In [1]:
bill_types <- read.csv("https://llc.herokuapp.com/summaries.csv?sql=select+distinct+bill_type+from+actions&_size=max")
sponsors <- read.csv("https://llc.herokuapp.com/summaries.csv?sql=select+distinct+sponsor+from+actions+where+sponsor+is+not+null+and+sponsor+%21%3D+%22%22&_size=max")
actions <- read.csv("https://llc.herokuapp.com/summaries.csv?sql=select+distinct+action+from+actions+where+action+is+not+null+and+action+%21%3D+%22%22+and+action+%21%3D+%22N%2FA%22&_size=max")

### More tokens



In [1]:
bill_type_tokens <- bill_types %>% 
  unnest_tokens(word, bill_type)
sponsor_tokens <- sponsors %>% 
  unnest_tokens(word, sponsor)
action_tokens <- actions %>% 
  unnest_tokens(word, action)

### Removing bill type and sponsor tokens



In [1]:
tokens <- tokens %>% 
  anti_join(bill_type_tokens) %>%
  anti_join(sponsor_tokens)

Another simple visualization



### Token frequency without bill types and sponsors



In [1]:
tokens %>%
  count(word, sort = TRUE) %>% 
  filter(n > 100) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

Not much has changed here, but words like "approved", "reported", and
"referred" are terms for bill actions that won't contribute meaningfully
to the topic modeling algorithm. We'll remove them next.



### Removal:PROPERTIES:



In [1]:
distinct_action_tokens <- action_tokens %>% 
  distinct(word) %>% 
  filter(!grepl('[0-9]', word)) %>% 
  filter(!grepl('ed$|ly$', word)) %>% 
  anti_join(stop_words)

action_tokens_to_exclude <- action_tokens %>% 
  anti_join(distinct_action_tokens)

tokens <- tokens %>% 
 anti_join(action_tokens_to_exclude)

### Token frequency without bill types, sponsors, months, and action



tokens to exclude
Another quick bar chart



In [1]:
tokens %>%
  count(word, sort = TRUE) %>% 
  filter(n > 100) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

This is enough to get going on topic modeling.



## Topic Modeling



We have to create a document term matrix for topic modeling. Luckily,
`tidytext` makes that very easy to do.

First, we get our tokens into the right shape for consumption by the
`cast_dtm` function.



In [1]:
tokens_for_dtm <- data %>% 
  unnest_tokens(word, content_text) %>% 
  filter(!grepl('[0-9]', word)) %>% 
  anti_join(stop_words) %>% 
  anti_join(month_tokens) %>% 
  anti_join(bill_type_tokens) %>%
  anti_join(sponsor_tokens) %>% 
  anti_join(action_tokens_to_exclude) %>% 
  count(rowid, word) %>% 
  mutate(document = rowid, term = word, count = n) %>% 
  select(document, term, count)

Next, we create our document term matrix.



In [1]:
senate_dtm <- tokens_for_dtm %>% 
  cast_dtm(document, term, count)

And now we can create our Latent Dirichlet Allocation object. k = 12 as
an arbitrary choice



In [1]:
senate_lda <- LDA(senate_dtm, k = 12, control = list(seed= 1234))

We can turn the topics generated by the LDA algorithm into tidy data for
exploration.



In [1]:
senate_topics <- tidy(senate_lda, matrix = "beta")

And we can pull out the top terms, based on the beta, or
[per-topic-per-word
probabilities](https://www.tidytextmining.com/topicmodeling.html#word-topic-probabilities), for each term.



In [1]:
senate_top_terms <- senate_topics %>% 
  group_by(topic) %>% 
  slice_max(beta, n = 10) %>% 
  ungroup() %>% 
  arrange(topic, -beta)

Let's see what we've got



In [1]:
senate_top_terms %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 3) +
  scale_y_reordered()

Some high frequency terms, like "act", "authorizes", and "public" are
cluttering our analysis. Let's sort the top terms by their beta and see
what we can get rid of.



In [1]:
top_terms_ordered <- senate_top_terms %>% arrange(desc(beta))
top_term_count <- top_terms_ordered %>% count(term) %>% arrange(desc(n))
top_term_count

Let's remove the ones that show up more than 5 times.



In [1]:
top_terms_to_remove <- top_term_count %>% filter(n > 5)
senate_top_terms <- senate_top_terms %>% anti_join(top_terms_to_remove)

And now let's take another look.



In [1]:
senate_top_terms %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 3) +
  scale_y_reordered()

## Further analysis



There were [48 Senate committees](https://en.wikipedia.org/wiki/74th_United_States_Congress#Committees) during the 74th Congress, so it could be interesting to change k to 48 in our call to `LDA()`.  However, that's pretty computationally intensive and would be hard to run in a free cloud computing environment like Google Colab.  It could also be interesting to visualize `LDA()` with k = 48 in [LDAvis](https://github.com/cpsievert/LDAvis).

