# Bag of Words in R with `tidytext`

If you're using R, you almost certainly know about the [Tidyverse](https://www.tidyverse.org/) family of libraries.  (If not, go fix that, they're easily the "killer app" for R).  Part of the Tidyverse is `tidytext`: a library for transforming texts into "one row per token per document" format.

If you're coming from other libraries or other NLP tools, this will be a very different way of thinking about your data.  Rather than performing operations on _documents,_ you'll be mostly performing operations on _words_.  There are some pros and cons to this approach, but at the end of the day they about even out to zero.  The biggest downside to this approach is that if you're using a train-test split (like we are here), you'll need to do a little extra work to make sure that certain transformations are applied uniformly to all your different splits.

`tidytext` is great for exploratory analyses and anything where you're directly analysing _words._  The "one row per token per document" format is a very nice format for this kind of work.  Personally, I find it a little awkward for predictive modeling (at least compared to `text2vec`, which we'll see in the next notebook), but that's probably mostly a matter of familiarity.

We'll use `tidytext` as part of a predictive modeling project.  We have a bunch of Amazon reviews, and we want to predict the number of stars from the review text.  We have a training split and a testing split for our data; we'll train a model using the training data, and evaluate the model on the testing data.  We'll use a Bernoulli Naive Bayes model to do the predictions--this is a super simple form of Naive Bayes that assumes all your features are either 1 or 0.  (this will let us take some shortcuts a bit later on).

In [1]:
# Dependencies---uncomment and run to install them
# install.packages("dplyr")      # if you don't know what dplyr is I can't help you
# install.packages("magrittr")   # pipes!
# install.packages("naivebayes") # Naive Bayes implementation
# install.packages("SnowballC")  # stemming
# install.packages("tidytext")   # text processing using Tidy data principles
# install.packages("yardstick")  # model metrics; part of the tidymodels suite

In [2]:
library(dplyr)
library(magrittr)
library(naivebayes)
library(SnowballC)
library(tidytext)
library(yardstick)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


naivebayes 0.9.7 loaded

For binary classification, the first factor level is assumed to be the event.
Use the argument `event_level = "second"` to alter this as needed.



In [3]:
# Load data
train <- read.csv("../../data/train.csv", stringsAsFactors = FALSE)
test <- read.csv("../../data/test.csv", stringsAsFactors = FALSE)
str(train)

'data.frame':	200000 obs. of  8 variables:
 $ review_id       : chr  "en_0964290" "en_0690095" "en_0311558" "en_0044972" ...
 $ product_id      : chr  "product_en_0740675" "product_en_0440378" "product_en_0399702" "product_en_0444063" ...
 $ reviewer_id     : chr  "reviewer_en_0342986" "reviewer_en_0133349" "reviewer_en_0152034" "reviewer_en_0656967" ...
 $ stars           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ review_body     : chr  "Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no wa"| __truncated__ "the cabinet dot were all detached from backing... got me" "I received my first order of this product and it was broke so I ordered it again. The second one was broke in m"| __truncated__ "This product is a piece of shit. Do not buy. Doesn't work, and then I try to call for customer support, it won'"| __truncated__ ...
 $ review_title    : chr  "I'll spend twice the amount of time boxing up the whole useless thing and send it back wit

`tidytext`'s main tool is the `unnest_tokens()` function.  It takes in a `data.frame` (or `data.frame`-compatible object like a `tibble`) along with the name of the column containing text, and the name of the column that should store the tokens after it's been tokenized.  This function will then run some basic preprocessing (mainly lowercasing the text and tokenizing it) and spit out a `data.frame` in the "one row per token per document" format.

In [4]:
# create the bag-of-words representation in "one token per row" format.
tokenized <- unnest_tokens(
    train[, c("review_body", "review_id", "stars")], # dataframe to process
    token,      # name of the output column that stores the tokens
    review_body # name of the input column that stores the text
)

# Note that all columns other than `token` and `review_body`, above,
# are treated like document identifiers. (in this case, "review_id" 
# and "stars").  We won't actually use anything other than the
# review_id column, so the stars column is just used to demonstrate
# this behavior.
head(tokenized, 10)

Unnamed: 0_level_0,review_id,stars,token
Unnamed: 0_level_1,<chr>,<int>,<chr>
1,en_0964290,1,arrived
2,en_0964290,1,broken
3,en_0964290,1,manufacturer
4,en_0964290,1,defect
5,en_0964290,1,two
6,en_0964290,1,of
7,en_0964290,1,the
8,en_0964290,1,legs
9,en_0964290,1,of
10,en_0964290,1,the


`tidytext` also gives us a dataframe--`stop_words`--containing a list of _stopwords_ that we usually want to remove from a bag-of-words analysis.  Bag-of-words is really a _meaning_-centric kind of analysis, and stopwords (like _the, to, for, a, with in,_ etc) don't usually contribute a lot to the "meaning" of a document.  They do contribute a lot to the _structure_ (grammar/syntax) of a document, but in a bag-of-words analysis, we don't usually care about that.

In [5]:
head(stop_words, 10)

word,lexicon
<chr>,<chr>
a,SMART
a's,SMART
able,SMART
about,SMART
above,SMART
according,SMART
accordingly,SMART
across,SMART
actually,SMART
after,SMART


There are a few different stopword lists in this table.  Everyone and their cat has their own specialized stopword list, and really, most of them are pretty functionally indistinguishable.

In [6]:
unique(stop_words$lexicon)

Let's be aggressive and remove any word that appears in any of these stopword lexicons.

In [7]:
tokenized <- filter(tokenized, !(token %in% stop_words$word))
head(tokenized, 10)

Unnamed: 0_level_0,review_id,stars,token
Unnamed: 0_level_1,<chr>,<int>,<chr>
1,en_0964290,1,arrived
2,en_0964290,1,broken
3,en_0964290,1,manufacturer
4,en_0964290,1,defect
5,en_0964290,1,legs
6,en_0964290,1,base
7,en_0964290,1,completely
8,en_0964290,1,formed
9,en_0964290,1,insert
10,en_0964290,1,casters


Let's run the Porter Stemmer (a very imple stemming algorithm that is still empirically very useful--this is the same one that Gensim uses).

In [8]:
tokenized$token <- wordStem(tokenized$token)
head(tokenized, 10)

Unnamed: 0_level_0,review_id,stars,token
Unnamed: 0_level_1,<chr>,<int>,<chr>
1,en_0964290,1,arriv
2,en_0964290,1,broken
3,en_0964290,1,manufactur
4,en_0964290,1,defect
5,en_0964290,1,leg
6,en_0964290,1,base
7,en_0964290,1,complet
8,en_0964290,1,form
9,en_0964290,1,insert
10,en_0964290,1,caster


A few other pieces of cleanup: we'll remove all non-letter characters...

In [9]:
# Remove all non-letter characters.
tokenized$token = gsub("[^a-z]+", "", tokenized$token)
head(tokenized, 10)

Unnamed: 0_level_0,review_id,stars,token
Unnamed: 0_level_1,<chr>,<int>,<chr>
1,en_0964290,1,arriv
2,en_0964290,1,broken
3,en_0964290,1,manufactur
4,en_0964290,1,defect
5,en_0964290,1,leg
6,en_0964290,1,base
7,en_0964290,1,complet
8,en_0964290,1,form
9,en_0964290,1,insert
10,en_0964290,1,caster


...and all words with only 1 or 2 characters (usually these are just noise)...

In [10]:
# remove short tokens
tokenized <- filter(tokenized, nchar(token) > 2)
head(tokenized, 10)

Unnamed: 0_level_0,review_id,stars,token
Unnamed: 0_level_1,<chr>,<int>,<chr>
1,en_0964290,1,arriv
2,en_0964290,1,broken
3,en_0964290,1,manufactur
4,en_0964290,1,defect
5,en_0964290,1,leg
6,en_0964290,1,base
7,en_0964290,1,complet
8,en_0964290,1,form
9,en_0964290,1,insert
10,en_0964290,1,caster


...and remove any words that appear in >50% of our documents (these are usually "domain stopwords", or words who tend to have a really flat distribution that our model can't learn much from), and in <10 documents (these are just rare words that don't really tell is much in the first place--and there will be a _lot_ of these).  I've picked these specific thresholds kind of arbitrarily; feel free to play around with them and see how the results change.

In [11]:
# Remove terms that appear in >50% of all documents or <10 documents.
common_terms <- (
    tokenized
    # convert to one row per document+token combo; i.e. deduplicate
    # token entries within documents.
    %>% unique() 
    %>% group_by(token)
    %>% tally()
    %>% mutate(pct = n / sum(n))
    %>% filter(n > 10 & pct < 0.5)
)
tokenized <- filter(tokenized, !(token %in% common_terms$token))
head(tokenized, 10)

Unnamed: 0_level_0,review_id,stars,token
Unnamed: 0_level_1,<chr>,<int>,<chr>
1,en_0537874,1,smudgi
2,en_0947767,1,undeliver
3,en_0148393,1,electrocut
4,en_0655502,1,clanci
5,en_0655502,1,uselessli
6,en_0243030,1,autoclav
7,en_0243030,1,dermal
8,en_0730545,1,havw
9,en_0730832,1,outf
10,en_0742404,1,peppercorn


Note that since we're going to use a Bernoulli Naive Bayes--which assumed our features are only 0 or 1--we haven't done any steps to calculate the number of times each token appears in each document.  We could do this with a pretty simple line of code:

```r
tokenized <- (
    tokenized
    %>% group_by(review_id, token)
    %>% tally()
)
```

...but this is actually very slow and inefficient given the size of our data.  There's a lot we could do to make it go faster, probably, but we don't actually need to do this in the first place--remember that the Bernoulli Naive Bayes model we're going to use assumes all features are either 1 or 0.  And, if we just use the `cast_sparse` function from `tidytext`--which converts this "long" format into a `matrix`--we'll get a 1/0 matrix our anyways. `cast_sparse` doesn't do any sort of pivoting logic; it directly sets values in the output array, and if you have duplicates, they just overwrite each other.

Long story short, `cast_sparse` will just give us the matris of only 1/0 unless we do the extra steps to calculate the counts beforehand.  So we just won't do that extra work.

In [12]:
# convert to a sparse Matrix so we can do some modeling.
# There are also other cast_* functions to cast to different
# formats, eg DocumentTermMatrix formats from the `tm` library
# for topic modeling) or the DFM format used by Quanteda.
train_bow <- cast_sparse(
    tokenized[,c("review_id", "token")],
    review_id,
    token
)

# verify that we only have 1s and 0s
print(max(train_bow))
print(min(train_bow))

[1] 1
[1] 0


We need to do a little extra bookkeeping to get the y values in the same order as the features after all this conversion.  None of the above steps will have changed the ordering of the documents, so we can safely get away with just a simple filter statement like so:

In [13]:
# filter to just the same list of review ids
train_y <- filter(train, review_id %in% rownames(train_bow))

# this should just print "TRUE", meaning there's a one-to-one correspondence
# between the review ids and the row labels in the bag of words matrix.
print(unique(train_y$review_id == rownames(train_bow)))

# pull out just the stars for use as our y-values for the naive bayes model
train_y <- train_y$stars

[1] TRUE


The above transformations are all _stateless;_ the only stateful part is the token filtering.  This is a bit of a problem for our data, since we have two different subsets (train and test), and we want to apply _the same transformations_ to the testing set as we do to the training set.  The `cast_sparse` step is the root of the issues here; we need some way to ensure that the columns are the same for the training and testing dataset.  But the identification of rare/common words to filter out is also a slight issue, albeit a much smaller one.

So instead, we'll do a bit of an end run around the issue.  We'll concatenate our training and testing datasets together, run the transformations, and then split them back out into separate matrices based on the row names in the resulting matrix.  It's not as hacky as it sounds!

In [14]:
string2bow <- function (df) {
    tokenized <- (
        df[, c("Split", "review_body", "review_id")]
        # tokenize
        %>% unnest_tokens(token, review_body)
        # remove stopwords
        %>% filter(!(token %in% stop_words$word))
        # stem
        %>% mutate(token = wordStem(token))
        # remove non-alpha characters
        %>% mutate(token = gsub("[^a-z]", "", token))
        # remove empty tokens and tokens <2 characters
        %>% filter(nchar(token) > 2)
    )
    
    # remove rare + common terms; but base this determination
    # only on the training dataset.
    common_terms <- (
        tokenized
        %>% filter(Split == "Train")
        %>% group_by(token)
        %>% tally()
        %>% mutate(pct = n / sum(n))
        %>% filter(!(n < 10 | pct > 0.5))
    )
    tokenized <- filter(tokenized, token %in% common_terms$token)
    
    # we need a numeric value to populate the sparse matrix with;
    # this should be the word counts.
    tokenized$n = 1
    
    return(tokenized)
}

# add indicator columns so we can split the datasets apart again later.
train$Split <- "Train"
test$Split <- "Test"
data <- rbind(train, test)
tokens <- string2bow(data)

Now we cast to a sparse matrix and split the data back up into training and testing subsets.

In [15]:
bow <- cast_sparse(tokens, review_id, token, n)

# extract the y values and the split labels
labels <- filter(data, review_id %in% rownames(bow))$stars
splits <- filter(data, review_id %in% rownames(bow))$Split

# break the data back out into train and test
train_bow <- bow[splits == "Train",]
train_y <- labels[splits == "Train"]

test_bow <- bow[splits == "Test",]
test_y <- labels[splits == "Test"]

That's it!  Now we're ready to train our Naive Bayes model and see how it does.  Since we have 5 classes and they're evenly balanced (I haven't explicitly calculated/shown that in this notebook, but it's easy to verify for yourself), we actually need to beat an accuracy/F1 of 0.2.

In [16]:
nb <- bernoulli_naive_bayes(train_bow, as.factor(train_y))
preds <- predict(nb, newdata = test_bow)

"bernoulli_naive_bayes(): there are 1570 empty cells leading to zero estimates. Consider Laplace smoothing."


In [17]:
print("Accuracy:")
mean(preds == test_y)

[1] "Accuracy:"


In [18]:
print("F1 score:")
f_meas(
    data = data.frame(preds = preds, true = as.factor(test_y)),
    preds,
    true,
    beta = 1
)

[1] "F1 score:"


.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
f_meas,macro,0.411209


~0.4 isn't as high as it could be, but it's definitely better than 0.2!