# Bag of Words in R with `text2vec`

Unlike `tidytext`, `text2vec` is more oriented towards document-level transformations.  But like `tidytext`, it's designed to provide a few well-designed, narrow tools that you stick into your data processing pipeline.  (As opposed to: being a one-stop shop for all your NLP needs).  It also provides a nice pre-built tool for replicable, stateful tokens-to-matrix transformations.  Personally, I like `text2vec` more than `tidytext`, but the practical differences are pretty minor.

In [1]:
# Requirements
# install.packages("dplyr")      # if you don't know what dplyr is I can't help you
# install.packages("magrittr")   # pipes!
# install.packages("naivebayes") # Naive Bayes implementation
# install.packages("SnowballC")  # stemming
# install.packages("text2vec")   # tokenization and document-term matrix creation
# install.packages("yardstick")  # model metrics; part of the tidymodels suite

In [2]:
library(dplyr)
library(magrittr)
library(naivebayes)
library(SnowballC)
library(text2vec)
library(yardstick)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


naivebayes 0.9.7 loaded

For binary classification, the first factor level is assumed to be the event.
Use the argument `event_level = "second"` to alter this as needed.



# TidyText

TidyText has a lot of pretty easy-to-use tools that integrate right into the broader Tidyverse ecosystem.  TidyText doesn't actually offer a whole lot of tools--mostly, just an `unnest_tokens()` function that converts a column of strings into the "one token per row" format.  But once everything is in that format, it's pretty easy to work with using existing Tidyverse tools and techniques.  E.g. filtering, groupby-aggregate, etc.; it's a format that's very friendly to _stateless_ and even parallelized operations over tokens.

In [3]:
# Load data
train <- read.csv("../../data/train.csv", stringsAsFactors = FALSE)
test <- read.csv("../../data/test.csv", stringsAsFactors = FALSE)
str(train)

'data.frame':	200000 obs. of  8 variables:
 $ review_id       : chr  "en_0964290" "en_0690095" "en_0311558" "en_0044972" ...
 $ product_id      : chr  "product_en_0740675" "product_en_0440378" "product_en_0399702" "product_en_0444063" ...
 $ reviewer_id     : chr  "reviewer_en_0342986" "reviewer_en_0133349" "reviewer_en_0152034" "reviewer_en_0656967" ...
 $ stars           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ review_body     : chr  "Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no wa"| __truncated__ "the cabinet dot were all detached from backing... got me" "I received my first order of this product and it was broke so I ordered it again. The second one was broke in m"| __truncated__ "This product is a piece of shit. Do not buy. Doesn't work, and then I try to call for customer support, it won'"| __truncated__ ...
 $ review_title    : chr  "I'll spend twice the amount of time boxing up the whole useless thing and send it back wit

Let's do some of the same cleanup we did earlie with `tidytext`:

In [4]:
tokens <- (
    train$review_body
    %>% tolower()
    %>% gsub("[^a-z]+", " ", .)
)

`text2vec` provides its own tokenizers (we'll use `word_tokenizer()`, but there are others; check the documentation), and tools for converting the tokenized text into an iterator format that most of the rest of its functions seem to expect:

In [5]:
tokens <- word_tokenizer(tokens)
head(tokens, 10)

If we want to apply stemming, we should do it now:

In [6]:
tokens <- lapply(tokens, wordStem)
head(tokens)

In [7]:
tokens <- itoken(tokens)
tokens

<itoken>
  Inherits from: <CallbackIterator>
  Public:
    callback: function (x) 
    clone: function (deep = FALSE) 
    initialize: function (x, callback = identity) 
    is_complete: active binding
    length: active binding
    move_cursor: function () 
    nextElem: function () 
    x: GenericIterator, iterator, R6

Now, we need to create a _vocabulary_ dataframe and run some filtering on it.  `text2vec` will use this dataframe to convert the token iterator into a matrix in a repeatable fashion.

In [8]:
vocab <- create_vocabulary(tokens)
head(vocab, 10)

Unnamed: 0_level_0,term,term_count,doc_count
Unnamed: 0_level_1,<chr>,<int>,<int>
1,aaaand,1,1
2,aand,1,1
3,aback,1,1
4,abarth,1,1
5,abdl,1,1
6,abena,1,1
7,abf,1,1
8,abi,1,1
9,abiut,1,1
10,ablaz,1,1


Let's filter out super rare and super common words.  Anything we filter out will be ignored when `text2vec` creates the sparse matrix in a bit here.

In [9]:
vocab <- (
    vocab
    %>% filter(doc_count > 10)
    %>% filter(doc_count < (sum(vocab$doc_count) / 2))
)

Now we create a _vectorizer_ based on the vocab table--this is what the other functions will use to create a document-term matrix with consistent columns.

In [10]:
# vectorizer
vectorizer <- vocab_vectorizer(vocab)
# create a document-term matrix
co_occurrence_matrix <- create_dtm(tokens, vectorizer)
co_occurrence_matrix[1:10, 1:10]

as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead

  [[ suppressing 10 column names 'abc', 'absent', 'ai' ... ]]



10 x 10 sparse Matrix of class "dgCMatrix"
                      
1  . . . . . . . . . .
2  . . . . . . . . . .
3  . . . . . . . . . .
4  . . . . . . . . . .
5  . . . . . . . . . .
6  . . . . . . . . . .
7  . . . . . . . . . .
8  . . . . . . . . . .
9  . . . . . . . . . .
10 . . . . . . . . . .

Let's combine that all into a single group of functions:

In [11]:
preprocess <- function(s) {
    return (
        s$review_body
        %>% tolower()
        %>% gsub("[^a-z]+", " ", .)
        %>% word_tokenizer()
        %>% lapply(wordStem)
        %>% itoken()
    )
}

train_tokens <- preprocess(train)
test_tokens <- preprocess(test)

vectorizer <- (
    train_tokens
    %>% create_vocabulary()
    %>% filter(doc_count > 10)
    %>% filter(doc_count < (sum(.$doc_count) / 2))
    %>% vocab_vectorizer()
)

train_bow <- create_dtm(train_tokens, vectorizer)
test_bow <- create_dtm(test_tokens, vectorizer)

Now we can fit and evaluate the Naive Bayes model as we did with `tidytext`.

In [12]:
# Now fit the Naive Bayes model as before.
nb <- bernoulli_naive_bayes(train_bow, as.factor(train$stars))
preds <- predict(nb, newdata = test_bow)

"bernoulli_naive_bayes(): there are 994 empty cells leading to zero estimates. Consider Laplace smoothing."


In [13]:
print("Accuracy:")
mean(preds == test$stars)

[1] "Accuracy:"


In [14]:
print("F1 score:")
f_meas(
    data = data.frame(preds = preds, true = as.factor(test$stars)),
    preds,
    true,
    beta = 1
)

[1] "F1 score:"


.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
f_meas,macro,0.3608865


Ignore the fact that the accuracy went down compared to the `tidytext` notebook; we didn't do things like stopword filtering.  We could tweak the preprocessing we did here to get comparable results if we wanted to; as it stands, the comparison is a little apples-to-oranges.