# Word Embeddings in R

There are a handful of packages in R for working with word embeddings, but we'll use `text2vec` again.  We'll repeat the same prediction task as before, but rather than bag-of-words, we'll use word embeddings to represent the documents, and instead of a Naive Bayes model, we'll use a simple multi-layer perceptron.  (Word embeddings + highly nonlinear models, like tree-based models or neural networks, is a great combination).

`text2vec` has tools to both load pre-trained word vectors and to train our own using the `GloVe` algorithm; we'll just train our own, since loading pre-trained ones requires going out and getting the vectors yourself.

In [1]:
# install if needed
# install.packages("caret")    # general machine learning library, provides a nice interface
                               # to the RSNNS MLP implementation.
# install.packages("dplyr")    # general data munging
# install.packages("RSNNS")    # provides the MLP implementation we'll use
# install.packages("text2vec") # text vectorization

In [2]:
# Load data
train <- read.csv("../../data/train.csv", stringsAsFactors = FALSE)
test <- read.csv("../../data/test.csv", stringsAsFactors = FALSE)
str(train)

'data.frame':	200000 obs. of  8 variables:
 $ review_id       : chr  "en_0964290" "en_0690095" "en_0311558" "en_0044972" ...
 $ product_id      : chr  "product_en_0740675" "product_en_0440378" "product_en_0399702" "product_en_0444063" ...
 $ reviewer_id     : chr  "reviewer_en_0342986" "reviewer_en_0133349" "reviewer_en_0152034" "reviewer_en_0656967" ...
 $ stars           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ review_body     : chr  "Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no wa"| __truncated__ "the cabinet dot were all detached from backing... got me" "I received my first order of this product and it was broke so I ordered it again. The second one was broke in m"| __truncated__ "This product is a piece of shit. Do not buy. Doesn't work, and then I try to call for customer support, it won'"| __truncated__ ...
 $ review_title    : chr  "I'll spend twice the amount of time boxing up the whole useless thing and send it back wit

In [3]:
library(caret)
library(dplyr)
library(text2vec)
library(yardstick)

Loading required package: ggplot2

Loading required package: lattice


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


For binary classification, the first factor level is assumed to be the event.
Use the argument `event_level = "second"` to alter this as needed.


Attaching package: 'yardstick'


The following objects are masked from 'package:caret':

    precision, recall, sensitivity, specificity




The general workflow for training your own `GloVe` vectors is roughly as follows:
- Tokenize your texts.
- Filter our super rare tokens.
- Build term-term co-occurrence matrix.
- Use GloVe on that matrix to factorize it and get the word vectors.
    - `text2vec` actually re-exports the GloVe implementation (and a few other things) from the `rsparse` package.  `rsparse` is designed for working with sparse matrices, including some common matrix factorization tools.  (GloVe is actually a matrix factorization algorithm!)


Note that we aren't going to remove stopwords or super common words, and we aren't going to stem our texts.  Since word embeddings learn to encode _co-occurrence_ information, stopwords can actually provide useful information about how words are distributed.  E.g., a word that usually appears soon after _the_ is probably a noun or an adjective.  Different inflected forms of a word might also carry subtly different meanings that an embedding model can pick up on.  In theory, a bag-of-words model can pick up on this kind of meaning too, but doing so requires a massive increase in the sparsity and number of features, which can cause other reliability issues.  Since word embeddings are specifically designed to _not_ be sparse, we can do less preprocessing of our texts to preserve the maximum amount of information possible.

We'll train our vectors based on just the training dataset.  Most of this is just about copy-pasted from the `text2vec` examples in the package's documentation, minus a bit of the preprocessing.

In [4]:
# Create iterator over tokens.  The tokenization functions return
# a list of list of tokens.
tokens <- (
    train$review_body
    %>% tolower()
    %>% gsub("[^a-z]+", " ", .)
    %>% word_tokenizer()
    %>% itoken()
)
vocab <- create_vocabulary(tokens)
head(vocab, 10)

Unnamed: 0_level_0,term,term_count,doc_count
Unnamed: 0_level_1,<chr>,<int>,<int>
1,aaaand,1,1
2,aand,1,1
3,aas,1,1
4,aback,1,1
5,abandonment,1,1
6,abarth,1,1
7,abbreviated,1,1
8,abbreviations,1,1
9,abdl,1,1
10,abduction,1,1


In [5]:
# Remove rare tokens; the cutoff of 5 _tota counts_ (not _document counts_,
# as with bag of words) is somewhat arbitrary on my part.
vocab <- filter(vocab, term_count >= 5)
head(vocab, 10)

Unnamed: 0_level_0,term,term_count,doc_count
Unnamed: 0_level_1,<chr>,<int>,<int>
1,abandon,5,5
2,abnormal,5,5
3,absorber,5,5
4,abusing,5,5
5,academic,5,5
6,accentuate,5,5
7,accentuated,5,5
8,accentuates,5,5
9,accomplishes,5,5
10,accountability,5,5


In [6]:
# Use the vocabulary to create a vectorizer, then use
# that vectorizer to create our term co-occurrence matrix
# ("tcm" in text2vec lingo).
vectorizer <- vocab_vectorizer(vocab)
co_occurrence_matrix <- create_tcm(tokens, vectorizer, skip_grams_window = 5)
head(co_occurrence_matrix, 10)

  [[ suppressing 33 column names 'abandon', 'abnormal', 'absorber' ... ]]



10 x 17863 sparse Matrix of class "dgTMatrix"
                                                                              
abandon        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
abnormal       . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
absorber       . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
abusing        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
academic       . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
accentuate     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
accentuated    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
accentuates    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
accomplishes   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
accountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                       
abandon        . ......
abnormal       . ......
absorber     

Now we fit the GloVe model.  We'll have it learn 300-dimensional vectors.

In [7]:
# see the `rsparse` library's documentation for more details about
# these functions.  This step might take a few minutes depending on
# your hardware.
glove = GlobalVectors$new(rank = 300, x_max=25)
wv_main = glove$fit_transform(
    co_occurrence_matrix,
    n_iter = 25,
    convergence_tol = 0.01,
    # NOTE: SET THIS LOWER IF YOU HAVE FEWER THREADS AVAILABLE
    # ON YOUR SYSTEM!
    n_threads = 8
)

INFO  [11:01:14.215] epoch 1, loss 0.2387
INFO  [11:01:20.213] epoch 2, loss 0.0909
INFO  [11:01:26.215] epoch 3, loss 0.0658
INFO  [11:01:32.237] epoch 4, loss 0.0485
INFO  [11:01:38.333] epoch 5, loss 0.0416
INFO  [11:01:44.948] epoch 6, loss 0.0368
INFO  [11:01:51.552] epoch 7, loss 0.0336
INFO  [11:01:58.826] epoch 8, loss 0.0310
INFO  [11:02:05.513] epoch 9, loss 0.0289
INFO  [11:02:12.040] epoch 10, loss 0.0271
INFO  [11:02:18.571] epoch 11, loss 0.0257
INFO  [11:02:25.151] epoch 12, loss 0.0244
INFO  [11:02:31.735] epoch 13, loss 0.0232
INFO  [11:02:38.273] epoch 14, loss 0.0222
INFO  [11:02:44.786] epoch 15, loss 0.0213
INFO  [11:02:51.082] epoch 16, loss 0.0205
INFO  [11:02:57.506] epoch 17, loss 0.0197
INFO  [11:03:03.888] epoch 18, loss 0.0191
INFO  [11:03:10.509] epoch 19, loss 0.0184
INFO  [11:03:17.163] epoch 20, loss 0.0179
INFO  [11:03:23.351] epoch 21, loss 0.0174
INFO  [11:03:30.527] epoch 22, loss 0.0169
INFO  [11:03:38.083] epoch 23, loss 0.0164
INFO  [11:03:47.193]

Per the `text2vec` documentation, this learns two vectors for each word: a "main" and a "context" vector.  The package authors recommend averaging or summing these together; we'll just sum them.

In [8]:
word_vectors = wv_main + t(glove$components)

Now we can get vectors for each word (truncated to just 10 dimensions, for the sake of sane output):

In [9]:
word_vectors["abandon", 1:10]

Our next step is going to be getting the summed word vectors for each document.  We're going to do this in a bit of a clever way: construct a document-term matrix (one row per document, one column per term, values are how many times that term appears in that document), then do a dot-product with the GloVe vectors.  For a very simple sketch of why this works, consider the two sentences _the cat sat on the mat_ and _the dog barked:_

$$
\text{Vectorized} = \textbf{DV}
$$

where $\textbf{D}$ is our document-term matrix, and $\textbf{V}$ is our array of word vectors.  Filling in some values:

$$
\begin{align}
    \text{Vectorized} &= \underbrace{
        \begin{bmatrix}
            2 & 1 & 1 & 1 & 1 & 0 & 0 \\
            1 & 0 & 0 & 0 & 0 & 1 & 1
        \end{bmatrix}
    }_{\textbf{D}}
    \underbrace{
        \begin{bmatrix}
            v_{\text{the}}\\
            v_{\text{cat}}\\
            v_{\text{sat}}\\
            v_{\text{on}}\\
            v_{\text{mat}} \\
            v_{\text{dog}} \\
            v_{\text{barked}}
        \end{bmatrix}
    }_{\textbf{V}}\\
    &= \begin{bmatrix}
        2v_{\text{the}} + v_{\text{cat}} + v_{\text{sat}} + v_{\text{on}} + v_{\text{mat}} \\
        v_{\text{the}} + v_{\text{dog}} + v_{\text{barked}}
    \end{bmatrix}
\end{align}
$$

where $v_i$ is the word embedding vector for word $i$.  Note that if we take $\mathbf{D}$ and divide each row by its sum, so that each row sums to 1, this will calculate the average of the word vectors.

We're doing it this way, rather than manually summing/averaging up vectors, mostly for efficiency/speed reasons.  This formulation lets us punt the computations off to very fast, efficient linear algebra libraries, while also writing way less code!  (realistically, explicitly summing the word vectors within each document probably isn't that much slower.  But this solution is so much cooler!)

In [10]:
dtm <- create_dtm(tokens, vectorizer) %*% word_vectors

as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead



In [11]:
dtm[1:10, 1:10]

10 x 10 Matrix of class "dgeMatrix"
         [,1]       [,2]       [,3]        [,4]      [,5]       [,6]       [,7]
1  16.8847337 -1.5966930 12.3255083  8.08903599 8.8072155 26.6074137 -18.572676
2   1.1124898 -0.1380209 -0.1977723 -0.60194675 0.7470795  1.9563452  -1.947656
3   8.9582622 -1.2017779  4.9813878  7.33242057 3.8924063 10.9962288 -11.436741
4   4.8123758 -1.5032346  2.8292792  4.93624175 3.0442225  8.2708395  -7.139036
5   0.8562205  1.1252240  1.2295044  3.11522275 1.5732660  6.2266281  -2.640486
6   4.7113183 -0.8208662  3.7866221  1.58915025 3.9922056  5.5243687  -2.180940
7   2.8899061 -0.4078928  2.1513048  0.71102101 0.2343536  3.5852977  -3.728535
8   2.3244985  0.1738621  1.1708788  1.03232248 1.3039819  0.9467438  -2.594803
9   2.9768242 -1.0697795  0.9845830  0.07709319 4.5567791  5.8693569  -3.592678
10 10.6938689 -0.6760623  7.0024062  6.36415726 8.0821253 16.1952134 -10.937990
         [,8]      [,9]     [,10]
1  -2.1972183 15.780257 24.285753
2   0.5850388  1

Let's wrap all this up in a single function to preprocess some texts once we've got our GloVe vectors trained.

In [12]:
preprocess <- function(df, vectorizer) {
    return (
        df$review_body
        %>% tolower()
        %>% gsub("[^a-z]+", " ", .)
        %>% word_tokenizer()
        %>% itoken(progressbar = FALSE)
        %>% create_dtm(vectorizer)
    )
}

train_vecs = preprocess(train, vectorizer) %*% word_vectors
train_y = train$stars
test_vecs = preprocess(test, vectorizer) %*% word_vectors
test_y = test$stars

And now we train a simple multi-layer perceptron and evaluate its performance: 

In [13]:
# column names are required for training with caret; just use dummy ones
# since the columns/features in the matrix don't have any meaningful direct
# interpretation anyways.
colnames(train_vecs) <- c(1:dim(train_vecs)[2])
colnames(test_vecs) <- c(1:dim(test_vecs)[2])

mlp <- train(
    # training an MLP like this requires a Matrix object
    # in order to do any automated preprocessing
    x = as.matrix(test_vecs),
    y = as.factor(test_y),
    preProcess = c("center", "scale"),
    method ="mlp",
    size = c(128, 64, 64),
    maxit = 10
)
mlp

Multi-Layer Perceptron 

5000 samples
 300 predictor
   5 classes: '1', '2', '3', '4', '5' 

Pre-processing: centered (300), scaled (300) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 5000, 5000, 5000, 5000, 5000, 5000, ... 
Resampling results across tuning parameters:

  size  Accuracy   Kappa    
  1     0.3496974  0.1886772
  3     0.4231999  0.2790802
  5     0.4277810  0.2847745

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was size = 5.

In [14]:
preds <- predict(mlp, as.matrix(test_vecs))
head(preds)

In [15]:
print("Accuracy:")
mean(preds == test_y)

[1] "Accuracy:"


In [16]:
print("F1 score:")
f_meas(
    data = data.frame(preds = preds, true = as.factor(test_y)),
    preds,
    true,
    beta = 1
)

[1] "F1 score:"


.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
f_meas,macro,0.4566993
