# Bag of Words in R with `udpipe`

`udpipe` is an R library that provides an interface to the UDPipe language models.  These models do a lot of similar things to `spacy`'s models (in Python)--they're trained on a lot of the same data and for a lot of the same tasks. They're not as intensively optimized for speed, and they don't have _all_ of the same annotations as `spacy`, but they're a very good drop-in replacement in probably 95% of cases.

There's also the `spacyr` library, which lets you run `spacy` models in R.  It runs them via the `reticulate` package, which lets R call Python code, but I've never managed to get it to work.  It's always caused my R sessions to crash whenever I try to use it.  But other people don't seem to have this problem, so it might just be me.

In [1]:
# Requirements
# install.packages("dplyr")      # if you don't know what dplyr is I can't help you
# install.packages("magrittr")   # pipes!
# install.packages("naivebayes") # Naive Bayes implementation
# install.packages("SnowballC")  # stemming
# install.packages("tidytext")   # only using the stop_words dataframe from tidytext
# install.packages("udpipe")     # linguistic annotation models
# install.packages("yardstick")  # model metrics; part of the tidymodels suite

In [2]:
library(dplyr)
library(magrittr)
library(naivebayes)
library(SnowballC)
library(tidytext)
library(udpipe)
library(yardstick)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


naivebayes 0.9.7 loaded

For binary classification, the first factor level is assumed to be the event.
Use the argument `event_level = "second"` to alter this as needed.



In [3]:
# Load data
train <- read.csv("../../data/train.csv", stringsAsFactors = FALSE)
test <- read.csv("../../data/test.csv", stringsAsFactors = FALSE)
str(train)

'data.frame':	200000 obs. of  8 variables:
 $ review_id       : chr  "en_0964290" "en_0690095" "en_0311558" "en_0044972" ...
 $ product_id      : chr  "product_en_0740675" "product_en_0440378" "product_en_0399702" "product_en_0444063" ...
 $ reviewer_id     : chr  "reviewer_en_0342986" "reviewer_en_0133349" "reviewer_en_0152034" "reviewer_en_0656967" ...
 $ stars           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ review_body     : chr  "Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no wa"| __truncated__ "the cabinet dot were all detached from backing... got me" "I received my first order of this product and it was broke so I ordered it again. The second one was broke in m"| __truncated__ "This product is a piece of shit. Do not buy. Doesn't work, and then I try to call for customer support, it won'"| __truncated__ ...
 $ review_title    : chr  "I'll spend twice the amount of time boxing up the whole useless thing and send it back wit

`udpipe` is pretty easy to use.  There's only really a few steps:
1. Download the model.
1. Load the model.
1. Annotate text with the model.

(or, you can do all of those in one step--we'll see how in a few cells)

In [4]:
# download a model
downloaded_model <- udpipe_download_model(language = "english")
downloaded_model

Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to C:/Users/andersonh/Documents/UA Projects/LAK 2023/demos/r/english-ewt-ud-2.5-191206.udpipe

 - This model has been trained on version 2.5 of data from https://universaldependencies.org

 - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0

 - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.

 - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')

Downloading finished, model stored at 'C:/Users/andersonh/Documents/UA Projects/LAK 2023/demos/r/english-ewt-ud-2.5-191206.udpipe'



language,file_model,url,download_failed,download_message
<chr>,<chr>,<chr>,<lgl>,<chr>
english-ewt,C:/Users/andersonh/Documents/UA Projects/LAK 2023/demos/r/english-ewt-ud-2.5-191206.udpipe,https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe,False,OK


Important note: while I'm not a lawyer, the fact that this model is distributed under a Creative Commons Attribution Share-Alike Non-commercial license (CC-BY-SA-NC) may have some ramification for how you have to distribute and license what you build with `UDPipe`.

Once the model is downloaded, use it to parse texts like so:

In [5]:
# load the model
udmodel <- udpipe_load_model(file = downloaded_model$file_model)

# annotate a piece of text with the model
doc <- (
    udpipe_annotate(udmodel, "UDPipe is a spaCy-like text annotation library for the R programming language.")
    %>% as.data.frame(detailed = TRUE)
)
doc

doc_id,paragraph_id,sentence_id,sentence,start,end,term_id,token_id,token,lemma,upos,xpos,feats,head_token_id,dep_rel,deps,misc
<chr>,<int>,<int>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
doc1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,1,6,1,1,UDPipe,UDPipe,PROPN,NNP,Number=Sing,9,nsubj,,
doc1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,8,9,2,2,is,be,AUX,VBZ,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin,9,cop,,
doc1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,11,11,3,3,a,a,DET,DT,Definite=Ind|PronType=Art,9,det,,
doc1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,13,17,4,4,spaCy,spaCy,NOUN,NN,Number=Sing,6,obl:npmod,,SpaceAfter=No
doc1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,18,18,5,5,-,-,PUNCT,HYPH,,6,punct,,SpaceAfter=No
doc1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,19,22,6,6,like,like,VERB,VB,VerbForm=Inf,9,amod,,
doc1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,24,27,7,7,text,text,NOUN,NN,Number=Sing,9,compound,,
doc1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,29,38,8,8,annotation,annotation,NOUN,NN,Number=Sing,9,compound,,
doc1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,40,46,9,9,library,library,NOUN,NN,Number=Sing,0,root,,
doc1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,48,50,10,10,for,for,ADP,IN,,14,case,,


Note that the output format is very `tidytext`-like: one row per token, one column per feature.

These annotations are extremely similar to what `spaCy` models add to token-level annotations.  A quick inventory:
- `doc_id`, `paragraph_id`, and `sentence_id` provide indicators about what document, paragraph, and sentence the string is from.
- `sentence` is the original string.
- `start` and `end` are the positions in the sentence where the token starts and ends (both are measured as the number of characters from the start of the sentence).
- `term_id` is a unique row identifier for the token, and is unique within the document.
- `token_id` is the token's position in the sentence, starting from 1.
- `token` is the original token.
- `lemma` is the lemmatized form of the token.
- `upos` and `xpos` are part of speech tags, with different levels of granularity.
- `feats` is a collection of miscellaneous morphological features, in the format: `Feature=Value|Feature=Value|...|Feature=Value`.
- `head_token_id` and `dep_rel` are related to the _dependency relationship_ between words.  Dependency grammars are one way to represent the syntactic structure of a sentence, though you don't see depedency grammars all that much outside of syntactic parsing models like this (usually, tree-based grammars dominate in Linguistics, but there are extremely strong correspondences between depednency and tree-based grammars).  `head_token_id` is the `token_id` of whatever token is the syntactic head of the current token (0 = the current token is the root of the sentence), and `dep_rel` specifies the syntactic relationship between this token and its syntactic head.
- `deps` is described in the documentation as "Enhanced dependency graph in the form of a list of head-deprel pairs," but I'll be honest, I'm not sure what this means, since it's usually NA for me.
- `misc` mostly contains flags about where space characters are in relation to the token, e.g., before or after the token.  Useful for reconstructing the original text.

However, some of the annotations that spaCy provides out-of-the-box are not present in UDPipe's output (e.g. stopword identification and named entity recognition).  But if you don't need those annotations, this obviously doesn't matter.

As an alternative to the above steps, you can use the `udpipe` function to run the whole annotation pipeline in one go.  Both this and the `udpipe_annotate` function can be parallelized with the `parallel.cores` argument, and the chunk size for parallel processing can be controlled with `parallel.chunk.size`, but be careful: according to the UDPipe documentation, the model is re-loaded _for each chunk._  This can cause pretty massive overhead if your chunk sizes are small.

In [6]:
doc <- udpipe(
    "UDPipe is a spaCy-like text annotation library for the R programming language.",
    object = "english",
    # this won't have any effect with just one document, but this is what
    parallel.cores = 2,
    parallel.chunk.size = 1,
)
doc

doc_id,paragraph_id,sentence_id,sentence,start,end,term_id,token_id,token,lemma,upos,xpos,feats,head_token_id,dep_rel,deps,misc
<chr>,<int>,<int>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,1,6,1,1,UDPipe,UDPipe,PROPN,NNP,Number=Sing,9,nsubj,,
1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,8,9,2,2,is,be,AUX,VBZ,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin,9,cop,,
1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,11,11,3,3,a,a,DET,DT,Definite=Ind|PronType=Art,9,det,,
1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,13,17,4,4,spaCy,spaCy,NOUN,NN,Number=Sing,6,obl:npmod,,SpaceAfter=No
1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,18,18,5,5,-,-,PUNCT,HYPH,,6,punct,,SpaceAfter=No
1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,19,22,6,6,like,like,VERB,VB,VerbForm=Inf,9,amod,,
1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,24,27,7,7,text,text,NOUN,NN,Number=Sing,9,compound,,
1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,29,38,8,8,annotation,annotation,NOUN,NN,Number=Sing,9,compound,,
1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,40,46,9,9,library,library,NOUN,NN,Number=Sing,0,root,,
1,1,1,UDPipe is a spaCy-like text annotation library for the R programming language.,48,50,10,10,for,for,ADP,IN,,14,case,,


UDPipe almost a drop-in replacement for TidyText's `unnest_tokens` function.  It requires a little bit of a re-write, but not much of one.

In [7]:
string2bow <- function (df) {
    tokenized <- (
        udpipe(
            x = df$review_body,
            object = "english",
            
            # disable part of speech tagging and syntactic parsing
            # for extra speed.
            tagger = "none",
            parser = "none",
            
            # print progress message every `trace` documents parsed.
            # This might not work as expected in a Jupyter notebook,
            # though, and you might get all the progress messages 
            # printed at once, after everything is completely finished.
            # This also doesn't seem to work with the parallel.cores
            # option enabled.
            trace = 2500,
            
            # parallelize for speed.  Give each job pretty large chunk
            # sizes so we can minimize the impact of the model loading
            # overhead.
            # parallel.cores = 8,
            # parallel.chunk.size = 2500,
            
            # specify document IDs.
            doc_id = c(df$id_and_split)
        )
        %>% as.data.frame()
        # remove stopwords
        %>% filter(!(lemma %in% stop_words$word))
        # remove non-alpha characters
        %>% mutate(lemma = gsub("[^a-z]", "", token))
        # remove empty tokens and tokens <2 characters
        %>% filter(nchar(lemma) > 2)
        # recreate the review_id and Split columns
        %>% mutate(
            review_id = gsub(";[^;]+", "", doc_id),
            Split = gsub("[^;]+;", "", doc_id)
        )
    )
    
    # remove rare + common terms; but base this determination
    # only on the training dataset.
    common_terms <- (
        tokenized
        %>% filter(Split == "Train")
        %>% group_by(lemma)
        %>% tally()
        %>% mutate(pct = n / sum(n))
        %>% filter(!(n < 10 | pct > 0.5))
    )
    tokenized <- filter(tokenized, lemma %in% common_terms$lemma)
    
    tokenized$n = 1
    
    return(tokenized)
}

# only running on on a subset of our data for the sake of the demo
# and speed--feel free to run it over the whole dataset on your own,
# but be prepared to wait a while before seeing any output in Jupyter.
train$Split = "Train"
test$Split = "Test"
data <- rbind(train, test)

# we can specify a doc_id column with udpipe and use it to track document-
# level metadata.  We'll need to track the training and testing splits,
# as well as review IDs, but the document ID for udpipe has to be a character
# vector.  So we'll just paste together the fields we need and split them
# back apart later.
data$id_and_split <- paste(data$review_id, data$Split, sep=";")

tokens <- string2bow(data)
head(tokens)

2023-03-03 10:02:49 Annotating text fragment 1/205000
2023-03-03 10:03:03 Annotating text fragment 2501/205000
2023-03-03 10:03:16 Annotating text fragment 5001/205000
2023-03-03 10:03:29 Annotating text fragment 7501/205000
2023-03-03 10:03:43 Annotating text fragment 10001/205000
2023-03-03 10:03:57 Annotating text fragment 12501/205000
2023-03-03 10:04:11 Annotating text fragment 15001/205000
2023-03-03 10:04:24 Annotating text fragment 17501/205000
2023-03-03 10:04:37 Annotating text fragment 20001/205000
2023-03-03 10:04:50 Annotating text fragment 22501/205000
2023-03-03 10:05:05 Annotating text fragment 25001/205000
2023-03-03 10:05:20 Annotating text fragment 27501/205000
2023-03-03 10:05:34 Annotating text fragment 30001/205000
2023-03-03 10:05:48 Annotating text fragment 32501/205000
2023-03-03 10:06:01 Annotating text fragment 35001/205000
2023-03-03 10:06:15 Annotating text fragment 37501/205000
2023-03-03 10:06:28 Annotating text fragment 40001/205000
2023-03-03 10:06:43 A

Unnamed: 0_level_0,doc_id,paragraph_id,sentence_id,sentence,start,end,term_id,token_id,token,lemma,upos,xpos,feats,head_token_id,dep_rel,deps,misc,review_id,Split,n
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
1,en_0964290;Train,1,1,Arrived broken.,1,7,1,1,Arrived,rrived,,,,,,,,en_0964290,Train,1
2,en_0964290;Train,1,1,Arrived broken.,9,14,2,2,broken,broken,,,,,,,SpaceAfter=No,en_0964290,Train,1
3,en_0964290;Train,1,2,Manufacturer defect.,17,28,4,1,Manufacturer,anufacturer,,,,,,,,en_0964290,Train,1
4,en_0964290;Train,1,2,Manufacturer defect.,30,35,5,2,defect,defect,,,,,,,SpaceAfter=No,en_0964290,Train,1
5,en_0964290;Train,1,3,"Two of the legs of the base were not completely formed, so there was no way to insert the casters.",45,47,9,3,the,the,,,,,,,,en_0964290,Train,1
6,en_0964290;Train,1,3,"Two of the legs of the base were not completely formed, so there was no way to insert the casters.",49,52,10,4,legs,legs,,,,,,,,en_0964290,Train,1


From here, the rest of our bag-of-words code is basically just what we did for `tidtext`.  We'll just swap over to using the `lemma` column rather than the `token` column.

In [8]:
bow <- cast_sparse(tokens, review_id, lemma, n)

# extract the y values and the split labels
labels <- filter(data, review_id %in% rownames(bow))$stars
splits <- filter(data, review_id %in% rownames(bow))$Split

# break the data back out into train and test
train_bow <- bow[splits == "Train",]
train_y <- labels[splits == "Train"]

test_bow <- bow[splits == "Test",]
test_y <- labels[splits == "Test"]

In [9]:
nb <- bernoulli_naive_bayes(train_bow, as.factor(train_y))
preds <- predict(nb, newdata = test_bow)

"bernoulli_naive_bayes(): there are 2918 empty cells leading to zero estimates. Consider Laplace smoothing."


In [10]:
print("Accuracy:")
mean(preds == test_y)

[1] "Accuracy:"


In [11]:
print("F1 score:")
f_meas(
    data = data.frame(preds = preds, true = as.factor(test_y)),
    preds,
    true,
    beta = 1
)

[1] "F1 score:"


.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
f_meas,macro,0.4334394
