CMDist: quick start guide
R package for Concept Mover's Distance, a measure of concept engagement in texts via word embeddings.
Install and load the
CMDist package from GitHub:
# install.packages("devtools") devtools::install_github("dustinstoltz/CMDist") library(CMDist)
To use the Concept Mover's Distance (CMD) function, you will need to transform your corpus into a document-term matrix (DTM). The preferred DTM is a sparse matrix as output by
cast_sparse function, but we have tried to make the package accommodate DTMs made with the
Quanteda package or a regular old base R matrix.
Word Embeddings Matrix
You will also need a matrix of word embeddings vectors (with the "words" as rownames), and ultimately, CMD is only as good as the word embeddings used.
Word embeddings vectors can be from a pre-trained source, for example, https://nlp.stanford.edu/projects/glove/ or https://fasttext.cc/docs/en/english-vectors.html. It might take little data wrangling to get them loaded as a matrix in R with rownames as the words.
You can also create your own embeddings trained on the corpus on which you are using CMD -- i.e. corpus-trained embeddings. For example, the
text2vec R package uses the GloVe method to train embeddings. As we discuss in our paper, the decision to use pre-trained vs corpus-trained is understudied as applied in the social-scientific context.
One important caveat: the word used to denote a concept need not be in the corpus, but it must be in the word embeddings matrix. If it is not, the function will stop and let you know. This means, obviously, that corpus-trained embeddings cannot be used with words not in the corpus (pre-trained must be used).
Selecting Terms Denoting Focal Concepts
The most difficult and important part of using Concept Mover's Distance is selecting terms that denote the concepts of interest. This should be driven by (a) theory, (b) prior literature, (c) domain knowledge, (d) the word embedding space. One way of double-checking that selected terms are approriate is to look at the term's nearest neighbors. Here we use the
sim2 function from
text2vec to get the cosine distance between "thinking" and its top 10 nearest neighbors.
cos.sim = text2vec::sim2(x = my.wv, y = my.wv["thinking", , drop = FALSE], method = "cosine") head(sort(cos.sim[,1], decreasing = TRUE), 10)
Once you have a DTM, word vector matrix and terms denoting focal concepts, the simplest use of
CMDist involves finding the closeness to a focal concept denoted by a single word. Here, we use the word "thinking."
doc.closeness <- CMDist(dtm = my.dtm, cw = "thinking", wv = my.wv)
However, we may be interested in specifying the concept somewhat with additional words: for example, we might want to capture "critical thinking." To handle this, it is as simple as specifying both words separated by a space in the quotes. This creates a pseudo-document that contains only "critical" and "thinking."
doc.closeness <- CMDist(dtm = my.dtm, cw = "critical thinking", wv = my.wv)
What if instead of a compound concept we are interested in a common concept represented with two words (i.e., a bigram)? First, just like with any other word, the ngram must be in the set of embeddings (the fastText pre-trained embeddings have a lot of n≥1 grams). As long as the ngram is in the embeddings, the only difference for
CMDist is that an underscore rather than a space needs to be placed between words:
doc.closeness <- CMDist(dtm = my.dtm, cw = "critical_thinking", wv = my.wv)
Multiple Distances at Once
An analysis might suggest multiple concepts are of interest. As running CMD can take some time, it is useful to get multiple distances at the same time. This, in effect, is adding more rows to our pseudo-document-term matrix. For example, in our JCSS paper, we compare how Shakespeare's plays engage with "death" against 200 other concepts.
# example 1 concept.words <- c("thinking", "critical", "thought") doc.closeness < CMDist(dtm = my.dtm, cw = concept.words, wv = my.wv) # example 2 concept.words <- c("critical thought", "critical_thinking") doc.closeness < CMDist(dtm = my.dtm, cw = concept.words, wv = my.wv)
Performance and Parallel CMDist
CMD relies on
RWMD, and while it is a more efficient rendering of Word Mover's Distance, it is still a very complex process and thus takes a while. One way to reduce complexity and thus time (without a noticeable drop in accuracy) is by removing very sparse terms in your DTM. Parallelizing is another option, so we decided to build it in. To use parallel calculations just set
parallel = TRUE. The default number of threads is 2, but you can set it as high as you have threads/cores (but usually you want to use less than your maximum).
doc.closeness <- CMDist(dtm = my.dtm, cw = "critical_thinking", wv = my.wv, parallel = TRUE, threads = 2)
As you can see from the figure below (which shows how quickly a single concept is estimated per run with different DTM sizes), there is an overhead to setting up parallel processing and the pay off is only really gained with larger matrices. When the DTM has about 5000 documents, their begins to be performance improvements with parallizing. However, specifying 6 threads doesn't have much more of an improvement over 2 threads, but we presume this is not the case for DTMs with document numbers above the limit of our example.
Note: These estimates are based off single runs of each size and thread count (so take with a grain of salt). The DTM is based on a large sample of news articles and with sparesness set to 0.99, the vocabulary size (i.e. the number of columns) was 4,869 (which is on the low end for text analysis).
Scaling Output and Vector Comparison Metric
The function comes with a few additional options. First, by default, the closeness scores are normalized using the
scale() function in R. If this is not desired, set
scale = FALSE. Second, the default vector comparison metric in
RWMD implementation is cosine, but the original Word Mover's Distance paper which our approach is based off used Euclidean distance to compare word embeddings vectors. Therefore, the default is
method = "cosine", but can be set to
method = "euclidean".
doc.closeness <- CMDist(dtm = my.dtm, cw = "thinking", wv = my.wv, scale = FALSE, method = "euclidean")
For more discussion of the math behind the measure see Stoltz and Taylor (2019) "Concept Mover's Distance" in the Journal of Computational Social Science. The replication code and data for that paper can be found here: https://github.com/dustinstoltz/concept_movers_distance_jcss