R text analysis: quanteda
Kasper Welbers & Wouter van Atteveldt 2019-1
- Step 1: Importing text and creating a quanteda corpus
- Step 2: Creating the DTM (or DFM)
- Step 3: Analysis
In this tutorial you will learn how to perform text analysis using the quanteda package. In the R Basics: getting started tutorial we introduced some of the techniques from this tutorial as a light introduction to R. In this and the following tutorials, the goal is to get more understanding of what actually happens ‘under the hood’ and which choices can be made, and to become more confident and proficient in using quanteda for text analysis.
The quanteda package
The quanteda package is an extensive text analysis suite for R. It covers everything you need to perform a variety of automatic text analysis techniques, and features clear and extensive documentation. Here we’ll focus on the main preparatory steps for text analysis, and on learning how to browse the quanteda documentation. The documentation for each function can also be found here.
For a more detailed explanation of the steps discussed here, you can read the paper Text Analysis in R (Welbers, van Atteveldt & Benoit, 2017).
library(quanteda) library(quanteda.textplots) library(quanteda.textstats)
Step 1: Importing text and creating a quanteda corpus
The first step is getting text into R in a proper format. stored in a variety of formats, from plain text and CSV files to HTML and PDF, and with different ‘encodings’. There are various packages for reading these file formats, and there is also the convenient readtext that is specialized for reading texts from a variety of formats.
From CSV files
For this tutorial, we will be importing text from a csv. For convenience, we’re using a csv that’s available online, but the process is the same for a csv file on your own computer. The data consists of the State of the Union speeches of US presidents, with each document (i.e. row in the csv) being a paragraph. The data will be imported as a data.frame.
library(readr) url = 'https://bit.ly/2QoqUQS' d = read_csv(url) head(d) ## view first 6 rows
We can now create a quanteda corpus with the
corpus() function. If you
want to learn more about this function, recall that you can use the
question mark to look at the documentation.
Here you see that for a data.frame, we need to specify which column contains the text field. Also, the text column must be a character vector.
corp = corpus(d, text_field = 'text') ## create the corpus corp
From text (or word/pdf) files
Rather than a csv file, your texts might be stored as separate files,
e.g. as .txt, .pdf, or .docx files. You can quite easily read these as
well with the
readtext function from the
readtext package. You might
have to install that package first with:
You can then call the readtext function on a particular file, or on a folder or zip archive of files directly.
library(readtext) url = "https://github.com/ccs-amsterdam/r-course-material/blob/master/data/files.zip?raw=true" texts = readtext(url) texts
As you can see, it automatically downloaded and unzipped the files, and converted the MS Word and PDF files into plain text.
I read them from an online source here, but you can also read them from your hard drive by specifying the path:
texts = reattext("c:/path/to/files") texts = reattext("/Users/me/Documents/files")
You can convert the texts directly into a corpus object as above:
corp2 = corpus(texts) corp2
Step 2: Creating the DTM (or DFM)
Many text analysis techniques only use the frequencies of words in documents. This is also called the bag-of-words assumption, because texts are then treated as bags of individual words. Despite ignoring much relevant information in the order of words and syntax, this approach has proven to be very powerfull and efficient.
The standard format for representing a bag-of-words is as a
document-term matrix (DTM). This is a matrix in which rows are
documents, columns are terms, and cells indicate how often each term
occured in each document. We’ll first create a small example DTM from a
few lines of text. Here we use quanteda’s
dfm() function, which stands
document-feature matrix (DFM), which is a more general form of a
text <- c(d1 = "Cats are awesome!", d2 = "We need more cats!", d3 = "This is a soliloquy about a cat.") dtm <- dfm(text, tolower=F) dtm
Here you see, for instance, that the word
soliloquy only occurs in the
third document. In this matrix format, we can perform calculations with
texts, like analyzing different sentiments of frames regarding cats, or
the computing the similarity betwee the third sentence and the first two
However, directly converting a text to a DTM is a bit crude. Note, for
instance, that the words
cat are given different
columns. In this DTM, “Cats” and “awesome” are as different as “Cats”
and “cats”, but for many types of analysis we would be more interested
in the fact that both texts are about felines, and not about the
specific word that is used. Also, for performance it can be useful (or
even necessary) to use fewer columns, and to ignore less interesting
words such as
is or very rare words such as
This can be achieved by using additional
preprocessing steps. In the
next example, we’ll again create the DTM, but this time we make all text
lowercase, ignore stopwords and punctuation, and perform
Simply put, stemming removes some parts at the ends of words to ignore
different forms of the same word, such as singular versus plural (“gun”
or “gun-s”) and different verb forms (“walk”,“walk-ing”,“walk-s”)
dtm = dfm(text, tolower=T, remove = stopwords('en'), stem = T, remove_punct=T) dtm
By now you should be able to understand better how the arguments in this
function work. The
tolower argument determines whether texts are
(TRUE) or aren’t (FALSE) converted to lowercase.
whether stemming is (TRUE) or isn’t (FALSE) used. The remove argument is
a bit more tricky. If you look at the documentation for the dfm function
?dfm) you’ll see that
remove can be used to give “a pattern of
user-supplied features to ignore”. In this case, we actually used
stopwords(), to get a list of english stopwords. You
can see for yourself.
This list of words is thus passed to the
remove argument in the
dfm() to ignore these words. If you are using texts in another
language, make sure to specify the language, such as stopwords(‘nl’) for
Dutch or stopwords(‘de’) for German.
There are various alternative preprocessing techniques, including more advanced techniques that are not implemented in quanteda. Whether, when and how to use these techniques is a broad topic that we won’t cover today. For more details about preprocessing you can read the Text Analysis in R paper cited above.
Filtering the DTM
For this tutorial, we’ll use the State of the Union speeches. We already
created the corpus above. We can now pass this corpus to the
function and set the preprocessing parameters.
dtm = dfm(corp, tolower=T, stem=T, remove=stopwords('en'), remove_punct=T) dtm
This dtm has 23,469 documents and 20,429 features (i.e. terms), and no longer shows the actual matrix because it simply wouldn’t fit. Depending on the type of analysis that you want to conduct, we might not need this many words, or might actually run into computational limitations.
Luckily, many of these 20K features are not that informative. The distribution of term frequencies tends to have a very long tail, with many words occuring only once or a few times in our corpus. For many types of bag-of-words analysis it would not harm to remove these words, and it might actually improve results.
We can use the
dfm_trim function to remove columns based on criteria
specified in the arguments. Here we say that we want to remove all terms
for which the frequency (i.e. the sum value of the column in the DTM) is
dtm = dfm_trim(dtm, min_termfreq = 10) dtm
Now we have about 5000 features left. See
?dfm_trim for more options.
Step 3: Analysis
Using the dtm we can now employ various techniques. You’ve already seen some of them in the introduction tutorial, but by now you should be able to understand more about the R syntax, and understand how to tinker with different parameters.
Word frequencies and wordclouds
Get most frequent words in corpus.
textplot_wordcloud(dtm, max_words = 50) ## top 50 (most frequent) words textplot_wordcloud(dtm, max_words = 50, color = c('blue','red')) ## change colors textstat_frequency(dtm, n = 10) ## view the frequencies
You can also inspect a subcorpus. For example, looking only at Obama
speeches. To subset the DTM we can use quanteda’s
dtm_subset(), but we
can also use the more general R subsetting techniques (as discussed last
week). Here we’ll use the latter for illustration.
docvars(dtm) we get a data.frame with the document variables.
docvars(dtm)$President, we get the character vector with
president names. Thus, with
docvars(dtm)$President == 'Barack Obama'
we look for all documents where the president was Obama. To make this
more explicit, we store the logical vector, that shows which documents
are ‘TRUE’, as is_obama. We then use this to select these rows from the
is_obama = docvars(dtm)$President == 'Barack Obama' obama_dtm = dtm[is_obama,] textplot_wordcloud(obama_dtm, max_words = 25)
Compare word frequencies between two subcorpora. Here we (again) first
use a comparison to get the is_obama vector. We then use this in the
textstat_keyness() function to indicate that we want to compare the
Obama documents (where is_obama is TRUE) to all other documents (where
is_obama is FALSE).
is_obama = docvars(dtm)$President == 'Barack Obama' ts = textstat_keyness(dtm, is_obama) head(ts, 20) ## view first 20 results
We can visualize these results, stored under the name
ts, by using the
As seen in the first tutorial, a keyword-in-context listing shows a given keyword in the context of its use. This is a good help for interpreting words from a wordcloud or keyness plot.
Since a DTM only knows word frequencies, the
kwic() function requires
the corpus object as input.
k = kwic(corp, 'freedom', window = 7) head(k, 10) ## only view first 10 results
kwic() function can also be used to focus an analysis on a
specific search term. You can use the output of the kwic function to
create a new DTM, in which only the words within the shown window are
included in the DTM. With the following code, a DTM is created that only
contains words that occur within 10 words from
terrorist, terror, etc.).
terror = kwic(corp, 'terror*') terror_corp = corpus(terror) terror_dtm = dfm(terror_corp, tolower=T, remove=stopwords('en'), stem=T, remove_punct=T)
Now you can focus an analysis on whether and how Presidents talk about
textplot_wordcloud(terror_dtm, max_words = 50) ## top 50 (most frequent) words
You can perform a basic dictionary search. In terms of query options this is less advanced than AmCAT, but quanteda offers more ways to analyse the dictionary results. Also, it supports the use of existing dictionaries, for instance for sentiment analysis (but mostly for english dictionaries).
An convenient way of using dictionaries is to make a DTM with the columns representing dictionary terms.
dict = dictionary(list(terrorism = 'terror*', economy = c('econom*', 'tax*', 'job*'), military = c('army','navy','military','airforce','soldier'), freedom = c('freedom','liberty'))) dict_dtm = dfm_lookup(dtm, dict, exclusive=TRUE) dict_dtm
The “4 features” are the four entries in our dictionary. Now you can perform all the analyses with dictionaries.
tk = textstat_keyness(dict_dtm, docvars(dict_dtm)$President == 'Barack Obama') textplot_keyness(tk)
You can also convert the dtm to a data frame to get counts of each concept per document (which you can then match with e.g. survey data)
df = convert(dict_dtm, to="data.frame") head(df)
Creating good dictionaries
A good dictionary means that all documents that match the dictionary are indeed about or contain the desired concept, and that all documents that contain the concept are matched.
To check this, you can manually annotate or code a sample of documents and compare the score with the dictionary hits.
You can also apply the keyword-in-context function to a dictionary to quickly check a set of matches and see if they make sense: