Skip to content
Switch branches/tags
Go to file
Cannot retrieve contributors at this time

Supervised Text Classification

Wouter van Atteveldt & Kasper Welbers April 2019

This handout contains a very brief introduction to using supervised machine learning for text classification in R. Check out for a more extensive tutorial.


We will use quanteda for text processing and some machine learning, and tidyverse for general data cleaning.


We also use caret for more machine learning options. This probably requires R version 3.5. If you have trouble installing it, you can still follow the quanteda part of this tutorial. The top line also installs the packages needed for the actual model training.

#install.packages(c("caret", "e1071", "LiblineaR"))


For machine learning, we need annotated training data. Fortunately, there are many review data files available for free. This is an example of movie:

download.file("", "data_corpus_movies.rda")
reviews = data_corpus_movies

Training and test data

We split off a test set for testing performance (using set.seed for reproducibility), and create subsets of the corpus for training and testing:

testset = sample(docnames(reviews), 500)
reviews_test =  reviews %>% corpus_subset(docnames(reviews) %in% testset)
reviews_train = reviews %>% corpus_subset(!docnames(reviews) %in% testset)
actual_train = as.factor(docvars(reviews_train, "Sentiment"))
actual_test = as.factor(docvars(reviews_test, "Sentiment"))

Training the model using quanteda

To prepare the data, we need to create matrices for the training and test data.

Now, we create the training dfm:

dfm_train = reviews_train %>%  dfm(stem = TRUE) %>% 
  dfm_select(min_nchar = 2) %>% dfm_trim(min_docfreq=10)

And train the model:

m_nb <- textmodel_nb(dfm_train, actual_train)

Testing the model

To see how well the model does, we test it on the test data. For this, it's important that the test data uses the same features (vocabulary) as the training data The model contains parameters for these features, not for words that only occur in the test data,

dfm_test <- reviews_test %>% dfm(stem = TRUE) %>% 
nb_pred <- predict(m_nb, newdata = dfm_test)

To see how well we do, we compare the predicted sentiment to the actual sentiment:

mean(nb_pred == actual_test)

So (at least on my computer), 81% accuracy. Not bad for a first try -- but movie reviews are quite simple, this will be a lot harder for most political text...

We can use the regular table command to create a cross-table, often called a 'confusion matrix' (as it shows what kind of errors the model makes):

confusion_matrix = table(actual_test, nb_pred)

The caret package has a function can be used to produce the 'regular' metrics from this table:

confusionMatrix(nb_pred, actual_test, mode = "everything")

We can also inspect the actual parameters assigned to the words, to see which words are indicative of bad and good movies:

scores = t(m_nb$PcGw) %>% as_tibble(rownames = "word")
scores %>% arrange(-neg)

Interestingly, the most negative words seem more indicative of genres than of evaluations (and after installing spacyr on more windows computers than I care for, I think the negative value for anaconda is certainly understandable...)

scores %>% arrange(-pos)

Aside: Scaling with Quanteda

Quanteda also allows for supervised and unsupervised scaling. Although wordfish is an unsupervised method (so doesn't really belong to this tutorial), it produces a nice visualization:

m_wf <- textmodel_wordfish(dfm_train, sparse=T)
topwords = c(scores %$% head(word, 10), scores %>% arrange(-neg) %$% head(word, 10))
textplot_scale1d(m_wf, margin = "features", highlighted = c(topwords, "coen", "scorses", "paltrow", "shakespear"))

(note the use of %$% to expose the columns directly to the next function without sending the whole tibble)

I highlighted the most positive and negative words according to naive bayes (and some other words). The most positive/negative words are interestingly located mostly in the center of the 2-dimensional scaling. So the (unsupervised) scaling captures genre more than sentiment, it seems.

Using Caret

Finally, let's use the caret library to train and test some models. First, we set the train control to none, as we don't want to do any resampling (like crossvalidation) yet:

trctrl = trainControl(method = "none")
dtm_train = convert(dfm_train, to='matrix')
dtm_test = convert(dfm_test, to='matrix')

We show two algorithms here, but caret can be used to train a very large number of different models. Note that caret doesn't include most algorithms, so you may need to install additional packages. Often, you also need to check the documentation for these packages (referenced in the caret docs) to understand exactly what the model does and what the hyperparameters are. See for more information.


Train a simple SVM from the LiblineaR package, setting the hyperparameters to (hopefully) sensible defaults:

m_svm = train(x = dtm_train, y = actual_train, method = "svmLinearWeights2",
              trControl = trctrl, tuneGrid = data.frame(cost = 1, Loss = 0, weight = 1))
svm_pred = predict(m_svm, newdata = dtm_test)
confusionMatrix(svm_pred, actual_test)

Note: For more information on the algorithm, including the meaning of the parameters and how to tune them, you need to consult the documentation of the underlying package. The caret documentation linked above will tell you which package is used (in this case: LiblineaR), and that package will contain a more technical explanation of the algorithm, generally including examples and references.

Neural Network

Train a simple Nueral Network (using nnet), choosing a single hidden layer and a small decay parameter. Note that since we need to set the maxnwts to the number of features time the number of layers

m_nn = train(x = dtm_train, y = actual_train, method = "nnet", 
             trControl = trctrl, tuneGrid = data.frame(size = 1, decay = 5e-4), MaxNWts = 6000)
nn_pred <- predict(m_nn, newdata = dtm_test)
confusionMatrix(nn_pred, actual_test)

Parameter tuning

Most algorithms have (hyper)parameters that need to be tuned, like the misclassification cost in SVM and the number and size of hidden layers in a neural network. There are often no good theoretical grounds to set these, so the best you can do is try a lot of them and taking the best.

You can do this yourself, but caret also has built-in functions to do an automatic grid search. For this, set the tuneGrid to multiple values per parameter, and choose a different trainControl method, like crossvalidation. See for more information, and for a good tutorial.