Supervised Text Classification
Wouter van Atteveldt & Kasper Welbers April 2019
- Training the model using quanteda
- Aside: Scaling with Quanteda
- Using Caret
This handout contains a very brief introduction to using supervised machine learning for text classification in R. Check out https://cran.r-project.org/web/packages/caret/vignettes/caret.html for a more extensive tutorial.
We will use
quanteda for text processing and some machine learning, and
tidyverse for general data cleaning.
We also use
caret for more machine learning options. This probably requires R version 3.5. If you have trouble installing it, you can still follow the quanteda part of this tutorial. The top line also installs the packages needed for the actual model training.
#install.packages(c("caret", "e1071", "LiblineaR")) library(caret)
For machine learning, we need annotated training data. Fortunately, there are many review data files available for free. This is an example of movie:
download.file("http://i.amcat.nl/data_corpus_movies.rda", "data_corpus_movies.rda") load("data_corpus_movies.rda") reviews = data_corpus_movies
Training and test data
We split off a test set for testing performance (using
set.seed for reproducibility), and create subsets of the corpus for training and testing:
set.seed(1) testset = sample(docnames(reviews), 500) reviews_test = reviews %>% corpus_subset(docnames(reviews) %in% testset) reviews_train = reviews %>% corpus_subset(!docnames(reviews) %in% testset) actual_train = as.factor(docvars(reviews_train, "Sentiment")) actual_test = as.factor(docvars(reviews_test, "Sentiment"))
Training the model using quanteda
To prepare the data, we need to create matrices for the training and test data.
Now, we create the training dfm:
dfm_train = reviews_train %>% dfm(stem = TRUE) %>% dfm_select(min_nchar = 2) %>% dfm_trim(min_docfreq=10)
And train the model:
m_nb <- textmodel_nb(dfm_train, actual_train) summary(m_nb)
Testing the model
To see how well the model does, we test it on the test data. For this, it's important that the test data uses the same features (vocabulary) as the training data The model contains parameters for these features, not for words that only occur in the test data,
dfm_test <- reviews_test %>% dfm(stem = TRUE) %>% dfm_match(featnames(dfm_train)) nb_pred <- predict(m_nb, newdata = dfm_test) head(nb_pred)
To see how well we do, we compare the predicted sentiment to the actual sentiment:
mean(nb_pred == actual_test)
So (at least on my computer), 81% accuracy. Not bad for a first try -- but movie reviews are quite simple, this will be a lot harder for most political text...
We can use the regular
table command to create a cross-table, often called a 'confusion matrix' (as it shows what kind of errors the model makes):
confusion_matrix = table(actual_test, nb_pred) confusion_matrix
caret package has a function can be used to produce the 'regular' metrics from this table:
confusionMatrix(nb_pred, actual_test, mode = "everything")
We can also inspect the actual parameters assigned to the words, to see which words are indicative of bad and good movies:
scores = t(m_nb$PcGw) %>% as_tibble(rownames = "word") scores %>% arrange(-neg)
Interestingly, the most negative words seem more indicative of genres than of evaluations (and after installing spacyr on more windows computers than I care for, I think the negative value for anaconda is certainly understandable...)
scores %>% arrange(-pos)
Aside: Scaling with Quanteda
Quanteda also allows for supervised and unsupervised scaling. Although wordfish is an unsupervised method (so doesn't really belong to this tutorial), it produces a nice visualization:
library(magrittr) set.seed(1) m_wf <- textmodel_wordfish(dfm_train, sparse=T) topwords = c(scores %$% head(word, 10), scores %>% arrange(-neg) %$% head(word, 10)) textplot_scale1d(m_wf, margin = "features", highlighted = c(topwords, "coen", "scorses", "paltrow", "shakespear"))
(note the use of %$% to expose the columns directly to the next function without sending the whole tibble)
I highlighted the most positive and negative words according to naive bayes (and some other words). The most positive/negative words are interestingly located mostly in the center of the 2-dimensional scaling. So the (unsupervised) scaling captures genre more than sentiment, it seems.
Finally, let's use the caret library to train and test some models. First, we set the train control to none, as we don't want to do any resampling (like crossvalidation) yet:
trctrl = trainControl(method = "none") dtm_train = convert(dfm_train, to='matrix') dtm_test = convert(dfm_test, to='matrix')
We show two algorithms here, but caret can be used to train a very large number of different models. Note that caret doesn't include most algorithms, so you may need to install additional packages. Often, you also need to check the documentation for these packages (referenced in the caret docs) to understand exactly what the model does and what the hyperparameters are. See https://topepo.github.io/caret/train-models-by-tag.html for more information.
Train a simple SVM from the
LiblineaR package, setting the hyperparameters to (hopefully) sensible defaults:
set.seed(1) m_svm = train(x = dtm_train, y = actual_train, method = "svmLinearWeights2", trControl = trctrl, tuneGrid = data.frame(cost = 1, Loss = 0, weight = 1)) svm_pred = predict(m_svm, newdata = dtm_test) confusionMatrix(svm_pred, actual_test)
Note: For more information on the algorithm, including the meaning of the parameters and how to tune them, you need to consult the documentation of the underlying package. The caret documentation linked above will tell you which package is used (in this case: LiblineaR), and that package will contain a more technical explanation of the algorithm, generally including examples and references.
Train a simple Nueral Network (using
nnet), choosing a single hidden layer and a small decay parameter. Note that since we need to set the maxnwts to the number of features time the number of layers
set.seed(1) m_nn = train(x = dtm_train, y = actual_train, method = "nnet", trControl = trctrl, tuneGrid = data.frame(size = 1, decay = 5e-4), MaxNWts = 6000) nn_pred <- predict(m_nn, newdata = dtm_test) confusionMatrix(nn_pred, actual_test)
Most algorithms have (hyper)parameters that need to be tuned, like the misclassification cost in SVM and the number and size of hidden layers in a neural network. There are often no good theoretical grounds to set these, so the best you can do is try a lot of them and taking the best.
You can do this yourself, but caret also has built-in functions to do an automatic grid search. For this, set the
tuneGrid to multiple values per parameter, and choose a different
trainControl method, like crossvalidation. See https://topepo.github.io/caret/random-hyperparameter-search.html for more information, and https://cran.r-project.org/web/packages/caret/vignettes/caret.html for a good tutorial.