## Load Libraries and Set Seed

In [1]:
library("text2vec")
library("glmnet")
library("slam")

set.seed(1528)

Loading required package: Matrix

Loaded glmnet 4.1-4



## Initial Processing

### Load Data

First, we'll load the entire dataset containing all movie reviews in the `alldata.tsv` file. We'll do additional post-processing to the `"review"` column to remove HTML tags (contained within `<>`).

In [2]:
data = read.table("alldata.tsv",
                  stringsAsFactors = FALSE,
                  header = TRUE)
data$review = gsub('<.*?>', ' ', data$review)

### Initial DocumentTerm Matrix

We use the R package `text2vec` to construct the $DT$ (DocumentTerm) matrix with a maximum of 4-grams allowed. As a preprocessing step, we lowercase all the text and use the `word_tokenizer` to tokenize the words in each review. When creating the vocabulary, we make sure to filter out a pre-defined set of stop words. We prune the vocabulary to filter out rare tokens (< 10 occurrences over all documents) and those that appear in less than $.1\%$ or more than $50\%$ of documents.

In [3]:
it_data = itoken(data$review,
                 preprocessor = tolower,
                 tokenizer = word_tokenizer)
stop_words = c("i", "me", "my", "myself", 
               "we", "our", "ours", "ourselves", 
               "you", "your", "yours", 
               "their", "they", "his", "her", 
               "she", "he", "a", "an", "and",
               "is", "was", "are", "were", 
               "him", "himself", "has", "have", 
               "it", "its", "the", "us")
tmp.vocab = create_vocabulary(it_data,
                              stopwords = stop_words, 
                              ngram = c(1L, 4L))
tmp.vocab = prune_vocabulary(tmp.vocab, term_count_min = 10,
                             doc_proportion_max = 0.5,
                             doc_proportion_min = 0.001)
dtm_data = create_dtm(it_data, vocab_vectorizer(tmp.vocab))

as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead



As expected the size of the $DT$ matrix is larger than the vocabulary size (i.e., # of columns of `dtm_train`) is greater than 30,0000 which is bigger than the sample size `n = 25000`.

In [4]:
dim(dtm_data)

## Improve Interpretability

To improve the interpretability of the vocabulary, we apply a simple screening method using the **two-sample t-test**. That is, we consider two groups: (1) *positive*, and (2) *negative*. Then, we compute the $t$-statistics for each word across the two groups and take only the top $k$ words with the largest absolute (i.e., magnitude) $t$-statistics. This way, we hope that the chosen words in the final vocabulary have more meaningful contribution (i.e., are the "most negative" or "most positive" words) to the final sentiment and make the final model more interpretable.

In [6]:
v.size = dim(dtm_data)[2]
labels = data$sentiment

summ = matrix(0, nrow=v.size, ncol=4)
summ[, 1] = colapply_simple_triplet_matrix(
  as.simple_triplet_matrix(dtm_data[labels==1,]), mean)
summ[, 2] = colapply_simple_triplet_matrix(
  as.simple_triplet_matrix(dtm_data[labels==1,]), var)
summ[, 3] = colapply_simple_triplet_matrix(
  as.simple_triplet_matrix(dtm_data[labels==0,]), mean)
summ[, 4] = colapply_simple_triplet_matrix(
  as.simple_triplet_matrix(dtm_data[labels==0,]), var)

n1 = sum(labels); 
n = length(labels)
n0 = n - n1

myp = (summ[, 1] - summ[, 3]) / sqrt(summ[, 2] / n1 + summ[, 4] / n0)
id = order(abs(myp), decreasing=TRUE)[1:2000]

There could potentially be some words that are left out of the above selection of words, but could still be useful and aid the interpretability. Here, we will also consider words that never appeared in the positive reviews and similarly, those that never appeared in the negative reviews.

In [7]:
id1 = which(summ[, 2] == 0)
id0 = which(summ[, 4] == 0)

Then, our chosen vocabulary would be the union of the above selections, `id`, `id0`, and `id1`.

In [9]:
words = colnames(dtm_data)
myvocab = words[union(id1, union(id, id0))]

However, at this point, the chosen vocabulary has a size `>= 2000`. But, we can still perform additional selections to reduce the size even further.

In [11]:
length(myvocab)

## Size Reduction using Lasso

We can utilize Lasso (with logistic regression) as a variable selector to reduce the size of our vocabulary.

In [12]:
it_data = itoken(data$review,
                 preprocessor = tolower,
                 tokenizer = word_tokenizer)
dtm_data = create_dtm(it_data, vocab_vectorizer(create_vocabulary(myvocab, ngram = c(1L, 4L))))

tmpfit = glmnet(x = dtm_data,
                y = data$sentiment, 
                alpha = 1,
                family='binomial')

The `glmnet` output `tmpfit` contains 98 sets of estimated $\beta$ values corresponding to 98 different lambda values. In particular, `tmpfit$df` tells us the number of non-zero $\beta$ values (i.e., `df`) for each of the 98 estimates. Since we are interested in a vocabulary size of less than 1K, I chose the largest `df` among those less than 1K (here, the 42nd column) and store the corresponding (here, 983) words in `myvocab`.

In [20]:
largest_idx = length(tmpfit$df[tmpfit$df < 1000])
largest_idx

myvocab = colnames(dtm_data)[which(tmpfit$beta[, largest_idx] != 0)]
length(myvocab)

## Save Vocabulary to File

Now, let's save the final vocabulary to file (`myvocab.txt`), with each word in the vocabulary saved on a separate line.

In [21]:
write.table(myvocab, file = "myvocab.txt",
            quote = FALSE, row.names = FALSE, col.names = FALSE,
            sep = "\n")