In [16]:
!pip3 install nltk scikit-learn pandas

Defaulting to user installation because normal site-packages is not writeable


After installing, you need to import (activate) the packages every
session:

## Python code

In [17]:
# General packages and dictionary analysis
import os
import tarfile
import bz2
import urllib.request
import re
import pickle
import nltk
import joblib
import requests
import pandas as pd
import numpy as np
from nltk.tokenize import TreebankWordTokenizer
import matplotlib.pyplot as plt

# Supervised text classification
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import joblib
# import eli5
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer


In earlier chapters, you learned about both supervised and unsupervised
machine learning as well about dealing with texts. This chapter brings
together these elements and discusses how to combine them to
automatically analyze large corpora of texts. After presenting
guidelines for choosing an appropriate approach in
<a href="#sec-deciding" class="quarto-xref">Section 2.6</a> and
downloading an example dataset in
<a href="#sec-reviewdataset" class="quarto-xref">Section 2.7</a>, we
discuss multiple techniques in detail. We begin with a very simple
top-down approach in
<a href="#sec-dictionary" class="quarto-xref">Section 2.10</a>, in which
we count occurrences of words from an *a priori* defined list of words.
In <a href="#sec-supervised" class="quarto-xref">Section 2.14</a>, we
still use pre-defined categories that we want to code, but let the
machine “learn” the rules of the coding itself. Finally, in
<a href="#sec-unsupervised" class="quarto-xref">Section 2.24</a>, we
employ a bottom-up approach in which we do not use any *a priori*
defined lists or coding schemes, but inductively extract topics from our
data.

## Deciding on the Right Method

When thinking about the computational analysis of texts, it is important
to realize that there is no method that is *the one* to do so. While
there are good choices and bad choices, we also cannot say that one
method is necessarily and always superior to another. Some methods are
more fashionable than others. For instance, there has been a growing
interest in topic models (see
<a href="#sec-unsupervised" class="quarto-xref">Section 2.24</a>) in the
past few years. There are indeed very good applications for such models,
they are also sometimes applied to research questions and/or data where
they make much less sense. As always, the choice of method should follow
the research question and not the other way round. We therefore caution
you about reading
<a href="#sec-chap-text" class="quarto-xref">Section 1</a> selectively
because you want, for instance, to learn about supervised machine
learning or about unsupervised topic models. Instead, you should be
aware of very different approaches to make an informed decision on what
to use when.

@Boumans2016 provide useful guidelines for this. They place automatic
text analysis approaches on a continuum from deductive (or top-down) to
inductive (or bottom-up). At the deductive end of the spectrum, they
place dictionary approaches
(<a href="#sec-dictionary" class="quarto-xref">Section 2.10</a>). Here,
the researcher has strong *a priori* (theoretical) assumptions (for
instance, which topics exist in a news data set; or which words are
positive or negative) and can compile lists of words or rules based on
these assumptions. The computer then only needs to execute these rules.
At the inductive end of the spectrum, in contrast, lie approaches such
as topic models
(<a href="#sec-unsupervised" class="quarto-xref">Section 2.24</a>) where
little or no *a priori* assumptions are made, and where we exploratively
look for patterns in the data. Here, we typically do not know which
topics exist in advance. Supervised approaches
(<a href="#sec-supervised" class="quarto-xref">Section 2.14</a>) can be
placed in between: here, we do define categories *a priori* (we do know
which topics exist, and given an article, we know to which topic it
belongs), but we do not have any set of rules: we do not know which
words to look for or which exact rules to follow. These rules are to be
“learned” by the computer from the data.

Before we get into the details and implementations, let us discuss some
use cases of the three main approaches for the computational analysis of
text: dictionary (or rule-based) approaches, supervised machine
learning, and unsupervised machine learning.

Dictionary approaches excel under three conditions. First, the variable
we want to code is *manifest and concrete* rather than *latent and
abstract*: names of actors, specific physical objects, specific phrases,
etc., rather than feelings, frames, or topics. Second, all synonyms to
be included must be known beforehand. And third, the dictionary entries
must not have multiple meanings. For instance, coding for how often gun
control is mentioned in political speeches fits these criteria. There
are only so many ways to talk about it, and it is rather unlikely that
speeches about other topics contain a phrase like “gun control”.
Similarly, if we want to find references to Angela Merkel, Donald Trump,
or any other well-known politician, we can just directly search for
their names – even though problems arise when people have very common
surnames and are referred to by their surnames only.

Sadly, most interesting concepts are more complex to code. Take a
seemingly straightforward problem: distinguishing whether a news article
is about the economy or not. This is really easy to do for humans: there
may be some edge cases, but in general, people rarely need longer than a
few seconds to grasp whether an article is about the economy rather than
about sports, culture, etc. Yet, many of these articles won’t directly
state that they are about the economy by explicitly using the word
“economy”.

We may think of extending our dictionary not only with `econom.+` (a
regular expression that includes economists, economic, and so on), but
also come up with other words like “stock exchange”, “market”,
“company.” Unfortunately, we will quickly run into a problem that we
also faced when we discussed the precision-recall trade-off in
**?@sec-validation**: the more terms we add to our dictionary, the more
false positives we will get: articles about the geographical space
called “market”, about some celebrity being seen in “company” of someone
else, and so on.

From this example, we can conclude that often (1) it is easy for humans
to decide to which class a text belongs, but (2) it is very hard for
humans to come up with a list of words (or rules) on which their
judgment is based. Such a situation is perfect for applying supervised
machine learning: after all, it won’t take us much time to annotate,
say, 1000 articles based on whether they are about the economy or not
(probably this takes less time than thoroughly fine tuning a list of
words to include or exclude); and the difficult part, deciding on the
exact rules underlying the decision to classify an article as economic
is done by the computer in seconds. Supervised machine learning,
therefore, has replaced dictionary approaches in many areas.

Both dictionary (or rule-based) approaches and supervised machine
learning assume that you know in advance which categories (positive
versus negative; sports versus economy versus politics; …) exist. The
big strength of unsupervised approaches such as topic models is that you
can also apply them without this knowledge. They therefore allow you to
find patterns in data that you did not expect and can generate new
insights. This makes them particularly suitable for explorative research
questions. Using them for confirmatory tests, in contrast, is less
defensible: after all, if we are interested in knowing whether, say,
news site A published more about the economy than news site B, then it
would be a bit weird to pretend not to know that the topic “economy”
exists. Also practically, mapping the resulting topics that the topic
model produces onto such *a priori* existing categories can be
challenging.

Despite all differences, all approaches share one requirement: you need
to “Validate. Validate. Validate” \[@Grimmer2013\]. Though it has been
done in the past, simply applying a dictionary without comparing the
performance to manual coding of the same concepts is not acceptable;
neither is using a supervised machine learning classifier without doing
the same; or blindly trusting a topic model without at least manually
checking whether the scores the model assigns to documents really
capture what the documents are about.

## Obtaining a Review Dataset

For the sections on dictionary and supervised approaches we will use a
dataset of movie reviews from the IMDB database \[@aclimdb\]. This
dataset is published as a compressed set of folders, with separate
folders for the train and test datasets and subfolders for positive and
negative reviews. Lots of other review datasets are available online,
for example for Amazon review data
([jmcauley.ucsd.edu/data/amazon/](https://jmcauley.ucsd.edu/data/amazon/)).

The IMDB dataset we will use is a relatively large file and it requires
bit of processing, so it is smart to *cache* the data rather than
downloading and processing it every time you need it. This is done in
**?@exm-reviewdata**, which also serves as a nice example of how to
download and process files. Both R and Python follow the same basic
pattern. First, we check whether the cached file exists, and if it does
we read the data from that file. For R, we use the standard *RDS*
format, while for Python we use a compressed *pickle* file. The format
of the data is also slightly different, following the convention for
each language: In R we use the data frame returned by `readtext`, which
can read files from a folder or zip archive and return a data frame
containing one text per row. In Python, we have separate lists for the
train and test datasets and for the full texts and labels: `text_train`
are the training texts and `y_train` are the corresponding labels.

Downloading and caching IMDB review data.

## Python code

In [9]:
filename = "reviewdata.pickle.bz2"
if os.path.exists(filename):
    print(f"Using cached file {filename}")
    with bz2.BZ2File(filename, "r") as zipfile:
        data = pickle.load(zipfile)
    text_train, text_test, y_train, y_test = data
else:
    url = "https://cssbook.net/d/aclImdb_v1.tar.gz"
    print(f"Downloading from {url}")
    fn, _headers = urllib.request.urlretrieve(url, filename=None)
    t = tarfile.open(fn, mode="r:gz")
    text_train, text_test = [], []
    y_train, y_test = [], []
    for f in t.getmembers():
        m = re.match("aclImdb/(\w+)/(pos|neg)/", f.name)
        if not m:
            # skip folder names, other categories
            continue
        dataset, label = m.groups()
        text = t.extractfile(f).read().decode("utf-8")
        if dataset == "train":
            text_train.append(text)
            y_train.append(label)
        elif dataset == "test":
            text_test.append(text)
            y_test.append(label)
    data = text_train, text_test, y_train, y_test
    print(f"Saving to {filename}")
    with bz2.BZ2File(filename, "w") as zipfile:
        pickle.dump(data, zipfile)

Using cached file reviewdata.pickle.bz2


If the cached data file does not exist yet, the file is downloaded from
the Internet. In R, we then extract the file and call `readtext` on the
resulting folder. This automatically creates columns for the subfolders,
so in this case for the dataset and label. After this, we remove the
download file and the extracted folder, clean up the `reviewdata`, and
save it to the `reviewdata.rds` file. In Python, we can extract files
from the downloaded file directly, so we do not need to explicitly
extract it. We loop over all files in the archive, and use a regular
expression to select only text files and extract the label and dataset
name (see **?@sec-regular** for more information about regular
expressions). Then, we extract the text from the archive, and add the
text and the label to the appropriate list. Finally, the data is saved
as a compressed pickle file, so the next time we run this cell it does
not need to download the file again.

## Dictionary Approaches to Text Analysis

A straightforward way to automatically analyze text is to compile a list
of terms you are interested in and simply count how often they occur in
each document. For example, if you are interested in finding out whether
mentions of political parties in news articles change over the years,
you only need to compile a list of all party names and write a small
script to count them.

Historically, this is how sentiment analysis was done. Example
**?@exm-sentsimple** shows how to do a simple sentiment analysis based
on a list of positive and negative words. The logic is straightforward:
you count how often each positive word occurs in a text, you do the same
for the negative words, and then determine which occur more often.

Different approaches to a simple dictionary-based sentiment analysis:
counting and summing all words using a for-loop over all reviews
(Python) versus constructing a term-document matrix and looking up the
words in there (R). Note that both approaches would be possible in
either language.

## Python code

In [18]:
poswords = "https://cssbook.net/d/positive.txt"
negwords = "https://cssbook.net/d/negative.txt"
pos = set(requests.get(poswords).text.split("\n"))
neg = set(requests.get(negwords).text.split("\n"))
sentimentdict = {word: +1 for word in pos}
sentimentdict.update({word: -1 for word in neg})

scores = []
mytokenizer = TreebankWordTokenizer()
# For speed, we only take the first 100 reviews
for review in text_train[:100]:
    words = mytokenizer.tokenize(review)
    # we look up each word in the sentiment dict
    # and assign its value (with default 0)
    scores.append(sum(sentimentdict.get(word, 0) for word in words))
print(scores)

[-3, -4, 1, 3, -2, -7, -6, 9, 7, 7, 10, 5, -1, 2, 7, -4, 2, 21, 1, -1, 2, -3, -2, -11, -2, -3, -7, 2, 4, -22, 5, 4, 3, -5, -8, 1, -1, 0, 1, 8, 0, -4, 3, -7, -11, -6, 0, 3, -1, 0, 6, -1, -8, 7, -5, 2, 10, 5, 5, 1, 0, 7, 0, 0, 5, 1, -8, 4, 3, 18, 2, 0, -3, -2, 5, 0, -2, 1, 1, 12, -3, -4, -6, -2, 2, -7, -1, -10, -5, 3, 4, -3, -17, 1, -1, 7, -3, 4, 12, 3]


As you may already realize, there are a lot of downsides to this
approach. Most notably, our bag-of-words approach does not allow us to
account for negation: “not good” will be counted as positive. Relatedly,
we cannot handle modifiers such as “very good”. Also, all words are
either positive or negative, while “great” should be more positive than
“good”. More advanced dictionary-based sentiment analysis packages like
Vader \[@Hutto2014\] or SentiStrength \[@Thelwall2012\] include such
functionalities. Yet, as we will discuss in Section
<a href="#sec-supervised" class="quarto-xref">2.14</a>, also these
off-the-shelf packages perform very poorly in many sentiment analysis
tasks, especially outside of the domains they were developed for.
Dictionary-based sentiment analysis has been shown to be problematic
when analyzing news content (e.g. @Gonzalez-Bailon2015; @Boukes2019).
They are problematic when accuracy at the sentence level is important,
but may be satisfactory with longer texts for comparatively easy tasks
such as movie review classification \[@Reagan2017\], where there is
clear ground truth data and the genre convention implies that the whole
text is evaluative and evaluates one object (the film).

Still, there are many use cases where dictionary approaches work very
well. Because your list of words can contain anything, not just positive
or negative words, dictionary approaches have been used, for instance,
to measure the use of racist words or swearwords in online fora \[e.g.,
@Tulkens2016\]. Dictionary approaches are simple to understand and
straightforward, which can be a good argument for using them when it is
important that the method is no black-box but fully transparent even
without technical knowledge. Especially when the dictionary already
exists or is easy to create, it is also a very cheap method. However,
this is at the expense of their limitation to only performing well when
measuring easy to operationalize concepts. To put it bluntly: it’s great
for measuring the visibility of parties or organizations in the news,
but it’s not good for measuring concepts such as emotions or frames.

What gave dictionary approaches a bit of a bad name is that many
researchers applied them without validating them. This is especially
problematic when a dictionary is applied in a slightly different domain
than that for which it was originally made.

If you want to use a dictionary-based approach, we advise the following
procedure:

-   Construct a dictionary based on theoretical considerations and by
    closely reading a sample of example texts.
    -   Code some articles manually and compare with the automated
        coding.
    -   Improve your dictionary and check again.
    -   Manually code a validation dataset of sufficient size. The
        required size depends a bit on how balanced your data is – if
        one code occurs very infrequently, you will need more data.
    -   Calculate the agreement. You could use standard intercoder
        reliability measures used in manual content analysis, but we
        would also advise you to calculate precision and recall (see
        Section **?@sec-validation**).

Very extensive dictionaries will have a high recall (it becomes
increasingly unlikely that you “miss” a relevant document), but often
suffer from low precision (more documents will contain one of the words
even though they are irrelevant). Vice versa, a very short dictionary
will often be very precise, but miss a lot of documents. It depends on
your research question where the right balance lies, but to
substantially interpret your results, you need to be able to quantify
the performance of your dictionary-based approach.

## How many documents do you need to calculate agreement with human annotators?

To determine the number of documents one needs to determine the
agreement between a human and a machine, one can follow the same
standards that are recommended for traditional manual content analysis.

For instance, @Krippendorff2004 provides a convenience table to look up
the required sample sizes for determining the agreement between two
human coders (p. 240). @Riffe2019 provide similar suggestions (p. 114).
In short, the sample size depends on the level of statistical
significance the researcher deems acceptable as well as on the
distribution of the data. In an extreme case, if only 5 out of 100 items
are to be coded as $x$, then in a sample of 20 items, such an item may
not even occur. In order to determine agreement between the automated
method and a human, we suggest that sample sizes that one would also use
for the calculation of agreement between human coders are used. For
specific calculations, we refer to content analysis books such as the
two referenced here. To give a very rough ballpark figure (that
shouldn’t replace a careful calculation!), roughly 100 to 200 items will
cover many scenarios (assuming a small amount of reasonably balanced
classes).

## Supervised Text Analysis: Automatic Classification and Sentiment Analysis

For many applications, there are good reasons to use the dictionary
approach presented in the previous section. First, it is intuitively
understandable and results can – in principle – even be verified by
hand, which can be an advantage when transparency or communicability is
of high importance. Second, it is very easy to use. But as we have
discussed in
<a href="#sec-deciding" class="quarto-xref">Section 2.6</a>, dictionary
approaches in general perform less well the more abstract, non-manifest,
or complex a concept becomes. In the next section, we will make the case
that topics, but also sentiment, in fact, are quite a complex concepts
that are often hard to capture with dictionaries (or at least, crafting
a custom dictionary would be difficult). For instance, while “positive”
and “negative” seem straightforward categories at first sight, the more
we think about it, the more apparent it becomes how context-dependent it
actually is: in a dataset about the economy and stock market returns,
“increasing” may indicate something positive, in a dataset about
unemployment rates the same word would be something negative. Thus,
machine learning can be a more appropriate technique for such tasks.

### Putting Together a Workflow

With the knowledge we gained in previous chapters, it is not difficult
to set up a supervised machine learning classifier to automatically
determine, for instance, the topic of a news article.

Let us recap the building blocks that we need. In
**?@sec-chap-introsml**, you learned how to use different classifiers,
how to evaluate them, and how to choose the best settings. However, in
these examples, we used numerical data as features; now, we have text.
In **?@sec-chap-dtm**, you learned how to turn text into numerical
features. And that’s all we need to get started!

Typical examples for supervised machine learning in the analysis of
communication include the classification of topics \[e.g.,
@Scharkow2011\], frames \[e.g., @Burscher2014\], user characteristics
such as gender or ideology, or sentiment.

Let us consider the case of sentiment analysis in more detail. Classical
sentiment analysis is done with a dictionary approach: you take a list
of positive words, a list of negative words, and count which occur more
frequently. Additionally, one may attach a weight to each word, such
that “perfect” gets a higher weight than “good”, for instance. An
obvious drawback is that these pure bag-of-words approaches cannot cope
with negation (“not good”) and intensifiers (“very good”), which is why
extensions have been developed that take these (and other features, such
as punctuation) into account \[@Thelwall2012; @Hutto2014;
@DeSmedt2012\].

But while available off-the-shelf packages that implement these extended
dictionary-based methods are very easy to use (in fact, they spit out a
sentiment score with one single line of code), it is questionable how
well they work in practice. After all, “sentiment” is not exactly a
clear, manifest concept for which we can enumerate a list of words. It
has been shown that results obtained with multiple of these packages
correlate very poorly with each other and with human annotations
\[@Boukes2019; @Chan2021\].

Consequently, it has been suggested that it is better to use supervised
machine learning to automatically code the sentiment of texts
\[@Gonzalez-Bailon2015; @vermeer2019seeing\]. However, you may need to
annotate documents from your own dataset: training a classifier on, for
instance, movie reviews and then using it to predict sentiment in
political texts violates the assumption that training set, test set, and
the unlabeled data that are to be classified are (at least in principle
and approximately) drawn from the same population.

To illustrate the workflow, we will use the ACL IMDB dataset, a large
dataset that consists of a training dataset of 25000 movie reviews (of
which 12500 are positive and 12500 are negative) and an equally sized
test dataset \[@aclimdb\]. It can be downloaded at
[ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz)

These data do not come in one file, but rather in a set of text files
that are sorted in different folders named after the dataset to which
they belong (`test` or `train`) and their label (`pos` and `neg`). This
means that we cannot simply use a pre-defined function to read them, but
we need to think of a way of reading the content into a data structure
that we can use. This data was loaded in **?@exm-reviewdata** above.

## Sparse versus dense matrices in Python and R

In a document-term matrix, you would typically find a lot of zeros: most
words do *not* appear in any given document. For instance, the reviews
in the IMDB dataset contain more than 100000 unique words. Hence, the
matrix has more than 100000 columns. Yet, most reviews only consist of a
couple of hundred words. As a consequence, more than 99% of the cells in
the table contain a zero. In a sparse matrix, we do not store all these
zeros, but only store the values for cells that actually contain a
value. This drastically reduces the memory needed. But even if you have
a huge amount of memory, this does not solve the issue: in R, the number
of cells in a matrix is limited to 2147483647. It is therefore
impossible to store a matrix with 100000 features and 25000 documents as
a dense matrix. Unfortunately, many models that you can run via *caret*
in R will convert your sparse document-term matrix to a dense matrix,
and hence are effectively only usable for very small datasets. An
alternative is using the *quanteda* package, which does use sparse
matrices throughout. However, at the time of writing this book, quanteda
only provides a very limited number of models. As all of these problems
do not arise in *scikit-learn*, you may want to consider using Python
for many text classification tasks.

Let us now train our first classifier. We choose a Naïve Bayes
classifier with a simple count vectorizer (**?@exm-imdbbaseline**). In
the Python example, pay attention to the fitting of the vectorizer: we
fit on the training data *and* transform the training data with it, but
we only transform the test data *without re-fitting the vectorizer*.
Fitting, here, includes the decision about which words to include (by
definition, words that are not present in the training data are not
included; but we could also choose additional constraints, such as
excluding very rare or very common words), but also assigning an
(internally used) identifier (variable name) to each word. If we fit the
classifier again, these would not be compatible any more. In R, the same
is achieved in a slightly different way: two term-document matrices are
created independently, before they are matched in such a way that only
the features that are present in the training matrix are retained in the
test matrix.

A word that is not present in the training data, but is present in the
test data, is thus ignored. If you want to use the information such
out-of-vocabulary words can entail (e.g., they may be synonyms),
consider using a word embedding approach (see **?@sec-wordembeddings**)

We do not necessarily expect this first model to be the best classifier
we can come up with, but it provides us with a reasonable baseline. In
fact, even without any further adjustments, it works reasonably well:
precision is higher for positive reviews and recall is higher for
negative reviews (classifying a positive review as negative happens
twice as much as the reverse), but none of the values is concerningly
low.

Training a Naïve Bayes classifier with simple word counts as features

## Python code

In [19]:
vectorizer = CountVectorizer(stop_words="english")
X_train = vectorizer.fit_transform(text_train)
X_test = vectorizer.transform(text_test)

nb = MultinomialNB()
nb.fit(X_train, y_train)

y_pred = nb.predict(X_test)

rep = metrics.classification_report(y_test, y_pred)
print(rep)

              precision    recall  f1-score   support

         neg       0.79      0.88      0.83     12500
         pos       0.86      0.76      0.81     12500

    accuracy                           0.82     25000
   macro avg       0.82      0.82      0.82     25000
weighted avg       0.82      0.82      0.82     25000



### Finding the Best Classifier

Let us start by comparing the two simple classifiers we know (Naïve
Bayes and Logistic Regression (see **?@sec-nb2dnn**) and the two
vectorizers that transform our texts into two numerical representations
that we know: word counts and `tf.idf` scores (see **?@sec-chap-dtm**).

We can also tune some things in the vectorizer, such as filtering out
stopwords, or specifying a minimum number (or proportion) of documents
in which a word needs to occur in order to be included, or the maximum
number (or proportion) of documents in which it is allowed to occur. For
instance, it could make sense to say that a word that occurs in less
than $n=5$ documents is probably a spelling mistake or so unusual that
it just unnecessarily bloats our feature matrix; and on the other hand,
a word that is so common that it occurs in more than 50% of all
documents is so common that it does not help us to distinguish between
different classes.

We can try all of these things out by hand by just re-running the code
from **?@exm-imdbbaseline** and only changing the line in which the
vectorizer is specified and the line in which the classifier is
specified. However, copy-pasting essentially the same code is generally
not a good idea, as it makes your code unnecessary long and increases
the likelihood of errors creeping in when you, for instance, need to
apply the same changes to multiple copies of the code. A more elegant
approach is outlined in **?@exm-basiccomparisons**: We define a function
that gives us a short summary of only the output we are interested in,
and then use a for-loop to iterate over all configurations we want to
evaluate, fit them and call the function we defined before. In fact,
with 23 lines of code, we manage to compare four different models, while
we already needed 15 lines (in **?@exm-imdbbaseline**) to evaluate only
one model.

An example of a custom function to give a brief overview of the
performance of four simple vectorizer-classifier combinations.

In [20]:
def short_classification_report(y_test, y_pred):
    print("    \tPrecision\tRecall")
    for label in set(y_pred):
        pr = metrics.precision_score(y_test, y_pred, pos_label=label)
        re = metrics.recall_score(y_test, y_pred, pos_label=label)
        print(f"{label}:\t{pr:0.2f}\t\t{re:0.2f}")

In [21]:
configs = [
    ("NB-count", CountVectorizer(min_df=5, max_df=0.5), MultinomialNB()),
    ("NB-TfIdf", TfidfVectorizer(min_df=5, max_df=0.5), MultinomialNB()),
    (
        "LR-Count",
        CountVectorizer(min_df=5, max_df=0.5),
        LogisticRegression(solver="liblinear"),
    ),
    (
        "LR-TfIdf",
        TfidfVectorizer(min_df=5, max_df=0.5),
        LogisticRegression(solver="liblinear"),
    ),
]

for name, vectorizer, classifier in configs:
    print(name)
    X_train = vectorizer.fit_transform(text_train)
    X_test = vectorizer.transform(text_test)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    short_classification_report(y_test, y_pred)
    print("\n")

NB-count
    	Precision	Recall
pos:	0.87		0.77
neg:	0.79		0.88


NB-TfIdf
    	Precision	Recall
pos:	0.87		0.78
neg:	0.80		0.88


LR-Count
    	Precision	Recall
pos:	0.87		0.85
neg:	0.85		0.87


LR-TfIdf
    	Precision	Recall
pos:	0.89		0.88
neg:	0.88		0.89




The output of this little example already gives us quite a bit of
insight into how to tackle our specific classification tasks: first, we
see that a $tf\cdot idf$ classifier seems to be slightly but
consistently superior to a count classifier (this is often, but not
always the case). Second, we see that the logistic regression performs
better than the Naïve Bayes classifier (again, this is often, but not
always, the case). In particular, in our case, the logistic regression
improved on the excessive misclassification of positive reviews as
negative, and achieves a very balanced performance.

There may be instances where one nevertheless may want to use a Count
Vectorizer with a Naïve Bayes classifier instead (especially if it is
too computationally expensive to estimate the other model), but for now,
we may settle on the best performing combination, logistic regression
with a `tf.idf` vectorizer. You could also try fitting a Support Vector
Machine instead, but we have little reason to believe that our data
isn’t linearly separable, which means that there is little reason to
believe that the SVM will perform better. Given the good performance we
already achieved, we decide to stick to the logistic regression for now.

We can now go as far as we like, include more models, use
crossvalidation and gridsearch (see **?@sec-crossvalidation**), etc.
However, our workflow now consists of *two* steps: fitting/transforming
our input data using a vectorizer, and fitting a classifier. To make
things easier, in scikit-learn, both steps can be combined into a
so-called pipe. **?@exm-basicpipe** shows how the loop in
**?@exm-basiccomparisons** can be re-written using pipes (the result
stays the same).

Instead of fitting vectorizer and classifier separately, they can be
combined in a pipeline.

In [22]:
for name, vectorizer, classifier in configs:
    print(name)
    pipe = make_pipeline(vectorizer, classifier)
    pipe.fit(text_train, y_train)
    y_pred = pipe.predict(text_test)
    short_classification_report(y_test, y_pred)
    print("\n")

NB-count
    	Precision	Recall
pos:	0.87		0.77
neg:	0.79		0.88


NB-TfIdf
    	Precision	Recall
pos:	0.87		0.78
neg:	0.80		0.88


LR-Count
    	Precision	Recall
pos:	0.87		0.85
neg:	0.85		0.87


LR-TfIdf
    	Precision	Recall
pos:	0.89		0.88
neg:	0.88		0.89




Such a pipeline lends itself very well to performing a gridsearch.
**?@exm-gridsearchlogreg** gives you an example. With
`LogisticRegression?` and `TfIdfVectorizer?`, we can get a list of all
possible hyperparameters that we may want to tune. For instance, these
could be the minimum and maximum frequency for words to be included or
whether we want to use only unigrams (single words) or also bigrams
(combinations of two words, see **?@sec-ngram**). For the Logistic
Regression, it may be the regularization hyperparameter C, which applies
a penalty for too complex models. We can put all values for these
parameters that we want to consider in a dictionary, with a descriptive
key (i.e., a string with the step of the pipeline followed by two
underscores and the name of the hyperparameter) and a list of all values
we want to consider as the corresponding value.

The gridsearch procedure will then estimate all combinations of all
values, using cross-validation (see **?@sec-validation**). In our
example, we have $2 x 2 x 2 x 2 x 3 = 24$ different models, and
$24 models x 5 folds = 120$ models to estimate. Hence, it may take you
some time to run the code.

A gridsearch to find the best hyperparameters for a pipeline consisting
of a vectorizer and a classifier. Note that we can tune any parameter
that either the vectorizer or the classifier accepts as an input, not
only the four hyperparameters we chose in this example.

In [23]:
pipeline = Pipeline(
    steps=[
        ("vectorizer", TfidfVectorizer()),
        ("classifier", LogisticRegression(solver="liblinear")),
    ]
)
grid = {
    "vectorizer__ngram_range": [(1, 1), (1, 2)],
    "vectorizer__max_df": [0.5, 1.0],
    "vectorizer__min_df": [0, 5],
    "classifier__C": [0.01, 1, 100],
}
search = GridSearchCV(
    estimator=pipeline, n_jobs=-1, param_grid=grid, scoring="accuracy", cv=5
)
search.fit(text_train, y_train)
print(f"Best parameters: {search.best_params_}")
pred = search.predict(text_test)
print(short_classification_report(y_test, pred))

60 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
60 fits failed with the following error:
Traceback (most recent call last):
  File "/home/damian/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/damian/.local/lib/python3.10/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/damian/.local/lib/python3.10/site-packages/sklearn/pipeline.py", line 472, in fit
    Xt = self._fit(X, y, routed_params)
  File "/home/damian/.local/lib/python3.10/site-packages/sklearn/pipeline.py", line 409, in _fit
    X, fitted_transformer = f

Best parameters: {'classifier__C': 1, 'vectorizer__max_df': 0.5, 'vectorizer__min_df': 5, 'vectorizer__ngram_range': (1, 2)}
    	Precision	Recall
pos:	0.89		0.90
neg:	0.90		0.89
None


We see that we could further improve our model to precision and recall
values of 0.90, by excluding extremely infrequent and extremely frequent
words, including both unigrams and bigrams (which, we may speculate,
help us to account for the “not good” versus “not”, “good” problem), and
changing the default penalty of $C=1$ to $C=100$.

Let us, just for the sake of it, compare the performance of our model
with an off-the-shelf sentiment analysis package, in this case Vader
\[@Hutto2014\]. For any text, it will directly estimate sentiment scores
(more specifically, a positivity score, a negativity score, a neutrality
score, and a compound measure that combines them), without any need to
have training data. However, as Example **?@exm-vader** shows, such a
method is clearly inferior to a supervised machine learning approach.
While in almost all cases (except for $n=11$ cases), Vader was able to
make a choice (getting scores of 0 is a notorious problem in very short
texts), precision and recall are clearly worse than even the simple
baseline model we started with, and much worse than those of the final
model we finished with. In fact, we miss half (!) of the negative
reviews. There are probably very few applications in the analysis of
communication in which we would find this acceptable. It is important to
highlight that this is not because the off-the-shelf package we chose is
a particularly bad one (on the contrary, it is actually comparatively
good), but because of the inherent limitations of dictionary-based
sentiment analysis.

For the sake of comparison, we calculate how an off-the-shelf sentiment
analysis package would have performed in this task

In [24]:
nltk.download("vader_lexicon")
analyzer = SentimentIntensityAnalyzer()
pred = []
for review in text_test:
    sentiment = analyzer.polarity_scores(review)
    if sentiment["compound"] > 0:
        pred.append("pos")
    elif sentiment["compound"] < 0:
        pred.append("neg")
    else:
        pred.append("dont know")

print(metrics.confusion_matrix(y_test, pred))
print(metrics.classification_report(y_test, pred))

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/damian/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


[[    0     0     0]
 [    6  6706  5788]
 [    5  1748 10747]]
              precision    recall  f1-score   support

   dont know       0.00      0.00      0.00         0
         neg       0.79      0.54      0.64     12500
         pos       0.65      0.86      0.74     12500

    accuracy                           0.70     25000
   macro avg       0.48      0.47      0.46     25000
weighted avg       0.72      0.70      0.69     25000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


We need to keep in mind, though, that with this dataset, we chose one of
the easiest sentiment analysis tasks: a set of long, rather formal texts
(compared to informal short social media messages), that evaluate
exactly one entity (one film), and that are not ambiguous at all. Many
applications that communication scientists are interested in are much
less straightforward. Therefore, however tempting it may be to use an
off-the-shelf package, doing so requires a thorough test based on at
least some human-annotated data.

### Using the Model

So far, we have focused on training and evaluating models, almost
forgetting why we were doing this in the first place: to use them to
predict the label for new data that we did not annotate.

Of course, we could always re-train the model when we need to use it –
but that has two downsides: first, as you may have seen, it may actually
take considerable time to train it, and second, you need to have the
training data available, which may be a problem both in terms of storage
space and of copyright and/or privacy if you want to share your
classifier with others.

Therefore, it makes sense to save both our classifier and our vectorizer
to a file, so that we can reload them later (Example **?@exm-reuse**).
Keep in mind that you have to re-use *both* – after all, the columns of
your feature matrix will be different (and hence, completely useless for
the classifier) when fitting a new vectorizer. Therefore, as you see,
you do not do any fitting any longer, and only use the `.transform()`
method of the (already fitted) vectorizer and the `.predict()` method of
the (already fitted) classifier.

In R, you have no vectorizer you could save – but because in contrast to
Python, both your DTM and your classifier include the feature names, it
suffices to save the classifier only (using
`saveRDS(myclassifier, "myclassifier.rds")`) and using on a new DTM
later on. You do need to remember, though, how you constructed the DTM
(e.g., which preprocessing steps you took), to make sure that the
features are comparable.

Saving and loading a vectorizer and a classifier

In [25]:
# Make a vectorizer and train a classifier
vectorizer = TfidfVectorizer(min_df=5, max_df=0.5)
classifier = LogisticRegression(solver="liblinear")
X_train = vectorizer.fit_transform(text_train)
classifier.fit(X_train, y_train)

# Save them to disk
with open("myvectorizer.pkl", mode="wb") as f:
    pickle.dump(vectorizer, f)
with open("myclassifier.pkl", mode="wb") as f:
    joblib.dump(classifier, f)

# Later on, re-load this classifier and apply:
new_texts = ["This is a great movie", "I hated this one.", "What an awful fail"]

with open("myvectorizer.pkl", mode="rb") as f:
    myvectorizer = pickle.load(f)
with open("myclassifier.pkl", mode="rb") as f:
    myclassifier = joblib.load(f)

new_features = myvectorizer.transform(new_texts)
pred = myclassifier.predict(new_features)

for review, label in zip(new_texts, pred):
    print(f"'{review}' is probably '{label}'.")

'This is a great movie' is probably 'pos'.
'I hated this one.' is probably 'neg'.
'What an awful fail' is probably 'neg'.


Another thing that we might want to do is to get a better idea of the
features that the model uses to arrive at its prediction; in our
example, what actually characterizes the best and the worst reviews.
Example **?@exm-eli5** shows how this can be done in one line of code
using *eli5* – a package that aims to “*e*xplain \[the model\] *l*ike
*I*’m *5* years old”. Here, we re-use the `pipe` we constructed earlier
to provide both the vectorizer and the classifier to *eli5* – if we had
only provided the classifier, then the feature names would have been
internal identifiers (which are meaningless to us) rather than
human-readable words.

Using eli5 to get the most predictive features

In [26]:
pipe = make_pipeline(
    TfidfVectorizer(min_df=5, max_df=0.5),
    LogisticRegression(solver="liblinear"),
)
pipe.fit(text_train, y_train)
# print(eli5.format_as_text(eli5.explain_weights(pipe)))

We can also use eli5 to explain how the classifier arrived at a
prediction for a specific document, by using different shades of green
and red to explain how much different features contributed to the
classification, and in which direction (Example **?@exm-eli5b**).

Using eli5 to explain a prediction \## Python code