<a href="https://colab.research.google.com/github/middlebury-csci-0451/CSCI-0451/blob/main/lecture-notes/text-classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


*Major components of this set of lecture notes are based on the [Text Classification](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) tutorial from the PyTorch documentation*. 

## Deep Text Classification and Word Embedding

In this set of notes, we'll discuss the problem of *text classification*. Text classification is a common problem in which we aim to classify pieces of text into different categories. These categories might be about:

- **Subject matter**: is this news article about news, fashion, finance?
- **Emotional valence**: is this tweet happy or sad? Excited or calm? This particular class of questions is so important that it has its own name: sentiment analysis.
- **Automated content moderation**: is this Facebook comment a possible instance of abuse or harassment? Is this Reddit thread promoting violence? Is this email spam?

We saw text classification previously when we first considered the problem of vectorizing pieces of text. We are now going to look at a somewhat more contemporary approach to text using *word embeddings*. 


In [1]:
import pandas as pd
import torch
import numpy as np

# for embedding visualization later
import plotly.express as px 
import plotly.io as pio
pio.templates.default = "plotly_white"

from sklearn.model_selection import train_test_split

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

For this example, we are going to use a data set containing headlines from a large number of different news articles on the website [HuffPost](https://www.huffpost.com/). I retrieved this data [from Kaggle](https://www.kaggle.com/rmisra/news-category-dataset). 

In [2]:
# access the data
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/news/News_Category_Dataset_v2.json"
df  = pd.read_json(url, lines=True)
df  = df[["category", "headline"]]

There are over 200,000 headlines listed here, along with the category in which they appeared on the website.


In [3]:
df.head()

Unnamed: 0,category,headline
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...


Our task will be to teach an algorithm to classify headlines by predicting the category based on the text of the headline. 

Training a model on this much text data can require a lot of time, so we are going to simplify the problem a little bit, by reducing the number of categories. Let's take a look at which categories we have: 

In [4]:
df.groupby("category").size()

category
ARTS               1509
ARTS & CULTURE     1339
BLACK VOICES       4528
BUSINESS           5937
COLLEGE            1144
COMEDY             5175
CRIME              3405
CULTURE & ARTS     1030
DIVORCE            3426
EDUCATION          1004
ENTERTAINMENT     16058
ENVIRONMENT        1323
FIFTY              1401
FOOD & DRINK       6226
GOOD NEWS          1398
GREEN              2622
HEALTHY LIVING     6694
HOME & LIVING      4195
IMPACT             3459
LATINO VOICES      1129
MEDIA              2815
MONEY              1707
PARENTING          8677
PARENTS            3955
POLITICS          32739
QUEER VOICES       6314
RELIGION           2556
SCIENCE            2178
SPORTS             4884
STYLE              2254
STYLE & BEAUTY     9649
TASTE              2096
TECH               2082
THE WORLDPOST      3664
TRAVEL             9887
WEDDINGS           3651
WEIRD NEWS         2670
WELLNESS          17827
WOMEN              3490
WORLD NEWS         2177
WORLDPOST          2579
dtype: 

Some of these categories are a little odd:

- "Women"? 
- "Weird News"? 
- What's the difference between "Style," "Style & Beauty," and "Taste"? ). 
- "Parenting" vs. "Parents"? 
- Etc?...

Well, there are definitely some questions here! Let's just choose a few categories, and discard the rest. We're going to give each of the categories an integer that we'll use to encode the category in the target variable. 

In [5]:
categories = {
    "STYLE"   : 0,
    "SCIENCE" : 1, 
    "DIVORCE" : 2
}

df = df[df["category"].apply(lambda x: x in categories.keys())]
df.head()

Unnamed: 0,category,headline
155,SCIENCE,Scientists Turn To DNA Technology To Search Fo...
285,SCIENCE,Unusual Asteroid Could Be An Interstellar Gues...
439,SCIENCE,China Marks Another Milestone In Quest To Beco...
449,SCIENCE,Terrifying Clip Shows Why You Should Never Run...
1246,SCIENCE,U.S. Climate Scientists Flee For France To ‘Ma...


In [6]:
df["category"] = df["category"].apply(categories.get)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["category"] = df["category"].apply(categories.get)


Unnamed: 0,category,headline
155,1,Scientists Turn To DNA Technology To Search Fo...
285,1,Unusual Asteroid Could Be An Interstellar Gues...
439,1,China Marks Another Milestone In Quest To Beco...
449,1,Terrifying Clip Shows Why You Should Never Run...
1246,1,U.S. Climate Scientists Flee For France To ‘Ma...
...,...,...
200754,1,Treating a World Without Antibiotics?
200815,1,Russian Cargo Ship Docks At International Spac...
200816,1,"Robots Play Catch, Starring Agile Justin And R..."
200817,1,Thomas Edison Voted Most Iconic Inventor In U....


Next we need to wrap this Pandas dataframe as a Torch data set. While we've been using pre-implemented Torch classes for things like directories of images, in this case it's not so hard to just implement our own Dataset. We just need to implement `__getitem__()` to return the appropriate row of the dataframe. 

In [7]:
from torch.utils.data import Dataset, DataLoader

class TextDataFromDF(Dataset):
    def __init__(self, df):
        self.df = df
    
    def __getitem__(self, index):
        return self.df.iloc[index, 1], self.df.iloc[index, 0]

    def __len__(self):
        return len(self.df)                

Now let's perform a train-validation split and make Datasets from each one. 

In [8]:
df_train, df_val = train_test_split(df,shuffle = True, test_size = 0.2)

In [9]:
train_data = TextDataFromDF(df_train)
val_data   = TextDataFromDF(df_val)

Each element of our data sets is a tuple of text and label: 

In [10]:
train_data[194]

('How Tragic Events Brought These Exes And Their Spouses Closer Together', 2)

## Text Vectorization (Again)

Now we need to vectorize our text. This time, we're not going to use one-hot encodings. Instead, we are going to treat each sentence as a sequence of words, and identify each word via an integer index. First we'll use a *tokenizer* to split each sentence into individual words: 

In [11]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')


tokenized = tokenizer(train_data[194][0])
tokenized

['how',
 'tragic',
 'events',
 'brought',
 'these',
 'exes',
 'and',
 'their',
 'spouses',
 'closer',
 'together']

You might reasonably disagree about whether this is a good tokenization: should punctuation marks be included? Should "you're" really have become "you", "'", and "re"? These are excellent questions that we won't discuss too much further right now. 

We're now ready to build a *vocabulary*. A vocabulary is a mapping from words to integers. The code below loops through the training data and uses it to build such a mapping. 

In [12]:
def yield_tokens(data_iter):
    for text, _ in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_data), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

Here are the first couple elements of the vocabulary: 

In [13]:
vocab.get_itos()[0:10]

['<unk>', "'", 'the', 'to', ',', 'a', 's', 'divorce', 'of', 'in']

This vocabulary can be applied on a list of tokens like this: 

In [14]:
vocab(tokenized)

[20, 2941, 2474, 2337, 76, 445, 12, 52, 1100, 821, 491]

# Batch Collation

Now we're ready to construct the function that is going to actually pass a batch of data to our training loop. Here are the main steps: 

1. We pull some feature data (i.e. a batch of headlines). 
2. We represent each headline as a sequence of integers using the `vocab`. 
3. We `append` the sequences so that they actually form a *single* sequence. 
4. Separately, we keep track of some `offsets` which let us remember where each individual headline begins in the long sequence. 
    - The big sequence of integers and the sequence of offsets *jointly* constitute our feature data. 
5. We return the sequence of integers, the offsets, and the labels as Torch tensors. 

In [15]:
text_pipeline = lambda x: vocab(tokenizer(x)) # turns text into lists of ints
label_pipeline = lambda x: int(x)

def collate_batch(batch):
    label_list, text_list, lengths = [], [], [0]
    for (_text, _label) in batch:

        # add label to list
         label_list.append(label_pipeline(_label))

         # add text (as sequence of integers) to list
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)

         # add info about how long the text was, used for offsets
         lengths.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(lengths[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

In [16]:
train_loader = DataLoader(train_data, batch_size=8, shuffle=False, collate_fn=collate_batch)
val_loader = DataLoader(val_data, batch_size=8, shuffle=False, collate_fn=collate_batch)

Let's take a look at a batch of data now: 

In [17]:
next(iter(train_loader))

(tensor([1, 2, 1, 0, 2, 2, 0, 1], device='cuda:0'),
 tensor([9306, 1735, 8568,   82,  349,   17,    6,   17, 6515,    4,  596,  638,
          821,    3, 7489, 1719,    2, 8607, 6902, 6213,  127, 1943,   10, 4840,
         7967,  106,  307, 4969, 5072,    4,   55,  105, 1550,   11,  935,    4,
          112,   76,   47,   88,  410,   24,   23,  318,   43,   54,   68,   56,
          461,   90,    3,   84,   30,  168,   68, 1718,   28,    7,   31,  279,
           64, 4077,  153, 4041,   10,   18,   29,   19, 2821, 2579,   70,    5,
         1245, 4943,   11,  905, 2585,   12, 3319,  661, 1467,   74,    1,    6,
         1548, 3895,  143,    1,  993, 4380,    1, 4314, 4239], device='cuda:0'),
 tensor([ 0, 16, 23, 32, 46, 55, 68, 79], device='cuda:0'))

The first element is the list of labels. The second is the concatenated sequence of integers representing 8 headlines worth of text. The final one is the list of offsets that tells us where each of the 8 headlines begins. 

## Modeling

### Word Embedding

A *word embedding* refers to a representation of a word in a vector space. Each word is assigned an individual vector. The general aim of a word embedding is to create a representation such that words with related meanings are close to each other in a vector space, while words with different meanings are farther apart. One usually hopes for the *directions* connecting words to be meaningful as well. Here's a nice diagram illustrating some of the general concepts: 

![](https://miro.medium.com/max/1838/1*OEmWDt4eztOcm5pr2QbxfA.png)

*Image credit: [Towards Data Science](https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8)*

Word embeddings are often produced as intermediate stages in many machine learning algorithms. In our case, we're going to add an embedding layer at the very base of our model. We'll allow the user to flexibly specify the number of dimensions. 

We'll typically expect pretty low-dimensional embeddings for this lecture, but state-of-the-art embeddings will typically have a much higher number of dimensions. For example, the [Embedding Projector demo](http://projector.tensorflow.org/) supplied by TensorFlow uses a default dimension of 200. 

In [18]:
from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):

        super().__init__()

        # embedding layer
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)

        # output layer (not very deep)
        self.fc = nn.Linear(embed_dim, num_class)
    
    def forward(self, text, offsets):
        # this time forward needs both the text integer sequence and the offsets 
        # in order to fully form the predictor data
        embedded = self.embedding(text, offsets)

        # we could add a lot more layers on top of the embedding, but this is 
        # enough for this task
        return self.fc(embedded)

Let's learn and train a model! 

In [19]:
vocab_size = len(vocab)
embedding_dim = 16
model = TextClassificationModel(vocab_size, embedding_dim, len(categories)).to(device)

In [20]:
import time

optimizer = torch.optim.Adam(model.parameters(), lr=1)
loss_fn = torch.nn.CrossEntropyLoss()

def train(dataloader):
    # keep track of some counts for measuring accuracy
    total_acc, total_count = 0, 0
    log_interval = 300
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):

        # zero gradients
        optimizer.zero_grad()
        # form prediction on batch
        predicted_label = model(text, offsets)
        # evaluate loss on prediction
        loss = loss_fn(predicted_label, label)
        # compute gradient
        loss.backward()
        # take an optimization step
        optimizer.step()

        # for printing accuracy
        total_acc   += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| train accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()


def evaluate(dataloader):

    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

In [21]:
EPOCHS = 20
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_loader)
    
    print('| end of epoch {:3d} | time: {:5.2f}s | '.format(epoch,
                                           time.time() - epoch_start_time))
    print('-' * 65)


| epoch   1 |   300/  786 batches | train accuracy    0.669
| epoch   1 |   600/  786 batches | train accuracy    0.764
| end of epoch   1 | time:  4.98s | 
-----------------------------------------------------------------
| epoch   2 |   300/  786 batches | train accuracy    0.864
| epoch   2 |   600/  786 batches | train accuracy    0.894
| end of epoch   2 | time:  1.77s | 
-----------------------------------------------------------------
| epoch   3 |   300/  786 batches | train accuracy    0.928
| epoch   3 |   600/  786 batches | train accuracy    0.945
| end of epoch   3 | time:  1.71s | 
-----------------------------------------------------------------
| epoch   4 |   300/  786 batches | train accuracy    0.956
| epoch   4 |   600/  786 batches | train accuracy    0.956
| end of epoch   4 | time:  1.72s | 
-----------------------------------------------------------------
| epoch   5 |   300/  786 batches | train accuracy    0.959
| epoch   5 |   600/  786 batches | train accura

In [22]:
evaluate(val_loader)

0.8575063613231552

Our accuracy on validation data is much lower than what we achieved on the training data. This is a possible sign of overfitting. Regardless, this predictive performance is much better than what we would have achieved by guesswork: 

In [23]:
df_train.groupby("category").size() / len(df_train)

category
0    0.285873
1    0.276010
2    0.438116
dtype: float64

## Inspecting Word Embeddings

Recall from our discussion of image classification that the intermediate layers learned by the model can help us understand the representations that the model uses to construct its final outputs. In the case of word embeddings, we can simply extract this matrix from the corresponding layer of the model: 

In [25]:
embedding_matrix = model.embedding.cpu().weight.data.numpy()

Let's also extract the words from our vocabular: 

In [26]:
tokens = vocab.get_itos()

The weight matrix itself has 16 columns, which is too many for us to conveniently visualize. So, instead we are going to use our friend PCA to extract a 2-dimensional representation that we can plot. 

In [27]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
weights = pca.fit_transform(embedding_matrix)

We'll use the Plotly package to do the plotting. Plotly works best with dataframes: 

In [28]:
embedding_df = pd.DataFrame({
    'word' : tokens, 
    'x0'   : weights[:,0],
    'x1'   : weights[:,1]
})
embedding_df

Unnamed: 0,word,x0,x1
0,<unk>,3.921737,4.245463
1,',-15.690315,30.810335
2,the,-1.679589,4.659822
3,to,-0.706810,67.403107
4,",",7.435288,-13.638614
...,...,...,...
9639,‘tail,5.232621,3.792657
9640,‘you,50.140232,21.606318
9641,‘your,4.788199,4.075893
9642,’90s,29.882799,-11.696157


And, let's plot! We've used Plotly for the interactivity: hover over a dot to see the word it corresponds to. 

In [29]:
fig = px.scatter(embedding_df, 
                 x = "x0", 
                 y = "x1", 
                 size = list(np.ones(len(embedding_df))),
                 size_max = 10,
                 hover_name = "word")

fig.show()

We've made an embedding! We might notice that this embedding appears to be a little bit "stretched out" in three main directions. Each one corresponds to one of the three classes in our training data. 

## Bias in Text Embeddings

Whenever we create a machine learning model that might conceivably have impact on the thoughts or actions of human beings, we have a responsibility to understand the limitations and biases of that model. Biases can enter into machine learning models through several routes, including the data used as well as choices made by the modeler along the way. For example, in our case: 

1. **Data**: we used data from a popular news source. 
2. **Modeler choice**: we only used data corresponding to a certain subset of labels. 

With these considerations in mind, let's see what kinds of words our model associates with female and male genders. 

In [30]:
feminine = ["she", "her", "woman"]
masculine = ["he", "him", "man"]

highlight_1 = ["strong", "powerful", "smart",     "thinking", "brave", "muscle"]
highlight_2 = ["hot",    "sexy",     "beautiful", "shopping", "children", "thin"]

def gender_mapper(x):
    if x in feminine:
        return 1
    elif x in masculine:
        return 4
    elif x in highlight_1:
        return 3
    elif x in highlight_2:
        return 2
    else:
        return 0

embedding_df["highlight"] = embedding_df["word"].apply(gender_mapper)
embedding_df["size"]      = np.array(1.0 + 50*(embedding_df["highlight"] > 0))

# 
sub_df = embedding_df[embedding_df["highlight"] > 0]

In [35]:
import plotly.express as px 

fig = px.scatter(sub_df, 
                 x = "x0", 
                 y = "x1", 
                 color = "highlight",
                 size = list(sub_df["size"]),
                 size_max = 10,
                 hover_name = "word", 
                 text = "word")

fig.update_traces(textposition='top center')


fig.show()

Our text classification model's word embedding is unambiguously sexist. 

- Words like "hot", "sexy", and "shopping" are more closely located to feminine words like "she", "her", and "woman".
- Words like "strong", "smart", and "thinking" are more closely located to masculine words like "he", "him", and "man". 

Where did these biases come from? 

- The primary source is the data itself: HuffPost headlines in certain categories can be highly gendered, and the "Style" category is an example of this. 
- A secondary source is the choices that I made as a modeler. In particular, I intentionally chose categories that would emphasize biases in the data and make them easy to visualize. 

While I could have made different choices and obtained different results, this episode highlights a fundamental set of questions usually underexamined in contemporary machine learning: 

- What biases are built into my data source? 
- How do my choices about which data to use influence the biases present in my model? 

For more on the topic of bias in language models, you may wish to read the now-infamous paper by Emily Bender, Angelina McMillan-Major, Timnt Gebru, and "Shmargret Shmitchell" (Margret Mitchell), "[On the Dangers of Stochastic Parrots](https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf)." This is the paper that ultimately led to the firing of the final two authors by Google in late 2020 and early 2021. 