# Neural Bag-of-Words Classifiers

Lecture 2 | CMU ANLP Spring 2025 | Instructor: Sean Welleck

This is a notebook for [CMU CS11-711 Advanced NLP](https://cmu-l3.github.io/anlp-spring2025/) that trains neural network classifiers. Specifically, each model uses a bag-of-words variant to encode an input sequence into a continuous vector that is mapped to a probability distribution over the output classes. The model is trained to minimize cross-entropy loss using backpropagation.

*Ackowledgements*: adapted from Graham Neubig's ANLP Fall 2025 [code](https://github.com/neubig/anlp-code/tree/main/02-textclass)

### Tweet classification

We use the [`mteb/tweet_sentiment_extraction`](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction) dataset, which consists of classifying an input tweet as positive, neutral, or negative sentiment.

In [1]:
!head -n 4 train.jsonl

{"id":"cb774db0d1","text":" I`d have responded, if I were going","label":1,"label_text":"neutral"}
{"id":"549e992a42","text":" Sooo SAD I will miss you here in San Diego!!!","label":0,"label_text":"negative"}
{"id":"088c60f138","text":"my boss is bullying me...","label":0,"label_text":"negative"}
{"id":"9642c003ef","text":" what interview! leave me alone","label":0,"label_text":"negative"}


#### Train a tokenizer

Based on the examples above, splitting on whitespace isn't a great idea. Let's learn a BPE vocabulary using `sentencepiece`.

In [None]:
import sentencepiece as spm
import json

with open("bow_tokenizer_txt.txt", "w", encoding="utf-8") as f:
    with open('train.jsonl', "r") as f2:
        for line in f2:
            j = json.loads(line)
            words = j['text']
            f.write(words + "\n")

import os

options = dict(
  input="bow_tokenizer_txt.txt",
  input_format="text",
  model_prefix="bow_tok", 
  model_type="bpe",
  vocab_size=2048,
  byte_fallback=True,
  num_threads=os.cpu_count()
)

spm.SentencePieceTrainer.train(**options);

In [3]:
sp = spm.SentencePieceProcessor()
sp.load('bow_tok.model')
vocab = [[sp.id_to_piece(idx), idx] for idx in range(sp.get_piece_size())]
vocab[1000:1020]

[['get', 1000],
 ['▁gl', 1001],
 ['▁away', 1002],
 ['eeee', 1003],
 ['▁left', 1004],
 ['▁mothers', 1005],
 ['?!', 1006],
 ['ily', 1007],
 ['oke', 1008],
 ['url', 1009],
 ['▁late', 1010],
 ['ire', 1011],
 ['hes', 1012],
 ['ner', 1013],
 ['▁Hope', 1014],
 ['▁Twitter', 1015],
 ['▁sha', 1016],
 ['▁bu', 1017],
 ['▁em', 1018],
 ['inking', 1019]]

#### Data loading

Read in the data, tokenize it, and split it into a training and dev set. There is a separate test set on [HuggingFace](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction).

In [4]:
from collections import defaultdict
import json
import random

random.seed(123)

label_to_text = {}
def read_dataset(filename):
    with open(filename, "r") as f:
        for line in f:
            j = json.loads(line)
            words = j['text']
            label = j['label']
            label_to_text[label] = j['label_text']
            tokens = sp.encode(words)
            yield (tokens, label)

# Read in the data
ds = list(read_dataset("train.jsonl"))
random.shuffle(ds)
train = ds[:-1000]
dev = ds[1000:]

nwords = len(sp)
ntags = 3

### Model 1: Bag-of-Embeddings

Our simplest model simply sums together 3-dimensional word embeddings (3 dimensions since we have three classes).

First, for understanding purposes let's implement our own embedding layer.

To do so, we multiply a one-hot vector representation of a token with a suitably-sized weight matrix.

In [5]:
import torch

print(train[1][0][:5])

torch.nn.functional.one_hot(torch.tensor(train[0][0]), num_classes=nwords)[:5]

[402, 510, 953, 428, 413]


tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]])

In [6]:
import torch.nn as nn

weight = nn.Parameter(torch.randn(nwords, 64))
weight

Parameter containing:
tensor([[-1.4509e+00,  9.2547e-01,  8.0536e-02,  ..., -3.2355e-01,
         -1.0078e-01,  1.9657e+00],
        [-1.1709e+00,  1.7886e+00,  2.0609e-01,  ..., -1.8114e+00,
         -1.1534e+00, -1.8620e-01],
        [ 1.7818e-01,  2.0482e+00,  2.8740e-01,  ..., -1.0381e-01,
         -9.6792e-01, -1.8163e-01],
        ...,
        [-2.8673e-01,  2.4939e-01, -5.8121e-01,  ..., -1.1078e-01,
         -8.1260e-02,  2.9797e-01],
        [ 1.9415e-03, -3.5610e+00, -7.1223e-01,  ..., -2.9136e-01,
          8.4358e-01, -1.5612e-01],
        [-1.9175e+00,  1.5752e+00,  2.1777e-01,  ..., -1.4831e+00,
          1.4779e+00,  1.3776e+00]], requires_grad=True)

In [7]:
xs = torch.nn.functional.one_hot(torch.tensor(train[1][0]), num_classes=nwords)[:5]

torch.matmul(xs.float(), weight).shape

torch.Size([5, 64])

In [8]:
class Embedding(nn.Module):
    def __init__(self, vocab_size, emb_size):
        super(Embedding, self).__init__()
        self.weight = nn.Parameter(torch.randn(vocab_size, emb_size))
        self.vocab_size = vocab_size

        nn.init.xavier_uniform_(self.weight)
        
    def forward(self, x):
        xs = torch.nn.functional.one_hot(x, num_classes=self.vocab_size).float()
        return torch.matmul(xs, self.weight)

Now here is our simple bag-of-words model

In [9]:
class BoW(torch.nn.Module):
    def __init__(self, vocab_size, num_labels):
        super(BoW, self).__init__()
        self.embedding = Embedding(vocab_size, num_labels)
        nn.init.xavier_uniform_(self.embedding.weight)

    def forward(self, tokens):
        emb = self.embedding(tokens)
        out = torch.sum(emb, dim=0) 
        logits = out.view(1, -1) 
        return logits

Let's also implement cross-entropy loss ourselves this time:

In [10]:
def ce_loss(logits, target):
    log_probs = torch.nn.functional.log_softmax(logits, dim=1)
    loss = -log_probs[:, target]
    return loss

Here's a training loop.

We simply do "single batch" training here, i.e. loop over each training example one at a time and perform an update. We'll implement batching later on.

You can use the SGD (Stochastic Gradient Descent) optimizer that was introduced in class, or this typically better optimizer Adam (we'll see it in a later class).

In [11]:
import random
import time

# initialize the model
model = BoW(nwords, ntags)
criterion = ce_loss
# optimizer = torch.optim.SGD(model.parameters(), lr=5e-4)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

for ITER in range(5):
    # Perform training
    random.shuffle(train)
    train_loss = 0.0
    start = time.time()
    for x, y in train:
        x = torch.tensor(x, dtype=torch.long)
        y = torch.tensor([y])
        logits = model(x)
        loss = criterion(logits, y)
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print("iter %r: train loss/sent=%.4f, time=%.2fs" % (
                ITER, train_loss/len(train), time.time()-start))
    # Perform validation
    test_correct = 0.0
    for x, y in dev:
        x = torch.tensor(x, dtype=torch.long)
        logits = model(x)[0].detach()
        predict = logits.argmax().item()
        if predict == y:
            test_correct += 1
    print("iter %r: valid acc=%.4f" % (ITER, test_correct/len(dev)))

iter 0: train loss/sent=0.9043, time=5.95s
iter 0: valid acc=0.6844
iter 1: train loss/sent=0.7722, time=6.14s
iter 1: valid acc=0.7026
iter 2: train loss/sent=0.7376, time=6.22s
iter 2: valid acc=0.7099
iter 3: train loss/sent=0.7222, time=5.74s
iter 3: valid acc=0.7148
iter 4: train loss/sent=0.7142, time=6.09s
iter 4: valid acc=0.7153


### Model 2: Bag-of-embeddings + output layer

This is what we called `CBoW` in the lecture. Take a look at the code to see how it differs from the previous model.

Also, it turns out to be important to initialize the weights well. We'll discuss this in a later class. Try removing the `nn.init` lines and see the performance change.

In [12]:
class CBoW(torch.nn.Module):
    def __init__(self, vocab_size, num_labels, emb_size):
        super(CBoW, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.output_layer = nn.Linear(emb_size, num_labels)

        nn.init.xavier_uniform_(self.embedding.weight)
        nn.init.xavier_uniform_(self.output_layer.weight)

    def forward(self, tokens):
        emb = self.embedding(tokens)    # [len(tokens) x emb_size]
        emb_sum = torch.sum(emb, dim=0) # [emb_size]
        h = emb_sum.view(1, -1)         # [1 x emb_size]
        logits = self.output_layer(h)   # [1 x num_labels]
        return logits

In [13]:
EMB_SIZE=32
model = CBoW(nwords, ntags, EMB_SIZE)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

for ITER in range(5):
    random.shuffle(train)
    train_loss = 0.0
    start = time.time()
    model.train()
    for x, y in train:
        x = torch.tensor(x, dtype=torch.long)
        y = torch.tensor([y])
        logits = model(x)
        loss = criterion(logits, y)
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print("iter %r: train loss/sent=%.4f, time=%.2fs" % (
                ITER, train_loss/len(train), time.time()-start))
    model.eval()
    # Perform testing
    test_correct = 0.0
    for x, y in dev:
        x = torch.tensor(x, dtype=torch.long)
        logits = model(x)[0].detach()
        predict = logits.argmax().item()
        if predict == y:
            test_correct += 1
    print("iter %r: dev acc=%.4f" % (ITER, test_correct/len(dev)))

iter 0: train loss/sent=0.8398, time=10.39s
iter 0: dev acc=0.7103
iter 1: train loss/sent=0.7399, time=9.62s
iter 1: dev acc=0.7204
iter 2: train loss/sent=0.7158, time=10.31s
iter 2: dev acc=0.7286
iter 3: train loss/sent=0.7048, time=10.92s
iter 3: dev acc=0.7341
iter 4: train loss/sent=0.6967, time=10.54s
iter 4: dev acc=0.7237


### Model 3: Deep CBoW

Now we introduce a nonlinear layer involving a tanh activation. 

In [14]:
class DeepCBoW(torch.nn.Module):
    def __init__(self, vocab_size, num_labels, emb_size, hid_size):
        super(DeepCBoW, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.linear1 = nn.Linear(emb_size, hid_size)    
        self.output_layer = nn.Linear(hid_size, num_labels)

        nn.init.xavier_uniform_(self.embedding.weight)
        nn.init.xavier_uniform_(self.linear1.weight)     
        nn.init.xavier_uniform_(self.output_layer.weight)

    def forward(self, tokens):
        emb = self.embedding(tokens)
        emb_sum = torch.sum(emb, dim=0) 
        h = emb_sum.view(1, -1) 
        h = torch.tanh(self.linear1(h))  
        logits = self.output_layer(h)
        return logits

In [15]:
EMB_SIZE=32
model = DeepCBoW(nwords, ntags, EMB_SIZE, 32)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

for EPOCH in range(10):
    random.shuffle(train)
    train_loss = 0.0
    start = time.time()
    model.train()
    for x, y in train:
        x = torch.tensor(x, dtype=torch.long)
        y = torch.tensor([y])
        logits = model(x)
        loss = criterion(logits, y)
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print("epoch %r: train loss/sent=%.4f, time=%.2fs" % (
                EPOCH, train_loss/len(train), time.time()-start))
    model.eval()
    # Perform testing
    test_correct = 0.0
    for x, y in dev:
        x = torch.tensor(x, dtype=torch.long)
        logits = model(x)[0].detach()
        predict = logits.argmax().item()
        if predict == y:
            test_correct += 1
    print("iter %r: dev acc=%.4f" % (EPOCH, test_correct/len(dev)))

epoch 0: train loss/sent=0.8199, time=11.63s
iter 0: dev acc=0.7249
epoch 1: train loss/sent=0.6858, time=10.44s
iter 1: dev acc=0.7553
epoch 2: train loss/sent=0.6270, time=10.71s
iter 2: dev acc=0.7830
epoch 3: train loss/sent=0.5716, time=13.66s
iter 3: dev acc=0.8120
epoch 4: train loss/sent=0.5143, time=12.05s
iter 4: dev acc=0.8405
epoch 5: train loss/sent=0.4565, time=12.04s
iter 5: dev acc=0.8695
epoch 6: train loss/sent=0.3966, time=12.47s
iter 6: dev acc=0.8929
epoch 7: train loss/sent=0.3438, time=11.31s
iter 7: dev acc=0.9133
epoch 8: train loss/sent=0.2917, time=12.60s
iter 8: dev acc=0.9288
epoch 9: train loss/sent=0.2439, time=12.54s
iter 9: dev acc=0.9400


Go deep learning!

Classify an example with our trained model

In [16]:
tweet = "I'm learning so much in advanced NLP!"
tokens = torch.tensor(sp.encode(tweet), dtype=torch.long)
logits = model(tokens)[0].detach()
predict = logits.argmax().item()
label_to_text[predict]


'positive'

### Suggested exercises

- Try changing the initialization of weights. Does the loss and/or dev accuracy change?
- Generalize the `DeepCBoW` implementation to take in a `num_layers` parameter. How does performance change as the number of layers is increased?
- Try different hyperparameters (e.g., learning rate, embedding size, hidden size, number of epochs). Can you identify any consistent trends?
- Try out different qualitative examples. Can you find patterns in how the model succeeds / fails?
- Implement batching by introducing a new `[PAD]` token. Make sure to mask out vectors for pad tokens in the model forward pass.