## Text Classification

In the same way that we can train neural networks for image classification, it is possible to train machine learning models capable of assigning a specific label to a piece of text.

## Dataset



In [None]:
# !pip install torchtext==0.10.0

In [None]:
import torch
import torchtext.legacy

torchtext.__version__

'0.10.0'

Torchtext contains a multitude of datasets that we can use, which are ideal when learning to work with neural networks for NLP tasks.

The "torchtext.legacy.data.Field" class contains all the necessary tokenization and text processing logic.

In [None]:
TEXT = torchtext.legacy.data.Field(tokenize = 'spacy')
LABEL = torchtext.legacy.data.LabelField(dtype = torch.long)

train_data, test_data = torchtext.legacy.datasets.IMDB.splits(TEXT, LABEL)



downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:04<00:00, 17.6MB/s]


In [None]:
len(train_data), len(test_data)

(25000, 25000)

In the following way we can see a sample example of our dataset, which is composed of the text and the valuation.

In [None]:
print(vars(train_data.examples[0]))

{'text': ['Like', 'his', 'earlier', 'film', ',', '"', 'In', 'a', 'Glass', 'Cage', '"', ',', 'Agustí', 'Villaronga', 'achieves', 'an', 'intense', 'and', 'highly', 'poetic', 'canvas', 'that', 'is', 'even', 'more', 'refined', 'visually', 'than', 'its', 'predecessor', '.', 'This', 'is', 'one', 'of', 'the', 'most', 'visually', 'accomplished', 'and', 'haunting', 'pictures', 'one', 'could', 'ever', 'see', '.', 'The', 'heightened', 'drama', ',', 'intensity', 'and', 'undertone', 'of', 'violence', 'threatens', 'on', 'the', 'the', 'melodramatic', 'or', 'farcical', ',', 'yet', 'never', 'steps', 'into', 'it', '.', 'In', 'that', 'way', ',', 'it', 'pulls', 'off', 'an', 'almost', 'impossible', 'feat', ':', 'to', 'be', 'so', 'over', '-', 'the', '-', 'top', 'and', 'yet', 'so', 'painfully', 'restrained', ',', 'to', 'be', 'so', 'charged', 'and', 'yet', 'so', 'understated', ',', 'and', 'even', 'the', 'explosives', 'finales', 'are', 'virtuosic', 'feasts', 'of', 'the', 'eye', '.', 'Unabashed', ',', 'gorgeous

## Tokenization

In this case we are going to build a vocabulary that will contain a certain number of words, for this the tokenizer will calculate the frequency of each word in the text and will keep the quantity that we specify.

In [None]:
MAX_VOCAB_SIZE = 10000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

len(TEXT.vocab), len(LABEL.vocab)

(10002, 2)

We have a vocabulary with the given length + two, these extra tokens correspond to the "unk" tokens, which will be assigned to unknown words and less frequent words that have not passed the first filter, and the "pad" token, which will be will use to make all phrases in a batch the same length.

In [None]:
TEXT.vocab.freqs.most_common(10)

[('the', 289838),
 (',', 275296),
 ('.', 236709),
 ('and', 156484),
 ('a', 156282),
 ('of', 144056),
 ('to', 133886),
 ('is', 109095),
 ('in', 87676),
 ('I', 77546)]

In [None]:
TEXT.vocab.itos[:10]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']

In [None]:
LABEL.vocab.stoi

defaultdict(None, {'neg': 0, 'pos': 1})

Build the DataLoader in charge of feeding the network with batches of phrases efficiently, using the torchtext.data.BucketIterator class, which will also join phrases of similar length minimizing the necessary padding.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

dataloader = {
    'train': torchtext.legacy.data.BucketIterator(train_data, batch_size=64, shuffle=True, sort_within_batch=True, device=device),
    'test': torchtext.legacy.data.BucketIterator(test_data, batch_size=64, device=device)
}

## Model

To classify the text we will use a many-to-one recursive network, which will receive the text word by word and we will use the last hidden state (which will contain information of the entire sentence) to generate our final prediction.

In [None]:
class RNN(torch.nn.Module):

    def __init__(self, input_dim, embedding_dim=128, hidden_dim=128, output_dim=2, num_layers=2, dropout=0.2, bidirectional=False):
        super().__init__()
        self.embedding = torch.nn.Embedding(input_dim, embedding_dim)
        self.rnn = torch.nn.GRU(
            input_size=embedding_dim, 
            hidden_size=hidden_dim, 
            num_layers=num_layers, 
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=bidirectional
        )
        self.fc = torch.nn.Linear(2*hidden_dim if bidirectional else hidden_dim, output_dim)

    def forward(self, text):
        #text = [sent len, batch size]        
        embedded = self.embedding(text)        
        #embedded = [sent len, batch size, emb dim]        
        output, hidden = self.rnn(embedded)        
        #output = [sent len, batch size, hid dim]
        y = self.fc(output[-1,:,:].squeeze(0))  
        """ Now the batch dimension is NOT the first, this is the default behavior of recursive 
            layers in Pytorch. You can modify this by adding the option batch_first=True in the 
            recursive layer (and make sure your dataloader uses the first dimension for the batch 
            as well. """  
        return y

We test that our network is well defined and the dimensions fit, we expect tensors with dimensions "sequence length x batch".

In [None]:
batch = next(iter(dataloader['train']))

batch.text.shape

torch.Size([93, 64])

At the output, the model will give us two values, if the first value is greater than the second, we will assign class 0 (negative opinion) and vice versa.

In [None]:
model = RNN(input_dim=len(TEXT.vocab))
outputs = model(torch.randint(0, len(TEXT.vocab), (100, 64)))
outputs.shape

torch.Size([64, 2])

## Training

In [None]:
from tqdm import tqdm
import numpy as np

def fit(model, dataloader, epochs=5):
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = torch.nn.CrossEntropyLoss()
    for epoch in range(1, epochs+1):
        model.train()
        train_loss, train_acc = [], []
        bar = tqdm(dataloader['train'])
        for batch in bar:
            X, y = batch.text, batch.label
            X, y = X.to(device), y.to(device)
            optimizer.zero_grad()
            y_hat = model(X)
            loss = criterion(y_hat, y)
            loss.backward()
            optimizer.step()
            train_loss.append(loss.item())
            acc = (y == torch.argmax(y_hat, axis=1)).sum().item() / len(y)
            train_acc.append(acc)
            bar.set_description(f"loss {np.mean(train_loss):.5f} acc {np.mean(train_acc):.5f}")
        bar = tqdm(dataloader['test'])
        val_loss, val_acc = [], []
        model.eval()
        with torch.no_grad():
            for batch in bar:
                X, y = batch.text, batch.label
                X, y = X.to(device), y.to(device)
                y_hat = model(X)
                loss = criterion(y_hat, y)
                val_loss.append(loss.item())
                acc = (y == torch.argmax(y_hat, axis=1)).sum().item() / len(y)
                val_acc.append(acc)
                bar.set_description(f"val_loss {np.mean(val_loss):.5f} val_acc {np.mean(val_acc):.5f}")
        print(f"Epoch {epoch}/{epochs} loss {np.mean(train_loss):.5f} val_loss {np.mean(val_loss):.5f} acc {np.mean(train_acc):.5f} val_acc {np.mean(val_acc):.5f}")

In [None]:
fit(model, dataloader)

loss 0.59914 acc 0.66571: 100%|██████████| 391/391 [16:07<00:00,  2.48s/it]
val_loss 0.52978 val_acc 0.73648: 100%|██████████| 391/391 [04:48<00:00,  1.36it/s]


Epoch 1/5 loss 0.59914 val_loss 0.52978 acc 0.66571 val_acc 0.73648


loss 0.34038 acc 0.85248: 100%|██████████| 391/391 [05:09<00:00,  1.27it/s]
val_loss 0.32701 val_acc 0.85414: 100%|██████████| 391/391 [04:50<00:00,  1.35it/s]


Epoch 2/5 loss 0.34038 val_loss 0.32701 acc 0.85248 val_acc 0.85414


loss 0.23077 acc 0.90959: 100%|██████████| 391/391 [05:07<00:00,  1.27it/s]
val_loss 0.26565 val_acc 0.88944: 100%|██████████| 391/391 [04:49<00:00,  1.35it/s]


Epoch 3/5 loss 0.23077 val_loss 0.26565 acc 0.90959 val_acc 0.88944


loss 0.16911 acc 0.93539: 100%|██████████| 391/391 [05:06<00:00,  1.27it/s]
val_loss 0.27508 val_acc 0.88435: 100%|██████████| 391/391 [04:47<00:00,  1.36it/s]


Epoch 4/5 loss 0.16911 val_loss 0.27508 acc 0.93539 val_acc 0.88435


loss 0.12250 acc 0.95609: 100%|██████████| 391/391 [05:06<00:00,  1.27it/s]
val_loss 0.30675 val_acc 0.89325: 100%|██████████| 391/391 [04:49<00:00,  1.35it/s]

Epoch 5/5 loss 0.12250 val_loss 0.30675 acc 0.95609 val_acc 0.89325





## Generating Predictions

In this case we only have two possible classes, but it is easy to intuit that if we were able to build a dataset with many more classes that more accurately describe the "sentiment" in a text, we could extract very valuable information.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

def predict(model, X):
    model.eval() 
    with torch.no_grad():
        X = torch.tensor(X).to(device)
        pred = model(X)
        return pred

In [None]:
sentences = ["this film is terrible", "this film is great", "this film is good", "a waste of time"]
tokenized = [[tok.text for tok in nlp.tokenizer(sentence)] for sentence in sentences]
indexed = [[TEXT.vocab.stoi[_t] for _t in t] for t in tokenized]
tensor = torch.tensor(indexed).permute(1,0)
predictions = torch.argmax(predict(model, tensor), axis=1)
predictions

  import sys


tensor([0, 1, 1, 0])

## Bidirectional Recurrent Networks

Bidirectional recurrent networks allow, in general, to obtain better results when we work with sequential data.

In applications such as text generation or time series prediction, we could not do this, however, for the text classification task, we can.

In [None]:
model = RNN(input_dim=len(TEXT.vocab), bidirectional=True)
fit(model, dataloader)

loss 0.61430 acc 0.63929: 100%|██████████| 391/391 [33:54<00:00,  5.20s/it]
val_loss 0.53135 val_acc 0.73211: 100%|██████████| 391/391 [10:43<00:00,  1.65s/it]


Epoch 1/5 loss 0.61430 val_loss 0.53135 acc 0.63929 val_acc 0.73211


loss 0.32976 acc 0.85543: 100%|██████████| 391/391 [10:56<00:00,  1.68s/it]
val_loss 0.30841 val_acc 0.86598: 100%|██████████| 391/391 [10:43<00:00,  1.64s/it]


Epoch 2/5 loss 0.32976 val_loss 0.30841 acc 0.85543 val_acc 0.86598


loss 0.22088 acc 0.91182: 100%|██████████| 391/391 [11:04<00:00,  1.70s/it]
val_loss 0.36229 val_acc 0.84234: 100%|██████████| 391/391 [10:56<00:00,  1.68s/it]


Epoch 3/5 loss 0.22088 val_loss 0.36229 acc 0.91182 val_acc 0.84234


loss 0.15653 acc 0.94101: 100%|██████████| 391/391 [10:57<00:00,  1.68s/it]
val_loss 0.31984 val_acc 0.87363: 100%|██████████| 391/391 [10:42<00:00,  1.64s/it]


Epoch 4/5 loss 0.15653 val_loss 0.31984 acc 0.94101 val_acc 0.87363


loss 0.10502 acc 0.96264: 100%|██████████| 391/391 [11:05<00:00,  1.70s/it]
val_loss 0.32625 val_acc 0.88268: 100%|██████████| 391/391 [10:53<00:00,  1.67s/it]

Epoch 5/5 loss 0.10502 val_loss 0.32625 acc 0.96264 val_acc 0.88268





In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

def predict(model, X):
    model.eval() 
    with torch.no_grad():
        X = torch.tensor(X).to(device)
        pred = model(X)
        return pred

In [None]:
sentences = ["this film is terrible", "this film is great", "this film is good", "a waste of time"]
tokenized = [[tok.text for tok in nlp.tokenizer(sentence)] for sentence in sentences]
indexed = [[TEXT.vocab.stoi[_t] for _t in t] for t in tokenized]
tensor = torch.tensor(indexed).permute(1,0)
predictions = torch.argmax(predict(model, tensor), axis=1)
predictions

  import sys


tensor([0, 1, 1, 0])