## Preparing Data

One of the main concepts of TorchText is the Field. These define how your data should be processed. In our sentiment classification task the data consists of both the raw string of the review and the sentiment, either "pos" or "neg".

The parameters of a Field specify how the data should be processed.

We use the TEXT field to define how the review should be processed, and the LABEL field to process the sentiment.

Our TEXT field has tokenize='spacy' as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the [spaCy](https://spacy.io/) tokenizer. If no tokenize argument is passed, the default is simply splitting the string on spaces.

LABEL is defined by a LabelField, a special subset of the Field class specifically used for handling labels. We will explain the dtype argument later.

For more on TorchText, go [here](https://pytorch.org/text/).

In [0]:
import torch
from torchtext import data

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)


Another handy feature of TorchText is that it has support for common datasets used in natural language processing (NLP).

The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as torchtext.datasets objects. It process the data using the Fields we have previously defined. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review

In [0]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

In [3]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000


In [4]:
print(vars(train_data.examples[0]))

{'text': ['I', "'ll", 'tell', 'you', 'a', 'tale', 'of', 'the', 'summer', 'of', '1994', '.', 'A', 'friend', 'and', 'I', 'attended', 'a', 'Canada', 'Day', 'concert', 'in', 'Barrie', ',', 'and', 'it', 'was', 'a', 'who', "'s", 'who', 'of', 'the', 'top', 'Canadian', 'bands', 'of', 'the', 'age', '.', 'We', 'got', 'there', 'about', '4', 'am', ',', 'waited', 'in', 'line', 'most', 'of', 'the', 'morning', ',', 'and', 'when', 'the', 'doors', 'opened', 'at', '9', 'am', ',', 'we', 'were', 'among', 'the', 'first', 'inside', 'the', 'gates', '.', 'We', 'then', 'waited', 'and', 'waited', 'in', 'the', 'hot', 'sun', ',', 'slowly', 'broiling', 'but', 'we', 'did', "n't", 'care', ',', 'because', 'the', 'headliners', 'were', 'among', 'our', 'favourites', '.', 'At', 'one', 'point', ',', 'early', 'in', 'the', 'afternoon', ',', 'I', 'sat', 'down', 'and', 'dozed', 'off', 'with', 'my', 'back', 'to', 'the', 'barrier', '.', 'I', 'was', 'awakened', 'to', 'my', 'shock', 'and', 'dismay', 'by', 'a', 'shrieking', 'girl'

The IMDb dataset only has train/test splits, so we need to create a validation set. We can do this with the .split() method.

By default this splits 70/30, however by passing a split_ratio argument, we can change the ratio of the split, i.e. a split_ratio of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set.

We also pass our random seed to the random_state argument, ensuring that we get the same train/validation split each time.

In [0]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

Again, we'll view how many examples are in each split.

In [6]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


The following builds the vocabulary, only keeping the most common max_size tokens.

In [0]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

Why do we only build the vocabulary on the training set? When testing any machine learning system you do not want to look at the test set in any way. We do not include the validation set as we want it to reflect the test set as much as possible.

In [8]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


We can also view the most common words in the vocabulary and their frequencies.

In [9]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 202740), (',', 193041), ('.', 165817), ('and', 109865), ('a', 109697), ('of', 100794), ('to', 94173), ('is', 76159), ('in', 61417), ('I', 54936), ('it', 53936), ('that', 49515), ('"', 44192), ("'s", 43416), ('this', 42703), ('-', 37456), ('/><br', 35832), ('was', 35209), ('as', 30609), ('with', 30105)]


In [10]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7f0e1f5c4158>, {'neg': 0, 'pos': 1})


## Word Embedding

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [13]:
embeds = nn.Embedding(len(TEXT.vocab), 10)  # number of words in vocab, 10 dimensional embeddings
lookup_tensor = torch.tensor(TEXT.vocab.stoi['hello'], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([-0.5849, -0.0740, -1.8753, -0.3271, -0.3832, -0.6796, -0.3518, -0.4548,
         0.4912,  2.0177], grad_fn=<EmbeddingBackward>)


## Logistic Regression

In [0]:
class LR(nn.Module):
    def __init__(self, input_dim, embedding_dim, output_dim):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        
        # self.flatten = nn.Flatten()
        
        self.fc = nn.Linear(embedding_dim, output_dim)
        
    def forward(self, text):
       
        embedded = self.embedding(text)     

        hidden = torch.mean(embedded, dim=0)
        
        return self.fc(hidden)

In [0]:
BATCH_SIZE = 64

device = torch.device('cuda')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
OUTPUT_DIM = 1

model = LR(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM)

## Train the Model

In [0]:
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

In [0]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [0]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [22]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'lr-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 3s
	Train Loss: 0.691 | Train Acc: 52.23%
	 Val. Loss: 0.666 |  Val. Acc: 70.44%
Epoch: 02 | Epoch Time: 0m 3s
	Train Loss: 0.679 | Train Acc: 61.34%
	 Val. Loss: 0.617 |  Val. Acc: 74.74%
Epoch: 03 | Epoch Time: 0m 3s
	Train Loss: 0.657 | Train Acc: 70.44%
	 Val. Loss: 0.547 |  Val. Acc: 76.30%
Epoch: 04 | Epoch Time: 0m 3s
	Train Loss: 0.621 | Train Acc: 75.02%
	 Val. Loss: 0.493 |  Val. Acc: 77.57%
Epoch: 05 | Epoch Time: 0m 3s
	Train Loss: 0.580 | Train Acc: 78.48%
	 Val. Loss: 0.465 |  Val. Acc: 78.94%


In [23]:
model.load_state_dict(torch.load('lr-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.484 | Test Acc: 77.21%


## Use pre-trained word embedding

Next is the use of pre-trained word embeddings. Now, instead of having our word embeddings initialized randomly, they are initialized with these pre-trained vectors. We get these vectors simply by specifying which vectors we want and passing it as an argument to build_vocab. TorchText handles downloading the vectors and associating them with the correct words in our vocabulary.

Here, we'll be using the "glove.6B.100d" vectors". glove is the algorithm used to calculate the vectors, go [here](https://nlp.stanford.edu/projects/glove/) for more. 6B indicates these vectors were trained on 6 billion tokens and 100d indicates these vectors are 100-dimensional.

You can see the other available vectors [here](https://github.com/pytorch/text/blob/master/torchtext/vocab.py#L113).

The theory is that these pre-trained vectors already have words with similar semantic meaning close together in vector space, e.g. "terrible", "awful", "dreadful" are nearby. This gives our embedding layer a good initialization as it does not have to learn these relations from scratch.

Note: these vectors are about 862MB, so watch out if you have a limited internet connection.

By default, TorchText will initialize words in your vocabulary but not in your pre-trained embeddings to zero. We don't want this, and instead initialize them randomly by setting unk_init to torch.Tensor.normal_. This will now initialize those words via a Gaussian distribution.

In [0]:
MAX_VOCAB_SIZE = 25000

# TEXT = data.Field(tokenize = 'spacy')
# LABEL = data.LabelField(dtype = torch.float)

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

In [25]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


In [0]:
BATCH_SIZE = 64
device = torch.device('cuda')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)


We then replace the initial weights of the embedding layer with the pre-trained embeddings.

Note: this should always be done on the weight.data and not the weight!

In [27]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.8330,  1.7377,  0.1831,  ..., -0.7654, -0.3746, -0.7757],
        [ 0.9665,  0.0843, -2.0277,  ...,  0.1333, -2.1372, -0.8860],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.5601, -0.2589,  0.7380,  ..., -1.3936,  0.5910,  1.0687],
        [-2.4713, -1.6929,  0.7839,  ...,  1.0108, -0.6083,  1.0693],
        [-0.7064, -0.1308, -0.0430,  ..., -1.0220,  0.7037, -0.3897]],
       device='cuda:0')

In [0]:
class LR(nn.Module):
    def __init__(self, vocab, input_dim, embedding_dim, output_dim):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.embedding.weight.data.copy_(vocab.vectors)
        # self.embedding.weight.requires_grad = False
        
        self.fc = nn.Linear(embedding_dim, output_dim)
        
    def forward(self, text):
       
        embedded = self.embedding(text)     

        hidden = torch.mean(embedded, dim=0)
        
        return self.fc(hidden)

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
OUTPUT_DIM = 1

model = LR(TEXT.vocab, INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM)

optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

In [30]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'lr-model_with_pre_trained_wv.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 3s
	Train Loss: 0.694 | Train Acc: 51.64%
	 Val. Loss: 0.673 |  Val. Acc: 72.08%
Epoch: 02 | Epoch Time: 0m 3s
	Train Loss: 0.683 | Train Acc: 59.66%
	 Val. Loss: 0.631 |  Val. Acc: 75.42%
Epoch: 03 | Epoch Time: 0m 3s
	Train Loss: 0.665 | Train Acc: 68.73%
	 Val. Loss: 0.569 |  Val. Acc: 76.11%
Epoch: 04 | Epoch Time: 0m 3s
	Train Loss: 0.636 | Train Acc: 73.26%
	 Val. Loss: 0.509 |  Val. Acc: 77.18%
Epoch: 05 | Epoch Time: 0m 3s
	Train Loss: 0.599 | Train Acc: 77.67%
	 Val. Loss: 0.470 |  Val. Acc: 78.52%


In [31]:
model.load_state_dict(torch.load('lr-model_with_pre_trained_wv.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.482 | Test Acc: 77.12%
