# Sentiment Analysis with Bert

In this notebook we will be using the transformer model, first introduced in [this](https://arxiv.org/abs/1706.03762) paper. Specifically, we will be using the **BERT** (Bidirectional Encoder Representations from Transformers) model from [this](https://arxiv.org/abs/1810.04805) paper. 

Transformer models are considerably larger than anything else covered in these tutorials. As such we are going to use the [transformers library](https://github.com/huggingface/transformers) to get pre-trained transformers and use them as our embedding layers. We will try two training methods:
1. freeze (not train) the transformer and only train the remainder of the model which learns from the representations produced by the transformer. 
2. Train all models end-to-end 

In this case we will be using a multi-layer bi-directional GRU, however any model can learn from these representations.

## Introduction to Bert

### History
2018 was a breakthrough year in NLP. Transfer learning, particularly models like Allen AI’s ELMO, OpenAI’s Open-GPT, and Google’s BERT allowed researchers to smash multiple benchmarks with minimal task-specific fine-tuning and provided the rest of the NLP community with pretrained models that could easily (with less data and less compute time) be fine-tuned and implemented to produce state of the art results. Unfortunately, for many starting out in NLP and even for some experienced practicioners, the theory and practical application of these powerful models is still not well understood.

### What is BERT?
**BERT (Bidirectional Encoder Representations from Transformers)**, released in late 2018, is the model we will use in this tutorial to provide readers with a better understanding of and practical guidance for using transfer learning models in NLP. BERT is a method of pretraining language representations that was used to create models that NLP practicioners can then download and use for free. You can either use these models to extract high quality language features from your text data, or you can fine-tune these models on a specific task (classification, entity recognition, question answering, etc.) with your own data to produce state of the art predictions.

### Advantages of Fine-Tuning
In this tutorial, we will use BERT to train a text classifier. Specifically, we will take the pre-trained BERT model, add an untrained layer of neurons on the end, and train the new model for our classification task. Why do this rather than train a train a specific deep learning model (a CNN, BiLSTM, etc.) that is well suited for the specific NLP task you need?

**Quicker Development**
First, the pre-trained BERT model weights already encode a lot of information about our language. As a result, it takes much less time to train our fine-tuned model - it is as if we have already trained the bottom layers of our network extensively and only need to gently tune them while using their output as features for our classification task. In fact, the authors recommend only 2-4 epochs of training for fine-tuning BERT on a specific NLP task (compared to the hundreds of GPU hours needed to train the original BERT model or a LSTM from scratch!).

**Less Data**
In addition and perhaps just as important, because of the pre-trained weights this method allows us to fine-tune our task on a much smaller dataset than would be required in a model that is built from scratch. A major drawback of NLP models built from scratch is that we often need a prohibitively large dataset in order to train our network to reasonable accuracy, meaning a lot of time and energy had to be put into dataset creation. By fine-tuning BERT, we are now able to get away with training a model to good performance on a much smaller amount of training data.

**Better Results**
Finally, this simple fine-tuning procedure (typically adding one fully-connected layer on top of BERT and training for a few epochs) was shown to achieve state of the art results with minimal task-specific adjustments for a wide variety of tasks: **classification, language inference, semantic similarity, question answering**, etc. Rather than implementing custom and sometimes-obscure architetures shown to work well on a specific task, simply fine-tuning BERT is shown to be a better (or at least equal) alternative.

### A Shift in NLP
This shift to transfer learning parallels the same shift that took place in computer vision a few years ago. Creating a good deep learning network for computer vision tasks can take millions of parameters and be very expensive to train. Researchers discovered that deep networks learn hierarchical feature representations (simple features like edges at the lowest layers with gradually more complex features at higher layers). Rather than training a new network from scratch each time, the lower layers of a trained network with generalized image features could be copied and transfered for use in another network with a different task. It soon became common practice to download a pre-trained deep network and quickly retrain it for the new task or add additional layers on top - vastly preferable to the expensive process of training a network from scratch. For many, the introduction of deep pre-trained language models in 2018 (ELMO, BERT, ULMFIT, Open-GPT, etc.) signals the same shift to transfer learning in NLP that computer vision saw.


## Installing the Hugging Face Library

Next, let’s install the transformers package from **Hugging Face** which will give us a pytorch interface for working with BERT. (This library contains interfaces for other pretrained language models like OpenAI’s GPT and GPT-2.) We’ve selected the pytorch interface because it strikes a nice balance between the high-level APIs (which are easy to use but don’t provide insight into how things work) and tensorflow code (which contains lots of details but often sidetracks us into lessons about tensorflow, when the purpose here is BERT!).

At the moment, the Hugging Face library seems to be the most widely accepted and powerful pytorch interface for working with BERT. In addition to supporting a variety of different pre-trained transformer models, the library also includes pre-built modifications of these models suited to your specific task. For example, in this tutorial we will use `BertForSequenceClassification`. We may also try to build our own classification model upon the base Bert model.

The library also includes task-specific classes for token classification, question answering, next sentence prediciton, etc. Using these pre-built classes simplifies the process of modifying BERT for your purposes.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/10/aeefced99c8a59d828a92cc11d213e2743212d3641c87c82d61b035a7d5c/transformers-2.3.0-py3-none-any.whl (447kB)
[K     |████████████████████████████████| 450kB 2.9MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/a6/b4/7a41d630547a4afd58143597d5a49e07bfd4c42914d8335b2a5657efc14b/sacremoses-0.0.38.tar.gz (860kB)
[K     |████████████████████████████████| 870kB 12.2MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 20.8MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.38-cp36-none-any.whl size=884629 sha256=c0bdbd4ad56d6c69

## Preparing Data

In [0]:
import torch

import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### Tokenize and Indexing

The transformer has already been trained with a specific vocabulary, which means we need to train with the exact same vocabulary and also tokenize our data in the same way that the transformer did when it was initially trained.

Luckily, the transformers library has tokenizers for each of the transformer models provided. In this case we are using the BERT model which ignores casing (i.e. will lower case every word). We get this by loading the pre-trained `bert-base-uncased` tokenizer.

In [3]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The `tokenizer` has a `vocab` attribute which contains the actual vocabulary we will be using. We can check how many tokens are in it by checking its length.

In [4]:
len(tokenizer.vocab)

30522

Using the tokenizer is as simple as calling `tokenizer.tokenize` on a string. This will tokenize and lower case the data in a way that is consistent with the pre-trained transformer model.

In [5]:
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')

print(tokens)

['hello', 'world', 'how', 'are', 'you', '?']


We can numericalize tokens using our vocabulary using `tokenizer.convert_tokens_to_ids`.

In [6]:
indexes = tokenizer.convert_tokens_to_ids(tokens)

print(indexes)

[7592, 2088, 2129, 2024, 2017, 1029]


### Special formatting
Before we fed the inputs to Bert model, we are required to:

* Add special tokens to the start and end of each sentence.
* Pad & truncate all sentences to a single constant length.
* Explicitly differentiate real tokens from padding tokens with the “attention mask”.

There are two special tokens we need to add:
* At the end of every sentence, we need to append the special [SEP] token.
This token is an artifact of two-sentence tasks, where BERT is given two separate sentences and asked to determine something;
* For classification tasks, we must prepend the special [CLS] token to the beginning of every sentence.

This token `[CLS]` has special significance. BERT consists of 12 Transformer layers. Each transformer takes in a list of token embeddings, and produces the same number of embeddings on the output (but with the feature values changed, of course!).

![](../figs/bert1.png)

On the output of the final (12th) transformer, only the first embedding (corresponding to the `[CLS]` token) is used by the classifier. Also, because BERT is trained to only use this `[CLS]` token for classification, we know that the model has been motivated to encode everything it needs for the classification step into that single 768-value embedding vector.



In [7]:
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

[CLS] [SEP] [PAD] [UNK]


We can get the indexes of the special tokens by converting them using the vocabulary, or by explicitly getting them from the tokenizer.

In [8]:
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

# init_token_idx = tokenizer.cls_token_id
# eos_token_idx = tokenizer.sep_token_id
# pad_token_idx = tokenizer.pad_token_id
# unk_token_idx = tokenizer.unk_token_id

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


Another thing we need to handle is that the model was trained on sequences with a **defined maximum length** - it does not know how to handle sequences longer than it has been trained on. We can get the maximum length of these input sizes by checking the `max_model_input_sizes` for the version of the transformer we want to use. In this case, it is 512 tokens. (However, the recommended maximum length is 128 by Bert authors)

In [9]:
MAX_INPUT_LENGTH = tokenizer.max_model_input_sizes['bert-base-uncased']

print(MAX_INPUT_LENGTH)

512


In [0]:
MAX_LENGTH = 256

<!-- Fortunately, the transfomer library provides a utility function `tokenizer.encode` to do all above for us: 
* tokenzing
* numericalizing
* adding `[CLS]` and `[SEP]`
* truncating to maximum length. -->

We define the following function to tokenize the original sentence and apply trucating to max_len

In [0]:
def tokenize_and_truncate(sentence, max_length):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_length-2]
    return tokens

In [0]:
# tokens = tokenizer.encode('Hello WORLD how ARE yoU?', max_length=MAX_LENGTH)

# print(tokens)

### Generate train/test datasets
Now we define our fields. 
* The transformer expects the batch dimension to be first, so we set `batch_first = True`. 
* As we already have the vocabulary for our text, provided by the transformer we set `use_vocab = False` to tell torchtext that we'll be handling the vocabulary side of things. 
* We pass our `tokenize_and_truncate` function as the tokenizer. 
* The `preprocessing` argument is a function that takes in the example after it has been tokenized, this is where we will convert the tokens to their indexes. 
* Set the special tokens to their predifined index values, i.e. 100 instead of `[UNK]`. otherwise, the default str value will cause error. Therefore, **there's no need to add special tokens `[CLS]` and `[SEP]` to inputs any more, since the later batching process will automatically do it for us**.

We define the label field as before.

In [0]:
from torchtext import data

TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = lambda x:tokenize_and_truncate(x, max_length=MAX_LENGTH),
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  #include_lengths = True
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx
                  )

LABEL = data.LabelField(dtype = torch.float)

We load the data and create the validation splits as before.

In [14]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:03<00:00, 23.1MB/s]


In [15]:
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


We can check an example and ensure that the text has already been numericalized.

In [16]:
print(vars(train_data.examples[0]))

{'text': [9826, 1010, 2073, 2064, 1045, 4088, 999, 2023, 2001, 1037, 2659, 5166, 1010, 27762, 6051, 2143, 1010, 2009, 2001, 2061, 18178, 2229, 2100, 2009, 2018, 2149, 2035, 21305, 2007, 7239, 2000, 2129, 3294, 2128, 7559, 5732, 2009, 2001, 999, 1996, 4690, 3554, 5019, 4694, 1005, 1056, 2130, 4690, 9590, 1010, 2027, 2020, 2652, 2105, 2007, 2070, 6081, 10689, 2027, 4149, 2012, 24547, 1011, 20481, 1998, 2035, 2027, 2020, 2725, 2001, 2074, 22653, 2000, 3046, 1998, 2191, 2009, 2298, 2066, 2027, 2020, 8084, 999, 999, 2033, 1998, 2026, 2155, 2001, 1999, 1996, 6888, 2005, 1037, 2428, 2204, 2895, 3185, 2028, 2154, 1010, 2061, 2057, 2787, 2000, 2175, 2000, 1996, 3573, 1998, 2298, 2005, 2028, 1010, 1998, 2045, 2009, 2001, 1996, 2387, 19392, 2479, 3185, 1012, 1045, 2812, 2009, 2246, 2061, 2307, 2021, 2043, 2057, 3427, 2009, 2012, 2188, 1045, 8134, 2351, 2044, 1996, 2034, 3496, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2821, 1998, 1996, 5436, 1997, 1996, 2143, 1010, 1996, 2466, 2604, 10

In [17]:
tokens = tokenizer.convert_ids_to_tokens(vars(train_data.examples[0])['text'])

print(tokens)

['honestly', ',', 'where', 'can', 'i', 'begin', '!', 'this', 'was', 'a', 'low', 'budget', ',', 'horribly', 'acted', 'film', ',', 'it', 'was', 'so', 'che', '##es', '##y', 'it', 'had', 'us', 'all', 'bursting', 'with', 'laughter', 'to', 'how', 'completely', 're', '##tar', '##ded', 'it', 'was', '!', 'the', 'sword', 'fighting', 'scenes', 'weren', "'", 't', 'even', 'sword', 'fights', ',', 'they', 'were', 'playing', 'around', 'with', 'some', 'plastic', 'swords', 'they', 'bought', 'at', 'wal', '-', 'mart', 'and', 'all', 'they', 'were', 'doing', 'was', 'just', 'moaning', 'to', 'try', 'and', 'make', 'it', 'look', 'like', 'they', 'were', 'struggling', '!', '!', 'me', 'and', 'my', 'family', 'was', 'in', 'the', 'mood', 'for', 'a', 'really', 'good', 'action', 'movie', 'one', 'day', ',', 'so', 'we', 'decided', 'to', 'go', 'to', 'the', 'store', 'and', 'look', 'for', 'one', ',', 'and', 'there', 'it', 'was', 'the', 'saw', '##tooth', 'island', 'movie', '.', 'i', 'mean', 'it', 'looked', 'so', 'great', 'bu

Although we've handled the vocabulary for the text, we still need to build the vocabulary for the labels.

In [18]:
LABEL.build_vocab(train_data)
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7f949e29f1e0>, {'neg': 0, 'pos': 1})


As before, we create the iterators

In [0]:
BATCH_SIZE = 32 ### 32 is recommended by Bert Authors, may try larger size, i.e. 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

In [0]:
# for batch in train_iterator:
#     print(batch)
#     break

## Build Model

For this task, we need to modify the pre-trained BERT model to give outputs for classification. There're two ways to do it:
* Build our own model upon the base Bert model implemented by the Library
* Use the library built-in model. 

Thankfully, the huggingface pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained BERT model, each has different top layers and output types designed to accomodate their specific NLP task, including classification, Q&A, sequence labeling, etc.
Here, we'll follow the 2nd method since it is simpler. The 1st method is more flexible and we'll leave it for future discussion.

We’ll be using `BertForSequenceClassification`. This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.

OK, let’s load BERT! There are a few different pre-trained BERT models available. `bert-base-uncased` means the version that has only lowercase letters (`uncased`) and is the smaller version of the two (“base” vs “large”).




In [0]:
from transformers import BertModel, BertForSequenceClassification, AdamW, BertConfig
NB_CLASSES = 2
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = NB_CLASSES, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)


We can check how many parameters the model has and where do they come from. 

In [22]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 109,483,778 trainable parameters


In [23]:
for name, param in model.named_parameters():
    #if param.requires_grad:
    print (name, param.requires_grad, param.data.shape)

bert.embeddings.word_embeddings.weight True torch.Size([30522, 768])
bert.embeddings.position_embeddings.weight True torch.Size([512, 768])
bert.embeddings.token_type_embeddings.weight True torch.Size([2, 768])
bert.embeddings.LayerNorm.weight True torch.Size([768])
bert.embeddings.LayerNorm.bias True torch.Size([768])
bert.encoder.layer.0.attention.self.query.weight True torch.Size([768, 768])
bert.encoder.layer.0.attention.self.query.bias True torch.Size([768])
bert.encoder.layer.0.attention.self.key.weight True torch.Size([768, 768])
bert.encoder.layer.0.attention.self.key.bias True torch.Size([768])
bert.encoder.layer.0.attention.self.value.weight True torch.Size([768, 768])
bert.encoder.layer.0.attention.self.value.bias True torch.Size([768])
bert.encoder.layer.0.attention.output.dense.weight True torch.Size([768, 768])
bert.encoder.layer.0.attention.output.dense.bias True torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.weight True torch.Size([768])
bert.encoder.

Our standard models have under 5M, but this one has almost 110M! Luckily, most of these parameters are from the transformer and we will decide to train or not train those.

In [0]:
FREEZE_BERT = False
for name, param in model.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = not FREEZE_BERT


Now let's check the trainable parameters again and see it largely reduced if we decide to freeze the bert parameters. We can certainly take it off (by setting `FREEZE_BERT=False`) and see if there's any improvement.

In [25]:
print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 109,483,778 trainable parameters


## Train and test


### Optimizer and learning rate scheduler
For the purposes of fine-tuning, the authors recommend choosing from the following values:

* Batch size: 16, 32 (We chose 32 when creating our DataLoaders).
* Learning rate (Adam): 5e-5, 3e-5, 2e-5 (We’ll use 2e-5).
* Number of epochs: 2, 3, 4 (We’ll use 4).
* The epsilon parameter `eps = 1e-8` is “a very small number to prevent any division by zero in the implementation” (from [here](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)).

You can find the creation of the AdamW optimizer in run_glue.py [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109).

In [0]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )

In [0]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs (authors recommend between 2 and 4)
N_EPOCHS = 4

# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_iterator) * N_EPOCHS

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [0]:
#Place the model onto the GPU (if available)
model = model.to(device)

### Training loop
Next, we'll define functions for: calculating accuracy, performing a training epoch, performing an evaluation epoch and calculating how long a training/evaluation epoch takes.

**Note**: No need to define *loss* function here, because `BertForSequenceClassification` model will output loss for you. If `num_label==1`, it uses `MSELoss` for regression; otherwise, `CrossEntropyLoss` for classification.

In [0]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    #round predictions to the closest integer
    pred_flat = torch.argmax(preds, axis=1).flatten()
    labels_flat = y.flatten()
    correct = (pred_flat == labels_flat).float()
    acc = correct.sum() / len(correct)
    return acc

We need to compute a `masking` vector as input to `BertForSequenceClassification` model.

In [0]:
def train(model, iterator, optimizer):
    epoch_loss = 0
    epoch_acc = 0

    model.train()
    for batch in iterator:
        optimizer.zero_grad()

        inputs = batch.text
        masks = inputs.ne(0).float()
        labels = batch.label.long() ### required for the loss computation
        
        outputs = model(input_ids=inputs, 
                    token_type_ids=None, 
                    attention_mask=masks, 
                    labels=labels)
        
        loss = outputs[0]
        predictions = outputs[1]
        #print(loss)
        #print(predictions)
        acc = binary_accuracy(predictions, labels)
        #print(acc)
        
        loss.backward()
        optimizer.step()
        # Update the learning rate.
        scheduler.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)



In [0]:
def evaluate(model, iterator):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in iterator:
            inputs = batch.text
            masks = inputs.ne(0).float()
            labels = batch.label.long() ### required for the loss computation
            
            outputs = model(input_ids=inputs, 
                        token_type_ids=None, 
                        attention_mask=masks, 
                        labels=labels)
            
            loss = outputs[0]
            predictions = outputs[1]
            #print(loss)
            #print(predictions)
            acc = binary_accuracy(predictions, labels)
            #print(acc)
            
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we'll train our model. This takes considerably longer than any of the previous models due to the size of the transformer. Even though we are not training any of the transformer's parameters we still need to pass the data through the model which takes a considerable amount of time on a standard GPU.

In [33]:
best_valid_loss = float('inf')
MODEL_PARAS_OBJ = 'bert-plain.pt'

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer)
    valid_loss, valid_acc = evaluate(model, valid_iterator)
        
    end_time = time.time()
        
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
        
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), MODEL_PARAS_OBJ)
    
    print(f'Epoch: {epoch+1} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 1 | Epoch Time: 8m 20s
	Train Loss: 0.277 | Train Acc: 88.29%
	 Val. Loss: 0.223 |  Val. Acc: 90.98%
Epoch: 2 | Epoch Time: 8m 21s
	Train Loss: 0.146 | Train Acc: 94.78%
	 Val. Loss: 0.232 |  Val. Acc: 91.48%
Epoch: 3 | Epoch Time: 8m 21s
	Train Loss: 0.069 | Train Acc: 97.87%
	 Val. Loss: 0.269 |  Val. Acc: 91.59%
Epoch: 4 | Epoch Time: 8m 20s
	Train Loss: 0.032 | Train Acc: 99.23%
	 Val. Loss: 0.321 |  Val. Acc: 91.56%


### Evaluation

With the test set prepared, we can apply our fine-tuned model to generate predictions on the test set.

In [34]:
model.load_state_dict(torch.load(MODEL_PARAS_OBJ))
test_loss, test_acc = evaluate(model, test_iterator)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.214 | Test Acc: 91.26%


## Inference
We'll then use the model to test the sentiment of some sequences. We tokenize the input sequence, trim it down to the maximum length, add the special tokens to either side, convert it to a tensor, add a fake batch dimension and then pass it through our model.

In [0]:
def predict_sentiment(model, tokenizer, sentence):
    model.eval()
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:MAX_LENGTH-2]
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    tensor = torch.LongTensor(indexed).to(device)
    mask = torch.ones_like(tensor).to(device)
    tensor = tensor.unsqueeze(0)
    mask = mask.unsqueeze(0)
    prediction = model(tensor, attention_mask=mask)
    output = torch.nn.functional.softmax(prediction[0], dim=1).squeeze()[1]
    return output.item()

In [36]:
predict_sentiment(model, tokenizer, "This film is terrible")

0.03746431693434715