## Contents
1. Training a Mini-GPT from Scratch

	1) Data Preprocess

	2) Building the Causal Language Model
2. BERT

	1) What is BERT? - Masked Language Modeling, Next Sentence Prediction

	2) Sentiment Analysis with BERT

	3) Sentiment Classification with BERT


# Training a Mini-GPT from Scratch

In [None]:
!pip3 install datasets
!pip3 install easydict
!pip3 install transformers
# !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

In [None]:
from datasets import load_dataset
from typing import List
import torch

### 1. Data Preprocess
We'll use the OpenWebText-10k Dataset to train and test our model.

In [None]:
# load train, valid, test datasets
data = load_dataset("stas/openwebtext-10k", split="train", trust_remote_code=True)
print(data)

In [None]:
train_split = 0.8
valid_split = 0.1
test_split = 0.1

# Split the dataset into train, valid, and test
split_data = data.train_test_split(test_size=valid_split + test_split, seed=42)
valid_test_split = split_data['test'].train_test_split(test_size=test_split / (valid_split + test_split), seed=42)

# Combine data into dictionary
split = {
    'train': split_data['train'],
    'valid': valid_test_split['train'],
    'test': valid_test_split['test']
}
print(len(split['train']), len(split['valid']), len(split['test']))
print(split['train'])
split['train']['text'][0]

In [26]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

def preprocess_function(sample):
  return tokenizer([" ".join(x.split("\n\n")) for x in sample['text']])

In [27]:
tokenized_split = dict()

# apply map function to each split[key] element with multiple processes.
for key in split:
	tokenized_split[key] = split[key].map(
		preprocess_function,
		batched=True, # operate on batches
		num_proc=4,
		remove_columns=split["train"].column_names,
	)

In [None]:
print(tokenized_split['train'])
print(tokenized_split['train']['input_ids'][0][:30])
print(len(tokenized_split['train']['input_ids'][0]))

In [29]:
block_size = 128

def group_texts(samples):
    # Concatenate all tokens.
    concatenated_samples = {k: sum(samples[k], []) for k in samples.keys()}
    total_length = len(concatenated_samples[list(samples.keys())[0]]) # total length of input_ids
    if total_length >= (block_size-2):
        total_length = (total_length // (block_size-2)) * (block_size-2)
    # Split by chunks of block_size, and add bos/eos token on each end
    result = {
        k: [[tokenizer.bos_token_id]+t[i : i + block_size-2]+[tokenizer.eos_token_id] for i in range(0, total_length, block_size-2)]
        for k, t in concatenated_samples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_dataset = dict()
for key in tokenized_split:
	lm_dataset[key] = tokenized_split[key].map(group_texts, batched=True, num_proc=4)

print(lm_dataset)
print(lm_dataset['train'][4]['input_ids'])
print(len(lm_dataset['train'][3]['input_ids']))

In [None]:
webtext_dataset = dict()
for key in lm_dataset:
	result = []
	for i in lm_dataset[key]:
		result.append(i['input_ids']) # create a single nested list of all input_ids
	webtext_dataset[key] = torch.tensor(result).to(torch.int64) # modify to torch.tensor

print(webtext_dataset["train"])
print(webtext_dataset["train"].shape)

In [None]:
print(tokenizer.decode(webtext_dataset["train"][0]))

In [None]:
print(tokenizer.bos_token_id)
print(tokenizer.bos_token)
print(tokenizer.eos_token)

### 2. Building a Causal Language Model

Text generation is best addressed with auto-regressive or causal language models such as GPT. GPT is a Transformer-based decoder-only model without an encoder. Thus each decoder layer of GPT-based models consists of masked self-attention and a feed network, without the encoder-decoder attention.

The structure of a decoder layer is shown below:
<div>
<img src="https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-27_at_12.41.44_PM.png" width=1000)/>
</div>

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(DEVICE)

We will be building the Mini-GPT model upon our Transformer model implemented last week. We'll reuse the MultiHeadAttention, PositionWiseFeedForward, SinusoidalPositionalEmbedding, Decoder modules. We need to modify the DecoderLayer so that there is no enc-dec attention.
#### 1) Basic Building Blocks
##### a. Multi-Head Attention
__Multi-head attention:__
<img src="http://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png" width=650)/>
* Equation:
$$\begin{align} \text{MultiHead}(Q, K, V) &= \text{Concat}(head_1, ...., head_h) W^O \\
\text{where head}_i &= \text{Attention} \left( QW^Q_i, K W^K_i, VW^v_i \right)
\end{align}$$

In [35]:
class MultiHeadAttention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""

    def __init__(
        self,
        emb_dim,
        num_heads,
        dropout=0.0,
        bias=False,
        encoder_decoder_attention=False,  # otherwise self_attention
        causal = False
    ):
        super().__init__()
        self.emb_dim = emb_dim
        self.num_heads = num_heads
        self.dropout = dropout
        self.head_dim = emb_dim // num_heads
        assert self.head_dim * num_heads == self.emb_dim, "emb_dim must be divisible by num_heads"

        self.encoder_decoder_attention = encoder_decoder_attention
        self.causal = causal
        self.q_proj = nn.Linear(emb_dim, emb_dim, bias=bias)
        self.k_proj = nn.Linear(emb_dim, emb_dim, bias=bias)
        self.v_proj = nn.Linear(emb_dim, emb_dim, bias=bias)
        self.out_proj = nn.Linear(emb_dim, emb_dim, bias=bias)


    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (
            self.num_heads,
            self.head_dim,
        )
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)
        # This is equivalent to
        # return x.transpose(1,2)


    def scaled_dot_product(self,
                           query: torch.Tensor,
                           key: torch.Tensor,
                           value: torch.Tensor,
                           attention_mask: torch.BoolTensor):

        attn_weights = torch.matmul(query, key.transpose(-1, -2)) / math.sqrt(self.emb_dim) # QK^T/sqrt(d)

        if attention_mask is not None:
            attn_weights = attn_weights.masked_fill(attention_mask.unsqueeze(1), float("-inf"))

        attn_weights = F.softmax(attn_weights, dim=-1)  # softmax(QK^T/sqrt(d))
        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training)
        attn_output = torch.matmul(attn_probs, value) # softmax(QK^T/sqrt(d))V

        return attn_output, attn_probs


    def MultiHead_scaled_dot_product(self,
                       query: torch.Tensor,
                       key: torch.Tensor,
                       value: torch.Tensor,
                       attention_mask: torch.BoolTensor):

        attn_weights = torch.matmul(query, key.transpose(-1, -2)) / math.sqrt(self.head_dim) # QK^T/sqrt(d)

        # Attention mask
        if attention_mask is not None:
            if self.causal:
              # (seq_len x seq_len)
                attn_weights = attn_weights.masked_fill(attention_mask.unsqueeze(0).unsqueeze(1), float("-inf"))
            else:
              # (batch_size x seq_len)
                attn_weights = attn_weights.masked_fill(attention_mask.unsqueeze(1).unsqueeze(2), float("-inf"))


        attn_weights = F.softmax(attn_weights, dim=-1)  # softmax(QK^T/sqrt(d))
        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training)

        attn_output = torch.matmul(attn_probs, value) # softmax(QK^T/sqrt(d))V
        attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
        concat_attn_output_shape = attn_output.size()[:-2] + (self.emb_dim,)
        attn_output = attn_output.view(*concat_attn_output_shape)
        attn_output = self.out_proj(attn_output)

        return attn_output, attn_weights


    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        attention_mask: torch.Tensor = None,
        ):

        q = self.q_proj(query)
        # Enc-Dec attention
        if self.encoder_decoder_attention:
            k = self.k_proj(key)
            v = self.v_proj(key)
        # Self attention
        else:
            k = self.k_proj(query)
            v = self.v_proj(query)

        q = self.transpose_for_scores(q)
        k = self.transpose_for_scores(k)
        v = self.transpose_for_scores(v)

        attn_output, attn_weights = self.MultiHead_scaled_dot_product(q,k,v,attention_mask)
        return attn_output, attn_weights


#### b. Position-wise feed-forward network
<a id='1b'></a>
In this section, we will implement position-wise feed forward network

$$\text{FFN}(x) = \max \left(0, x W_1 + b_1 \right) W_2 + b_2$$

In [36]:
class PositionWiseFeedForward(nn.Module):

    def __init__(self, emb_dim: int, d_ff: int, dropout: float = 0.1):
        super(PositionWiseFeedForward, self).__init__()

        self.activation = nn.ReLU()
        self.w_1 = nn.Linear(emb_dim, d_ff)
        self.w_2 = nn.Linear(d_ff, emb_dim)
        self.dropout = dropout

    def forward(self, x):
        residual = x
        x = self.activation(self.w_1(x))
        x = F.dropout(x, p=self.dropout, training=self.training)

        x = self.w_2(x)
        x = F.dropout(x, p=self.dropout, training=self.training)
        return x + residual # residual connection for preventing gradient vanishing

#### c. Sinusoidal Positional Encoding
<a id='1c'></a>
In this section, we will implement sinusoidal positional encoding

$$\begin{align}
PE(pos, 2i) &= \sin \left( pos / 10000^{2i / d_{model}} \right)  \\
PE(pos, 2i+1) &= \cos \left( pos / 10000^{2i / d_{model}} \right)  
\end{align}$$

In [37]:
import numpy as np

# Since Transformer contains no recurrence and no convolution,
# in order for the model to make use of the order of the sequence,
# we must inject some information about the relative or absolute position of the tokens in the sequence.
# To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks.
# There are many choices of positional encodings, learned and fixed

class SinusoidalPositionalEmbedding(nn.Embedding):

    def __init__(self, num_positions, embedding_dim, padding_idx=None):
        super().__init__(num_positions, embedding_dim) # torch.nn.Embedding(num_embeddings, embedding_dim)
        self.weight = self._init_weight(self.weight) # self.weight => nn.Embedding(num_positions, embedding_dim).weight

    @staticmethod
    def _init_weight(out: nn.Parameter):
        n_pos, embed_dim = out.shape
        pe = nn.Parameter(torch.zeros(out.shape))
        for pos in range(n_pos):
            for i in range(0, embed_dim, 2):
                pe[pos, i].data.copy_( torch.tensor( np.sin(pos / (10000 ** ( i / embed_dim)))) )
                pe[pos, i + 1].data.copy_( torch.tensor( np.cos(pos / (10000 ** ((i + 1) / embed_dim)))) )
        pe.detach_()

        return pe

    @torch.no_grad()
    def forward(self, input_ids):
      bsz, seq_len = input_ids.shape[:2]
      positions = torch.arange(seq_len, dtype=torch.long, device=self.weight.device)
      return super().forward(positions)


#### 2) Transformer Decoder
##### a. Decoder Layer
In this section, we will build a Transformer Decoder layer without encoder-decoder attention.
<div>
<img src="https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-27_at_12.41.44_PM.png" width=1000)/>
</div>

In [38]:
class DecoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.emb_dim = config.emb_dim
        self.ffn_dim = config.ffn_dim
        self.self_attn = MultiHeadAttention(
            emb_dim=self.emb_dim,
            num_heads=config.attention_heads,
            dropout=config.attention_dropout,
            causal=True,
        )
        self.dropout = config.dropout
        self.self_attn_layer_norm = nn.LayerNorm(self.emb_dim)
        self.PositionWiseFeedForward = PositionWiseFeedForward(self.emb_dim, self.ffn_dim, config.dropout)
        self.final_layer_norm = nn.LayerNorm(self.emb_dim)


    def forward( # we only need the self-attention and fully connected layers
        self,
        x,
        causal_mask=None,
    ):
        residual = x
        # Self Attention
        x, self_attn_weights = self.self_attn(
            query=x,
            key=x, # adds keys to layer state
            attention_mask=causal_mask,
        )
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = residual + x
        x = self.self_attn_layer_norm(x)

        # Fully Connected
        residual = x
        x = self.PositionWiseFeedForward(x)
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = residual + x
        x = self.final_layer_norm(x)

        return (
            x,
            self_attn_weights,
        )

In [39]:
class Decoder(nn.Module):

    def __init__(self, config, embed_tokens: nn.Embedding):
        super().__init__()
        self.dropout = config.dropout
        self.padding_idx = embed_tokens.padding_idx
        self.max_target_positions = config.max_position_embeddings
        self.embed_tokens = embed_tokens
        self.embed_positions = SinusoidalPositionalEmbedding(
            config.max_position_embeddings, config.emb_dim, self.padding_idx
        )
        self.layers = nn.ModuleList([DecoderLayer(config) for _ in range(config.decoder_layers)])  # type: List[DecoderLayer]

    def forward(
        self,
        input_ids,
        decoder_causal_mask,
    ):

        # embed positions
        positions = self.embed_positions(input_ids)
        x = self.embed_tokens(input_ids)
        x += positions

        x = F.dropout(x, p=self.dropout, training=self.training)

        # decoder layers
        for idx, decoder_layer in enumerate(self.layers):
            x, layer_self_attn = decoder_layer(
                x,
                causal_mask=decoder_causal_mask,
            )

        return x # we don't need to return cross-attention-scores

#### 3) Mini-GPT Model
Let's build our mini-GPT model with the basic building blocks, and modified transformer decoder blocks.

In [40]:
class GPT(nn.Module):
    def __init__(self, tokenizer, config):
        super().__init__()
        ##############TODO###############
        # create decoder embedding, decoder, prediction_head
        ##################################

        self.init_weights()

    def generate_mask(self, trg): # we don't need the mask encoder attention part
        # Mask decoder attention for causality
        tmp = torch.ones(trg.size(1), trg.size(1), dtype=torch.bool)
        mask = torch.arange(tmp.size(-1))
        dec_attention_mask = tmp.masked_fill_(mask < (mask + 1).view(tmp.size(-1), 1), False).to(DEVICE)

        return dec_attention_mask

    def init_weights(self):
        for name, param in self.named_parameters():
            if param.requires_grad:
                if 'weight' in name:
                    nn.init.normal_(param.data, mean=0, std=0.01) # weight initialization by normal dist
                else:
                    nn.init.constant_(param.data, 0)

    def forward(
        self,
        trg,
    ):
        ##############TODO###############
        # 1. Generate decoder causal mask
        # 2. Put input in decoder layers
        # 3. Put decoder output through prediction head
        ################################

        return decoder_output

### 3. Training our Model

In [41]:
torch.manual_seed(0)
import easydict
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader

config = easydict.EasyDict({
    "emb_dim":128,
    "ffn_dim":512,
    "attention_heads":8,
    "attention_dropout":0.0,
    "dropout":0.2,
    "max_position_embeddings":128,
    "decoder_layers":6,
})

learning_rate = 5e-4
BATCH_SIZE = 64 # 128

model = GPT(tokenizer, config)
model.to(DEVICE)
optimizer = optim.Adam(model.parameters(),lr=learning_rate, betas=(0.9, 0.98), eps=1e-9)
criterion = nn.CrossEntropyLoss()

best_valid_loss = float('inf')

In [None]:
train_dataloader = DataLoader(webtext_dataset["train"], batch_size=BATCH_SIZE)
valid_dataloader = DataLoader(webtext_dataset["valid"], batch_size=BATCH_SIZE)
test_iter = webtext_dataset["test"]

print(next(iter(train_dataloader)).shape)

In [43]:
from tqdm import tqdm

def train_epoch(model, optimizer):
    model.train()
    losses = 0

    for tgt in tqdm(train_dataloader):
        tgt = tgt.to(DEVICE)

        logits = model(tgt)

        optimizer.zero_grad()

        tgt_out = tgt[:, 1:]
        logits = logits[:, :-1]
        loss = criterion(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

        del tgt

    return losses / len(list(train_dataloader))


def evaluate(model):
    model.eval()
    losses = 0

    with torch.no_grad():
        for tgt in valid_dataloader:
            tgt = tgt.to(DEVICE)

            logits = model(tgt)

            tgt_out = tgt[:, 1:]
            logits = logits[:, :-1]
            loss = criterion(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
            losses += loss.item()

            del tgt

    return losses / len(list(valid_dataloader))

In [None]:
from timeit import default_timer as timer
NUM_EPOCHS = 10 # 50
best_loss = float("inf")

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(model, optimizer)
    end_time = timer()
    val_loss = evaluate(model)
    # torch.save(model.state_dict(), f'model_{epoch}.pt')
    # if val_loss < best_loss:
    #     torch.save(model.state_dict(), f'best_model.pt')
    #     best_loss = val_loss
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))


### 4. Greedy Decoding for Test Evaluation

In [45]:
# function to generate output sequence using greedy algorithm
def greedy_decode(model, input, max_len):
    ##############TODO###############
    # 1. get model output
    # 2. find maximum value for each predicted token
    # 3. # get last prediction (our target word)
    # 4. concat greedy decoded token to sequence of words for next input
    #################################

Let's print our greedy-decoded test generated results.

In [46]:
import random

random.seed(1234)

In [None]:
idx = random.choice(range(len(test_iter)))
test_data = torch.unsqueeze(test_iter[idx][:20], 0).to(DEVICE)
print(tokenizer.decode(test_data[0]))
print("Generating...")
generated = greedy_decode(model, test_data, 30)
print(tokenizer.decode(generated[0][1:]))

## What is BERT?

BERT (introduced in [this paper](https://arxiv.org/abs/1810.04805)) stands for Bidirectional Encoder Representations from Transformers.

- Bidirectional - to understand the text  you're looking you'll have to look back (at the previous words) and forward (at the next words)
- Transformers - The [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper presented the Transformer model. The Transformer reads entire sequences of tokens at once. The attention mechanism allows for learning contextual relations between words (e.g. `his` in a sentence refers to Jim).
- (Pre-trained) contextualized word embeddings - [The ELMO paper](https://arxiv.org/abs/1802.05365v2) introduced a way to encode words based on their context. Nails has multiple meanings - fingernails and metal nails.

BERT was trained by masking 15% of the tokens with the goal to guess them. An additional objective was to predict the next sentence. Let's look at examples of these tasks:

### Masked Language Modeling (Masked LM)

The objective of this task is to guess the masked tokens. Let's look at an example, and try to not make it harder than it has to be:

That's `[mask]` she `[mask]` -> That's what she said

### Next Sentence Prediction (NSP)

Given a pair of two sentences, the task is to say whether or not the second follows the first (binary classification). Let's continue with the example:

*Input* = `[CLS]` That's `[MASK]` she `[MASK]`. [SEP] Hahaha, nice! [SEP]

*Label* = *IsNext*

*Input* = `[CLS]` That's `[MASK]` she `[MASK]`. [SEP] Dwight, you ignorant `[MASK]`! [SEP]

*Label* = *NotNext*

The training corpus was comprised of two entries: [Toronto Book Corpus](https://arxiv.org/abs/1506.06724) (800M words) and English Wikipedia (2,500M words). While the original Transformer has an encoder (for reading the input) and a decoder (that makes the prediction), BERT uses only the encoder.

BERT is simply a pre-trained stack of Transformer Encoders. How many Encoders? We have two versions - with 12 (BERT base) and 24 (BERT Large).

### Is This Thing Useful in Practice?

The BERT paper was released along with [the source code](https://github.com/google-research/bert) and pre-trained models.

The best part is that you can do Transfer Learning (thanks to the ideas from OpenAI Transformer) with BERT for many NLP tasks - Classification, Question Answering, Entity Recognition, etc. You can train with small amounts of data and achieve great performance!

## Contents

1. Sentiment Analysis with BERT: I'll show you the fine-tuning process using the Huggingface package.

# 1. Sentiment Analysis with BERT

> In this section, you'll learn how to fine-tune BERT for sentiment analysis. You'll do the required text preprocessing (special tokens, padding, and attention masks) and build a Sentiment Classifier using the Transformers library by Hugging Face!

You'll learn how to:

- Intuitively understand what BERT is
- Preprocess text data for BERT and build PyTorch Dataset (tokenization, attention masks, and padding)
- Use Transfer Learning to build Sentiment Classifier using the Transformers library by Hugging Face
- Evaluate the model on test data
- Predict sentiment on raw text

## Setup

In [None]:
!pip3 install seaborn
!pip3 install scikit-learn

In [None]:
#Setup & Config
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch

import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict

from torch import nn
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

import warnings
warnings.filterwarnings(action='ignore')

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

rcParams['figure.figsize'] = 12, 8

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

## Data Exploration

We'll load the Google Play app reviews dataset, that we've put together in the previous part:

In [None]:
!wget https://www.dropbox.com/s/kt16vthpeyddscz/reviews.csv
df = pd.read_csv("reviews.csv")
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
print(df.score.value_counts())
sns.countplot(df, x="score")
plt.xlabel('review score');

We're going to convert the dataset into negative, neutral and positive sentiment:

In [59]:
def to_sentiment(rating):
  rating = int(rating)
  if rating <= 2:
    return 0
  elif rating == 3:
    return 1
  else:
    return 2

df['sentiment'] = df.score.apply(to_sentiment)

In [60]:
class_names = ['negative', 'neutral', 'positive']

In [None]:
ax = sns.countplot(df, x='sentiment')
plt.xlabel('review sentiment')
ax.set_xticklabels(class_names);

## Data Preprocessing

You might already know that Machine Learning models don't work with raw text. You need to convert text to numbers (of some sort). BERT requires even more attention. Here are the requirements:

- Add special tokens to separate sentences and do classification
- Pass sequences of constant length (introduce padding)
- Create array of 0s (pad token) and 1s (real token) called *attention mask*

The Transformers library provides a wide variety of Transformer models (including BERT).

In [62]:
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'

You can use a cased and uncased version of BERT and tokenizer. In this task, the cased version works better. Intuitively, that makes sense, since "BAD" might convey more sentiment than "bad".

Let's load a pre-trained [BertTokenizer](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer):

In [63]:
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

We'll use this text to understand the tokenization process:

In [64]:
sample_txt = 'Nice to meet you. How are you?'

Some basic operations can convert the text to tokens and tokens to unique integers (ids):

In [None]:
tokens = tokenizer.tokenize(sample_txt)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f' Sentence: {sample_txt}')
print(f'   Tokens: {tokens}')
print(f'Token IDs: {token_ids}')

### Special Tokens

`[SEP]` - marker for ending of a sentence

In [None]:
tokenizer.sep_token, tokenizer.sep_token_id

[CLS] - we must add this token to the start of each sentence, so BERT knows we're doing classification

In [None]:
tokenizer.cls_token, tokenizer.cls_token_id

There is also a special token for padding:

In [None]:
tokenizer.pad_token, tokenizer.pad_token_id

BERT understands tokens that were in the training set. Everything else can be encoded using the `[UNK]` (unknown) token:

In [None]:
tokenizer.unk_token, tokenizer.unk_token_id

All of that work can be done using the [`encode_plus()`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.encode_plus) method:

In [None]:
encoding = tokenizer.encode_plus(
  sample_txt,
  truncation = True,
  max_length=32,
  add_special_tokens=True, # Add '[CLS]' and '[SEP]'
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',  # Return PyTorch tensors
)

encoding.keys()
# trucation=True

The token ids are now stored in a Tensor and padded to a length of 32:

In [None]:
print(len(encoding['input_ids'][0]))
encoding['input_ids'][0]

The attention mask has the same length:

In [None]:
print(len(encoding['attention_mask'][0]))
encoding['attention_mask'][0]

We can inverse the tokenization to have a look at the special tokens:

In [None]:
tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])

### Choosing Sequence Length

BERT works with fixed-length sequences. We'll use a simple strategy to choose the max length. Let's store the token length of each review:

In [74]:
token_lens = []

for txt in df.content:
  tokens = tokenizer.encode(txt, truncation = True, max_length=512)
  token_lens.append(len(tokens))

In [75]:
tokens;

and plot the distribution:

In [None]:
sns.distplot(token_lens)
plt.xlim([0, 256]);
plt.xlabel('Token count');

Most of the reviews seem to contain less than 128 tokens, but we'll be on the safe side and choose a maximum length of 160.

In [77]:
MAX_LEN = 160

We have all building blocks required to create a PyTorch dataset. Let's do it:

In [78]:
class GPReviewDataset(Dataset):

  def __init__(self, reviews, targets, tokenizer, max_len):
    self.reviews = reviews
    self.targets = targets
    self.tokenizer = tokenizer
    self.max_len = max_len

  def __len__(self):
    return len(self.reviews)

  def __getitem__(self, item):
    review = str(self.reviews[item])
    target = self.targets[item]

    encoding = self.tokenizer.encode_plus(
      review,
      add_special_tokens=True,
      max_length=self.max_len,
      return_token_type_ids=False,
      pad_to_max_length=True,
      return_attention_mask=True,
      return_tensors='pt',
    )

    return {
      'review_text': review,
      'input_ids': encoding['input_ids'].flatten(),               # flatten(): [1, 160] -> [160]
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.long)
    }

In [79]:
### For your understanding
ds = GPReviewDataset(
reviews=df.content.to_numpy(),        # df.content.to_numpy().shape: (12495,)
targets=df.sentiment.to_numpy(),      # df.sentiment.to_numpy().shape: (12495,)
tokenizer=tokenizer,
max_len=160
)

In [None]:
ds.__len__()

In [None]:
ds.__getitem__(0)['review_text']

In [None]:
ds.__getitem__(0)['input_ids']    # size(): [160]

In [None]:
ds.__getitem__(0)['attention_mask']    # size(): [160]

Let's split the dataset:

In [84]:
df_train, df_test = train_test_split(df, test_size=0.1, random_state=RANDOM_SEED)
df_val, df_test = train_test_split(df_test, test_size=0.5, random_state=RANDOM_SEED)

In [None]:
df_train.shape, df_val.shape, df_test.shape

We also need to create a couple of data loaders. Here's a helper function to do it:

In [86]:
def create_data_loader(df, tokenizer, max_len, batch_size):
  ds = GPReviewDataset(
    reviews=df.content.to_numpy(),        # df.content.to_numpy().shape: (15746,)
    targets=df.sentiment.to_numpy(),      # df.sentiment.to_numpy().shape: (15746,)
    tokenizer=tokenizer,
    max_len=max_len
  )

  return DataLoader(ds, batch_size=batch_size, num_workers=4)

In [87]:
BATCH_SIZE = 16

train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(df_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test, tokenizer, MAX_LEN, BATCH_SIZE)

Let's have a look at an example batch from our training data loader:

In [None]:
sample_batched = next(iter(train_data_loader))
sample_batched.keys()

In [None]:
print(sample_batched['input_ids'].shape)
print(sample_batched['attention_mask'].shape)
print(sample_batched['targets'].shape)

## Sentiment Classification with BERT and Hugging Face

There are a lot of helpers that make using BERT easy with the Transformers library. Depending on the task you might want to use [BertForQuestionAnswering](https://huggingface.co/transformers/model_doc/bert.html#bertforquestionanswering) or something else.

We'll use the basic [BertModel](https://huggingface.co/transformers/model_doc/bert.html#bertmodel) and build our sentiment classifier on top of it. Let's load the model:

In [90]:
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)

And try to use it on the encoding of our sample text:

In [91]:
last_hidden_state, pooled_output = bert_model(               # pooled_output: last_hidden_state's first token([CLS]) -> nn.Linear(config.hidden_size, config.hidden_size) -> nn.Tanh()
  input_ids=encoding['input_ids'],
  attention_mask=encoding['attention_mask'],
  return_dict=False
)

The `last_hidden_state` is a sequence of hidden states of the last layer of the model. Obtaining the `pooled_output` is done by applying the [BertPooler](https://github.com/huggingface/transformers/blob/edf0582c0be87b60f94f41c659ea779876efc7be/src/transformers/modeling_bert.py#L426) on `last_hidden_state`:

In [None]:
last_hidden_state.size()

In [None]:
pooled_output.size()

We have the hidden state for each of our 32 tokens (the length of our example sequence). But why 768? This is the number of hidden units in the feedforward-networks. We can verify that by checking the config:

In [None]:
bert_model.config.hidden_size

You can think of the `pooled_output` as a summary of the content, according to BERT.
So, we'll use `pooled_output` for the sentiment analysis.

We can use all of this knowledge to create a classifier that uses the BERT model:

In [95]:
class SentimentClassifier(nn.Module):

  def __init__(self, n_classes):
    super(SentimentClassifier, self).__init__()
    self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
    self.drop = nn.Dropout(p=0.3)
    self.out = nn.Linear(self.bert.config.hidden_size, n_classes)

  def forward(self, input_ids, attention_mask):
    _, pooled_output = self.bert(
      input_ids=input_ids,
      attention_mask=attention_mask,
      return_dict=False
    )
    output = self.drop(pooled_output)
    return self.out(output)

This should work like any other PyTorch model. Let's create an instance and move it to the GPU:

In [96]:
model = SentimentClassifier(len(class_names))       # len(class_names): 3
model = model.to(device)

We'll move the example batch of our training data to the GPU:

In [None]:
input_ids = sample_batched['input_ids'].to(device)
attention_mask = sample_batched['attention_mask'].to(device)

print(input_ids.size()) # (batch size, seq length)
print(attention_mask.size()) # (batch size, seq length)

To get the predicted probabilities from our trained model, we'll apply the softmax function to the outputs:

In [None]:
F.softmax(model(input_ids, attention_mask), dim=1)

################################# **Doing yourself from here **#################################

### Training

To reproduce the training procedure from the BERT paper, we'll use the [AdamW](https://huggingface.co/transformers/main_classes/optimizer_schedules.html#adamw) optimizer provided by Hugging Face. It corrects weight decay, so it's similar to the original paper. We'll also use a linear scheduler with no warmup steps:

In [99]:
EPOCHS = 3

optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)

loss_fn = nn.CrossEntropyLoss().to(device)

How do we come up with all hyperparameters? The BERT authors have some recommendations for fine-tuning:

- Learning rate (Adam): 5e-5, 3e-5, 2e-5
- Number of epochs: 2, 3, 4

We're going to ignore the number of epochs recommendation but stick with the rest.

Let's continue with writing a helper function for training our model for one epoch:

In [100]:
def train_epoch(model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):

  model = model.train()

  losses = []
  correct_predictions = 0

  for d in data_loader:
    input_ids = d["input_ids"].to(device)
    attention_mask = d["attention_mask"].to(device)
    targets = d["targets"].to(device)

    outputs = model(
      input_ids=input_ids,
      attention_mask=attention_mask
    )

    _, preds = torch.max(outputs, dim=1)
    loss = loss_fn(outputs, targets)

    correct_predictions += torch.sum(preds == targets)
    losses.append(loss.item())

    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

  return correct_predictions.double() / n_examples, np.mean(losses)

We're avoiding exploding gradients by clipping the gradients of the model using [clip_grad_norm_](https://pytorch.org/docs/stable/nn.html#clip-grad-norm).

Let's write another one that helps us evaluate the model on a given data loader:

In [101]:
def eval_model(model, data_loader, loss_fn, device, n_examples):
  model = model.eval()

  losses = []
  correct_predictions = 0

  with torch.no_grad():
    for d in data_loader:
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)

      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1)

      loss = loss_fn(outputs, targets)

      correct_predictions += torch.sum(preds == targets)
      losses.append(loss.item())

  return correct_predictions.double() / n_examples, np.mean(losses)

Using those two, we can write our training loop. We'll also store the training history:

In [None]:
history = defaultdict(list)
best_accuracy = 0

for epoch in range(EPOCHS):

  print(f'Epoch {epoch + 1}/{EPOCHS}')
  print('-' * 10)

  train_acc, train_loss = train_epoch(
    model,
    train_data_loader,
    loss_fn,
    optimizer,
    device,
    scheduler,
    len(df_train)
  )

  print(f'Train loss {train_loss} accuracy {train_acc}')

  val_acc, val_loss = eval_model(
    model,
    val_data_loader,
    loss_fn,
    device,
    len(df_val)
  )

  print(f'Val   loss {val_loss} accuracy {val_acc}')
  print()

  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['val_acc'].append(val_acc)
  history['val_loss'].append(val_loss)

  if val_acc > best_accuracy:
    torch.save(model.state_dict(), 'best_model_state.bin')
    best_accuracy = val_acc

Note that we're storing the state of the best model, indicated by the highest validation accuracy.

We can look at the training vs validation accuracy:

In [None]:
plt.plot([x.cpu() for x in history['train_acc']], label='train accuracy')
plt.plot([x.cpu() for x in history['val_acc']], label='validation accuracy')

plt.title('Training history')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.ylim([0, 1]);

The training accuracy starts to approach 100% after 10 epochs or so. You might try to fine-tune the parameters a bit more, but this will be good enough for now.

### Evaluation

Let's start by calculating the accuracy on the test data:

In [None]:
test_acc, _ = eval_model(
  model,
  test_data_loader,
  loss_fn,
  device,
  len(df_test)
)

test_acc.item()

Our model seems to generalize well.

We'll define a helper function to get the predictions from our model:

In [106]:
def get_predictions(model, data_loader):
  model = model.eval()

  review_texts = []
  predictions = []
  prediction_probs = []
  real_values = []

  with torch.no_grad():
    for d in data_loader:

      texts = d["review_text"]
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)

      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1)

      probs = F.softmax(outputs, dim=1)

      review_texts.extend(texts)
      predictions.extend(preds)
      prediction_probs.extend(probs)
      real_values.extend(targets)

  predictions = torch.stack(predictions).cpu()
  prediction_probs = torch.stack(prediction_probs).cpu()
  real_values = torch.stack(real_values).cpu()
  return review_texts, predictions, prediction_probs, real_values

This is similar to the evaluation function, except that we're storing the text of the reviews and the predicted probabilities (by applying the softmax on the model outputs):

In [107]:
y_review_texts, y_pred, y_pred_probs, y_test = get_predictions(model, test_data_loader)

In [None]:
print(classification_report(y_test, y_pred, target_names=class_names))

Looks like it is hard to classify neutral (3 stars) reviews.
We'll continue with the confusion matrix:

In [None]:
def show_confusion_matrix(confusion_matrix):
  hmap = sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues")
  hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
  hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
  plt.ylabel('True sentiment')
  plt.xlabel('Predicted sentiment');

cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index=class_names, columns=class_names)
show_confusion_matrix(df_cm)

This confirms that our model is having difficulty classifying neutral reviews. It mistakes those for negative and positive at a roughly equal frequency.

### Predicting on Raw Text

Let's use our model to predict the sentiment of some raw text:

In [110]:
review_text = "well, not bad"

We have to use the tokenizer to encode the text:

In [111]:
encoded_review = tokenizer.encode_plus(
  review_text,
  max_length=MAX_LEN,
  add_special_tokens=True,
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',
)

Let's get the predictions from our model:

In [None]:
input_ids = encoded_review['input_ids'].to(device)
attention_mask = encoded_review['attention_mask'].to(device)

output = model(input_ids, attention_mask)
_, prediction = torch.max(output, dim=1)

print(f'Review text: {review_text}')
print(f'Sentiment  : {class_names[prediction]}')

# References
- [Training a causal language model from scratch](https://huggingface.co/learn/nlp-course/chapter7/6)
- [Causal language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling)
- [Sentiment Analysis with BERT and Transformers by Hugging Face using PyTorch and Python](https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/)
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
- [L11 Language Models - Alec Radford (OpenAI)](https://www.youtube.com/watch?v=BnpB3GrpsfM)
- [The Illustrated BERT, ELMo, and co.](https://jalammar.github.io/illustrated-bert/)
- [BERT Fine-Tuning Tutorial with PyTorch](https://mccormickml.com/2019/07/22/BERT-fine-tuning/)
- [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583.pdf)
- [Huggingface Transformers](https://huggingface.co/transformers/)
- [BERT Explained: State of the art language model for NLP](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)
- [BERT implementation](https://github.com/codertimo/BERT-pytorch)