<a href="https://colab.research.google.com/github/ezgimez/dl-demos/blob/main/demo8_modified.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to BERT and Language Models

This is a gentle introduction to BERT-type language models, based off [this tutorial](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/).

We discussed BERT type models [in class](https://chinmayhegde.github.io/dl-notes/notes/lecture08/). In this lecture, we will 
* see how BERT models look like, and
* use pre-trained BERT embeddings to train a simple classifier that predicts movie sentiments.

Check this [link](https://jalammar.github.io/illustrated-bert/) for a quick introduction to BERT.

# Setup

Excellent implementations of several language models (including BERT) are available via the `transformers` library in [HuggingFace](https://huggingface.co/models).

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 4.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 57.4 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.12.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 38.1 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 39.6 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.8 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as  # pytorch transformers
import warnings
warnings.filterwarnings('ignore')

# Model preparation

Let us first load `DistilBERT`, a pre-trained, lightweight BERT model.

Details are in the original [paper](https://arxiv.org/pdf/1910.01108.pdf) but the high level idea is to perform a pruning technique called *knowledge distillation* on the original BERT model in order to reduce the number of parameters by approximately 40\%.

In [3]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Notice that the model seems massive. How does it look like?

In [4]:
model

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

In [5]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(count_parameters(model))

66362880


(That's a lot of parameters!! But par for the course in language models.) 

# Dataset preparation

Let us now download a small movie review dataset called the Stanford Sentiment Treebank [SST2](https://www.kaggle.com/atulanandjha/stanford-sentiment-treebank-v2-sst2). We will only keep the first 2000 reviews for training purposes.

In [39]:
path = 'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'
df = pd.read_csv(path, delimiter='\t', header=None) #for documentation, checkout: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
# reads a .csv file into a Dataframe
batch_1 = df[:2000]

In [7]:
batch_1

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1
...,...,...
1995,too bland and fustily tasteful to be truly pru...,0
1996,it does n't work as either,0
1997,this one aims for the toilet and scores a dire...,0
1998,in the name of an allegedly inspiring and easi...,0


The `0th` field is the review itself, while the `1th` field is the label (a +1 if the review was positive and a 0 if the review was negative.)

We can see that this set of reviews is balanced; approximately half the reviews are positive and the other half are negative.

In [8]:
batch_1[1].value_counts() # .value_counts() is some useful function from pandas library.

1    1041
0     959
Name: 1, dtype: int64

Before feeding into our model, we will first have to *tokenize* our inputs; this is basically a way to split up the sentence at (approximately) the word level, and assign each word with a unique token index.

(Each language model is associated with a specific tokenizer; before loading the features, careful to use this tokenizer and not something else.)

In [16]:
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
# This turns every sentence into the list of ids.
# documentation: https://huggingface.co/docs/transformers/main_classes/tokenizer

In [40]:
tokenized

0       [101, 1037, 18385, 1010, 6057, 1998, 2633, 182...
1       [101, 4593, 2128, 27241, 23931, 2013, 1996, 62...
2       [101, 2027, 3653, 23545, 2037, 4378, 24185, 10...
3       [101, 2023, 2003, 1037, 17453, 14726, 19379, 1...
4       [101, 5655, 6262, 1005, 1055, 12075, 2571, 376...
                              ...                        
1995    [101, 2205, 20857, 1998, 11865, 16643, 2135, 5...
1996    [101, 2009, 2515, 1050, 1005, 1056, 2147, 2004...
1997    [101, 2023, 2028, 8704, 2005, 1996, 11848, 199...
1998    [101, 1999, 1996, 2171, 1997, 2019, 9382, 1898...
1999    [101, 1996, 3185, 2003, 25757, 2011, 1037, 244...
Name: 0, Length: 2000, dtype: object

We can see that the first token is always `101`. BERT has several special tokens; the `CLS` token marks the start of the sentence, while the `SEP` token marks the end of the sentence. Similarly, the last taken is always `102`.

This is a list of sentences broken down to tokens. We will need to pad all token sequences to have the same length (transformer models require this).

In [41]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i) # calculates the maximum length among all token seqeuences
print("maximum length is : ", max_len)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values]) # does the padding.
padded, padded.shape

maximum length is :  59


(array([[  101,  1037, 18385, ...,     0,     0,     0],
        [  101,  4593,  2128, ...,     0,     0,     0],
        [  101,  2027,  3653, ...,     0,     0,     0],
        ...,
        [  101,  2023,  2028, ...,     0,     0,     0],
        [  101,  1999,  1996, ...,     0,     0,     0],
        [  101,  1996,  3185, ...,     0,     0,     0]]), (2000, 59))

To speed up computations (and remove scaling issues) we can tell our self-attention mechanism to ignore the zeros caused by padding.

In [26]:
attention_mask = np.where(padded != 0, 1, 0) # check documentation here: https://note.nkmk.me/en/python-numpy-where/
# when condition `padded != 0` is satisfied, it returns 1 for that elements.
# otherwise, it returns 0.
attention_mask.shape

(2000, 59)

Let's create feature embeddings by passing the tokens through the `DistilBERT` model. This will take a few minutes.

In [28]:
input_ids = torch.tensor(padded)  # constructs a tensor by copying `data`
attention_mask = torch.tensor(attention_mask)

# We now create an input tensor out of the padded token matrix, and send that to DistilBERT


with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

# After running this step, last_hidden_states holds the outputs of DistilBERT. 
# It is a tuple with the shape (number of examples, max number of tokens in the sequence, number of hidden units in the DistilBERT model). 
# In our case, this will be 2000 (since we only limited ourselves to 2000 examples), 59 (which is the number of tokens in 
# the longest sequence from the 2000 examples), 768 (the number of hidden units in the DistilBERT model).

The output would be a vector for each input token. each vector is made up of 768 numbers (floats). 
Because this is a sentence classification task, we ignore all except the first vector (the one associated with the [CLS] token). The one vector we pass as the input to the logistic regression model.
We will only need [CLS] part of the output embedding of BERT; let's extract the corresponding slice.

In [58]:
features = last_hidden_states[0][:,0,:].numpy()
# For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, 
# so we select that slice of the cube and discard everything else.
# no coincidence that [CLS] token stands for classification.
# print(last_hidden_states[0].shape) # prints [2000, 59, 768]

In [30]:
features, features.shape

(array([[-0.21593425, -0.14028914,  0.00831067, ..., -0.13694833,
          0.58670044,  0.20112702],
        [-0.17262712, -0.1447617 ,  0.00223441, ..., -0.17442559,
          0.21386437,  0.37197483],
        [-0.05063363,  0.07203963, -0.02959726, ..., -0.07148931,
          0.71852386,  0.26225471],
        ...,
        [-0.27829778, -0.24803594,  0.135858  , ..., -0.19039164,
          0.13099569,  0.34978375],
        [-0.03667713,  0.10638557, -0.01111022, ..., -0.11206637,
          0.41619495,  0.5033802 ],
        [ 0.12402624,  0.01425158,  0.01038425, ..., -0.11606552,
          0.53459144,  0.2749533 ]], dtype=float32), (2000, 768))

OK! We now have feature embeddings for our input sentences; each sentence is now encoded as a feature vector of size 768. Lastly, let us also extract the corresponding labels.

In [35]:
labels = batch_1[1] # remember that the column 1 contained the positive/negative labels associated with each review.

# Training our model

Let us train a simple logistic regression model to classify our reviews as positive or negative.

In [36]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
# documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
# by default, train/test size will be 0.75/0.25.

In [37]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression()

In [38]:
lr_clf.score(test_features, test_labels)

0.832

Note that this already reaches ~85\% accuracy. Not too bad, considering that we didn't really have to do any work except load existing models. For reference, the best models on this dataset currently have ~96\% accuracy: you can track this [here](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary)