### BERT language model

#### Installing required libraries

In [14]:
!pip install tensorflow
!pip install nltk
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Importing required packages

In [15]:
#NLP libraries
import nltk  
nltk.download('stopwords')  
nltk.download('punkt')
nltk.download('gutenberg')
from nltk.corpus import stopwords
from nltk.corpus import gutenberg

#DL libraries
from transformers import BertTokenizer, TFBertForMaskedLM
import tensorflow as tf

#Data manipulation
import numpy as np
import re

#Other libraries
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


#### Setting data path

We will use text from the book "Moby Dick" by "Herman Melville". This dataset is a public domain dataset from Project Gutenberg that comes installed in the nltk package by default into the folder ` /root/nltk_data/` when this notebook is executed.

In [16]:
!cp /root/nltk_data/corpora/gutenberg/melville-moby_dick.txt moby_dick.txt

#### Loading and pre-processing data

Pre-processing the data involves removing stopwords and punctuation and converting the words into tokens, which is the format that BERT needs. Then we will use the tokenizer that Hugging face provides.

In [17]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 # This loads the list of stop words from english language
sw = stopwords.words('english')

#### Helper functions

In [18]:
# This function removes stopwords if they are present in the line
def rem_stops_line(line, words):
    if len(line) >1:
        return [w for w in line if w not in words]
    else:
        return line

# This function finds the position of the masks in a sentence
def find_masks(inp):
    return np.where(inp.input_ids.numpy()[0] == 103)[0].tolist()

# This function removes stop words for an entire text. 
# Separate functions make it easier to parallelize if required.
def remove_stops(text, words = sw):
    return [rem_stops_line(line, words) for line in text]

# This function predicts for all the positions of the masks
def single_prediction(model, inp, mask_loc):
    return model(inp).logits[0].numpy()[mask_loc]

def return_prediction(model, query):
    # Return a prediction for a single sentence
    inp = tokenizer(query,return_tensors='tf')
    mask_loc = find_masks(inp)
    # Find the Prediction with the highest confidence
    predicted_tokens = np.argmax(single_prediction(model, inp, mask_loc),axis=1).tolist()
    # Decode the numerical value of the returned ID back to the word 
    return tokenizer.decode(predicted_tokens)

def multiple_preds(model, query_list):
    # Return predictions for a list of queries
    preds = [f"{x} -> {return_prediction(model, x).split(' ')}" for x in query_list]
    for i in preds: print(i)

In [19]:
# Take the first 1000 sentences for demo purposes
with open('/content/moby_dick.txt', 'r') as f:
    lines = f.readlines()[:1000]

In [20]:
# We will use the Helper functions loaded above in order to 
# remove new lines, convert all to lowercase, remove punctuation 
# and stop words and tokenize

lines = [line.rstrip('\n').lower() for line in lines]
lines = [line.translate(str.maketrans('', '', string.punctuation)) for line in lines]
filtered_lines = remove_stops(text = lines, words = sw)

inputs = tokenizer(lines,
                   max_length=100,
                   truncation=True,
                   padding='max_length',
                   return_tensors='tf')

#### Loading The Masked Language Model

We use the model from the Transformers library directly. By working with the uncased model, all text will be converted into lowercase.

In [21]:
model = TFBertForMaskedLM.from_pretrained('bert-base-uncased')

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


#### Creating The Mask

In [22]:
inputs

{'input_ids': <tf.Tensor: shape=(1000, 100), dtype=int32, numpy=
array([[  101, 11240,  2100, ...,     0,     0,     0],
       [  101,   102,     0, ...,     0,     0,     0],
       [  101,   102,     0, ...,     0,     0,     0],
       ...,
       [  101, 28838,  2098, ...,     0,     0,     0],
       [  101,  2061,  2357, ...,     0,     0,     0],
       [  101,  2046,  2793, ...,     0,     0,     0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1000, 100), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1000, 100), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=in

In [23]:
print('GPU name: ', tf.config.experimental.list_physical_devices('GPU'))

GPU name:  []


#### Creating a Mask
In a masked language model, such as BERT, creating a mask refers to the process of replacing some of the tokens in the inputs previously defined,  with a special token called the [MASK] token, which in this case we have set as 103. 
Then during training, a certain percentage of the input tokens are randomly selected and replaced with the [MASK] token. The model is then trained to predict the original token based on the surrounding context.

In [24]:
# MASK
inp_ids = []
lbs = []
for idx, inp in enumerate(inputs.input_ids.numpy()):
    tokens = list(set(range(100)) - 
                         set(np.where((inp == 101) | (inp == 102) 
                            | (inp == 0))[0].tolist()))
    
    # Number of tokens to mask
    masked_tokens = 0.15 * len(tokens)
    masks = np.random.choice(np.array(tokens), 
                                     size=int(masked_tokens), 
                                     replace=False)
    
    # Store special token and inform model
    inp[masks.tolist()] = 103
    inp_ids.append(inp)
    
# Converting the tokens to tensors
inp_ids = tf.convert_to_tensor(inp_ids)
inputs['input_ids'] = inp_ids

In [25]:
inp_ids

<tf.Tensor: shape=(1000, 100), dtype=int32, numpy=
array([[  101, 11240,  2100, ...,     0,     0,     0],
       [  101,   102,     0, ...,     0,     0,     0],
       [  101,   102,     0, ...,     0,     0,     0],
       ...,
       [  101, 28838,  2098, ...,     0,     0,     0],
       [  101,  2061,   103, ...,     0,     0,     0],
       [  101,  2046,  2793, ...,     0,     0,     0]], dtype=int32)>

In [26]:
query_list = ["And what thing soever [MASK] cometh within the chaos", 
              "Scarcely had we [MASK] two days on the sea", 
              "He visited this [MASK] also with a view of [MASK] horse-whales"]
multiple_preds(model, query_list)

And what thing soever [MASK] cometh within the chaos -> ['knows']
Scarcely had we [MASK] two days on the sea -> ['spent']
He visited this [MASK] also with a view of [MASK] horse-whales -> ['island', 'the']
