# This Notebook is for analysing the pretrained model and `strings's` training data

## Basic checks on BERT's finetuned model

In [34]:
!wc -l lm_finetuning/finetuned_lm/vocab.txt

30522 lm_finetuning/finetuned_lm/vocab.txt


In [35]:
!shuf -n20 lm_finetuning/finetuned_lm/vocab.txt

[unused150]
alfred
robson
##ately
lori
premise
[unused679]
accordingly
আ
paintings
[unused986]
##ν
library
sip
##ahu
warmly
generals
completing
besides
subtly


## Checking the training data of `strings.ai`

In [36]:
bert_vocab = set()
with open('lm_finetuning/finetuned_lm/vocab.txt', 'r') as inp:
    for l in inp.readlines():
        bert_vocab.add(l.strip())

#### Need to use BERT's tokenizer to create vocabs

In [37]:
import pytorch_pretrained_bert as ppb
tokenizer = ppb.tokenization.BertTokenizer.from_pretrained('lm_finetuning/finetuned_lm/')

In [39]:
from tqdm import tqdm
strings_vocab = set()
with open('/home/strings/datasets/transcripts/all_calls.data', 'r') as inp:
    for l in tqdm(inp.readlines()):
        tokens = tokenizer.tokenize(l.strip())
        strings_vocab.update(tokens)

100%|██████████| 878173/878173 [03:16<00:00, 4463.83it/s]


In [42]:
print("Strings Vocab Size: {}\nBert vocab Size: {}".format(len(bert_vocab), len(strings_vocab)))

Strings Vocab Size: 30522
Bert vocab Size: 22446


In [49]:
print(list(strings_vocab)[2000:2010])

['sunset', 'coroner', 'fisheries', 'oppose', 'triggering', 'contemporaries', 'xx', '##anna', 'particles', 'relate']


In [50]:
print("Checking for NEW words in strings's vocab...")
count = 0
for t in strings_vocab:
    if not t in bert_vocab:
        count += 1

print("NEW tokens in string's finetuning data: {}".format(count))

Checking for NEW words in strings's vocab...
NEW tokens in string's finetuning data: 0


# Text Generation from BERT:

Text generation from BERT is not straightforward[1]. The reason for this is that BERT is a Bidirectional Model, hence intutively, it will require context from the `right-to-left` as well. BERT is trained using a Masked LM strategy, in which some random words of a sentence are `[MASKED]` and the model is then asked to predict them (Kind of fill in the blanks...)

Notes:
[1]: https://arxiv.org/abs/1902.04094

In [59]:
import torch
from pytorch_pretrained_bert import BertForMaskedLM
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('lm_finetuning/finetuned_lm/')
model.eval()


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=

In [64]:
from random import randint

In [68]:
with open('debug.data', 'r') as inp:
    for l in inp.readlines():
        text = "[CLS] " + l.strip() + " [SEP]"
        tokenized_text = tokenizer.tokenize(text)

        # Mask a token that we will try to predict back with `BertForMaskedLM`
        masked_index = randint(1, len(tokenized_text) - 2)
        tokenized_text[masked_index] = '[MASK]'

        # Convert token to vocabulary indices
        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
        # Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
        segments_ids = [0]*len(tokenized_text)


        # Convert inputs to PyTorch tensors
        tokens_tensor = torch.tensor([indexed_tokens])
        segments_tensors = torch.tensor([segments_ids])


        # If you have a GPU, put everything on cuda
        # tokens_tensor = tokens_tensor.to('cuda')
        # segments_tensors = segments_tensors.to('cuda')
        # model.to('cuda')

        # Predict all tokens
        with torch.no_grad():
            predictions = model(tokens_tensor, segments_tensors)

        predicted_index = torch.argmax(predictions[0, masked_index]).item()
        predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]

        predicted_tokens = list(tokenized_text)
        predicted_tokens[masked_index] = predicted_token
        print("Original: {}".format(text))
        print("Input: {}".format(" ".join(tokenized_text)))
        print("Prediction: {}".format(" ".join(predicted_tokens)))
        print("\n\n")


Original: [CLS] the cost? [SEP]
Input: [CLS] the [MASK] ? [SEP]
Prediction: [CLS] the what ? [SEP]



Original: [CLS] it's it's all manual today. And we're paying people, [SEP]
Input: [CLS] it ' s it ' s all [MASK] today . and we ' re paying people , [SEP]
Prediction: [CLS] it ' s it ' s all free today . and we ' re paying people , [SEP]



Original: [CLS] some some of them you have to process your own data and host it on your own server computer [SEP]
Input: [CLS] some some [MASK] them you have to process your own data and host it on your own server computer [SEP]
Prediction: [CLS] some some of them you have to process your own data and host it on your own server computer [SEP]



Original: [CLS] Yes, I'm trying to get ahold of somebody in the marketing department. [SEP]
Input: [CLS] yes , i ' m trying to get [MASK] ##old of somebody in the marketing department . [SEP]
Prediction: [CLS] yes , i ' m trying to get ah ##old of somebody in the marketing department . [SEP]



Original: [CL