## Global modules import

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import json
import numpy as np
from operator import itemgetter
import random as rnd
import sys
import torch

## Local modules import

In [3]:
sys.path.append('..')

In [4]:
from data_loading import create_word_lists, tidy_sentence_length

In [5]:
from sklearn.model_selection import train_test_split

## Loading data

In [6]:
sys.path.append('../data')

In [7]:
with open('../data/corpus_data.json') as json_file:
    data = json.load(json_file)
data = data['records']

In [8]:
human_transcripts = [entry['human_transcript'] for entry in data]
stt_transcripts   = [entry['stt_transcript'] for entry in data]

In [9]:
human_words, stt_words, word_labels, word_grams, word_sems = \
    create_word_lists(data)

Some of the sentences are too long, so we need to shorten them.

In [10]:
stt_transcripts, stt_words, word_labels, word_grams, word_sems = \
    tidy_sentence_length(stt_transcripts, stt_words, word_labels, word_grams, word_sems)

## Train-test split

We need to extract which sentences contain German words in order to stratify the data split:

In [11]:
max_length = max(map(len, word_labels))
padded_labels = [row + [False] * (max_length - len(row)) for row in word_labels]
padded_labels = np.array(padded_labels)
stat_labels = np.any(padded_labels, axis=1)

Here, we split only indices and not data itself, because the data contains arrays of variable length, which does not work with `train_test_split`:

In [12]:
indices = list(range(len(stt_transcripts)))
tr_indices, te_indices = train_test_split(indices, test_size=0.2, random_state=0, shuffle=True, stratify=stat_labels)

These are hepler functions that will extract data selected by indices:

In [13]:
extract_train = itemgetter(*tr_indices)
extract_test  = itemgetter(*te_indices)

Finally, do data splitting:

In [14]:
tr_stt_transcripts   = extract_train(stt_transcripts)
tr_stt_words         = extract_train(stt_words)

tr_word_labels       = extract_train(word_labels)
tr_word_grams        = extract_train(word_grams)
tr_word_sems         = extract_train(word_sems)

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

te_stt_transcripts   = extract_test(stt_transcripts)
te_stt_words         = extract_test(stt_words)

te_word_labels       = extract_test(word_labels)
te_word_grams        = extract_test(word_grams)
te_word_sems         = extract_test(word_sems)

# BERT part

Imports:

In [87]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

Load tokenizers. Make sure that the model outputs hidden states, as they will be used for word decoding. Also, set model to evaluation mode, as we will not train it. There is an option which we can test, which does not allow the splitting of words into multiple tokens. TODO.

In [88]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertForMaskedLM.from_pretrained('bert-base-uncased', output_hidden_states=True)
model_bert.eval();

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [121]:
tr_stt_transcripts[0]

'my favorite candy is   he i do not love very much kinde but the lowly pup is my favorite from far'

In [130]:
test_sentence = '[CLS]' + tr_stt_transcripts[0] + '[SEP]'  # Add special characters to denote start and end of sentence
tokens = tokenizer.tokenize(test_sentence)                  # Tokenize the sentence
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)    # Get dictionary indices for tokens
segments_ids = [1] * len(tokens)                            # The whole utterance is just one segment

# Next, convert the outputs to tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])


In [131]:
with torch.no_grad():   # We aren't doing backprop
    outputs = model_bert(tokens_tensor, segments_tensors)

In [132]:
outputs.hidden_states[0].shape

torch.Size([1, 25, 768])

In [133]:
logits = torch.squeeze(outputs.logits, dim=0)

In [134]:
probabilities = torch.nn.functional.softmax(logits, dim=-1)

In [135]:
probabilities.shape

torch.Size([25, 30522])

In [136]:
def perplexity(probabilities):
    return 2**torch.sum(probabilities*torch.log(probabilities), dim=-1)

In [137]:
perplexity(probabilities)[1:-1]

tensor([0.9290, 0.9757, 0.9632, 0.9958, 0.8860, 0.9964, 0.9869, 0.9986, 0.5944,
        0.9511, 0.9919, 0.9724, 0.9973, 0.9656, 0.9971, 0.5210, 0.2393, 0.9799,
        0.9937, 0.9953, 0.9794, 0.9563, 0.9145])

There are three things that the model outputs:

In [23]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states'])

We are interested in hidden states, of which there are

In [24]:
len(outputs.hidden_states)

13

`output.hidden_layers` is a tuple of 13 tensors. Each tensor is three-dimensional. First dimension represents batches (sentences), second tokens and third individual values in a vector-encoded token.

The first layer consists of input embeddings, and the 12 remaining are BERT's layers.

In [25]:
outputs.hidden_states[1].shape

torch.Size([1, 11, 768])

In [26]:
outputs.hidden_states[0].shape[1] == len(tokens)

True

`hidden_states` was a tuple, so we make it a tensor. Furthermore, as our input contains just one sentence, we get rid of 'batches dimension'.

In [27]:
token_embeddings = torch.stack(outputs.hidden_states, dim=0)
token_embeddings = torch.squeeze(token_embeddings, dim=1)
token_embeddings = token_embeddings.permute(1,0,2)
token_embeddings = token_embeddings[1:-1]                   # We don't need [CLS] and [SEP]
token_embeddings.shape

torch.Size([9, 13, 768])

In [28]:
ts = torch.cat((token_embeddings[:, -4, :], token_embeddings[:, -3, :], token_embeddings[:, -2, :], token_embeddings[:, -1, :]), dim=1)
ts.shape

torch.Size([9, 3072])

Now, `token_embeddings` tensor contains all of the embeddings for each of our tokens. First dimension are tokens themselves, second dimension represents layers and the third one are embedding values.

Now we need to get a single vector to represent each token. This can be done in multiple ways, most important of which are

- concatenation of last four layers
- sum of last four layers
- extraction of the second-to-last layer

Here, for the first pass, I will do sum of the last four layers.

In [29]:
token_vectors = []
for emb in token_embeddings:
    token_vectors.append(torch.sum(emb[-4:], dim=0))
token_vectors = torch.stack(token_vectors)
token_vectors.shape

torch.Size([9, 768])

In [30]:
token_vectors2 = torch.sum(token_embeddings[:,-4:,:], dim=1)
token_vectors2.shape

torch.Size([9, 768])

In [31]:
torch.equal(token_vectors, token_vectors2)

True

This was a more efficient way!

Next, we need to go through the tokens and combine those that belong to the same word.

In [32]:
word_token_lengths = []
for word in tr_stt_words[23]:
    word_token_lengths.append(len(tokenizer.encode(word, add_special_tokens=False)))

In [33]:
id = 0
word_vectors = []
for wl in word_token_lengths:
    word_vectors.append(torch.mean(token_vectors[id:id+wl], dim=0))
    id = id+wl
word_vectors = torch.stack(word_vectors)

In [34]:
word_vectors.shape

torch.Size([8, 768])

In [35]:
len(tr_stt_words[23])

8

In [36]:
type(word_vectors)

torch.Tensor

I strongly beleive this works. If anyone has any idea why it might not, feel free to share.

Let's test the implementation:

In [37]:
from bert_encoder import encode_sentence

In [38]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model_bert.eval();

Let's do the whole corpus:

We have some transcripts that are empty -- we should deal with that!!!

Also, deal with situations when the sentences are too long.

In [40]:
for sentence, words in zip(tr_stt_transcripts, tr_stt_words):
    encode_sentence(sentence, words, model_bert, tokenizer)

There are 47 really long sentences. We should perhaps split them into multiple. I will see about the strategies tomorrow.

IDEA: Don't use the original STT transcripts, but rather reconstruct them from a list of words.