## Global modules import

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import json
import numpy as np
from operator import itemgetter
import random as rnd
import sys
import torch

## Local modules import

In [4]:
sys.path.append('..')

In [5]:
from data_loading import create_word_lists

In [6]:
from sklearn.model_selection import train_test_split

## Loading data

In [7]:
sys.path.append('../data')

In [8]:
with open('../data/corpus_data.json') as json_file:
    data = json.load(json_file)
data = data['records']

In [9]:
human_transcripts = [entry['human_transcript'] for entry in data]
stt_transcripts   = [entry['stt_transcript'] for entry in data]

In [10]:
human_words, stt_words, word_labels, word_grams, word_sems = \
    create_word_lists(data)

# PIPELINE START
---

## Train-test split

We need to extract which sentences contain German words in order to stratify the data split:

In [11]:
max_length = max(map(len, word_labels))
padded_labels = [row + [False] * (max_length - len(row)) for row in word_labels]
padded_labels = np.array(padded_labels)
stat_labels = np.any(padded_labels, axis=1)

Here, we split only indices and not data itself, because the data contains arrays of variable length, which does not work with `train_test_split`:

In [12]:
indices = list(range(len(human_transcripts)))
tr_indices, te_indices = train_test_split(indices, test_size=0.2, random_state=0, shuffle=True, stratify=stat_labels)

These are hepler functions that will extract data selected by indices:

In [13]:
extract_train = itemgetter(*tr_indices)
extract_test  = itemgetter(*te_indices)

Finally, do data splitting:

In [14]:
tr_human_transcripts = extract_train(human_transcripts) 
tr_stt_transcripts   = extract_train(stt_transcripts)
tr_human_words       = extract_train(human_words)
tr_stt_words         = extract_train(stt_words)

tr_word_labels       = extract_train(word_labels)
tr_word_grams        = extract_train(word_grams)
tr_word_sems         = extract_train(word_sems)

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

te_human_transcripts = extract_test(human_transcripts) 
te_stt_transcripts   = extract_test(stt_transcripts)
te_human_words       = extract_test(human_words)
te_stt_words         = extract_test(stt_words)

te_word_labels       = extract_test(word_labels)
te_word_grams        = extract_test(word_grams)
te_word_sems         = extract_test(word_sems)

## BERT part

Imports:

In [127]:
import torch
from transformers import BertTokenizer, BertModel

Load tokenizers. Make sure that the model outputs hidden states, as they will be used for word decoding. Also, set model to evaluation mode, as we will not train it. There is an option which we can test, which does not allow the splitting of words into multiple tokens. TODO.

In [128]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model_bert.eval();

In [140]:
test_sentence = '[CLS]' + tr_stt_transcripts[23] + '[SEP]'  # Add special characters to denote start and end of sentence
tokens = tokenizer.tokenize(test_sentence)                  # Tokenize the sentence
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)    # Get dictionary indices for tokens
segments_ids = [1] * len(tokens)                            # The whole utterance is just one segment

# Next, convert the outputs to tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])


In [141]:
with torch.no_grad():   # We aren't doing backprop
    outputs = model_bert(tokens_tensor, segments_tensors)

There are three things that the model outputs:

In [142]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states'])

We are interested in hidden states, of which there are

In [143]:
len(outputs.hidden_states)

13

`output.hidden_layers` is a tuple of 13 tensors. Each tensor is three-dimensional. First dimension represents batches (sentences), second tokens and third individual values in a vector-encoded token.

The first layer consists of input embeddings, and the 12 remaining are BERT's layers.

In [144]:
outputs.hidden_states[1].shape

torch.Size([1, 39, 768])

In [139]:
outputs.hidden_states[0].shape[1] == len(tokens)

True

`hidden_states` was a tuple, so we make it a tensor. Furthermore, as our input contains just one sentence, we get rid of 'batches dimension'.

In [147]:
token_embeddings = torch.stack(hidden_states, dim=0)
token_embeddings = torch.squeeze(token_embeddings, dim=1)
token_embeddings = token_embeddings.permute(1,0,2)
token_embeddings = token_embeddings[1:-1]                   # We don't need [CLS] and [SEP]
token_embeddings.shape

torch.Size([37, 13, 768])

Now, `token_embeddings` tensor contains all of the embeddings for each of our tokens. First dimension are tokens themselves, second dimension represents layers and the third one are embedding values.

Now we need to get a single vector to represent each token. This can be done in multiple ways, most important of which are

- concatenation of last four layers
- sum of last four layers
- extraction of the second-to-last layer

Here, for the first pass, I will do sum of the last four layers.

In [148]:
token_vectors = []
for emb in token_embeddings:
    token_vectors.append(torch.sum(emb[-4:], dim=0))
token_vectors = torch.stack(token_vectors)
token_vectors.shape

torch.Size([37, 768])

In [155]:
token_vectors2 = torch.sum(token_embeddings[:,-4:,:], dim=1)
token_vectors2.shape

torch.Size([37, 768])

In [156]:
torch.equal(token_vectors, token_vectors2)

True

This was a more efficient way!

Next, we need to go through the tokens and combine those that belong to the same word.

In [157]:
word_token_lengths = []
for word in tr_stt_words[23]:
    word_token_lengths.append(len(tokenizer.encode(word, add_special_tokens=False)))

In [159]:
id = 0
word_vectors = []
for wl in word_token_lengths:
    word_vectors.append(torch.mean(token_vectors[id:id+wl], dim=0))
    id = id+wl
word_vectors = torch.stack(word_vectors)

In [160]:
final_vectors.shape

torch.Size([36, 768])

In [169]:
len(tr_stt_words[23])

36

I strongly beleive this works. If anyone has any idea why it might not, feel free to share.