## Global modules import

In [41]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [42]:
import json
import numpy as np
from operator import itemgetter
import random as rnd
import sys
import torch

## Local modules import

In [43]:
sys.path.append('..')

In [44]:
from data_loading import create_word_lists, tidy_sentence_length

In [45]:
from sklearn.model_selection import train_test_split

## Loading data

In [46]:
sys.path.append('../data')

In [47]:
with open('../data/corpus_data.json') as json_file:
    data = json.load(json_file)
data = data['records']

In [48]:
human_transcripts = [entry['human_transcript'] for entry in data]
stt_transcripts   = [entry['stt_transcript'] for entry in data]

In [49]:
human_words, stt_words, word_labels, word_grams, word_sems = \
    create_word_lists(data)

Some of the sentences are too long, so we need to shorten them.

In [50]:
stt_transcripts, stt_words, word_labels, word_grams, word_sems = \
    tidy_sentence_length(stt_transcripts, stt_words, word_labels, word_grams, word_sems)

# PIPELINE START
---

## Train-test split

We need to extract which sentences contain German words in order to stratify the data split:

In [51]:
max_length = max(map(len, word_labels))
padded_labels = [row + [False] * (max_length - len(row)) for row in word_labels]
padded_labels = np.array(padded_labels)
stat_labels = np.any(padded_labels, axis=1)

Here, we split only indices and not data itself, because the data contains arrays of variable length, which does not work with `train_test_split`:

In [52]:
indices = list(range(len(stt_transcripts)))
tr_indices, te_indices = train_test_split(indices, test_size=0.2, random_state=0, shuffle=True, stratify=stat_labels)

These are hepler functions that will extract data selected by indices:

In [53]:
extract_train = itemgetter(*tr_indices)
extract_test  = itemgetter(*te_indices)

Finally, do data splitting:

In [54]:
tr_stt_transcripts   = extract_train(stt_transcripts)
tr_stt_words         = extract_train(stt_words)

tr_word_labels       = extract_train(word_labels)
tr_word_grams        = extract_train(word_grams)
tr_word_sems         = extract_train(word_sems)

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

te_stt_transcripts   = extract_test(stt_transcripts)
te_stt_words         = extract_test(stt_words)

te_word_labels       = extract_test(word_labels)
te_word_grams        = extract_test(word_grams)
te_word_sems         = extract_test(word_sems)

## BERT part

In [55]:
import torch
from transformers import BertTokenizer, BertModel

In [56]:
from bert_encoder import encode_sentence

In [57]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model_bert.eval();

In [58]:
tr_stt_vectors = []
te_stt_vectors = []

Encode the corpus:

In [60]:
for sentence, words in zip(tr_stt_transcripts, tr_stt_words):
    tr_stt_vectors.append(
        encode_sentence(sentence, words, model_bert, tokenizer)
    )

In [61]:
for sentence, words in zip(te_stt_transcripts, te_stt_words):
    te_stt_vectors.append(
        encode_sentence(sentence, words, model_bert, tokenizer)
    )

There are 47 really long sentences. We should perhaps split them into multiple. I will see about the strategies tomorrow.

IDEA: Don't use the original STT transcripts, but rather reconstruct them from a list of words.