# BERT-MLP MODEL

This notebook contains a sample pipeline that shows the BERT-MLP model used in the project. The model given here deviates from the baseline as it has weights. As multiple parameter configurations were tested in the paper, the places in which code should be changed to implement another variant of this model are denoted and instructions are given on how to obtain all of the variations that were tested. This configuration was chosen as it was most performant.

## Global modules import

In [25]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [26]:
import json
import numpy as np
import os
import random as rnd
import sys
import torch

from sklearn.model_selection import train_test_split
from operator import itemgetter

## Local modules import

## Loading data

In [27]:
from data_loading import create_word_lists, tidy_sentence_length

In [28]:
with open("data/corpus_data.json") as json_file:
    data = json.load(json_file)
data = data["records"]

Extract whole trasncripts:

In [29]:
human_transcripts = [entry["human_transcript"] for entry in data]
stt_transcripts = [entry["stt_transcript"] for entry in data]

Extract individual words and their labels:

In [30]:
human_words, stt_words, word_labels, word_grams, word_sems = create_word_lists(data)

Some of the sentences are too long, so we need to shorten them. The sentences are basically concatenations of individual words with spaces in between, without any interpuction, so they are reconstructed from word lists when necessary.

In [31]:
stt_transcripts, stt_words, word_labels, word_grams, word_sems = tidy_sentence_length(
    stt_transcripts, stt_words, word_labels, word_grams, word_sems
)

Here, instead of loading the original corpus, augmented data can also be used. File `translation.py` contains a function that can be used to produce more German words in the dataset by usage of Google Translate API calls.

# PIPELINE START
---

## Train-test split

We need to extract which sentences contain German words in order to stratify the data split:

In [32]:
max_length = max(map(len, word_labels))
padded_labels = [row + [False] * (max_length - len(row)) for row in word_labels]
padded_labels = np.array(padded_labels)
stat_labels = np.any(padded_labels, axis=1)

Here, we split only indices and not data itself, because the data contains arrays of variable length, which does not work with `train_test_split`:

In [33]:
indices = list(range(len(stt_transcripts)))
tr_indices, te_indices = train_test_split(
    indices, test_size=0.2, random_state=0, shuffle=True, stratify=stat_labels
)

These are hepler functions that will extract data selected by indices:

In [34]:
extract_train = itemgetter(*tr_indices)
extract_test = itemgetter(*te_indices)

Finally, do data splitting:

In [35]:
tr_stt_transcripts = extract_train(stt_transcripts)
tr_stt_words = extract_train(stt_words)

tr_word_labels = extract_train(word_labels)
tr_word_grams = extract_train(word_grams)
tr_word_sems = extract_train(word_sems)

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

te_stt_transcripts = extract_test(stt_transcripts)
te_stt_words = extract_test(stt_words)

te_word_labels = extract_test(word_labels)
te_word_grams = extract_test(word_grams)
te_word_sems = extract_test(word_sems)

## BERT part

In [36]:
import torch
from transformers import BertTokenizer, BertModel

  from .autonotebook import tqdm as notebook_tqdm


In [37]:
from bert_encoder import encode_sentence

We are here using pretrained base models, but other ones can be chosen, too. Using the multilingual model did not yield any benefits.

In [38]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model_bert = BertModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
model_bert.eval();

In [39]:
tr_stt_vectors = []
te_stt_vectors = []

Encode the corpus:

In [40]:
for sentence, words in zip(tr_stt_transcripts, tr_stt_words):
    tr_stt_vectors.append(encode_sentence(sentence, words, model_bert, tokenizer))

In [41]:
for sentence, words in zip(te_stt_transcripts, te_stt_words):
    te_stt_vectors.append(encode_sentence(sentence, words, model_bert, tokenizer))

The encodnigs in the previous step can be changed by passing `vectorization` parameter to the `encode_sentence` function. By default, encoding is done by summing the last four hidden layers of the BERT model. If `vectorization` is set to `stl`, second-to-last layer is used as representation, and if `concat`, the last four layers are concatenated to obtain the final representation. 

Convert the corpus to tensors:

In [42]:
tr_tensor = torch.vstack(tr_stt_vectors)
tr_label_tensor = torch.tensor(
    [int(element) for sublist in tr_word_labels for element in sublist]
)
tr_grams_tensor = torch.tensor(
    [int(element) for sublist in tr_word_grams for element in sublist]
)
tr_sems_tensor = torch.tensor(
    [int(element) for sublist in tr_word_sems for element in sublist]
)


te_tensor = torch.vstack(te_stt_vectors)
te_label_tensor = torch.tensor(
    [int(element) for sublist in te_word_labels for element in sublist]
)
te_grams_tensor = torch.tensor(
    [int(element) for sublist in te_word_grams for element in sublist]
)
te_sems_tensor = torch.tensor(
    [int(element) for sublist in te_word_sems for element in sublist]
)

## MLP part

For quicker experimenting, load saved data:

In [44]:
import itertools
import pandas as pd

from sklearn.model_selection import StratifiedKFold

from tqdm import tqdm

import torch.nn as nn
import torch.optim as optim

In [45]:
from mlp import MLP, cross_validate_model, train_model, calc_stats

Use CUDA accelleration if possible:

In [46]:
torch_device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

Here we explore the inpact of number of neurons per layer on our result. The rest of the parameters were chosen based on previous grid searched. That can be changed to explore the full hyperparameter space if necessary.

In [47]:
epochs = 20
hidden_layers = 1
neurons_per_layer_options = [32, 64, 128, 256, 512, 700, 1024, 2048]
learning_rate = 1e-4

Define global variables:

In [48]:
best_loss = float("inf")
best_param = None

features = tr_tensor
labels = tr_label_tensor

criterion = nn.BCELoss()
splitter = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

german_proportion = tr_label_tensor.to(torch.float).mean()
weights = torch.tensor([1 / (1 - german_proportion), 1 / german_proportion])

The class weights in the previous cell were calculated so that a less prominent class has more weight on the result of loss calculation.

Features can be augmented to add a column representing grammar or semantic errors, or model perplexity score.

Create a temporary array to store intermediate data:

In [49]:
grid_search_data = []

Do a grid search for best number of neurons per layer:

In [50]:
for neurons_per_layer in tqdm(neurons_per_layer_options):
    model = MLP(features.shape[1], hidden_layers, neurons_per_layer).to(torch_device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    tr_loss, tr_loss_std, te_loss, te_loss_std = cross_validate_model(
        model,
        features,
        labels,
        criterion,
        optimizer,
        splitter,
        n_epochs=epochs,
        num_workers=0,
        device=torch_device,
        class_weights=weights,
    )

    values_to_add = [neurons_per_layer, tr_loss, tr_loss_std, te_loss, te_loss_std]

    # Add preliminary data to dataframe
    grid_search_data.append(values_to_add)

    if te_loss < best_loss:
        best_loss = te_loss
        best_param = neurons_per_layer

  0%|          | 0/8 [00:00<?, ?it/s]

100%|██████████| 8/8 [06:23<00:00, 47.93s/it]


The above loop can be redesigned to perform bias-variance tradeoff testing, by splitting the dataset into multiple subsets and using their growing unions as data for cross-validation.

Create dataframe to store hyperparameter data and save it:

In [51]:
out_path = "."
columns = ["neurons_per_layer", "tr_loss", "tr_loss_std", "te_loss", "te_loss_std"]
gs_frame = pd.DataFrame(grid_search_data, columns=columns)
gs_frame.to_csv(os.path.join(out_path, "gs_data_test.csv"), index=False)

## Test the best model

In [52]:
from mlp import STTDataset
from torch.utils.data import DataLoader

Train the model on the whole dataset with the best parameters:

In [53]:
train_data = STTDataset(tr_tensor, tr_label_tensor)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True, num_workers=0)

german_proportion = tr_label_tensor.to(torch.float).mean()
weights = torch.tensor([1 / (1 - german_proportion), 1 / german_proportion])
neurons_per_layer = best_param

criterion = nn.BCELoss(reduction="none")
model = MLP(train_data.embeddings.shape[1], hidden_layers, neurons_per_layer).to(
    torch_device
)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [54]:
train_model(
    model,
    criterion,
    optimizer,
    train_loader,
    n_epochs=epochs,
    device=torch_device,
    class_weights=weights,
)

0.10087998536274866

Test the model on the test set:

In [55]:
test_data = STTDataset(te_tensor, te_label_tensor)
test_loader = DataLoader(
    test_data, batch_size=len(test_data), shuffle=True, num_workers=0
)
criterion = nn.BCELoss()

In [56]:
model.eval()
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(torch_device), labels.to(torch_device)
        pred = model(inputs)
        pred = torch.squeeze(pred, dim=1)
        loss = criterion(pred, labels.to(torch.float)).item()

In [57]:
accuracy, precision, recall, f1 = calc_stats(pred, te_label_tensor)

In [58]:
f1

0.03375527426160338

Save the results for later use:

In [59]:
results = pd.DataFrame(
    [[loss, accuracy, precision, recall, f1]],
    columns=["loss", "accuracy", "precision", "recall", "f1"],
)
results.to_csv(os.path.join(out_path, "results.csv"), index=False)

Check how well the model performs on german words by extracting them and their assigned labels.

In [60]:
all_te_words = [element for sublist in te_stt_words for element in sublist]
all_te_labels = [element for sublist in te_word_labels for element in sublist]
all_te_predictions = (pred.to("cpu").numpy().flatten() > 0.5).astype(int)

In [61]:
german_words = []
german_predictions = []
for i in range(len(all_te_words)):
    if all_te_labels[i]:
        german_words.append(all_te_words[i])
        german_predictions.append(all_te_predictions[i])

predicted_labels = pd.DataFrame(
    {"word": german_words, "prediction": german_predictions}
)
predicted_labels.to_csv(os.path.join(out_path, "word_labels.csv"))

In [62]:
predicted_labels.sort_values(by="prediction", ascending=False).head(20)

Unnamed: 0,word,prediction
156,ma,1
66,i,1
331,on,1
257,to,1
35,spear,1
128,salton,1
320,site,1
41,as,1
386,s,1
122,he,1
