# CORPUS AUGMENTATION

This notebook provides the code used for generating an augmented corpus with more German words. Google Translate API was used to translate English->German.

## Global modules import

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
import json
import numpy as np
import random as rnd
import sys
import torch

from sklearn.model_selection import train_test_split
from operator import itemgetter

## Loading data

In [4]:
from data_loading import create_word_lists, tidy_sentence_length

In [5]:
with open("data/corpus_data.json") as json_file:
    data = json.load(json_file)
data = data["records"]

In [6]:
human_transcripts = [entry["human_transcript"] for entry in data]
stt_transcripts = [entry["stt_transcript"] for entry in data]

In [7]:
human_words, stt_words, word_labels, word_grams, word_sems = create_word_lists(data)

Some of the sentences are too long, so we need to shorten them. The sentences are basically concatenations of individual words with spaces in between, without any interpuction, so they are reconstructed from word lists when necessary.

In [8]:
stt_transcripts, stt_words, word_labels, word_grams, word_sems = tidy_sentence_length(
    stt_transcripts, stt_words, word_labels, word_grams, word_sems
)

In [9]:
max_length = max(map(len, word_labels))
padded_labels = [row + [False] * (max_length - len(row)) for row in word_labels]
padded_labels = np.array(padded_labels)
stat_labels = np.any(padded_labels, axis=1)

Here, we split only indices and not data itself, because the data contains arrays of variable length, which does not work with `train_test_split`:

In [10]:
indices = list(range(len(stt_transcripts)))
tr_indices, te_indices = train_test_split(
    indices, test_size=0.2, random_state=0, shuffle=True, stratify=stat_labels
)

These are hepler functions that will extract data selected by indices:

In [11]:
extract_train = itemgetter(*tr_indices)
extract_test = itemgetter(*te_indices)

Finally, do data splitting:

In [12]:
tr_stt_transcripts = extract_train(stt_transcripts)
tr_stt_words = extract_train(stt_words)

tr_word_labels = extract_train(word_labels)
tr_word_grams = extract_train(word_grams)
tr_word_sems = extract_train(word_sems)

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

te_stt_transcripts = extract_test(stt_transcripts)
te_stt_words = extract_test(stt_words)

te_word_labels = extract_test(word_labels)
te_word_grams = extract_test(word_grams)
te_word_sems = extract_test(word_sems)

## Translate English Words

In [13]:
from googletrans import Translator

Here we choose which part of the corpus to translate. Instead of using `tr_` variables, `te_` variables can be chosen to translate test words:

In [15]:
new_tr_words, new_tr_word_labels = ([], [])
to_translate_words = []

for sentence, labels in zip(tr_stt_words, tr_word_labels):
    if any(labels):
        new_tr_words.append(sentence)
        new_tr_word_labels.append(labels)
    else:
        to_translate_words.append(sentence)

In [6]:
from translation import translate_sentences

Do the translation:

In [18]:
translations, tr_labels, n = translate_sentences(to_translate_words)

  0%|          | 0/4676 [00:00<?, ?it/s]

100%|██████████| 4676/4676 [29:44<00:00,  2.62it/s]  


Check the proportion of failed translations:

In [19]:
print(n / len(to_translate_words))

0.5241659538066724


If necessary, translate can be used again on a newly created dataset to further increase the proportion of German words. API call fails seem unavoidable, and by repeatedly translating the same dataset results can be made better.

Merge the original German sentences and the augmented part of dataset:

In [20]:
new_tr_words = new_tr_words + translations
new_tr_word_labels = new_tr_word_labels + tr_labels

Finally, save the dataset:

In [25]:
import os
import pickle

In [27]:
out_path = "intermediate_data/translator_basic"

In [28]:
with open(os.path.join(out_path, "words_higher_perc.pkl"), "wb") as file:
    pickle.dump(new_tr_words, file)
with open(os.path.join(out_path, "labels_higher_perc.pkl"), "wb") as file:
    pickle.dump(new_tr_word_labels, file)
with open(os.path.join(out_path, "test_words_higher_perc.pkl"), "wb") as file:
    pickle.dump(te_stt_words, file)
with open(os.path.join(out_path, "test_labels_higher_perc.pkl"), "wb") as file:
    pickle.dump(te_word_labels, file)

We can aslo check the proportion of German words in the augmented set:

In [30]:
num_words = 0
num_germans = 0
for l in new_tr_word_labels:
    num_words += len(l)
    num_germans += sum(l)
print(num_germans / num_words)

0.11717936393495153


As well as the original one:

In [34]:
num_words = 0
num_germans = 0
for l in word_labels:
    num_words += len(l)
    num_germans += sum(l)
print(num_germans / num_words)

0.029981542412326458
