## WMT OOD and Glossary Processing
This notebook will be used for devising a script to parse WMT OOD data and our MeSpEn glossary.

### Start State
The WMT OOD data is in a big .tsv file about 110MB in size, while the MeSpEn glossary is in a small .tsv file containing around 6.5k rows.

### Desired Outcome
For both, we want to generate a single, **separate** .txt file containing aligned pairs of source and target sentences in the format `source_sentence (TAB) target_sentence (newline)`. This is more convenient for evaluation sets, for which we will use the HuggingFace `Trainer` API. We will concatenate them later. Note that we haven't started preprocessing our data - we will use Wu et al.(2022)'s method for this later.

### Considerations
- HuggingFace's programmatic Dataset upload API is broken; this issue is still unresolved and cannot be overcome by switching to earlier versions. We therefore have to preprocess and upload via the web interface.
- We cannot use a CSV, because the `datasets` library's to_csv function adds strange characters to the data, even with the correct UTF encoding applied. In essence, converting to CSV and uploading _that_ leads to data corruption. Intuitively, we must use a .txt file and upload that via the web interface - end-users will load our dataset as a Dataset or DatasetDict object, and the .txt file source will be transparent to them. I'll write up a script to process these .txt files, since they will all be in the same format.
- We won't be using pandas here, as our sentences are not split into multiple tab-separated columns (which pandas is good for). We'll just read the .txt files and process them as we go.

### Unpacking
We've already unzipped the WMT22 OOD data using `gunzip`. This yields a tab-separated file. Let's begin with the OOD data; it seems more convenient.

In [1]:
#Get current working directory
import os
dir_path = os.getcwd()
PATHS = {"OOD":"raw_training_data/wmt22_enfr_news_train.tsv", "glossary":"raw_training_data/mespen_enfr_glossary.tsv"}

In [13]:
wmt_ood = open(os.path.join(dir_path, PATHS["OOD"]), "r", encoding="utf8") #Other encodings don't seem to work.
output = open(os.path.join(dir_path, "ood_train.txt"), "w", encoding="utf8")
for line in wmt_ood.readlines():
    if (line != "\t\n"): #There are blank lines between articles - we must remove them.
        line = line.replace("�", "") #There are some non-UTF8 characters which seem to occur at random intervals in a few sentences, 
        #replacing seemingly random characters (or even no characters). Unsure what these mean.
        #If sentences don't make sense after doing this, we can eliminate such sentences (e.g., mismatched quotation marks) during pre-processing.
        output.write(line) #Apart from that, it's pretty much in our ideal format! Yay!
wmt_ood.close()
output.close()

That wasn't too bad, although we may have a bit more work to do during preprocessing. Let's move on to the MeSpEn glossary. At first glance, this glossary is quite dirty - there are english phrases scattered all around the target side, abbreviation expansions rather than translations, partial translations, comments from the translator, and so on.

In [61]:
dirty_list = [",", ";", "(", ")", "!", "\\", "/", "#", "*", "[", "]", "="] #All tokens which deviate from one-source-term-to-one-target-term translation
removed_sentences = ["Abbreviation	Entire Word", "Acute respiratory insufficiency	Possibly no EN abbrev.", "Ao	Aorta", "chest pain	retro-sternal pain", "Cx	Circumflex", "Day 2	Day 2",
                     "English	French", "French	English", "Hearing Aids	ãÚíäÇÊ ÓãÚíÉ", "hr	Hour", "nv	Normal value", "Possibly no EN abbrev.	Acute respiratory insufficiency", "wind-sock design	ref.",
                    ]

In [62]:
glossary = open(os.path.join(dir_path, PATHS["glossary"]), "r", encoding="utf8")
output1 = open(os.path.join(dir_path, "clean_glossary.txt"), "w", encoding = "utf8")
output2 = open(os.path.join(dir_path, "dirty_glossary.txt"), "w", encoding = "utf8")
banned_list = open(os.path.join(dir_path, "banned.txt"), "w", encoding = "utf8")
for line in glossary.readlines():
    line = line[3:] #Skip space, number, space preceding source terminology
    line = line.strip() #Remove all leading and trailing spaces
    if (any(sentence in line for sentence in removed_sentences)): #Filter out removed sentences
        continue
    line = line.replace(".v.", "") #These represent verbs; no such annotations will be present at inference time
    line = line.replace("=>", "") #These markings may be translator-specific
    line = line.replace("->", "") #These markings may be translator-specific
    terms = line.split("\t")
    if((terms[0].strip() == "") or (terms[1].strip() == "")): #Some lines are apparently empty
        continue
    line = terms[0].strip() + "\t" + terms[1].strip() #Some lines contain extra spaces between the \t symbols
    if(terms[0].strip().isupper()): #Flag for removal due to abbreviation possibility - I will manually append later
        banned_list.write(line + '\n')
        continue
    if not (any(substring in line for substring in dirty_list)): #Filter out possibly problematic sentences
        output1.write(line + '\n')
    output2.write(line + '\n')
banned_list.close()
output1.close()
output2.close()
glossary.close()

It is very difficult to filter this, mainly because a, b TAB c may mean that both source terminologies a and b translate into c with equal fidelity, or that b is a contextualisation of a. We also cannot use langdetect or langid to filter this by language, because we only have terms, rather than sentences. Yet, we may lose important information if we only use the clean glossary (although 5k terms is still quite substantial - HW-TSC only had 6k). Thus, I will manually go through both clean and dirty glossaries to remove obvious English terminology on the target side - this is okay because they are small. We also know that the English terminology arrived due to the mistaken inclusion of an English abbreviation-to-entire word list, so we can filter based on that. Choi et al. managed to get good results by appending the glossary directly to the training corpus, and we can do that with the dirty glossary. What's important is that we only use one-to-one terms for our soft-constraints.

After manually filtering all lists, we add the acceptable pairs (i.e., non-English abbrev to English full form pairs) to the lists as appropriate.

In [63]:
filtered_banned = open(os.path.join(dir_path, "banned_filtered.txt"), "r", encoding = "utf8")
clean_glossary = open(os.path.join(dir_path, "clean_glossary.txt"), "a", encoding = "utf8")
dirty_glossary = open(os.path.join(dir_path, "dirty_glossary.txt"), "a", encoding = "utf8")
for line in filtered_banned.readlines():
    line = line.strip()
    if not (any(substring in line for substring in dirty_list)):
        clean_glossary.write(line + '\n')
    dirty_glossary.write(line + '\n')
dirty_glossary.close()
clean_glossary.close()
filtered_banned.close()