## WMT Training Data Processing
This notebook will be used for devising a script to parse WMT BTT training sets.

### Start State
We have three WMT BTT training sets - 2016, 2019, and 2022. They are in three slightly different formats, requiring bespoke parsing.
- The 2016 data is in a big .txt file about 100MB in size. Each line is formatted as follows: PubMed ID | \[Source sentence\].|Target Sentence. A quick inspection of the dataset shows that some source sentences are marked as "Not Available" and others are marked as "In Process Citation". We must discard these. Quite a few sentences also have double quotation marks replaced with \&quot; which we must subsitute.
- The 2019 data is in a folder comprising aligned parallel abstracts in files of the format PMID_en.txt and PMID_fr.txt. Within each abstract, sentences beginning with # are not translated, and can be ignored.
- The 2022 data is in a folder comprising unaligned parallel abstracts in files of the format PMID_en.txt and PMID_fr.txt. Following Choi et al. (2022), we will split these sentences up using a sentence splitter and only consider those abstracts with the same number of sentences as our parallel corpus. 

### Desired Outcome
For every training set, we want to generate a single, **separate** .txt file containing aligned pairs of source and target sentences in the format `source_sentence (TAB) target_sentence (newline)`. This is more convenient for evaluation sets, for which we will use the HuggingFace `Trainer` API. We will concatenate them later. Note that we haven't started preprocessing our data - we will use Wu et al.(2022)'s method for this later.

### Considerations
- HuggingFace's programmatic Dataset upload API is broken; this issue is still unresolved and cannot be overcome by switching to earlier versions. We therefore have to preprocess and upload via the web interface.
- We cannot use a CSV, because the `datasets` library's to_csv function adds strange characters to the data, even with the correct UTF encoding applied. In essence, converting to CSV and uploading _that_ leads to data corruption. Intuitively, we must use a .txt file and upload that via the web interface - end-users will load our dataset as a Dataset or DatasetDict object, and the .txt file source will be transparent to them. I'll write up a script to process these .txt files, since they will all be in the same format.
- We won't be using pandas here, as our sentences are not split into multiple tab-separated columns (which pandas is good for). We'll just read the .txt files and process them as we go.

### Unpacking
We've already unpacked the tar.gz files using `tar -xzf`; I've renamed the WMT19 training data directory to `wmt19_enfr_train` to be more consistent.

Let's begin with the 2016 data first.

In [1]:
#Get current working directory
import os
dir_path = os.getcwd()

In [2]:
PATHS = {16:"raw_training_data/wmt16_enfr_train.txt", 19:"raw_training_data/wmt19_enfr_train", 22: "raw_training_data/wmt22_enfr_train"}

In [3]:
wmt_2016 = open(os.path.join(dir_path, PATHS[16]),  "r", encoding = "utf8")
output = open(os.path.join(dir_path, "wmt16_train.txt"), "w", encoding = "utf8")
#print(repr(wmt_2016.readline().strip())) #No hidden characters

#We will preprocess line by line into a fresh .txt file.

for line in wmt_2016.readlines():
    line = line.strip(); #Remove trailing \n
    content = line.split("|") #Split string into list of PMID, source, and target
    source = content[1]
    if ((source == "[Not Available].") or ("In Process Citation" in source) or ("In process citation" in source) or ("In process Citation" in source)): #No source translation
        continue
    #Remove delimiters and weird characters
    source = source.replace("[", "")
    source = source.replace("]", "")
    source = source.replace("&quot;", '"')
    #Full stops are added to source sentences regardless of ending punctuation, but we cannot do a block replacement without affecting !... and ?..., so this is the best we can do.
    if ((source[-2:] == "?.") or (source[-2:] == "!.")): 
        source = source[:-1]
    target = content[2]
    if (("In process citation" in target) or ("In Process Citation" in target)): #No target translation
        continue
    target = target.replace("&quot;", '"')
    if ((target[-2:] == "?.") or (target[-2:] == "!.")):
        target = target[:-1]
    
    output.write(source + "\t" + target + "\n")

wmt_2016.close()
output.close()

Looking through the resultant .txt file, it's not too bad - we have some misalignments, but this is to be expected, and we can rectify these later during preprocessing. Right now, we just want to convert our file into an obedient format, and we have done this successfully. Let's move on to the 2019 data.

In [4]:
files_2019 = os.listdir(os.path.join(dir_path, PATHS[19])) #Gives us all the names of the files in the directory; happily, these are all sorted. They're aligned, too!

In [5]:
print(files_2019[1])

20847962_fr.txt


In [6]:
#Let's first iterate through the list to check whether we have unmatched files (always a possibility).
for fileNum in range(0, len(files_2019), 2):
    if(files_2019[fileNum][:-7] != files_2019[fileNum + 1][:-7]):
        print("oops!") #Wow, there really aren't any unmatched files!

In [7]:
path_2019 = os.path.join(dir_path, PATHS[19]) + "/"
output = open(os.path.join(dir_path, "wmt19_train.txt"), "w", encoding = "utf8")
for i in range(0, len(files_2019), 2):
    f1 = open(os.path.join(path_2019, files_2019[i]), "r", encoding = "utf8")
    sourceSentences = f1.readlines()
    f2 = open(os.path.join(path_2019, files_2019[i + 1]), "r", encoding = "utf8")
    targetSentences = f2.readlines()
    for j in range(len(sourceSentences)):
        if(sourceSentences[j][0] == "#"): #Ignore untranslated sentences
            continue
        #We need to remove untranslated article names; these are found in the first sentence after the # marks. 
        #Generally, we can get around this with not much loss by starting the source sentence from BACKGROUND, which skips the title. 
        #Subsequent preprocessing will filter out poorly-aligned sentences - this is just an early labour-saving step.
        startOfTranslated = sourceSentences[j].find("BACKGROUND")
        if(startOfTranslated != -1):
            sourceSentences[j] = sourceSentences[j][startOfTranslated:]
        if(sourceSentences[j][0] == "["): #Some other sentences have the untranslated article name in front of the translated sentences; the name is enclosed in square brackets and ends with a full stop.
            endOfTitle = sourceSentences[j].find("]") + 2 #We want the first char after the full stop
            sourceSentences[j] = sourceSentences[j][endOfTitle:]
            if (sourceSentences[j].strip() == ""):
                continue
        translate_msg = targetSentences[j].find("[Traduction par l’éditeur].") #A few target sentences have this message appended to them - Translation by the editor. We omit this.
        if(translate_msg != -1):
            targetSentences[j] = targetSentences[j][:translate_msg]
            if (targetSentences[j].strip() == ""):
                continue
        output.write(sourceSentences[j].strip() + "\t" + targetSentences[j].strip() + "\n") #Get rid of whitespace
    f1.close()
    f2.close()
output.close()

It seems okay! Let's move on to the 2022 data now. We found an unpaired file - 32479674_en.txt - so we removed that. We also found a non EN/FR file, 32514214_no.txt; we also removed that.

In [8]:
files_2022 = os.listdir(os.path.join(dir_path, PATHS[22])) #Gives us all the names of the files in the directory

In [9]:
#Let's first iterate through the list to check whether we have unmatched files (always a possibility).
for fileNum in range(0, len(files_2022), 2):
    if(files_2022[fileNum][:-7] != files_2022[fileNum + 1][:-7]):
        print(files_2022[fileNum]) #All is well.

In [10]:
#Following Choi et al.(2022), we will use MosesSentenceSplitter to split our many-sentence-in-one-line abstracts.
#from mosestokenizer import MosesSentenceSplitter #Very, very slow, likely due to its unwrapping ability. We won't need that here - we will use a sentence aligner to check later on.
from sentence_splitter import SentenceSplitter #Source: https://libraries.io/pypi/sentence-splitter
sourceSplitter = SentenceSplitter(language='en')
targetSplitter = SentenceSplitter(language='fr')

In [13]:
path_2022 = os.path.join(dir_path, PATHS[22]) + "/"
output = open(os.path.join(dir_path, "wmt22_train.txt"), "w", encoding = "utf8")
for i in range(0, len(files_2022), 2):
    f1 = open(os.path.join(path_2022, files_2022[i]), "r", encoding = "utf8")
    sourceSentences = sourceSplitter.split(text=f1.readline())
    f2 = open(os.path.join(path_2022, files_2022[i + 1]), "r", encoding = "utf8")
    targetSentences = targetSplitter.split(text=f2.readline())
    if (len(sourceSentences) != len(targetSentences)): #If there are more sentences in either, we don't know which should be aligned with which, so ignore (per Choi et al.)
        continue
    for j in range(len(sourceSentences)):
        output.write(sourceSentences[j].strip() + "\t" + targetSentences[j].strip() + "\n") #Get rid of whitespace
    f1.close()
    f2.close()
output.close()

This was adequate for a first pass, but we note that there are several HTML tags, such as \<i>,\<sub>, \<sup>, and so on. Yet, this is fine - these appear in the test set, too! I think we can leave them in, because we need to keep the test set the same as the WMT22 BTT anyway. Now, it's time to concatenate all the training data. This doesn't impose extra work for preprocessing, because the first step in preprocessing is to remove duplicates, and those must be captured across all our training data. Subsequently, when we add more data to our training corpus using various means, we'll repeat this preprocessing step over the new training data.

In [14]:
final_train_output = open(os.path.join(dir_path, "wmt_parallel_train.txt"), "w", encoding = "utf8")
train_2016 = open(os.path.join(dir_path, "wmt16_train.txt"), "r", encoding = "utf8")
for line in train_2016.readlines():
    final_train_output.write(line)
train_2019 = open(os.path.join(dir_path, "wmt19_train.txt"), "r", encoding = "utf8")
for line in train_2019.readlines():
    final_train_output.write(line)
train_2022 = open(os.path.join(dir_path, "wmt22_train.txt"), "r", encoding = "utf8")
for line in train_2022.readlines():
    final_train_output.write(line)
final_train_output.close()

And we've generated our initial in-domain parallel corpus! Our next step will be preprocessing.