## WMT Test Set Processing
This notebook will be used for devising a script to parse WMT BTT test sets.

### Start State
WMT Test sets from 2020, 2021 and 2022 each comprise four files - test sentences, gold sentences, document mapping to ID, and sentence alignment.

### Desired Outcome
We have two different outcomes. These outcomes depend on the format most convenient for the utilities we will use.
- For our **test set**, which is the WMT 2022 BTT test set, we want to generate two .txt files of aligned sentences. The first .txt file contains source sentences s<sub>1</sub>, s<sub>2</sub> ... s<sub>N</sub>, while the second .txt file contains target sentences t<sub>1</sub>, t<sub>2</sub> ... t<sub>N</sub>, where t<sub>i</sub> is the gold-standard translation of s<sub>i</sub>. This format is most convenient for the Transformers `pipeline` utility.
- For our **validation set**, which comprises the concatenated 2020 and 2021 test sets, we want to generate a single .txt file containing aligned pairs of source and target sentences in the format `source_sentence (TAB) target_sentence (newline)`. This is more convenient for evaluation sets, for which we will use the HuggingFace `Trainer` API.

### Considerations
- HuggingFace's programmatic Dataset upload API is broken; this issue is still unresolved and cannot be overcome by switching to earlier versions. We therefore have to preprocess and upload via the web interface.
- We cannot use a CSV, because the `datasets` library's to_csv function adds strange characters to the data, even with the correct UTF encoding applied. In essence, converting to CSV and uploading _that_ leads to data corruption. Intuitively, we must use a .txt file and upload that via the web interface - end-users will load our dataset as a Dataset or DatasetDict object, and the .txt file source will be transparent to them. I'll write up a script to process these .txt files, since they will all be in the same format.

In [1]:
#Get current working directory
import os, pandas as pd
dir_path = os.getcwd()

In [2]:
#Organise file path names first
YEARS = [20, 21, 22]
FILE_NAMES = ["alignment.tsv", "gold.txt", "mapping.txt", "test.txt"]
FILE_PATHS = { YEARS[j] : ["../data/wmt" + str(YEARS[j]) + "_enfr_test/wmt" + str(YEARS[j]) + "_enfr_" + FILE_NAMES[i] for i in range(len(FILE_NAMES))] for j in range(len(YEARS))}

In [3]:
#Start with 2022
files_2022 = FILE_PATHS[22]

In [4]:
#Inspect and read content into a list of pandas dataframes
content = {"alignment":"", "gold":"", "mapping":"", "test":""} #Helps us identify which is which
numMap = {0 : "alignment", 1: "gold", 2 : "mapping", 3 : "test"}
col_names = {0 : ["alignmentQuality", "ID", "targetSentenceNum", "sourceSentenceNum"], 1 : ["docNum", "targetSentenceNum", "sentence"], 2 : ["ID", "docNum"], 3 : ["docNum", "sourceSentenceNum", "sentence"]}
for i in range(len(files_2022)):
    f = open(os.path.join(dir_path, files_2022[i]))
    #print(repr(f.readline())) #All our files are tab-separated! This allows us to use pandas as follows:
    content[numMap[i]] = pd.read_csv(f, delimiter = "\t", header=None, names = col_names[i])
    f.close()

In [5]:
#Test our ability to write output without data corruption - seems okay from what I can see
"""
f = open("testoutput.txt", "w")
for i in range(len(content["gold"])):
    f.write(content["gold"]["sentence"][i])
    f.write("\n")
f.close()
"""

Our first step is to modify the alignments by substituting document numbers for IDs. We'll add an extra column for this.

In [6]:
#Convert our mappings dataframe into a dictionary
mappings = dict(content["mapping"].values)

In [7]:
content["alignment"]

Unnamed: 0,alignmentQuality,ID,targetSentenceNum,sourceSentenceNum
0,OK,35096693,1,1
1,OK,35096693,2,2
2,OK,35096693,3,3
3,OK,35096693,4,4
4,OK,35096693,5,5
...,...,...,...,...
808,NO_ALIGNMENT,19144122,omitted,26
809,NO_ALIGNMENT,19144122,omitted,27
810,NO_ALIGNMENT,19144122,omitted,28
811,NO_ALIGNMENT,19144122,omitted,29


In [8]:
#Create new column with corresponding document numbers
alignments = content["alignment"]
alignments["docNum"] = alignments["ID"].map(mappings)

In [9]:
alignments

Unnamed: 0,alignmentQuality,ID,targetSentenceNum,sourceSentenceNum,docNum
0,OK,35096693,1,1,doc19
1,OK,35096693,2,2,doc19
2,OK,35096693,3,3,doc19
3,OK,35096693,4,4,doc19
4,OK,35096693,5,5,doc19
...,...,...,...,...,...
808,NO_ALIGNMENT,19144122,omitted,26,doc11
809,NO_ALIGNMENT,19144122,omitted,27,doc11
810,NO_ALIGNMENT,19144122,omitted,28,doc11
811,NO_ALIGNMENT,19144122,omitted,29,doc11


Then, we must filter out only the OK sentence alignments and query the gold and test dataframes with these. Weirdly enough, some OK alignments omitted the target and source sentence numbers, perhaps due to truncated abstracts. So far, so good - this is still WMT BTT protocol, because they only used the OK aligned test sets.

In [10]:
ok_alignments = alignments.loc[(alignments["alignmentQuality"] == "OK") & 
(alignments["targetSentenceNum"].str.contains("omitted") == False) & 
(alignments["sourceSentenceNum"].str.contains("omitted") == False)]

In [11]:
print(ok_alignments.targetSentenceNum.unique())
print(ok_alignments.sourceSentenceNum.unique())

['1' '2' '3' '4' '5' '6' '7' '8' '9' '10' '11' '12' '13' '14' '15' '16'
 '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '14,15' '11,12' '10,11'
 '27' '9,10' '4,5' '2,3']
['1' '2' '3' '4' '5' '6' '7' '8,9' '10' '11' '12' '13' '14' '15' '16' '8'
 '9' '17' '18' '1,2' '19' '20' '21,22' '23' '24' '25' '26' '27' '28' '21'
 '22' '4,5']


We also notice that some sentences align to multiple sentences, perhaps due to semicolons. This shouldn't be a big problem - just concatenate the comma-separated sentences together.

In [12]:
ok_alignments = ok_alignments.astype({"sourceSentenceNum": str, "targetSentenceNum": str, "docNum" : str})
gold = content["gold"].astype({"targetSentenceNum": str, "docNum" : str, "sentence" : str})
test = content["test"].astype({"sourceSentenceNum": str, "docNum" : str, "sentence" : str})

In [13]:
testSentences = ok_alignments[["sourceSentenceNum", "docNum"]]
f = open("wmt22test.txt", "w")
for index, row in testSentences.iterrows():
    queries = row["sourceSentenceNum"].split(",")
    buffer = ""
    for query in queries:
        if query != queries[-1]:
            buffer += test.loc[(test["sourceSentenceNum"] == query) & (test["docNum"] == row["docNum"])]["sentence"].values[0] + " "
        else:
            buffer += test.loc[(test["sourceSentenceNum"] == query) & (test["docNum"] == row["docNum"])]["sentence"].values[0]
    f.write(buffer + "\n")
f.close()

In [14]:
goldSentences = ok_alignments[["targetSentenceNum", "docNum"]]
f = open("wmt22gold.txt", "w")
for index, row in goldSentences.iterrows():
    queries = row["targetSentenceNum"].split(",")
    buffer = ""
    for query in queries:
        if query != queries[-1]:
            buffer += gold.loc[(gold["targetSentenceNum"] == query) & (gold["docNum"] == row["docNum"])]["sentence"].values[0] + " "
        else:
            buffer += gold.loc[(gold["targetSentenceNum"] == query) & (gold["docNum"] == row["docNum"])]["sentence"].values[0]
    f.write(buffer + "\n")
f.close()

So far, so good. A quick inspection reveals that there are some misalignments (e.g., doc22), but because this matches up with the provided alignment file, we won't modify it - it might artificially inflate our scores relative to the WMT SOTA. Let's move on to the WMT 2021 and WMT 2020 datasets, beginning with 2021. We can do the exact same thing, and then read from both text files. We write into another text file, filling in with \t.

In [15]:
files_2021 = FILE_PATHS[21]
files_2020 = FILE_PATHS[20]

In [16]:
for i in range(len(files_2021)):
    f = open(os.path.join(dir_path, files_2021[i]), encoding = "utf8")
    #print(repr(f.readline())) #All our files are still tab-separated! This allows us to use pandas as follows:
    content[numMap[i]] = pd.read_csv(f, delimiter = "\t", header=None, encoding = "utf8", names = col_names[i])
    f.close()
mappings = dict(content["mapping"].values)
alignments = content["alignment"]
alignments["docNum"] = alignments["ID"].map(mappings)
ok_alignments = alignments.loc[(alignments["alignmentQuality"] == "OK") & 
(alignments["targetSentenceNum"].str.contains("omitted") == False) & 
(alignments["sourceSentenceNum"].str.contains("omitted") == False)]
ok_alignments
ok_alignments = ok_alignments.astype({"sourceSentenceNum": str, "targetSentenceNum": str, "docNum" : str})
gold = content["gold"].astype({"targetSentenceNum": str, "docNum" : str, "sentence" : str})
test = content["test"].astype({"sourceSentenceNum": str, "docNum" : str, "sentence" : str})
testSentences = ok_alignments[["sourceSentenceNum", "docNum"]]
#At this point we had an error - the organisers omitted quite a few documents entirely from both gold and test sets.
#print(testSentences.docNum.unique() == gold.docNum.unique())
#print(gold.docNum.unique() == test.docNum.unique()) #Evidently, both gold and test cohere.
testSentences = testSentences.loc[testSentences["docNum"].isin(gold.docNum.unique())]
f = open("wmt21test.txt", "w", encoding = "utf8")
for index, row in testSentences.iterrows():
    queries = row["sourceSentenceNum"].split(",")
    buffer = ""
    for query in queries:
        if query != queries[-1]:
            buffer += test.loc[(test["sourceSentenceNum"] == query) & (test["docNum"] == row["docNum"])]["sentence"].values[0] + " "
        else:
            buffer += test.loc[(test["sourceSentenceNum"] == query) & (test["docNum"] == row["docNum"])]["sentence"].values[0]
    f.write(buffer + "\n")
f.close()
goldSentences = ok_alignments[["targetSentenceNum", "docNum"]]
goldSentences = goldSentences.loc[goldSentences["docNum"].isin(gold.docNum.unique())]
f = open("wmt21gold.txt", "w", encoding = "utf8")
for index, row in goldSentences.iterrows():
    queries = row["targetSentenceNum"].split(",")
    buffer = ""
    for query in queries:
        if query != queries[-1]:
            buffer += gold.loc[(gold["targetSentenceNum"] == query) & (gold["docNum"] == row["docNum"])]["sentence"].values[0] + " "
        else:
            buffer += gold.loc[(gold["targetSentenceNum"] == query) & (gold["docNum"] == row["docNum"])]["sentence"].values[0]
    f.write(buffer + "\n")
f.close()
with open("wmt21test.txt", encoding = "utf8") as f1, open("wmt21gold.txt", encoding = "utf8") as f2:
    test_list = [line.rstrip('\n') for line in f1]
    gold_list = [line.rstrip('\n') for line in f2]
f1.close()
f2.close()
output = open("validation.txt", "w", encoding = "utf8")
for i in range(len(test_list)):
    output.write(test_list[i] + "\t" + gold_list[i] + "\n")
output.close() #Inspecting the text file, it looks alright.

Having settled 2021, let's append 2020's content to validation.txt. We encountered issues because the source set missed out sentence 2 of doc80 and sentence 6 of doc54 entirely. Because we aren't sure which maps to which, let's remove those rows from the alignment mapping. We also encountered issues because the csv reader was unable to parse sentence 9 of doc96 due to double quotation marks in sentence 8 - we'll remove the row containing source sentence 8, too.

In [17]:
for i in range(len(files_2020)):
    f = open(os.path.join(dir_path, files_2020[i]), encoding = "utf8")
    #print(repr(f.readline())) #All our files are still tab-separated! This allows us to use pandas as follows:
    content[numMap[i]] = pd.read_csv(f, delimiter = "\t", header=None, encoding = "utf8", names = col_names[i])
    f.close()
mappings = dict(content["mapping"].values)
alignments = content["alignment"]
alignments["docNum"] = alignments["ID"].map(mappings)
ok_alignments = alignments.loc[(alignments["alignmentQuality"] == "OK") & 
(alignments["targetSentenceNum"].str.contains("omitted") == False) & 
(alignments["sourceSentenceNum"].str.contains("omitted") == False)]
ok_alignments
ok_alignments = ok_alignments.astype({"sourceSentenceNum": str, "targetSentenceNum": str, "docNum" : str})
gold = content["gold"].astype({"targetSentenceNum": str, "docNum" : str, "sentence" : str})
test = content["test"].astype({"sourceSentenceNum": str, "docNum" : str, "sentence" : str})

#Hacky way to drop the offending columns, because abstracts aren't that long
to_drop = (("2", "doc80"), ("6", "doc54"), ("8", "doc96"))
for item in to_drop:
    dropped_index = (ok_alignments.loc[(ok_alignments["sourceSentenceNum"].str.contains("1" + item[0]) == False) & (ok_alignments["sourceSentenceNum"].str.contains(item[0])) & 
                     (ok_alignments.docNum == item[1])]).index
    ok_alignments = ok_alignments.drop(dropped_index)
testSentences = ok_alignments[["sourceSentenceNum", "docNum"]]

#At this point we had the same error again - the organisers omitted quite a few documents entirely from both gold and test sets.
#print(gold.docNum.unique() == test.docNum.unique()) #Evidently, both gold and test cohere.
testSentences = testSentences.loc[testSentences["docNum"].isin(gold.docNum.unique())]
f = open("wmt20test.txt", "w", encoding = "utf8")
for index, row in testSentences.iterrows():
    queries = row["sourceSentenceNum"].split(",")
    buffer = ""
    for query in queries:
        try:
            if query != queries[-1]:
                buffer += test.loc[(test["sourceSentenceNum"] == query) & (test["docNum"] == row["docNum"])]["sentence"].values[0] + " "
            else:
                buffer += test.loc[(test["sourceSentenceNum"] == query) & (test["docNum"] == row["docNum"])]["sentence"].values[0]
        except:
            print("Issue with source query: " + query + " and document: " + row["docNum"])
    f.write(buffer + "\n")
f.close()
goldSentences = ok_alignments[["targetSentenceNum", "docNum"]]
goldSentences = goldSentences.loc[goldSentences["docNum"].isin(gold.docNum.unique())]
f = open("wmt20gold.txt", "w", encoding = "utf8")
for index, row in goldSentences.iterrows():
    queries = row["targetSentenceNum"].split(",")
    buffer = ""
    for query in queries:
        try:
            if query != queries[-1]:
                buffer += gold.loc[(gold["targetSentenceNum"] == query) & (gold["docNum"] == row["docNum"])]["sentence"].values[0] + " "
            else:
                buffer += gold.loc[(gold["targetSentenceNum"] == query) & (gold["docNum"] == row["docNum"])]["sentence"].values[0]
        except:
            print("Issue with target query: " + query + " and document: " + row["docNum"])
    f.write(buffer + "\n")
f.close()
with open("wmt20test.txt", encoding = "utf8") as f1, open("wmt20gold.txt", encoding = "utf8") as f2:
    test_list = [line.rstrip('\n') for line in f1]
    gold_list = [line.rstrip('\n') for line in f2]
f1.close()
f2.close()
output = open("validation.txt", "a", encoding = "utf8")
for i in range(len(test_list)):
    output.write(test_list[i] + "\t" + gold_list[i] + "\n")
output.close() #Inspecting the text file, it looks alright.