# Generating an English - Hungarian translation test set
First of all **WHY**? Why do I want to have a dataset of translation pairs?

Since I intend to use it to evaluate LLM translation capabilities, it does make sence to use a few (tousend) translation examples to instruct the LLM to complete, and evaluate against. Decideing if a sentence pair is each other's translation, is easy even for models not "speaking" Hungarian well. However, producing those translations are not so.

**HOW MUCH** of it do we need?

We need a quickly evaluatable set (on the order of few 1000s), from the broadest possible sources. Diversity here really helps.
For first tests I decided to go with 2000 sentence pairs, represented the same way as it is in the original source:
- 182 from classic literature
- 1508 from modern literature
- 310 from subtitles

For this, we will use some publicly available bilingual corpuses, the Hunglish corpus collection by BME. All the texts are already broke down to English <-> Hungarian sentence pairs.

It consists of several sources:
- Classical literature book pairs (202744) 9.1%
- Modern literature book pairs (scrambled text lines) (1670129) 75.4%
- Movie subtitles paired up (343331) 15.5%

## Observations on the source data
Just peeking into the data randomly here and there, I realized that:
- **There are misspellings** all around the data. We can mittigate this by using a spell checking on the data before importing it into our set. 
- **There are numerous misspairing** / mistranslations of sentences all around the source dataset. To solve this: we will use the best HuLU scored open source LLM, Gemma2 to just decide if 2 sentences are translations of each other. This will most probably filter out a lot of good translations as well, which Gemma2 wouldn't understand correctly, but it's still better to have a simpler but mostly correct dataset to work from.
- **Meaningless pairs** here and there. Like only numbers translated to numbers. There is no language specific translation involved in them. We will put a minimum word count on sentences on both sides to mittigate this.
- **Incorrect characters** are used for őŐ and űŰ. They will need to be replaced (õ->ő, û->Ű).
- **Duplicates** are there, especially for the short sentences, and in the subtitles. We will need to do a deduplication on the ready dataset.

## Program settings

In [63]:
import json

target_count = 2000
sources = [
    ("classic_utf8.bi", int(target_count * 0.091)),
    ("modern_utf8.bi", int(target_count * 0.754)),
    ("subtitles_utf8.bi", int(target_count * 0.155))
]

class BiLingualData:
    def __init__(self, id = "", english = "", hungarian = ""):
        self.id = id
        self.english = english
        self.hungarian = hungarian

    def to_json(self):
        return json.dumps(self.__dict__)


## Helper method definitions

In [64]:
# Removes excess whitespace chars, replaces incorrect chars, and trims
def TrimSentence(sentence):
    while "  " in sentence:
        sentence = sentence.replace("  ", " ")
    sentence = sentence.replace("õ", "ő").replace("û", "ű")
    return sentence.strip()

In [65]:
import subprocess
# Checks for spelling errors in the sentence. Returns true if no errors were found
def SpellCheckHU(sentence):
    # Run the bash command with the provided input string
    result = subprocess.run("hunspell -d hu-HU -l", input=sentence, capture_output=True, text=True, shell=True)
    # Capture the output
    outputlines = result.stdout.splitlines()
    return len(outputlines) == 0

In [66]:
import subprocess

def SpellCheckEN(sentence):
    # Run the bash command with the provided input string
    result = subprocess.run("hunspell -d en-US -l", input=sentence, capture_output=True, text=True, shell=True)
    # Capture the output
    outputlines = result.stdout.splitlines()
    return len(outputlines) == 0

In [67]:
def WordCount(sentence):
    return len(sentence.split())

In [68]:
import requests
import re

def are_pairs_correct(english, hungarian):
    api_url = "http://localhost:5001/api/v1"
    stop_words = ["###","</s>","<|"]
    headers = {
        "Content-Type": "application/json"
    }

    data = {
        "prompt": f"""<|system|>Determine if the English and Hungarian sentence pair is translations of each other (1), or not (0).<|end|>
<|english|>What time is it?<|end|>
<|hungarian|>Mennyi az idő?<|end|>
<|assistant|>1<|end|>
<|english|>The surface is the darkest among Uranian moons, and appears to have been shaped primarily by impacts.<|end|>
<|hungarian|>A felszíne a legsötétebb az uránuszi holdak közül, és úgy tűnik, leginkább becsapódások alakították.<|end|>
<|assistant|>1<|end|>
<|english|>Umbriel, along with another Uranian satellite, Ariel, was discovered by William Lassell on October 24, 1851.<|end|>
<|hungarian|>Umbrielt a többi uránuszi holddal együtt a Voyager 2 űrszonda vizsgálta, 1986 januárjában.<|end|>
<|assistant|>0<|end|>
<|english|>He created numerous programs to provide relief to the unemployed and farmers while seeking economic recovery with the National Recovery Administration and other programs.<|end|>
<|hungarian|>Számos programot hozott létre a munkanélküliek és gazdálkodók megsegítésére, miközben az Országos Helyreállítási Igazgatósággal és más programokkal kereste a gazdasági fellendülést.<|end|>
<|assistant|>1<|end|>
<|english|>{english}<|end|>
<|hungarian|>{hungarian}<|end|>
<|assistant|>
""",
        "max_tokens": 10,
        "temperature": 0,
        "top_p": 1.0,
        "n": 20,
        "stop": stop_words
    }
    
    response = requests.post(f"{api_url}/completion", headers=headers, json=data)
    result = response.json()["choices"][0]["text"]
    for sw in stop_words:
        result = result.replace(sw, "")
    match = re.search(r'[01]', result)
    return int(match.group()) == 1 if match else False

In [69]:
# Small simple test harness

line = "00:03:25.80,00:03:28.48 Ezt a halála napján írta be.	And that was dated the day he died."
(hun, eng) = line.split("\t")
hun = TrimSentence(hun)
print(hun)
eng = TrimSentence(eng)
print(f"spell hu: {SpellCheckHU(hun)}")
print(f"word cnt: {WordCount(hun)}")
print(f"are translations: {are_pairs_correct(eng, hun)}")

00:03:25.80,00:03:28.48 Ezt a halála napján írta be.
spell hu: True
word cnt: 7
are translations: True


## Process input data
What is defined in the ```sources``` variable, we iterate through, and get random lines until we meet the required target number of lines to import.

In [70]:
import random

all_data = []
master_count = 0
skipped = 0
for (file, target_count_per_file) in sources:
    with open(file, 'r', encoding='utf-8') as content:
        all_lines = content.read()
    lines = all_lines.splitlines()

    picks = []
    filedata = []
    while len(filedata) < target_count_per_file:
        # Pick a line randomly (which wasn't picked before)
        pick = random.randint(0, len(lines)-1)
        while pick in picks:
            pick = random.randint(0, len(lines)-1)
        picks.append(pick)

        # Check if line is OK for us
        line = lines[pick]
        (hun, eng) = line.split("\t")
        hun = TrimSentence(hun)
        eng = TrimSentence(eng)
        if WordCount(hun) > 4 and WordCount(eng) > 4 and SpellCheckHU(hun) and are_pairs_correct(eng, hun):
            master_count += 1
            data = BiLingualData(id=master_count, english=eng, hungarian=hun)
            filedata.append(data)
            all_data.append(data)

            if master_count % 10 == 0:
                print(f"Processing {master_count/target_count*100:5.1f}%", end="\r")
        else:
            skipped += 1

first = True
with open("hunglish-BLEU.json", 'w', encoding='utf-8') as writer:
    writer.write("[")
    for data in all_data:
        if not first:
            writer.write(",")
        first = False
        writer.write(data.to_json())
    writer.write("]")

print("Ready.                    ")
print(f"Skipped lines: {skipped}")

Ready.                    
Skipped lines: 4597
