# Neural Machine Translation for Cantonese-English Language Pair
Florence Yuen
- Uses datasets from Tatoeba and OpenSubtitles to load Cantonese-English language pair data
- Preprocess character-based and Jyutping romanized Cantonese data by tokenizing and doing data cleansing
- Apply mBART-50 pre-trained multilingual NMT model
- Compare and evaluate greedy and beam search decoding strategies
- Save translation outputs to csv file
- Generates BLEU scores to evaluate and compare the two decoding strategies


In [6]:
# Install dependencies
!pip install transformers datasets sacrebleu pandas sentencepiece pypinyin epitran hf_xet protobuf



## Load Pretrained mBART-50 NMT Model

In [None]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

# Import the tokenizer
model_name = "facebook/mbart-large-50-many-to-many-mmt"
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer.src_lang = "yue_Hant"
tokenizer.tgt_lang = "en_XX"

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

## Load Dataset from Tatoeba

In [None]:
from datasets import load_dataset
import pandas as pd
import torch

# Open subtitles cantonese-english dataset
# dataset = load_dataset("opensubtitles", lang1="zh_yue", lang2="en")
# dataset = load_dataset("Helsinki-NLP/open_subtitles", lang1="zh_yue", lang2="en", streaming=True)
# dataset = load_dataset("facebook/flores", "yue_Hant-eng")

# small_data = dataset['train'].select(range(100))
# df = pd.DataFrame(small_data['translation'])
# df = df.rename(columns={'zh_yue': 'cantonese', 'en': 'english'})
# df.head()

# Tatoeba cantonese-english dataset
# dataset = load_dataset("Helsinki-NLP/tatoeba_mt", "yue-eng")
# small_data = dataset['test'].select(range(100))  # Small subset
# df = small_data.to_pandas()[["sourceString", "targetString"]]
# df.columns = ["cantonese", "english"]
# df.dropna(inplace=True)
# df.head()

README.md: 0.00B [00:00, ?B/s]

flores.py: 0.00B [00:00, ?B/s]

RuntimeError: Dataset scripts are no longer supported, but found flores.py

In [None]:
import re

# Manually load downloaded .en and .yue files, apply pre-processing to clean the text
def load_parallel_corpus(cantonese_file, english_file, max_lines=None):
    # Open the cantonese and english files
    with open(cantonese_file, encoding='utf-8') as f_yue, open(english_file, encoding='utf-8') as f_en:
        yue_lines = f_yue.readlines()
        en_lines = f_en.readlines()

    # Ensure that there is the same line count
    if max_lines:
        yue_lines = yue_lines[:max_lines]
        en_lines = en_lines[:max_lines]

    assert len(yue_lines) == len(en_lines), "Line count mismatch!"

    #Apply preprocessing to clean the text 
    def clean_text(text):
        # Remove brackets
        text = re.sub(r'\[[^\]]*\]', '', text)
        text = re.sub(r'\([^\)]*\)', '', text)
        
        # Normalize whitespace
        text = re.sub(r'\s+', ' ', text)
        return text.strip()

    # Clean each line of text for both cantonese and english files
    yue_lines = [clean_text(line) for line in yue_lines]
    en_lines = [clean_text(line) for line in en_lines]

    # Build the DataFrame and drop empty lines
    df = pd.DataFrame({'cantonese': yue_lines, 'english': en_lines})
    df = df[(df['cantonese'] != '') & (df['english'] != '')].reset_index(drop=True)
    return df

# Load Tatoeba dataset
df = load_parallel_corpus('en-yue.txt/Tatoeba.en-yue.yue', 'en-yue.txt/Tatoeba.en-yue.en', max_lines=1000)
print(df.head())

             cantonese                                        english
0              我要去瞓覺喇。                         I have to go to sleep.
1  我話唔定做一陣就會放棄，走去瞓晏覺算。       I may give up soon and just nap instead.
2       我不嬲都鍾意啲神秘啲嘅人物。     I always liked mysterious characters more.
3   雖然佢講咗對唔住，但係我都仲係好嬲。  Even though he apologized, I'm still furious.
4               我唯有係等。                               I can only wait.


## Define Translation Function

In [None]:
import torch
from tqdm import tqdm
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # for progress bar

# Translation function with tydm to add a loading progress bar for insights
def translate(texts, beam=1, batch_size=16):
    translations = []
    # Evaluate the model
    model.eval()
    
    # Devide into batches so that the progress/ percentage is shown too
    for i in tqdm(range(0, len(texts), batch_size), desc="Translating"):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, padding=True, truncation=True, max_length=128, return_tensors="pt").to(device)
        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs,
                max_length=128,
                num_beams=beam,
                no_repeat_ngram_size=2
            )
        batch_translations = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
        translations.extend(batch_translations)
    return translations


## Translate & Compare Decoding Methods

In [None]:
# Run translations for greedy and beam search algorithm
df['greedy'] = translate(df['cantonese'].tolist(), beam=1, batch_size=16)
df['beam'] = translate(df['cantonese'].tolist(), beam=5, batch_size=16)
df.head()

Translating:   0%|          | 0/63 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Translating:   2%|▏         | 1/63 [00:19<19:58, 19.34s/it]The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Translating:   3%|▎         | 2/63 [00:54<28:55, 28.45s/it]The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Translating:   5%|▍         | 3/63 [01:20<27:39, 27.65s/it]The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Translating:   6%|▋         | 4/63 [01:43<25:13, 25.66s/it]The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Translating:   8%|▊         

Unnamed: 0,cantonese,english,greedy,beam
0,我要去瞓覺喇。,I have to go to sleep.,I'm going to go to the temple.,I'm going to go to the monastery.
1,我話唔定做一陣就會放棄，走去瞓晏覺算。,I may give up soon and just nap instead.,I'm going to say I will give up a fight and go...,I said I was going to do a series I would give...
2,我不嬲都鍾意啲神秘啲嘅人物。,I always liked mysterious characters more.,I'm not a big fan of mysterious characters.,I don't think I've ever heard of a mysterious ...
3,雖然佢講咗對唔住，但係我都仲係好嬲。,"Even though he apologized, I'm still furious.","I'm not sure if I can do it, but I have a good...","I'm not sure if we're going to live together, ..."
4,我唯有係等。,I can only wait.,I'm only a single one.,I'm the only one who can wait.


## Evaluate algorithms using BLEU scores

In [None]:
from sacrebleu import corpus_bleu

# Define hypotheses for greedy and beam search (as string lists)
greedy_hypotheses = df['greedy'].astype(str).tolist()
beam_hypotheses = df['beam'].astype(str).tolist()

# Define references as string lists
references = [df['english'].astype(str).tolist()]

# Calculate the bleu score
greedy_bleu = corpus_bleu(greedy_hypotheses, references).score
beam_bleu = corpus_bleu(beam_hypotheses, references).score

# Display bleu score
print(f"Greedy BLEU: {greedy_bleu:.2f}")
print(f"Beam BLEU: {beam_bleu:.2f}")


Greedy BLEU: 7.94
Beam BLEU: 9.21


## Save Translation Comaprison Results to CSV

In [40]:
df.to_csv("translation_comparison_results.csv", index=False)
df.head()

Unnamed: 0,cantonese,english,greedy,beam
0,我要去瞓覺喇。,I have to go to sleep.,I'm going to go to the temple.,I'm going to go to the monastery.
1,我話唔定做一陣就會放棄，走去瞓晏覺算。,I may give up soon and just nap instead.,I'm going to say I will give up a fight and go...,I said I was going to do a series I would give...
2,我不嬲都鍾意啲神秘啲嘅人物。,I always liked mysterious characters more.,I'm not a big fan of mysterious characters.,I don't think I've ever heard of a mysterious ...
3,雖然佢講咗對唔住，但係我都仲係好嬲。,"Even though he apologized, I'm still furious.","I'm not sure if I can do it, but I have a good...","I'm not sure if we're going to live together, ..."
4,我唯有係等。,I can only wait.,I'm only a single one.,I'm the only one who can wait.
