 # Evaluation of machine translation (MT) of large language modes (LLMs) on Chichewa

---



**Project**: Chichewa NLP Project    |        **Date**: October 20, 2023.

**Background**

With the recent advancements in natural language processing, large language models (LLMs) like ChatGPT have emerged. These LLMs have demonstrated impressive performance in various natural language processing (NLP) tasks, such as machine translation (MT), text generation, sentence classification, and more. As these LLMs achieve state-of-the-art (SOTA) performance for high-resource languages such as English, Chinese, German, and French, attention has now shifted towards advancing NLP in low-resource languages. This shift is necessitated by the fact that the majority of the over 7,000 languages are considered low-resource due to their limited online text data.

In this study, we conduct machine translation (MT) evaluations for Large Language Models (LLMs) and commonly used translation platforms, including Google Translate, Bing Microsoft Translator, ChatGPT, and NLLB, specifically focusing on Chichewa. We utilize two datasets for this evaluation.

**Objective:**

* Perform machine translation (MT) evaluation on Chichewa using two distinct benchmark datasets.
* Compare the performance of Google Translate, Bing Microsoft Translator, ChatGPT, and NLLB on Chichewa using both datasets.
* Introduce a new, high-quality machine translation benchmark dataset named `chichewa_nlp`.

**Datasets**

We use two datasets for this task:
* [FLORES-200 ](https://github.com/facebookresearch/flores/blob/main/flores200/README.md): This dataset comprises translations from 842 distinct web articles, totaling 3,001 sentences. For Chichewa, we specifically employ the parallel devtest split `nya_Latn.devtest`, and for English, we use `eng_Latn.devtest`, which consists of 1,012 sentences.

* **Chichewa NLP Benchmark Dataset for MT** `chichewa_nlp`: We introduce a novel benchmark dataset for Chichewa machine translation named `chichewa_nlp.`. This dataset is human-annotated and consists of 809 parallel data pairs, translating from English to Chichewa. The data is sourced from various text types, including news articles, WhatsApp posts, online translation datasets, political texts, and public health data.


**LLMs and Translation Platforms**

[Google Translate API](https://www.google.com/aclk?sa=l&ai=DChcSEwiviraSvP-BAxVkOwYAHWuOAesYABAAGgJ3cw&gclid=Cj0KCQjwhL6pBhDjARIsAGx8D5_eDZ-HCtFRBn4hijfoC5zZT0BHk_z0sL0Z-ILsMqsgevJcvh_GoIgaAtgIEALw_wcB&sig=AOD64_3Zp9mMPw30Trz91C1msIVI1JJ1Yg&q&adurl&ved=2ahUKEwi-7K6SvP-BAxWh0wIHHd8QBPAQ0Qx6BAgIEAE): Utilizing Google's neural machine translation technology, the Google Translate API enables instant text translation into more than one hundred languages, similar to the Google Translate service used in web and mobile applications.

[Microsoft (MS) Azure](https://azure.microsoft.com/en-us/free/search/?ef_id=_k_Cj0KCQjwhL6pBhDjARIsAGx8D5_WBwjXDtl54jndhnypDnQ02rUhHIee46ULhIqC0wQOhrZsdt68v2MaArrJEALw_wcB_k_&OCID=AIDcmm4z26duq7_SEM__k_Cj0KCQjwhL6pBhDjARIsAGx8D5_WBwjXDtl54jndhnypDnQ02rUhHIee46ULhIqC0wQOhrZsdt68v2MaArrJEALw_wcB_k_&gclid=Cj0KCQjwhL6pBhDjARIsAGx8D5_WBwjXDtl54jndhnypDnQ02rUhHIee46ULhIqC0wQOhrZsdt68v2MaArrJEALw_wcB): Similar to the Google Translate API, Microsoft (MS) Azure supports text translation in over a hundred languages and can be accessed as Bing Translator for web and mobile applications.

[OpenAI's GPT](https://platform.openai.com/docs/guides/gpt): The generative pre-trained transformer (GPT) powers the popular ChatGPT chatbot, known for generating human-like text. These GPT models are accessible through the OpenAI API. For this project, we leverage the latest GPT model, `gpt-3.5-turbo.`

[NLLB ](https://arxiv.org/abs/2207.04672):  Introduced in the paper titled "No Language Left Behind: Scaling Human-Centered Machine Translation" by Meta AI in 2022, NLLB is a multilingual LLM capable of translating over 200 languages, including low-resource languages like Chichewa.
**Evaluation**

We use [SacreBLEU](http://aclweb.org/anthology/W18-6319) to compute [BLEU](https://cloud.google.com/translate/automl/docs/evaluate#:~:text=BLEU%20(BiLingual%20Evaluation%20Understudy)%20is,of%20high%20quality%20reference%20translations.) scores and [chrF](https://aclanthology.org/W15-3049/) given the morphologically-rich nature of Chichewa.
The translation outputs are also evalauated by native speakers to compare the translation qualities.



In [None]:
# important libralies
!pip install --upgrade google-cloud-translate
!pip install googletrans
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install sacrebleu
!pip install sentencepiece
!pip install transformers
!pip install  datasets
!pip install accelerate -U
!pip install transformers[torch]
!pip install requests uuid
!pip install openai

In [None]:
# transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import evaluate
import openai

In [None]:
import warnings
import os
import json
import html
import csv
import pandas as pd
import numpy as np
warnings.filterwarnings('ignore')
from pathlib import Path
import nltk
import requests, uuid, json
from six import text_type
from google.cloud import translate
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize, sent_tokenize
from googletrans import Translator, LANGUAGES
from IPython.display import Image, display
pd.set_option('display.float_format', lambda x: '%.3f' % x)

**Chichewa NLP data**

In [None]:
# ensure the path is correct
data_en_chichewa=pd.read_csv('/content/chichewa_en_nlp.csv',index_col=None)# load csv data
data_ny_chichewa=pd.read_csv('/content/chichewa_ny_nlp.csv', index_col=None)
chichewa_en=data_en_chichewa['0'].tolist() # convert pd.series into list
chichewa_ny=data_ny_chichewa['0'].tolist()
print(f' the number of chichewa senteces in chichewa nlp data set: {len(chichewa_en)}')
print(f' the number of english senteces in chichewa nlp data set: {len(chichewa_ny)}')

 the number of chichewa senteces in chichewa nlp data set: 809
 the number of english senteces in chichewa nlp data set: 809


**Flores data**

In [None]:
# ensure the path is correct
data_en_flores='/content/eng_Latn.devtest'
data_ny_flores='/content/nya_Latn.devtest'
with open(data_en_flores, "r") as f:
  files = f.read()
with open(data_ny_flores, "r") as f:
  files1 = f.read()
# convert text into list
flores_en=[]
sentences_en = [sentence.strip() for sentence in files.split('\n') if sentence.strip()]

flores_ny=[]
sentences_ny = [sentence.strip() for sentence in files1.split('\n') if sentence.strip()]

# Print the list of sentences
for sentence in sentences_en:
  flores_en.append(sentence)
for sentence in sentences_ny:
  flores_ny.append(sentence)

print(f' the number of chichewa senteces in flores data set: {len( flores_ny)}')
print(f' the number of english senteces in flores data set: {len( flores_en)}')

 the number of chichewa senteces in flores data set: 1012
 the number of english senteces in flores data set: 1012


# Google Translation

**Translating with Google translate**

In [None]:
en_ny_translations_google=[]
ny_translations_google=[]

# Set the environment variable to your Google Cloud credentials JSON file
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/content/sincere-pen-401719-37daac6dd7e0.json'

# Initialize the translation client
translate_client = translate.TranslationServiceClient()

# Define your source and target languages
source_language_code = "EN"  # English
target_language_code = "NY"  # Chichewa

# Translate the sentences
for sentence in flores_en:
    translation = translate_client.translate_text(
        parent=f"projects/sincere-pen-401719/locations/global",
        contents=[sentence],
        source_language_code=source_language_code,
        target_language_code=target_language_code,
    )
    translated_sentence = html.unescape(translation.translations[0].translated_text)
    en_ny_translations_google.append((f'en: {sentence}',f'ny: {translated_sentence}'))
    ny_translations_google.append(translated_sentence)


**Automatic Evaluation**

In [None]:
# evaluation
bleu_metric = evaluate.load("sacrebleu")
chrf_metric = evaluate.load("chrf")

bleu_result = bleu_metric.compute(predictions=ny_translations_google, references=chichewa_ny,tokenize='flores200')
chrf_result = chrf_metric.compute(predictions=ny_translations_google, references=chichewa_ny)

result = {"bleu": bleu_result["score"], "chrf": chrf_result["score"]}
results = {k: round(v, 4) for k, v in result.items()}
print(f'the BLEU score for english to chichewa Google translation: {results["bleu"]}')
print(f'the ChrF score for english to chichewa Google translation: {results["chrf"]}')

**Saving Google translation results**

In [None]:
# Combine the sentences into a list of dictionaries
en_ny_translations_google = [{"en": en, "ny": ny} for en, ny in zip(chichewa_en, ny_translations_google)]

# Specify the filename for the JSON file
filename = "en_ny_translations_google.json"

# Write the data to a JSON file
with open(filename, "w") as json_file:
    json.dump(en_ny_translations_google, json_file, indent=4)

print(f"JSON data has been written to {filename}")

JSON data has been written to en_ny_translations_google.json


# Bing Translation

**Setting the Azure account**


---


location: westus

key: 1bee95c1ea8147c9bd25d6d17b171***

endpoint: https://api.cognitive.microsofttranslator.com/





---



In [None]:
#---------------------
# Limit of 2 million characters per hour
# translate request is limited to 50,000 characters
#-------------------------

# Calculate the total number of characters in all text
total_char_chichewa = sum(len(string) for string in chichewa_en)
total_char_flores= sum(len(string) for string in flores_en)

# Print the total
print("Total number of characters for chichewa nlp data:", total_char_chichewa)
print("Total number of characters for flores data:", total_char_flores)

Total number of characters for chichewa nlp data: 98911
Total number of characters for flores data: 131966


**Create batches**

In [None]:
# Calculate the batch size
batch_size = 300

# Split the list into  batches
batches = [chichewa_en[i:i + batch_size] for i in range(0, len(chichewa_en), batch_size)]
print(len(batches ))

3


**Translation with MS Azure (Bing)**

In [None]:
# en_ny_translations=[]
ny_translations_batch3=[]


body = [{'text': text} for text in batches[2] ]

# Add your key and endpoint
key = "1bee95c1ea8147c9bd25d6d17b171***"
endpoint = "https://api.cognitive.microsofttranslator.com"
location = "westus"

path = '/translate'
constructed_url = endpoint + path

params = {
    'api-version': '3.0',
    'from': 'en',
    'to': ['nya']
}

headers = {
    'Ocp-Apim-Subscription-Key': key,
    'Ocp-Apim-Subscription-Region': location,
    'Content-type': 'application/json',
    'X-ClientTraceId': str(uuid.uuid4())
}
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()

for i in range(len(batches[2])):
    original_text =batches[2][i]
    translation = response[i]["translations"][0]["text"]
    # en_ny_translations.append((original_text, translation))
    ny_translations_batch3.append( translation )


with open('ny_translations_batch3.csv', 'w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file)

    # Write each element in the list as a separate row in the CSV file
    for item in ny_translations_batch3:
        csv_writer.writerow([item])


**Putting translations together**

In [None]:
file_paths = ['/content/ny_translations_batch1.csv','/content/ny_translations_batch2.csv','/content/ny_translations_batch3.csv']
combined_data = pd.DataFrame()

# Read and concatenate the CSV files into one DataFrame
for file_path in file_paths:
    df = pd.read_csv(file_path,header=None)
    combined_data = pd.concat([combined_data, df], ignore_index=True)

# Save the combined data to a new CSV file
combined_data.to_csv('ny_translations_bing.csv', index=False)

#read csv file
files=pd.read_csv('/content/ny_translations_bing.csv')['0']
ny_translations_bing=files.tolist()

In [None]:
# evaluation
bleu_metric = evaluate.load("sacrebleu")
chrf_metric = evaluate.load("chrf")

bleu_result = bleu_metric.compute(predictions= ny_translations_bing, references=chichewa_ny,tokenize='flores200')
chrf_result = chrf_metric.compute(predictions= ny_translations_bing, references=chichewa_ny)

result = {"bleu": bleu_result["score"], "chrf": chrf_result["score"]}
results = {k: round(v, 4) for k, v in result.items()}
print(f'the BLEU score for english to chichewa Bing translation: {results["bleu"]}')
print(f'the ChrF score for english to chichewa Bing translation: {results["chrf"]}')

**Saving Bing translate results**

In [None]:
# Combine the sentences into a list of dictionaries
en_ny_translations_google = [{"en": en, "ny": ny} for en, ny in zip(flores_en, ny_translations_bing)]

# Specify the filename for the JSON file
filename = "en_ny_translations_bing_flores.json"

# Write the data to a JSON file
with open(filename, "w") as json_file:
    json.dump(en_ny_translations_google, json_file, indent=4)

print(f"JSON data has been written to {filename}")

JSON data has been written to en_ny_translations_bing_flores.json


**Saving Google translation results**

# ChatGPT

**1. ChatGPT chatbot**

We followed prompting approach described in [(Jiao et al., 2023)](https://arxiv.org/abs/2301.08745) using the translation prompt
```
Translate the following list of  sentences from  English to Chichewa (Nyanja).

```

Challenges:


*   Hallucinations
*   Mixing ids
*   Missing translations
*   Slow
*  Inconsistincies in translations





 **2. ChatGPT API**

In [None]:
import openai

api_key = 'sk-uH3UeJEQZRcPZ33bFOPbT3BlbkFJDTfLPKct59JtDEOl5***'
openai.api_key = api_key

def translate_sentences(input_list, src_lang, tgt_lang):
  """
  Generate translation with chatgpt
  input_list: source sentences
  src_lang: source language
  tgt_lang: target language

  """
  translations = [] # store the list of translations

  for sentence in input_list:
      prompt = f"Translate the following sentence from {src_lang} to {tgt_lang}:\n {sentence}" # translation prompt
      prompt+="\nTranslation:"

      # Call the ChatGPT API for  translations
      response = openai.Completion.create(
          engine="gpt-3.5-turbo-instruct", # model family
          prompt=prompt,
          max_tokens=256,
          temperature=0.3 # control out randomness
      )

      translation = response.choices[0].text.strip()
      translations.append(translation)

  return translations

source_language = "English"
target_language = "Chichewa"

ny_translations = translate_sentences(flores_en, source_language, target_language)


In [None]:
len(ny_translations)

1012

**Automatic Evaluation**

In [None]:
# evaluation
bleu_metric = evaluate.load("sacrebleu")
chrf_metric = evaluate.load("chrf")

bleu_result = bleu_metric.compute(predictions=ny_translations, references=flores_ny,tokenize='flores200')
chrf_result = chrf_metric.compute(predictions=ny_translations, references=flores_ny)

result = {"bleu": bleu_result["score"], "chrf": chrf_result["score"]}
results = {k: round(v, 4) for k, v in result.items()}
print(f'the BLEU score for english to chichewa ChatGPT translation: {results["bleu"]}')
print(f'the ChrF score for english to chichewa ChatGPT translation: {results["chrf"]}')



the BLEU score for english to chichewa ChatGPT translation: 5.6283
the ChrF score for english to chichewa ChatGPT translation: 30.4527


**Saving ChatGpt results**

In [None]:
# Combine the sentences into a list of dictionaries
en_ny_translations_ChatGPT = [{"en": en, "ny": ny} for en, ny in zip(flores_en, ny_translations)]

# Specify the filename for the JSON file
filename = "en_ny_translations_ChatGPT_flores.json"

# Write the data to a JSON file
with open(filename, "w") as json_file:
    json.dump(en_ny_translations_ChatGPT, json_file, indent=4, ensure_ascii=False)

print(f"JSON data has been written to {filename}")

JSON data has been written to en_ny_translations_ChatGPT_flores.json


# NLLB

**Model**

In [None]:
# load pretrained model
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

**NLLB**

In [None]:
def batch_translate_direct(input_list: list, src_lang: str, tgt_lang: str):
  """
  Generate translations using NLLB
  input_list: list source sentences
  src_lang: source language code --> check NLLB list of lang codes
  tgt_lang: target language code

  Returns:
  translation into target lang

  """

  tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M",  src_lang=src_lang) # initialize tokenizer from NLLB
  en_ny_translations=[]
  ny_translations=[]
  for line in input_list:
    source = line
    input=tokenizer(source, return_tensors="pt") # tokenization
    translated_tokens = model.generate(
    **input, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang], max_length=1000) # generate translations
    output=tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] # decode the translated tokens
    en_ny_translations.append((f'en: {source}',f'ny: {output}'))
    ny_translations.append(output)

  return  (en_ny_translations,ny_translations)
en_ny_translations,ny_translations= batch_translate_direct(flores_en, src_lang="eng_Latn", tgt_lang="nya_Latn")

**Automatic Evaluation**

In [None]:
len(en_ny_translations)

1012

In [None]:
# evaluation
bleu_metric = evaluate.load("sacrebleu")
chrf_metric = evaluate.load("chrf")

bleu_result = bleu_metric.compute(predictions=ny_translations, references=flores_ny,tokenize='flores200')
chrf_result = chrf_metric.compute(predictions=ny_translations, references=flores_ny)

result = {"bleu": bleu_result["score"], "chrf": chrf_result["score"]}
results = {k: round(v, 4) for k, v in result.items()}
print(f'the BLEU score for english to chichewa NLLB translation: {results["bleu"]}')
print(f'the ChrF score for english to chichewa NLLB translation: {results["chrf"]}')



the BLEU score for english to chichewa NLLB translation: 15.8566
the ChrF score for english to chichewa NLLB translation: 46.7084


**Saving NLLB translation results**

In [None]:
# Combine the sentences into a list of dictionaries
en_ny_translations_NLLB = [{"en": en, "ny": ny} for en, ny in zip(flores_en, ny_translations)]

# Specify the filename for the JSON file
filename = "en_ny_translations_NLLB_flores.json"

# Write the data to a JSON file
with open(filename, "w") as json_file:
    json.dump(en_ny_translations_google, json_file, indent=4,ensure_ascii=False)

print(f"JSON data has been written to {filename}")

JSON data has been written to en_ny_translations_NLLB_flores.json


**Results**

**Summary**

We present the results in the **Table 1** below. All translation outputs are also saved in .json format and can be found in the results folder.

| LLM | Chichewa NLP      |                   | FLORES-200                |                   |
|-----------------------|------------------------|-------------------|-----------------------|-------------------|
|                     | BLEU                   | ChrF              | BLEU                  | ChrF              |
|-----------------------|------------------------|-------------------|-----------------------|-------------------|
| Google Translate             | 17.1838                  | 49.8466            | 21.2157                 | 52.8351             |
| MS Bing              | 15.0289                  | 48.7437             | 19.5783                 | 51.3756           |
| ChatGPT(3.5)            | 4.5473                  | 30.8839             | 5.6283                 | 30.4527            |
| NLLB          | 12.8351                 | 45.1251             | 15.8566                 | 46.7084             |



Table 1: Comparison of performance of LLMs on Chichewa NLP and FLORES-200 datasets.


**Which LLM is faster?**

1.   Google Translate -  ~2 minutes for flores and Chichewa nlp datasets.
2.  MS Bing Translate -  ~1 second  minutes for 300 sentences.
3. ChatGPT - ~ 12 mins for flores and ~ 16 mins for Chichewa nlp datasets.
4. NLLB - ~ 2 hrs  for flores and ~ 1 hr 30 mins for Chichewa nlp datasets.










---


**Discussion and Conclusion**

We present the results in **Table 1**. For the two languages under consideration, Google Translate achieved the highest BLEU scores of 17.18 and 21.21, with ChatGPT yielding significantly lower BLEU scores of 4.54 and 5.62 for the `chichewa_nlp` and `FLORES-200` datasets, respectively. In terms of translation speed, Google Translate and MS Bing proved to be the fastest, completing translations in under two minutes for each dataset. In contrast, ChatGPT and NLLB took up to 16 minutes and 2 hours for translation, respectively.

This study introduced a new, high-quality, human-annotated MT benchmark dataset, `chichewa_nlp`. We also evaluated MT performance on Google Translate, Bing Microsoft Translator, ChatGPT, and NLLB. Google Translate outperformed other platforms in terms of BLEU scores for Chichewa. For our future work, we plan to expand our research to include additional languages and consider various model settings.

**End 😊**