Benchmarking Large Language Models (GPT) for Machine Translation

Overview

In this work, we investigate the translation capabilities of GPT models across 203 diverse languages from FLORES 200 dataset

Read more about it in our paper (accepted to WMT 2023)

Also see our Zeno browser, with interactive visualizations of our results.

We have outputs for 5 systems:

ChatGPT (0-shot prompts) (GPT-3.5-turbo): 203 target languages
ChatGPT (5-shot prompts) (GPT-3.5-turbo): 203 target languages
GPT-4 (5-shot prompts): 20 target languages
NLLB-MOE: 201 target languages
Google Translate: 115 target languages

All model outputs can be found on Zenodo

Reproducing the work

We used gpt-3.5-turbo-0613 and gpt-4-0613 in July and August 2023. Find instructions below on how to use our codebase.

Outputs and inputs

The outputs and inputs from this work can be found here(will be updated) . We will release the outputs in a folder system_outputs. Each tsv contains 3 columns:

messages : Contains the prompts
label : Is the reference
predictions : The predictions from Open AI.

Querying OpenAI

This section has instructions on how to use our codebase to run the experiments.

You will need to install Zeno and OpenAI libraries. Install them and other requirements by running pip install -r requirements.txt
config.py contains the configuration for the models; GPT-3.5-turbo aand GPT-4
modelling.py: This script contains utilities. An important one is the call to generate_from_chat_prompt function. You may want to reduce the requests_per_minute parameter value especially for n-shot and non-latin scripts so as not to max out and get empty returns from the API.
flores200_utils.py : Contains data processing utilities.
Have a file called langs.txt that contains the languages you want to generated.
Your source folders should be named [prompt]/[lang]/
Within each lang, have a file with the prompt.
run.sh: This is the bash script that launches main.py
Finally run bash run.sh

Evaluation

We have a script eval_runs.py that handles evaluation for BLEU, chrF, SLR and TER. python eval_runs.py --results_dir [folder] --langs_file [a file with line searated languages to be evaluated] --tokenizer [tokenizer-this is optional]

Notebooks

langid_classifier.ipynb - for classifying the langauge of the predictions

zeno_browser.ipynb - This notebook shows how to use the Zeno library to analyze the results from our experiments.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
scores		scores
visualizations		visualizations
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
config.py		config.py
eval_runs.py		eval_runs.py
flores200_utils.py		flores200_utils.py
langid_classifier.ipynb		langid_classifier.ipynb
langs.txt		langs.txt
main.py		main.py
modeling.py		modeling.py
openai_utils.py		openai_utils.py
requirements.txt		requirements.txt
run.sh		run.sh
score.py		score.py
zeno_browser.ipynb		zeno_browser.ipynb
zero-shot-scores.tsv		zero-shot-scores.tsv

License

cmu-llab/gpt_mt_benchmark

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Large Language Models (GPT) for Machine Translation

Overview

Reproducing the work

Outputs and inputs

Querying OpenAI

Evaluation

Notebooks

License

About

Resources

License

Stars

Watchers

Forks

Languages