# Machine translation basics

| Authors | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2023-12-05 |

<br>

<a target="_blank" href="https://colab.research.google.com/github/fabiennelind/Going-Cross-Lingual_Course/blob/main/code/translation_basics.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

If you run notebooks on **Colab**, you can **enable GPU** computing by

1. clicking on "Runtime" in the menu,
2. selecting "Change runtime type", and
3. choose "GPU" in the "Hardware accelerator" section of the pop-up

This speeds up computing of deep neural networks.

## Background


**_Translation_** means representing an input text written in a **source** language in a **target** language. The source and target languages can be any pair of languages. For example, the source language can be English and the target language can be French. In this case, the translation task is to represent an input text written in English in French.

**_Machine translation_** (MT) is a subfield of artificial intelligence that focuses on developing systems capable of automatically translating text or speech from one language to another.
The goal of MT is to bridge language barriers and facilitate communication between individuals who speak different languages.
In the context of computational text analyses for political and social analyses, machine translation plays a crucial role in making vast amounts of information accessible across linguistic boundaries.

**How Machine Translation Works**

At its core, machine translation relies on advanced algorithms and models to analyze and understand the structure and meaning of a given text in one language, and then generate an equivalent text in another language. 
Until the turn of the century or so, the dominating approach was [statistical machine translation](https://en.wikipedia.org/wiki/Statistical_machine_translation).
However, with the rise of deep learning algorithms, neural machine translation (NMT) has witnessed significant advancements.

NMT models use deep learning techniques to capture complex patterns and dependencies in language, allowing them to produce more contextually accurate translations.
These models are trained on large datasets containing parallel text in multiple languages, learning to map input sequences to output sequences in a way that preserves semantic meaning.
Beginning with the introduction of the [encoder-decoder architecture](https://arxiv.org/abs/1409.0473) in 2014, NMT has evolved into more sophisticated architectures.
Today, models mostly rely on [transformer architectures](https://arxiv.org/abs/1706.03762).


## Setup

In [None]:
try:
    import google.colab
    COLAB = True
except:
    COLAB=False
print('on colab:', COLAB)

In [None]:
# need to install libraries if on Colab
%%capture
if COLAB:
    !pip install iso639==0.1.4 easynmt==2.0.0 deepl==1.16.1 google-cloud-translate==3.12.1

We will load the generally-used libraries here and load the rest on the fly on demand.

In [1]:
import os # for data inport/export
import pandas as pd # for data frames
import iso639 # for standardized language codes

## "free" translation with `easyNMT`

`easyNMT` is a python package that provides and interface to use NMT models that are publicly available through the the [huggingface model hub]().

To use these pre-trained models, we import the `EasyNMT` class from the package (see below) and specify the name of the model we want to use (see [here](https://github.com/UKPLab/EasyNMT#available-models) for an overview of available models).

### Setup

In [2]:
import torch
import easynmt
print(easynmt.__version__)

2.0.0


Let's determinse what device you can use:

- with a GPU &rarr; "cuda"
- with MacOS's M1/M2 chip &rarr; "mps"
- else "cpu"

We do so like this:

In [3]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu' # not macOS mps backend not supported by easyNTM
print(device) # this should print 'cuda'

cpu


### with Facebook Research's M2M model

In [4]:
# dowlnoad the model
model = easynmt.EasyNMT('m2m_100_418M', device=device)
# note: there is also the m2m_100_1.2B model, but it is too large for the T4 GPU you get on Google Colab with a free account

#### Simple example

In [5]:
# one text
model.translate('Hello text-as-data friends!', target_lang='de')
# note: to disable progress bar, set argument `show_progress_bar = False`

'Hallo Text-as-Data Freunde'

In [6]:
# multiple texts
model.translate(['Hello text-as-data friends!', 'Let\'s translate!'], target_lang='de', show_progress_bar=False)

['Hallo Text-as-Data Freunde', 'Lass uns übersetzen!']

#### Advanced usage

##### Translation stream

In [7]:
# define function to print time stamp
import time
ts = lambda: time.strftime("%H:%M:%S", time.localtime())

for t in model.translate_stream(["Hello text-as-data friends!", "Let's translate!"], target_lang = "es", show_progress_bar = False):
  print(ts(), t)
  time.sleep(2) # pause 2 seconds (to verify that translations are yielded iteratively)

17:13:57 ¡Buenos amigos text-as-data!
17:13:59 ¡Vamos a traducir!


##### Get translation directions

Assuming you have decided on a target language (here, English), the code below shows how you see which source languages your NMT model  supports.

In [11]:
# from ??? to English
langs = model.get_languages(target_lang = "en")
langs[:4] # print first 4

['af', 'am', 'ar', 'ast']

In [12]:

# look up language names
langs = {l: iso639.to_name(l) if iso639.is_valid639_1(l) or iso639.is_valid639_2(l) else None for l in langs}

print(f"can translate {len(langs)} to English ('en')")

# show first 10
{c: n for c, n in langs.items() if c in list(langs.keys())[:10]}

can translate 99 to English ('en')


{'af': 'Afrikaans',
 'am': 'Amharic',
 'ar': 'Arabic',
 'ast': 'Asturian; Bable; Leonese; Asturleonese',
 'az': 'Azerbaijani',
 'ba': 'Bashkir',
 'be': 'Belarusian',
 'bg': 'Bulgarian',
 'bn': 'Bengali; Bangla',
 'br': 'Breton'}

The `get_languages()` method can also be used in the reverse way, i.e., to find in which target languages a given source language can be translated:

In [13]:
# from Zulu to ??? (only showing first 10)
{l: iso639.to_name(l) for l in model.get_languages(source_lang = "zu")[:10]}

{'af': 'Afrikaans',
 'am': 'Amharic',
 'ar': 'Arabic',
 'ast': 'Asturian; Bable; Leonese; Asturleonese',
 'az': 'Azerbaijani',
 'ba': 'Bashkir',
 'be': 'Belarusian',
 'bg': 'Bulgarian',
 'bn': 'Bengali; Bangla',
 'br': 'Breton'}

## with *Google Translate*

### Setup

To use Google Cloud Translation, we need to create a Google Cloud Project, enable the Translation API and generate credentials for API access.
Here is how you should proceed:

1. Create a Google Cloud Project:
    1. Go to the [Google Cloud Console](https://console.cloud.google.com/).
    2. Click on the project drop-down at the top of the page and select *or* create a project.
2. Enable the Translation API for the project:
    1. In the Google Cloud Console, navigate to the "API & Services" > "Library" page.
    2. Search for "Cloud Translation API" and enable it for your project.
    3. If prompted to set up billing, do so by clicking enabling billing/creating a billing account.
3. Create access credentials:
    1. In the Google Cloud Console, navigate to the "API & Services" > "Credentials" page.
    2. In the "Service Accounts" section, click "Mangage service accounts".
    3. At the top of the "Service accounts" page, click "CREATE SERVICE ACCOUNT".
    4. Give the service account a name and ID (e.g., 'translate') and click "CREATE AND CONTINUE".
    5. Give the service account the role of an "Editor" and click "CREATE AND CONTINUE" and then "CREATE"
    6. Back at the "Service accounts" page, click the "Actions" button next to the new service role and select "Managage keys".
    7. Click "ADD KEY" > "Create new key".
    8. Select the JSON key type and click "CREATE".
    9. A JSON file with your credentials should be downloaded to your computer (remember where you saved it!).

In [26]:
# next, authenticate with google
from google.oauth2 import service_account
fp = os.path.join(os.environ['SPATH'], 'multilingual-gesis-translate.json')
credentials = service_account.Credentials.from_service_account_file(fp)

In [27]:
# use the credentials to create a translation client
from google.cloud import translate_v2 as google_translate

client = google_translate.Client(credentials=credentials)

### Simple example

In [28]:
# just one text string
result = client.translate('Hallo Welt!', target_language='en')
result

{'translatedText': 'Hello World!',
 'detectedSourceLanguage': 'de',
 'input': 'Hallo Welt!'}

In [29]:
# a list of texts 
texts = ['Hello, world!', 'This is a test.']
result = client.translate(texts, target_language='de')
result

[{'translatedText': 'Hallo Welt!',
  'detectedSourceLanguage': 'en',
  'input': 'Hello, world!'},
 {'translatedText': 'Das ist ein Test.',
  'detectedSourceLanguage': 'en',
  'input': 'This is a test.'}]

In [30]:
# if your target langauge is the same for all texts, you can also just set the `target_language` attribute
print('before:', client.target_language)
client.target_language = 'en'
client.translate("Wie geht's dir?")

before: en


{'translatedText': 'How are you doing?',
 'detectedSourceLanguage': 'de',
 'input': "Wie geht's dir?"}

### Adcanced usage

#### Specify the source and target languages

In [31]:
# if the source and target are the same for all texts, just set the `source_language` and `target_language` arguments
client.translate(texts, source_language='en', target_language='de')

[{'translatedText': 'Hallo Welt!', 'input': 'Hello, world!'},
 {'translatedText': 'Das ist ein Test.', 'input': 'This is a test.'}]

### Translate texts from different source languages

In [32]:
import pandas as pd

df = pd.DataFrame(
    [
        {'text': '¿Cómo estás?', 'lang': 'es'},
        {'text': 'Wie geht es dir?', 'lang': 'de'},
        {'text': 'How are you?', 'lang': 'en'},
    ]
)
print(df)

# doesn't work!
try:
    client.translate(df['text'].tolist(), source_language=df['lang'].tolist(), target_language='fr')
except Exception as e:
    print('ERROR:', str(e))

               text lang
0      ¿Cómo estás?   es
1  Wie geht es dir?   de
2      How are you?   en
ERROR: 400 POST https://translation.googleapis.com/language/translate/v2?prettyPrint=false: Invalid JSON payload received. Unknown name "source": Proto field is not repeating, cannot start list. [{'@type': 'type.googleapis.com/google.rpc.BadRequest', 'fieldViolations': [{'description': 'Invalid JSON payload received. Unknown name "source": Proto field is not repeating, cannot start list.'}]}]


In [33]:
# OPTION 1: just let Google guess the source language
client.translate(df['text'].tolist(), target_language='fr')

[{'translatedText': 'Comment ça va?',
  'detectedSourceLanguage': 'es',
  'input': '¿Cómo estás?'},
 {'translatedText': 'Comment allez-vous?',
  'detectedSourceLanguage': 'de',
  'input': 'Wie geht es dir?'},
 {'translatedText': 'Comment vas-tu?',
  'detectedSourceLanguage': 'en',
  'input': 'How are you?'}]

In [34]:
# OPTION 2 (if you want to set the source language explicitly): split by language and 
#     translate each language-specific subset separately

# 1) initialize a new column for the translations with empty stings
df['tranlation'] = ['']*len(df)
# 2) iterate over language-specific subsets
for l, d in df.groupby('lang'):
    print(f'translating {len(d)} text(s) from "{l}"')
    # a) translate the texts in the current subset
    tmp = client.translate(d['text'].tolist(), source_language=l, target_language='fr')
    # b) assign the translations to the relevant rows in the original dataframe
    df.loc[d.index, 'translation'] = [t['translatedText'] for t in tmp]
df

translating 1 text(s) from "de"
translating 1 text(s) from "en"
translating 1 text(s) from "es"


Unnamed: 0,text,lang,tranlation,translation
0,¿Cómo estás?,es,,Comment ça va?
1,Wie geht es dir?,de,,Comment allez-vous?
2,How are you?,en,,Comment vas-tu?


#### Get the list of supported languages

In [35]:
langs = {l['language']: l['name']  for l in client.get_languages()}
langs

{'af': 'Afrikaans',
 'sq': 'Albanian',
 'am': 'Amharic',
 'ar': 'Arabic',
 'hy': 'Armenian',
 'as': 'Assamese',
 'ay': 'Aymara',
 'az': 'Azerbaijani',
 'bm': 'Bambara',
 'eu': 'Basque',
 'be': 'Belarusian',
 'bn': 'Bengali',
 'bho': 'Bhojpuri',
 'bs': 'Bosnian',
 'bg': 'Bulgarian',
 'ca': 'Catalan',
 'ceb': 'Cebuano',
 'ny': 'Chichewa',
 'zh': 'Chinese (Simplified)',
 'zh-TW': 'Chinese (Traditional)',
 'co': 'Corsican',
 'hr': 'Croatian',
 'cs': 'Czech',
 'da': 'Danish',
 'dv': 'Divehi',
 'doi': 'Dogri',
 'nl': 'Dutch',
 'en': 'English',
 'eo': 'Esperanto',
 'et': 'Estonian',
 'ee': 'Ewe',
 'tl': 'Filipino',
 'fi': 'Finnish',
 'fr': 'French',
 'fy': 'Frisian',
 'gl': 'Galician',
 'lg': 'Ganda',
 'ka': 'Georgian',
 'de': 'German',
 'el': 'Greek',
 'gn': 'Guarani',
 'gu': 'Gujarati',
 'ht': 'Haitian Creole',
 'ha': 'Hausa',
 'haw': 'Hawaiian',
 'iw': 'Hebrew',
 'hi': 'Hindi',
 'hmn': 'Hmong',
 'hu': 'Hungarian',
 'is': 'Icelandic',
 'ig': 'Igbo',
 'ilo': 'Iloko',
 'id': 'Indonesian',
 'g

## with DeepL

In [36]:
import deepl # see https://github.com/DeepLcom/deepl-python
print('Using `deepl` version', deepl.__version__)

Using `deepl` version 1.16.1


### Setup

In [37]:
# read your API key
with open(os.path.join(os.environ['SPATH'], 'deepl')) as f:
    api_key = f.read().strip()

In [38]:
# initialize a `Translator` instance
translator = deepl.Translator(api_key)

### Simple example

In [39]:
# translate examples texts
texts = ['Hello, world!', 'This is a test.']
result = translator.translate_text(texts, target_lang='de')

In [40]:
# inspect the `result` object
result
# note: a list of 'deepl.api_data.TextResult' objects

[<deepl.api_data.TextResult at 0x2f4edba30>,
 <deepl.api_data.TextResult at 0x2f4edba00>]

In [41]:
# inspect the 'deepl.api_data.TextResult' object in `result`
r = result[0]
# 'deepl.api_data.TextResult' objects have only two attributes: 'text' and 'detected_source_language'
[attr for attr in dir(r) if not callable(getattr(r, attr)) and not attr.startswith("__")]

['detected_source_lang', 'text']

In [42]:
print(r.detected_source_lang)
print(r.text)

EN
Hallo, Welt!


In [43]:
# just get text of each result
[r.text for r in result]

['Hallo, Welt!', 'Dies ist ein Test.']

### Advanced usage

#### Specifiying the "source" language

In [44]:
# we could alos specify the source language when translating
result = translator.translate_text(texts, source_lang='en', target_lang='de')
[r.text for r in result]

['Hallo, Welt!', 'Dies ist ein Test.']

#### Listing available source and target languages

- "source" language = the language you want to translate *from*
- "target" language = the language you want to translate *to*


In [45]:
src_langs = {l.code: l.name for l in translator.get_source_languages()}
src_langs

{'BG': 'Bulgarian',
 'CS': 'Czech',
 'DA': 'Danish',
 'DE': 'German',
 'EL': 'Greek',
 'EN': 'English',
 'ES': 'Spanish',
 'ET': 'Estonian',
 'FI': 'Finnish',
 'FR': 'French',
 'HU': 'Hungarian',
 'ID': 'Indonesian',
 'IT': 'Italian',
 'JA': 'Japanese',
 'KO': 'Korean',
 'LT': 'Lithuanian',
 'LV': 'Latvian',
 'NB': 'Norwegian',
 'NL': 'Dutch',
 'PL': 'Polish',
 'PT': 'Portuguese',
 'RO': 'Romanian',
 'RU': 'Russian',
 'SK': 'Slovak',
 'SL': 'Slovenian',
 'SV': 'Swedish',
 'TR': 'Turkish',
 'UK': 'Ukrainian',
 'ZH': 'Chinese'}

In [46]:
# DeepL's source language codes are just ISO 639-1 codes (two-letter codes)
src_langs_iso1 = {
    c: iso639.to_iso639_2(c.lower()) if iso639.is_valid639_1(c) else None
    for c in src_langs.keys()
}
src_langs_iso1


{'BG': 'bul',
 'CS': 'cze',
 'DA': 'dan',
 'DE': 'ger',
 'EL': 'gre',
 'EN': 'eng',
 'ES': 'spa',
 'ET': 'est',
 'FI': 'fin',
 'FR': 'fre',
 'HU': 'hun',
 'ID': 'ind',
 'IT': 'ita',
 'JA': 'jpn',
 'KO': 'kor',
 'LT': 'lit',
 'LV': 'lav',
 'NB': 'nob',
 'NL': 'dut',
 'PL': 'pol',
 'PT': 'por',
 'RO': 'rum',
 'RU': 'rus',
 'SK': 'slo',
 'SL': 'slv',
 'SV': 'swe',
 'TR': 'tur',
 'UK': 'ukr',
 'ZH': 'chi'}

In [47]:
tgt_langs = {l.code: l.name for l in translator.get_target_languages()}
tgt_langs

{'BG': 'Bulgarian',
 'CS': 'Czech',
 'DA': 'Danish',
 'DE': 'German',
 'EL': 'Greek',
 'EN-GB': 'English (British)',
 'EN-US': 'English (American)',
 'ES': 'Spanish',
 'ET': 'Estonian',
 'FI': 'Finnish',
 'FR': 'French',
 'HU': 'Hungarian',
 'ID': 'Indonesian',
 'IT': 'Italian',
 'JA': 'Japanese',
 'KO': 'Korean',
 'LT': 'Lithuanian',
 'LV': 'Latvian',
 'NB': 'Norwegian',
 'NL': 'Dutch',
 'PL': 'Polish',
 'PT-BR': 'Portuguese (Brazilian)',
 'PT-PT': 'Portuguese (European)',
 'RO': 'Romanian',
 'RU': 'Russian',
 'SK': 'Slovak',
 'SL': 'Slovenian',
 'SV': 'Swedish',
 'TR': 'Turkish',
 'UK': 'Ukrainian',
 'ZH': 'Chinese (simplified)'}

In [48]:
# DeepL's source language codes are just ISO 639-1 codes (two-letter codes)
tgt_langs_iso1 = {
    c: iso639.to_iso639_2(c.lower()) if iso639.is_valid639_1(c) else None
    for c in tgt_langs.keys()
}
tgt_langs_iso1
# exceptions: 
#  - EN-GB and EN-US are both 'en'
#  - PT-BR and PT-PT are both 'pt'

{'BG': 'bul',
 'CS': 'cze',
 'DA': 'dan',
 'DE': 'ger',
 'EL': 'gre',
 'EN-GB': None,
 'EN-US': None,
 'ES': 'spa',
 'ET': 'est',
 'FI': 'fin',
 'FR': 'fre',
 'HU': 'hun',
 'ID': 'ind',
 'IT': 'ita',
 'JA': 'jpn',
 'KO': 'kor',
 'LT': 'lit',
 'LV': 'lav',
 'NB': 'nob',
 'NL': 'dut',
 'PL': 'pol',
 'PT-BR': None,
 'PT-PT': None,
 'RO': 'rum',
 'RU': 'rus',
 'SK': 'slo',
 'SL': 'slv',
 'SV': 'swe',
 'TR': 'tur',
 'UK': 'ukr',
 'ZH': 'chi'}

#### Check usage and rate limits

In [49]:
usage = translator.get_usage()

In [50]:
print('Characters translated:', usage.character.count)
print('Remaining characters:', usage.character.limit-usage.character.count)
print('% quota used:', usage.character.count/usage.character.limit*100)

Characters translated: 1757
Remaining characters: 2498243
% quota used: 0.07028


#### More translation options

There are many more options for the `translatior.translate_text` methods ([source](https://github.com/DeepLcom/deepl-python#text-translation-options)), for example:

- `split_sentences`: specify how input text should be split into sentences,
  default: `'on'`.
    - `'on'`: input text will be split into sentences using both newlines and punctuation.
    - `'off'`: input text will not be split into sentences. Use this for applications where each input text contains only one sentence.
    - `'nonewlines'`: input text will be split into sentences using punctuation but not newlines.
- `preserve_formatting`: controls automatic-formatting-correction. Set to `True` to prevent automatic-correction of formatting, default: `False`.
- `formality`: controls whether translations should lean toward informal or
  formal language. 
    - `'less'`: use informal language.
    - `'more'`: use formal, more polite language.
  *Note:* This option is only available for some target languages, see [Listing available languages](#listing-available-languages).
- `glossary`: specifies a glossary to use with translation, either as a string
  containing the glossary ID, or a `GlossaryInfo` as returned by
  `get_glossary()`.
- `context`: specifies additional context to influence translations, that is not
  translated itself. Note this is an **alpha feature**: it may be deprecated at
  any time, or incur charges if it becomes generally available.
  See the [API documentation][api-docs-context-param] for more information and
  example usage.