# Machine translation (advanced)

| Authors | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2023-12-05 |

<br>

<a target="_blank" href="https://colab.research.google.com/github/fabiennelind/Going-Cross-Lingual_Course/blob/main/code/translation_advanced.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

If you run notebooks on **Colab**, you can **enable GPU** computing by

1. clicking on "Runtime" in the menu,
2. selecting "Change runtime type", and
3. choose "GPU" in the "Hardware accelerator" section of the pop-up

This speeds up computing of deep neural networks.

## Setup

In [1]:
try:
    import google.colab
    COLAB = True
except:
    COLAB=False
print('on colab:', COLAB)

on colab: False


In [2]:
# need to install libraries if on Colab
%%capture
if COLAB:
    !pip install iso639==0.1.4 torch==2.1.0 easynmt==2.0.0 deepl==1.16.1 google-cloud-translate==3.12.1 sentence-transformers==2.2.2

UsageError: Line magic function `%%capture` not found.


In [2]:
import os
import pandas as pd

import iso639

import torch

import easynmt
import deepl
from google.oauth2 import service_account
from google.cloud import translate_v2 as gt

base_path = os.path.join('..')
data_path = os.path.join(base_path, 'data')

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'

## Load data

We use the [PimPo](https://manifesto-project.wzb.eu/information/documents/pimpo) dataset that records party manifesto quasi-sentences.

In [5]:
if COLAB:
    fp = "https://raw.githubusercontent.com/fabiennelind/Going-Cross-Lingual_Course/main/data/lehmann%2Bzobel_2018_labeled_manifestos.tsv"
else:
    fp = os.path.join(data_path, 'lehmann+zobel_2018_labeled_manifestos.tsv')
df = pd.read_csv(fp, sep='\t', encoding='utf-8')

In [6]:
# number of manifestos
len(df.manifesto_id.unique())

242

In [7]:
# big dataset with broad language coverage
print('N =', len(df))
df.lang.value_counts()

N = 234111


lang
deu    66255
spa    40074
eng    36621
nor    33544
nld    30931
swe     8210
fin     7888
dan     6453
fra     4135
Name: count, dtype: int64

So, the dataset is quite big.
Let's see what it'd cost us to translate it with a commercial service (Google or DeepL):

In [8]:
# count number of characters
print('# characters', df.text.apply(len).sum() / 1_000_000) # in millions
# translating 1 mio characters costs about 20 EUR with DeepL and 20 U.S. Dollars with Google
print('approx. costs (in EUR or $) =', (df.text.apply(len).sum() / 1_000_000) * 20)

# characters 26.510575
approx. costs (in EUR or $) = 530.2115


So for practical purposes, we subset the dataset to sentences about the issues of immigration or integration (they have stance codings).


In [9]:
print(df.issue.value_counts())
df = df[df.issue.isin(['integration', 'immigration'])]

issue
other          225166
integration      5052
immigration      3893
Name: count, dtype: int64


In [10]:
print(f'# characters =~ {df.text.apply(len).sum() / 1_000_000:.03f} mio')
print(f'approximate costs =~ {(df.text.apply(len).sum() / 1_000_000)*20:.02f} EUR')

# characters =~ 1.040 mio
approximate costs =~ 20.80 EUR


This looks better.
We have prepare the data in a separate data file, which we load instead:

In [11]:
if COLAB:
    fp = fp = "https://raw.githubusercontent.com/fabiennelind/Going-Cross-Lingual_Course/main/data/lehmann%2Bzobel_2018_pimpo_positions.tsv"
else:
    fp = os.path.join(data_path, 'lehmann+zobel_2018_pimpo_positions.tsv')
df = pd.read_csv(fp, sep='\t', encoding='utf-8')

In [12]:
df.position.value_counts()

position
supportive    5841
sceptical     2136
neutral        968
Name: count, dtype: int64

## Building an advanced translation function: illustration with easyNMT's M2M model

In [13]:
import easynmt
model = easynmt.EasyNMT('m2m_100_418M', device=device)

We will first create a helper function that allows to split a list into smaller chunks.
We will use this to safely translate a large list of sentences in smaller portions.

In [14]:
# chunk list of sentences into smaller chunks
def chunk(lst, size):
    for i in range(0, len(lst), size):
        yield lst[i:i + size]

# example: split list with first 10 letters in alphabet into chunks of 3
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
for i, c in enumerate(chunk(alphabet, 3)):
    print(f'chunk nr. {i} has {len(c)} element(s): {c}')

chunk nr. 0 has 3 element(s): ['a', 'b', 'c']
chunk nr. 1 has 3 element(s): ['d', 'e', 'f']
chunk nr. 2 has 3 element(s): ['g', 'h', 'i']
chunk nr. 3 has 1 element(s): ['j']


Next, we will create a function for freeing up GPU memory if we run into overflow issues.

In [15]:
import torch
import gc

def clean_memory(device):
    if 'cuda' in str(device):
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()
    elif 'mps' in str(device):
        torch.mps.empty_cache()
        torch.mps.release_buffers()
    else:
        pass
    gc.collect()

Now we are ready to define the first translation function.
It takes a list of texts and applies a translation function, which is passed to the `translation_fun` argument
(This should be a function that takes a list of texts and returns a list of translations.)

The cool thing about `translate_batch_safely` is that if it runs into GPU memory issues, it tries to translate the texts one by one (instead of all in one batch).
This avoids breaking the proecss due to single overflow errors.
Hence, the "safely" in the function name.

In [16]:
import torch
from typing import Callable, Union

def translate_batch_safely(texts: list, translation_fun: Callable, device: Union[str, torch.device], **kwargs) -> list:
    """
    Translates a batch of texts using the model, handling potential errors.

    Parameters:
        texts (list): A list of texts to be translated.
        translation_fun (Callable): The translation function to be used.
        **kwargs: Additional keyword arguments to be passed to `translation_fun`.

    Returns:
        list: A list of translated texts.
    """
    try:
        # Attempt to translate the batch of texts using the model
        res = translation_fun(texts, **kwargs)
    except Exception as e:
        # If the exception is _not_ related to running out of memory, ...
        if 'out of memory' not in str(e):
            # ... raise the exception
            raise e
        # but if the error was due running out of memory, ...
        else:
            # get the key-value arguments
            args = dict(**kwargs)
            # drop the batch size (will be set to 1 below)
            if 'batch_size' in args:
              del args['batch_size']

            clean_memory(device)
            res = [None] * len(texts)
            # ... try translating each text individually
            for i, text in enumerate(texts):
                try:
                    res[i] = translation_fun(text, batch_size=1, **args)
                except Exception as e:
                    # If unable to translate a text, print a warning message
                    print(f'WARNING: couldn\'t translate text "{text[:min(30, len(text))]}". Reason: {str(e)}')
    return res

Next, we define a function that translates a list of texts by internally splitting it into smaller chunks and calling translate batch safely.
Note that the `translation_fun` argument required by `tranlate_batch_safely` is just passed onwards as a keyword argument (captured by the `**kwargs`).

In [17]:
from tqdm.auto import tqdm

def translate_in_batches(texts: list, batch_size, verbose=False, pbar_desc=None, **kwargs) -> list:
    """
    Translates a list of texts in batches using the `translate_batch_safely` function.

    Parameters:
        texts (list): A list of texts to be translated.
        batch_size (int): The size of each translation batch.
        verbose (bool): Whether to print messages and a progress bar
        pbard_desc (str): The description of the progress bar
        **kwargs: Additional keyword arguments to be passed to the `translate_batch_safely` function.

    Returns:
        list: A list of translated texts.
    """
    # Initialize an empty list to store the translations
    translations = []
    n_batches = len(texts)//batch_size
    if verbose:
        pbar = tqdm(total=n_batches, desc=pbar_desc)
    # Iterate over the batches of texts
    for batch in chunk(texts, batch_size):
        # Translate the batch of texts
        translations += translate_batch_safely(batch, **kwargs)
        if verbose: 
            pbar.update(1)
    if verbose: 
        pbar.close()
    return translations

#### Example with one source language

Let's take a sample of sentences from the German-language subset of our dataset and translate them to English.

In [18]:
b = 8 # <== batch size
expls = df.text[df.lang == 'deu'].tolist()
expls = expls[:b*2]
print('No. sentences =', len(expls))

No. sentences = 16


In [19]:
translate_in_batches(
    texts=expls, 
    translation_fun=model.translate, # <== use esayNMT model's translate method
    device=model.device, 
    batch_size=b, # <== batch size
    # arguments forwarded to `translation_fun``
    source_lang='de', 
    target_lang='en', 
    beam_size=5,
    perform_sentence_splitting=False, 
    show_progress_bar=False,
)

['Instead of democratically integrating the interests of all stakeholders, it sets on division: workers against unemployed, west against east, healthy against sick, boys against old, men against women, Germans against foreigners, singles against families.',
 'Where the law of the stronger applies, hatred and violence are not far away.',
 'Five years after the fires in Solingen and Lübeck, extreme-right violence continues to be bitter everyday.',
 'Five years after the abolition of the asylum law, people in Germany must seek asylum before the police in churches',
 'Millions of citizens are still denied citizenship rights',
 'A disgraceful foreign domestic policy pushes especially young migrants to the edge of society',
 'Green wants to oppose a policy in which young people can participate actively and which takes their experiences and conditions of life seriously.',
 'Chancellor words such as “Free-Time Park” and campaigns against the “abuse” of the social security system or against for

*Note:* takes some time if you run it without a GPU.

#### Example with multiple source languages

Next, take samples of sentences from the German-, Finish- and Spanish-language subsets of our dataset and translate them to English.

In [20]:
b = 4 # <== batch size
# subset data to selected languages (for illustration) and columns
expls = df.loc[df.lang.isin(['deu', 'fin', 'spa']), ['lang', 'text']]
# sample 4 sentences per language subset
expls = expls.groupby('lang').sample(b, random_state=42)
# reshguffle rows and reset index
expls = expls.sample(frac=1.0, random_state=42).reset_index(drop=True)

expls

Unnamed: 0,lang,text
0,spa,Los inmigrantes han rejuvenecido nuestra pobla...
1,spa,así como proyectar la acción del Gobierno en c...
2,deu,§ Integration fördern
3,spa,Ello requiere una vigilancia mayor y más solid...
4,fin,"Kun keskustelu kiehuu, on vaarana, että huudam..."
5,deu,"Junge Deutsche, die auch noch den Pass eines a..."
6,deu,Die von den Kantonen und Gemeinden eingerichte...
7,spa,Apoyaremos de manera decidida las iniciativas ...
8,fin,On tärkeää ymmärtää maahanmuuton syitä ja muut...
9,fin,Maahanmuuttaja- ja romaniperheiden lapsia tule...


In [21]:
# initialize new column for storing translations
expls['text_mt_m2m'] = [None]*len(expls)

# iterate over languages subsets
for l, d in expls.groupby('lang'):
    # get the ISO 639-1 language code for the language
    lang_code = iso639.to_iso639_1(l)
    # note: usually, we want to check beforehand whether the language is supported by the model

    print(f'translating {len(d)} texts from {l} ({lang_code})')
    # translate texts in batches and assign translations directly into relevant rows of target column (using input data frame's index values)
    expls.loc[d.index, 'text_mt_m2m'] = translate_in_batches(
        texts=d.text.tolist(), 
        translation_fun=model.translate,
        device=model.device,
        batch_size=b, 
        # arguments forwarded to `translation_fun`
        source_lang=lang_code, # <== use current iteration's language
        target_lang='en', 
        beam_size=5,
        perform_sentence_splitting=False,
        show_progress_bar=False, 
    )

translating 4 texts from deu (de)
translating 4 texts from fin (fi)
translating 4 texts from spa (es)


In [22]:
# the final data frame has the translations
expls

Unnamed: 0,lang,text,text_mt_m2m
0,spa,Los inmigrantes han rejuvenecido nuestra pobla...,Immigrants have rejuvenated our population and...
1,spa,así como proyectar la acción del Gobierno en c...,and project the Government’s action in coopera...
2,deu,§ Integration fördern,Promote integration
3,spa,Ello requiere una vigilancia mayor y más solid...,This requires greater surveillance and more so...
4,fin,"Kun keskustelu kiehuu, on vaarana, että huudam...","When the discussion is boiling, there is a dan..."
5,deu,"Junge Deutsche, die auch noch den Pass eines a...",Young Germans who still have the passport of a...
6,deu,Die von den Kantonen und Gemeinden eingerichte...,The existing offers of courses and employment ...
7,spa,Apoyaremos de manera decidida las iniciativas ...,We will strongly support co-development initia...
8,fin,On tärkeää ymmärtää maahanmuuton syitä ja muut...,It is important to understand the causes of mi...
9,fin,Maahanmuuttaja- ja romaniperheiden lapsia tule...,Children from immigrants and Romans should be ...


#### Wrap the entire logic into a custom function

In many use cases, we want to translate a whole dataset in the form of a data frame.
So let's wrap the logic above into a custom function that takes a data frame as input and returns the data frame with the sentence translations added.

In [23]:
import numpy as np
from typing import Callable, Union, List

# helpers
def is_string_series(s: pd.Series):
    """
    Test if pandas series is a string series/series of strings
    
    source: https://stackoverflow.com/a/67001213
    """
    if isinstance(s.dtype, pd.StringDtype):
        # The series was explicitly created as a string series (Pandas>=1.0.0)
        return True
    elif s.dtype == 'object':
        # Object series, check each value
        return all(isinstance(v, str) or (v is None) or (np.isnan(v)) for v in s)
    else:
        return False

def is_nonempty_string(s: pd.Series):
    return np.array([isinstance(v, str) and len(v) > 0 for v in s], dtype=bool)

# main function
def translate_df(
        df: pd.DataFrame, 
        translation_function: Callable,
        supported_languages: List[str],
        text_col: str = 'text', 
        lang_col: str = 'lang',
        target_language: str = 'en',
        target_col: str = 'translation',
        device: Union[str, torch.device] = 'cpu',
        batch_size: int = 16,
        verbose: bool = False,
        **kwargs
    ):
    """
    Translates the texts in a data frame from the source languages specified in a column to a target language and add the translations to the data frame.

    Parameters:
        df (pd.DataFrame): The DataFrame containing the texts to be translated.
        translation_function (Callable): The translation function to be used.
        supported_languages (List[str]): A list language codes supported by the translation model.
        text_col (str): The name of the column in the DataFrame that contains the texts to be translated. Default is 'text'.
        lang_col (str): The name of the column in the DataFrame that contains the language codes. Default is 'lang'.
        target_language (str): The target language to translate the texts to. Can be either an ISO 639-1 or ISO 639-2 language code. Default is 'en'.
        target_col (str): The name of the column in the DataFrame to store the translations. Default is 'translation'.
        supported_languages (List[str]): A list of ISO 639-1 or ISO 639-2 language codes supported by the translation model. Default is None.
        device (Union[str, torch.device]): The device to use for translation. Default is 'cpu' but should be compatible with device used by translation model.
        batch_size (int): The size of each translation batch. Default is 16.
        **kwargs: Additional keyword arguments to be passed to the `translate_in_batches` function which, in turn, passes them to the translation function.

    Returns:
        pd.DataFrame: The DataFrame with the translated texts in column `target_col` in the target language `target_lang`.
    """
    # validate the inputs
    assert text_col in df.columns, f'Column "{text_col}" not found in data frame.'
    assert is_string_series(df[text_col]), f'Column "{text_col}" is not a series of string values.'
    assert lang_col in df.columns, f'Column "{lang_col}" not found in data frame.'
    assert is_string_series(df[lang_col]), f'Column "{lang_col}" is not a series of string values.'
    assert target_language is not None, 'Target language must be specified.'
    assert target_col not in df.columns, f'Column "{target_col}" already exists in data frame.'
    assert translation_function is not None, 'Translation function must be specified.'
    assert batch_size > 0, 'Batch size must be greater than 0.'
    assert supported_languages is not None, 'Supported languages must be specified.'
    assert isinstance(supported_languages, list), 'Supported languages must be a list.'
    assert len(supported_languages) > 0, 'Supported languages must not be empty.'
    assert all([isinstance(l, str) for l in supported_languages]), 'Supported languages must be a list of strings.'
    
    # check whether the model supports the target language
    langs = df['lang'].unique().tolist()
    # try to get the ISO 639-1 or ISO 639-2 language code for each language in the data frame
    langs_map = {
        l: l if iso639.is_valid639_1(l) else iso639.to_iso639_1(l) if iso639.is_valid639_2(l) else None 
        for l in langs
    }
    # check whether there are unsupported languages
    not_supported = [
        l 
        for l, c in langs_map.items() 
        if l not in supported_languages and c not in supported_languages and l != target_language and c != target_language
    ]
    # print warning message if there are unsupported languages
    if len(not_supported) > 0:
        print(
            f'WARNING: values {not_supported} in column "{lang_col}" are not supported by NMT model.',
            'Texts with these values will not be translated.'
        )
    # now update language mapping with "correct" language codes (use ISO code if available, otherwise use original indicator from the data frame)
    langs_map = {
        l: c if c in supported_languages else l if l in supported_languages else None 
        for l, c in langs_map.items()
    }

    # create new column for translation
    df[target_col] = [None]*len(df)

    # iterate over languages
    for l, d in df.groupby(lang_col):
        lang_code = langs_map[l]
        # just copy texts if source language is the target language
        if lang_code == target_language or l == target_language:
            df.loc[d.index, target_col] = d[text_col].tolist()
            continue
        # skip unsupported languages
        if l in not_supported or lang_code is None:
            continue
        # test for each text value if non-empty string
        flag = is_nonempty_string(d[text_col])
        if any(~flag):
            print(f'WARNING: {sum(~flag)} empty or non-string text(s) in "{l}"')
        df.loc[d.index[flag], target_col] = translate_in_batches(
            texts=d[text_col][flag].tolist(), # <== only translate non-empty texts
            translation_fun=translation_function,
            device=device,
            batch_size=batch_size, 
            source_lang=lang_code, 
            target_lang=target_language,
            verbose=verbose, 
            pbar_desc=f'translating {len(d)} text(s) from "{l}"',
            **kwargs
        )
    
    return df

In [24]:
# take the example sentences from above again (but delete previously translations)
if 'text_mt_m2m' in expls.columns:
    del expls['text_mt_m2m']

# translate
expls = translate_df(
    df=expls, 
    # data frame arguments
    text_col='text',
    target_language='en', 
    target_col='text_mt_m2m',
    # translation model arguments
    translation_function=model.translate,
    supported_languages=model.get_languages(),
    batch_size=4, # <== use small batch size for illustration 
    device=model.device,
    # arguments forwarded to model.translate()
    beam_size=5,
    perform_sentence_splitting=False,
    show_progress_bar=False, 
    # print progress bar
    verbose=True,
)

expls

translating 4 text(s) from "deu":   0%|          | 0/1 [00:00<?, ?it/s]

translating 4 text(s) from "fin":   0%|          | 0/1 [00:00<?, ?it/s]

translating 4 text(s) from "spa":   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,lang,text,text_mt_m2m
0,spa,Los inmigrantes han rejuvenecido nuestra pobla...,Immigrants have rejuvenated our population and...
1,spa,así como proyectar la acción del Gobierno en c...,and project the Government’s action in coopera...
2,deu,§ Integration fördern,Promote integration
3,spa,Ello requiere una vigilancia mayor y más solid...,This requires greater surveillance and more so...
4,fin,"Kun keskustelu kiehuu, on vaarana, että huudam...","When the discussion is boiling, there is a dan..."
5,deu,"Junge Deutsche, die auch noch den Pass eines a...",Young Germans who still have the passport of a...
6,deu,Die von den Kantonen und Gemeinden eingerichte...,The existing offers of courses and employment ...
7,spa,Apoyaremos de manera decidida las iniciativas ...,We will strongly support co-development initia...
8,fin,On tärkeää ymmärtää maahanmuuton syitä ja muut...,It is important to understand the causes of mi...
9,fin,Maahanmuuttaja- ja romaniperheiden lapsia tule...,Children from immigrants and Romans should be ...


### Leverage the flexibilits of `translate_df`

We can leverage the flexibility of `translate_df` to create three specialized translation functions:



#### translate data frame with easyNMT model

In [25]:
from types import SimpleNamespace

def translate_df_with_easynmt(df, **kwargs):
    """
    Translates a DataFrame using the EasyNMT model.

    Args:
        df (pandas.DataFrame): The DataFrame to be translated.
        args: Additional arguments for the translation process.

    Returns:
        pandas.DataFrame: The translated DataFrame.
    """
    args = SimpleNamespace(**kwargs)

    try:
        device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
        model = easynmt.EasyNMT(args.model_name, device=device)
        print(f'Using device "{model.device}"')
    except Exception as e:
        print(f'WARNING: could not load model "{args.model_name}"')
        raise e
    
    tgt_lang = [l.lower() for l in model.get_languages() if args.target_language.lower() == l.lower() or args.target_language.lower() in l.lower()]
    if len(tgt_lang) == 0:
        raise ValueError(f'Target language "{args.target_language}" not supported by DeepL.')
    if len(tgt_lang) > 1:
        raise ValueError(f'Target language "{args.target_language}" ambiguous. Please specify one of {tgt_lang}.')
    tgt_lang = tgt_lang[0]
    src_langs = model.get_languages(target_lang = tgt_lang)

    df = df.copy(deep=True)
    
    try:
        df = translate_df(
            df=df, 
            # data frame arguments
            text_col=args.text_col,
            lang_col=args.lang_col,
            target_language=tgt_lang, 
            target_col=args.target_col if hasattr(args, 'target_col') else f'{args.text_col}_mt_{args.model_name.lower()}',
            # translation model arguments
            translation_function=model.translate,
            supported_languages=src_langs,
            batch_size=args.batch_size if hasattr(args, 'batch_size') else 16,
            device=model.device,
            # arguments forwarded to model.translate()
            beam_size=5,
            perform_sentence_splitting=False,
            show_progress_bar=False, 
            # print progress bar
            verbose=args.verbose,
        )
    except Exception as e:
        print(f'WARNING: Error during translation "{str(e)}". Returning data frame with translations so far.')
    
    return df


In [26]:
# take the example sentences from above again (but delete previously translations)
if 'text_mt_m2m' in expls.columns:
    del expls['text_mt_m2m']

# translate
expls = translate_df_with_easynmt(
    df=expls, 
    model_name='m2m_100_418M',
    # data frame arguments
    text_col='text',
    lang_col='lang',
    target_language='en', 
    target_col='text_mt_m2m',
    batch_size=4, # <== use small batch size for illustration 
    # arguments forwarded to model.translate()
    beam_size=5,
    perform_sentence_splitting=False,
    show_progress_bar=False, 
    # print progress bar
    verbose=True,
)

Using device "mps"


translating 4 text(s) from "deu":   0%|          | 0/1 [00:00<?, ?it/s]

translating 4 text(s) from "fin":   0%|          | 0/1 [00:00<?, ?it/s]

translating 4 text(s) from "spa":   0%|          | 0/1 [00:00<?, ?it/s]

In [27]:
# take the example sentences from above again (but delete previously translations)
if 'text_mt_opus' in expls.columns:
    del expls['text_mt_opus']

# translate
expls = translate_df_with_easynmt(
    df=expls, 
    model_name='opus-mt',
    # data frame arguments
    text_col='text',
    lang_col='lang',
    target_language='en', 
    target_col='text_mt_opus',
    batch_size=4, # <== use small batch size for illustration 
    # arguments forwarded to model.translate()
    beam_size=5,
    perform_sentence_splitting=False,
    show_progress_bar=False, 
    # print progress bar
    verbose=True,
)


Using device "mps"


translating 4 text(s) from "deu":   0%|          | 0/1 [00:00<?, ?it/s]

translating 4 text(s) from "fin":   0%|          | 0/1 [00:00<?, ?it/s]

translating 4 text(s) from "spa":   0%|          | 0/1 [00:00<?, ?it/s]

In [28]:
import deepl

def translate_df_with_deepl(df, **kwargs):
    args = SimpleNamespace(**kwargs)

    # get API key
    try:
        with open(args.api_key_file) as f:
            api_key = f.read().strip()
    except Exception as e:
        raise ValueError(f'Could not load API key file "{args.api_key_file}". Reason: {str(e)}')
        
    # initialize a `Translator` instance
    try:
        translator = deepl.Translator(api_key)
    except Exception as e:
        raise ValueError(f'Could not connect to DeepL API. Reason: {str(e)}')

    # get source and target languages
    src_langs = [l.code.lower() for l in translator.get_source_languages()]
    tgt_lang = [l.code.lower() for l in translator.get_target_languages() if args.target_language.lower() == l.code.lower() or args.target_language.lower() in l.code.lower()]
    if len(tgt_lang) == 0:
        raise ValueError(f'Target language "{args.target_language}" not supported by DeepL.')
    if len(tgt_lang) > 1:
        raise ValueError(f'Target language "{args.target_language}" ambiguous. Please specify one of {tgt_lang}.')
    tgt_lang = tgt_lang[0]
    
    df = df.copy(deep=True)
    tgt_col = f'{args.text_col}_mt_deepl'
    
    # translate
    try:
        df = translate_df(
            df=df, 
            # data frame arguments
            text_col=args.text_col,
            lang_col=args.lang_col,
            target_language=tgt_lang,
            target_col=args.target_col if hasattr(args, 'target_col') else tgt_col,
            # translation model arguments
            translation_function=translator.translate_text,
            supported_languages=src_langs,
            batch_size=args.batch_size if hasattr(args, 'batch_size') else 1280,
            # arguments forwarded to translator.translate_text()
            split_sentences='off' if not hasattr(args, 'split_sentences') else 'on' if args.split_sentences else 'off',
            # print progress bar
            verbose=args.verbose,
        )
    except Exception as e:
        print(f'WARNING: Error during translation "{str(e)}". Returning data frame with translations so far.')
    
    try:
        # post-process translation result
        df[tgt_col] = df[tgt_col].apply(lambda x: x if isinstance(x, str) else x.text if x is not None else None)
    except Exception as e:
        print(f'WARNING: Error during post-processing "{str(e)}". Returning data frame with translations so far.')
    
    return df

In [29]:
# translate
expls = translate_df_with_deepl(
    df=expls, 
    api_key_file=os.path.join(os.environ['SPATH'], 'deepl'),
    # data frame arguments
    text_col='text',
    lang_col='lang',
    target_language='en-gb', 
    target_col='text_mt_deepl',
    # print progress bar
    verbose=True,
)

translating 4 text(s) from "deu": 0it [00:00, ?it/s]

translating 4 text(s) from "fin": 0it [00:00, ?it/s]

translating 4 text(s) from "spa": 0it [00:00, ?it/s]

In [30]:
from google.oauth2 import service_account
from google.cloud import translate_v2 as gt

def translate_df_with_google(df, **kwargs):
    args = SimpleNamespace(**kwargs)

    if hasattr(args, 'batch_size'):
        assert args.batch_size <= 128, 'Google Cloud Translation API only supports batch sizes up to 128.'
    
    # get API key
    try:
        credentials = service_account.Credentials.from_service_account_file(args.api_key_file)
    except Exception as e:
        raise ValueError(f'Could not load API key file "{args.api_key_file}". Reason: {str(e)}')
    
    # initialize a `translator` instance
    try:
        translator = gt.Client(credentials=credentials)
    except Exception as e:
        raise ValueError(f'Could not connect to Google Cloud Translation API. Reason: {str(e)}')
    
    # get source and target languages
    src_langs = [l['language']  for l in translator.get_languages()]
    tgt_lang = [l.lower() for l in src_langs if args.target_language.lower() == l.lower() or args.target_language.lower() in l.lower()]
    if len(tgt_lang) == 0:
        raise ValueError(f'Target language "{args.target_language}" not supported by Google Cloud Trsanslation API.')
    if len(tgt_lang) > 1:
        raise ValueError(f'Target language "{args.target_language}" ambiguous. Please specify one of {tgt_lang}.')
    tgt_lang = tgt_lang[0]
    
    df = df.copy(deep=True)
    tgt_col = f'{args.text_col}_mt_google'
    
    def translate_util(values, target_lang, source_lang, **kwargs):
        return translator.translate(values=values, target_language=target_lang, source_language=source_lang, **kwargs)

    # translate
    try:
        df = translate_df(
            df=df, 
            # data frame arguments
            text_col=args.text_col,
            lang_col=args.lang_col,
            target_language=tgt_lang,
            target_col=args.target_col if hasattr(args, 'target_col') else tgt_col,
            # translation model arguments
            translation_function=translate_util,
            supported_languages=src_langs,
            batch_size=args.batch_size if hasattr(args, 'batch_size') else 128,
            # print progress bar
            verbose=args.verbose,
        )
    except Exception as e:
        print(f'WARNING: Error during translation "{str(e)}". Returning data frame with translations so far.')
    
    try:
        # post-process translation result
        df[tgt_col] = df[tgt_col].apply(lambda x: x if isinstance(x, str) else x['translatedText'] if x is not None else None)
    except Exception as e:
        print(f'WARNING: Error during post-processing "{str(e)}". Returning data frame with translations so far.')
    
    return df

In [31]:
# translate
expls = translate_df_with_google(
    df=expls, 
    api_key_file=os.path.join(os.environ['SPATH'], 'multilingual-gesis-translate.json'),
    # data frame arguments
    text_col='text',
    lang_col='lang',
    target_language='en', 
    target_col='text_mt_google',
    # print progress bar
    verbose=True,
)

translating 4 text(s) from "deu": 0it [00:00, ?it/s]

translating 4 text(s) from "fin": 0it [00:00, ?it/s]

translating 4 text(s) from "spa": 0it [00:00, ?it/s]

## Compare translations

Ideally, we would want to (i) compare machine translations against human translations and (ii) crowd-source bilingual speakers' assessments of translation quality.
But this is beyonds our means and outside the scope of this course.
Instead we show some quantitative comparisons of the translations produced by the different models.

### Qualititive comparison

In [32]:
expls

Unnamed: 0,lang,text,text_mt_m2m,text_mt_opus,text_mt_deepl,text_mt_google
0,spa,Los inmigrantes han rejuvenecido nuestra pobla...,Immigrants have rejuvenated our population and...,Immigrants have rejuvenated our population and...,Immigrants have rejuvenated our population and...,Immigrants have rejuvenated our population and...
1,spa,así como proyectar la acción del Gobierno en c...,and project the Government’s action in coopera...,as well as project the Government's action in ...,as well as projecting the Government's action ...,as well as projecting the Government&#39;s act...
2,deu,§ Integration fördern,Promote integration,§ Promote integration,§ Promoting integration,§ Promote integration
3,spa,Ello requiere una vigilancia mayor y más solid...,This requires greater surveillance and more so...,This requires greater vigilance and greater so...,This requires greater vigilance and more solid...,This requires greater vigilance and more solid...
4,fin,"Kun keskustelu kiehuu, on vaarana, että huudam...","When the discussion is boiling, there is a dan...","When the debate boils, there is a danger that ...","When the debate boils over, we run the risk of...","When the discussion is heated, there is a dang..."
5,deu,"Junge Deutsche, die auch noch den Pass eines a...",Young Germans who still have the passport of a...,"Young Germans, who also have the passport of a...",Young Germans who also have a passport from an...,Young Germans who also have a passport from an...
6,deu,Die von den Kantonen und Gemeinden eingerichte...,The existing offers of courses and employment ...,The existing courses and employment programmes...,The existing courses and employment programmes...,The existing courses and employment programs f...
7,spa,Apoyaremos de manera decidida las iniciativas ...,We will strongly support co-development initia...,We will strongly support co-development initia...,We will strongly support co-development initia...,We will decisively support co-development init...
8,fin,On tärkeää ymmärtää maahanmuuton syitä ja muut...,It is important to understand the causes of mi...,It is important to understand the causes of im...,It is important to understand the causes of mi...,It is important to understand the reasons for ...
9,fin,Maahanmuuttaja- ja romaniperheiden lapsia tule...,Children from immigrants and Romans should be ...,Children of immigrant and Roma families should...,Children from immigrant and Roma families shou...,Children from immigrant and Roma families shou...


#### Quantitative comparison

Let's use the Jaccard similarity measure, defined as 

$$
Jac(A,B) = \frac{|A \cap B|}{|A \cup B|}
$$

where 
- $A$ is the set of words in the sentence $a$ and $B$ is the set of words in sentence $b$,
- $|A \cap B|$ measures intersection of words (i.e., words that are in both sentences), and
- $|A \cup B|$ measures the union of words (i.e., words that are in either sentence).

**_Note:_** we could also more specialized metrics like [BLEU](https://en.wikipedia.org/wiki/BLEU) or [METEOR](https://en.wikipedia.org/wiki/METEOR) that are specifically designed for machine translation evaluation (read more [here](https://www.machinetranslation.com/blog/machine-translation-evaluation-ultimate-guide)). But we have no gold standard translations here, so we stick to the Jaccard similarity.

In [33]:
# we can compare translations similarities

def jaccard_set(list1, list2):
    """
    Define Jaccard Similarity function for two sets

    source: https://www.learndatasci.com/glossary/jaccard-similarity/
    """
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

expls['jsim'] = expls[['text_mt_m2m', 'text_mt_deepl']].apply(lambda x: jaccard_set(x.iloc[0].split(' '), x.iloc[1].split(' ')), axis=1)
for i, d in expls.sort_values('jsim', ascending=False).iterrows():
    print(f'jaccard similarity = {d.jsim:.03f}\n')
    print(f'{d.lang}: {d.text}')
    print(f'm2m: {d.text_mt_m2m}')
    print(f'deepl: {d.text_mt_deepl}')
    print('---\n')

jaccard similarity = 0.720

spa: Los inmigrantes han rejuvenecido nuestra población y han contribuido al periodo de prosperidad económica más importante de este comienzo de siglo.
m2m: Immigrants have rejuvenated our population and have contributed to the most important period of economic prosperity of this beginning of the century.
deepl: Immigrants have rejuvenated our population and contributed to the most important period of economic prosperity at the beginning of this century.
---

jaccard similarity = 0.611

fin: Kristillisdemokraatit haastavat suomalaiset kasvamaan muualta muuttaneiden ihmisten hyväksymisessä
m2m: Christian Democrats challenge the Finnish to grow in acceptance of people who have moved elsewhere
deepl: Christian Democrats challenge Finns to grow in accepting people who have moved from elsewhere
---

jaccard similarity = 0.531

fin: On tärkeää ymmärtää maahanmuuton syitä ja muuttajien haasteita, etteivät kielteiset ilmiöt saa kasvualustaa.
m2m: It is important to 

summary of all pairwise comparisons:

In [34]:
from itertools import combinations

cols = ['text_mt_m2m', 'text_mt_opus', 'text_mt_google', 'text_mt_deepl']
# compute all pairs of values in `cols``
pairs = combinations(cols, 2)

jsims = {}
for a, b in pairs:
    vals = expls[[a, b]].apply(lambda x: jaccard_set(x.iloc[0].split(' '), x.iloc[1].split(' ')), axis=1)
    key = f'jsim_{a.split("_")[-1]}_{b.split("_")[-1]}'
    jsims[key] = {}
    for (l, m), s in zip(vals.groupby(expls.lang).mean().to_dict().items(), vals.groupby(expls.lang).std().to_list()):
        jsims[key][l] = {'mean': m, 'std': s}

In [35]:
pd.DataFrame(jsims).apply(lambda x: x.apply(lambda y: f"{y['mean']:.03f}±{y['std']:.03f}")).T

Unnamed: 0,deu,fin,spa
jsim_m2m_opus,0.554±0.103,0.546±0.179,0.559±0.143
jsim_m2m_google,0.575±0.094,0.548±0.099,0.545±0.108
jsim_m2m_deepl,0.407±0.124,0.467±0.185,0.537±0.132
jsim_opus_google,0.793±0.200,0.629±0.213,0.599±0.081
jsim_opus_deepl,0.641±0.123,0.585±0.286,0.634±0.087
jsim_google_deepl,0.671±0.160,0.560±0.286,0.678±0.042


According to this metric (and our very low sample size), Google compares only slightly better to DeepL than OPUS-MT –- but  also depends on the language.

The problem with token-based metrics like Jaccard is that they do not account for the semantic similarity of sentences.
So let's use sentence embeddings to evaluate translations "semantic" similarity.

In [36]:
# we can compare translations similarities with sentence embeddings
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

sentence_transformer = SentenceTransformer('all-MiniLM-L6-v2', device=device)

In [37]:
sent1 = 'The existing offers of courses and employment programs for asylum seekers established by the cantons and municipalities must be assessed by the Federation in order to expand the “good practices”.'
sent2 = 'The existing courses and employment programmes for asylum seekers set up by the cantons and municipalities must be assessed by the Confederation in order to expand "good practices".'

embeddings = sentence_transformer.encode([sent1, sent2])

# compute the similarity between the two sentences

cosine_similarity(embeddings)[0,1]

0.9302455

In [38]:

def embedding_similarity(sent1, sent2):
    embeddings = sentence_transformer.encode([sent1, sent2])
    return cosine_similarity(embeddings)[0,1]

expls['esim'] = expls[['text_mt_m2m', 'text_mt_deepl']].apply(lambda x: embedding_similarity(x.iloc[0], x.iloc[1]), axis=1)
for i, d in expls.sort_values('esim', ascending=False).iterrows():
    print(f'jaccard similarity = {d.esim:.03f}\n')
    print(f'{d.lang}: {d.text}')
    print(f'm2m: {d.text_mt_m2m}')
    print(f'deepl: {d.text_mt_deepl}')
    print('---\n')

jaccard similarity = 0.997

spa: Los inmigrantes han rejuvenecido nuestra población y han contribuido al periodo de prosperidad económica más importante de este comienzo de siglo.
m2m: Immigrants have rejuvenated our population and have contributed to the most important period of economic prosperity of this beginning of the century.
deepl: Immigrants have rejuvenated our population and contributed to the most important period of economic prosperity at the beginning of this century.
---

jaccard similarity = 0.975

spa: Apoyaremos de manera decidida las iniciativas de codesarrollo, contribuyendo a que los inmigrantes que voluntariamente lo deseen puedan destinar parte de sus ingresos a inversiones productivas y creación de empresas en sus países de origen.
m2m: We will strongly support co-development initiatives, contributing to the fact that immigrants who voluntarily wish to do so can allocate part of their income to productive investments and the creation of in their country of origi

In [39]:
from itertools import combinations

cols = ['text_mt_m2m', 'text_mt_opus', 'text_mt_google', 'text_mt_deepl']
pairs = combinations(cols, 2)

esims = {}
for a, b in pairs:
    vals = expls[[a, b]].apply(lambda x: embedding_similarity(x.iloc[0], x.iloc[1]), axis=1)
    key = f'jsim_{a.split("_")[-1]}_{b.split("_")[-1]}'
    esims[key] = {}
    for (l, m), s in zip(vals.groupby(expls.lang).mean().to_dict().items(), vals.groupby(expls.lang).std().to_list()):
        esims[key][l] = {'mean': m, 'std': s}

In [40]:
pd.DataFrame(esims).apply(lambda x: x.apply(lambda y: f"{y['mean']:.03f}±{y['std']:.03f}")).T

Unnamed: 0,deu,fin,spa
jsim_m2m_opus,0.910±0.083,0.867±0.033,0.935±0.069
jsim_m2m_google,0.914±0.084,0.834±0.077,0.912±0.084
jsim_m2m_deepl,0.892±0.079,0.878±0.058,0.920±0.077
jsim_opus_google,0.990±0.008,0.915±0.077,0.947±0.059
jsim_opus_deepl,0.966±0.014,0.936±0.067,0.960±0.058
jsim_google_deepl,0.966±0.020,0.897±0.070,0.978±0.019


According to this metric, OPUS-MT compares as well to DeepL as Google.
But mind our small sample size.