# Data Translation

The following code is employed for the purpose of automatically translating our German text data into English, utilising natural language processing. We are employing three different models for this task. 

Note: ChatGPT was consulted to a limited extent in the process of developing the code, mainly for trouble shooting. 

### Housekeeping

Here, we import the data and packages that we need for the translation.

In [1]:
# install packages if necessary
!pip install pandas
!pip install transformers
!pip install sentencepiece
!pip install deep-translator
!pip install sacremoses

In [2]:
# import packages
import pandas as pd
from transformers import MarianMTModel, MarianTokenizer
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
from googletrans import Translator

In [None]:
# import data 
data = pd.read_csv("../data/raw/data.csv") 

### MarianMT

Translation with the model MarianMT:

In [4]:
# define the model that is used for translation
marian = "Helsinki-NLP/opus-mt-de-en"
model = MarianMTModel.from_pretrained(marian)
tokenizer = MarianTokenizer.from_pretrained(marian)

In [5]:
# define function to tranlate text data
def translate_marian(text):
    # tokenize the text input 
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    # generate translation 
    translated_tokens = model.generate(**inputs)
    # decode the translated tokens into string
    translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return translated_text

In [None]:
# apply the translation function to the "text" column
data['text_eng'] = data['text'].apply(translate_marian)

# write data into a file 
data.to_csv("../data/translated/data_marian.csv", index=False)

### MBart

Translation with the model MBart: 

In [7]:
# define the model that is used for translation
mbart = "facebook/mbart-large-50-many-to-many-mmt"
model = MBartForConditionalGeneration.from_pretrained(mbart)
tokenizer = MBart50TokenizerFast.from_pretrained(mbart)

In [8]:
# define function to tranlate text data
def translate_mbart(text):
    tokenizer.src_lang = "de_DE"  
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
    translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return translated_text

In [None]:
# apply the translation function 
data['text_eng'] = data['text'].apply(translate_mbart)

# write data into a file
data.to_csv("../data/translated/data_mbart.csv", index=False)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


### Google Translator

Translation with Google Translator:

In [10]:
# define function to translate text data
def translate_google(text):
    translator = Translator()
    translated = translator.translate(text, src='de', dest='en')
    return translated.text

In [None]:
# apply the translation function 
data['text_eng'] = data['text'].apply(translate_google)

# write data into a file
data.to_csv("../data/translated/data_google.csv", index=False)