# BATCH MACHINE TRANSLATION WITH HUGGING FACE AND MARIANMT (1/3)

Machine translation doesn't generate as much excitement as other emerging areas in NLP, in part because consumer-facing services like Google Translate have been around since [April 2006](https://en.wikipedia.org/wiki/Google_Translate). But recent advances in NLP, particularly the work by Hugging Face in making transformer models more easily accessible, have opened up interesting new possibilities in this place.

For one, small batch translation in multiple languages can now be run from a desk or laptop without having to subscribe to an [expensive service](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/translator/). No doubt the translated works by [neural machine translation](https://www.microsoft.com/en-us/translator/business/machine-translation/#nnt) models are not (yet) as artful or precise as those by a skilled human translator. But they get 60% or more of the job done, in my view. Depending on your use case, that could be a huge time saver, to say nothing of the fact that skilled human translators are pretty much a rarity in most work places.

Over three short notebooks, I'll look at demo a simple workflow for using [Hugging Face's version of MarianMT](https://huggingface.co/transformers/model_doc/marian.html) to:
* translate 3 English speeches of varying lengths to Chinese
* translate 5 English news stories on Covid-19 (under 500 words) to Chinese
* translate 3 Chinese speeches to English

These notebooks took just minutes to run on my late-2015 iMac (32Gb), and could run faster/slower depending on your hardware set-up. You can scale up the number of speeches/articles as you wish, though running a py file on the command line is a better idea if you are going for lage numbers. I expect notebooks to crash for CSV files with large numbers of speeches/articles.

## DATASET

The first dataset comprises 11 speeches in 4 languages (English, Malay, Chinese, and Tamil) taken from the website of the [Singapore Prime Minister's Office](https://www.pmo.gov.sg/).

The second dataset consists of 5 English news stories on Covid-19 published on Singapore news outlet's [CNA's website](https://www.channelnewsasia.com/) in March 2020.

## RESULTS

The output CSV files with the translated text and original copy can be downloaded [here](https://www.dropbox.com/sh/q4vy8recuib4rs9/AACagocH9mkrdj0yKKqrO0vaa).

## CURRENT LIMITS

At the time of writing, you can tap over 1,300 open source transformer models on [Hugging Face's model hub](https://huggingface.co/Helsinki-NLP?utm_campaign=Hugging%2BFace&utm_medium=email&utm_source=Hugging_Face_1) for machine translation. Finding a model with the right language pairing is obviously the first step towards deciding if your use case is suitable.

As no MarianMT models for English-Malay and English-Tamil (and vice versa) have been released to date, this series of notebooks will not deal with these two languages for now. I'll revisit them as and when the models are available.

## MEDIUM POST

A little more context and discussion here in my Medium post: https://bit.ly/31ldeyj

In [1]:
import nltk
import pandas as pd
import re

from nltk.tokenize import sent_tokenize
from transformers import MarianMTModel, MarianTokenizer



# ENGLISH -TO-CHINESE MACHINE TRANSLATION: 3 SPEECHES

In this notebook, we'll focus on an efficient workflow to translate 3 mid-length English speeches - ranging from 1,352 to 1,750 words - Chinese. 

For best results, the sentences should be translated one at a time. This would require a minor work around using the sent-tokenizer from nltk.

# 1. LOAD DATA, SELECT ENGLISH SPEECHES

In [2]:
raw = pd.read_csv('../data/translation_speeches.csv')

In [3]:
eng = raw[raw['Language'] == 'English'].copy()

eng.head()

Unnamed: 0,Date,Speaker,Title,Language,Text,URL
0,2020-04-30,Lee Hsien Loong,May Day Message 2020,English,"This year, we mark May Day amidst difficult ci...",https://www.pmo.gov.sg/Newsroom/PM-Lee-Hsien-L...
4,2020-07-27,Lee Hsien Loong,Cabinet Swearing-in Ceremony,English,Singaporeans have just gone through a crucial ...,https://www.pmo.gov.sg/Newsroom/Speech-by-PM-L...
7,2020-08-09,Lee Hsien Loong,National Day Message 2020,English,"Every year, rain or shine, Singaporeans come t...",https://www.pmo.gov.sg/Newsroom/National-Day-M...


# 2. GENERATE TRANSLATION

## 2.1 DEFINE  FUNCTIONS FOR TEXT CLEANING, TRANSLATION

We'll use 2 simple functions, one to lightly clean the text, and the second for tokenizing the cleaned text at the sentence level + translating the batched input.

In [4]:
def clean_text(text):
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"\n\n", " ", text)
    text = text.strip(" ")
    text = re.sub(' +',' ', text).strip() # gets rid of multiple spaces and replace with a single
    return text

In [5]:
# function adapted from https://github.com/plotly/dash-sample-apps/blob/master/apps/dash-translate/ColabDemo.ipynb

def translate(text):
    if text is None or text == "":
        return "Error",

    #batch input + sentence tokenization
    batch = tokenizer.prepare_translation_batch(sent_tokenize(text))

    #run model
    translated = model.generate(**batch)
    tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

    return " ".join(tgt_text)

## 2.2 PICK THE RIGHT MODEL FOR YOUR USE CASE

The full list is [here](https://huggingface.co/Helsinki-NLP?utm_campaign=Hugging%2BFace&utm_medium=email&utm_source=Hugging_Face_1). The names of the models can change without notice, so do check before use.

In [6]:
# An older version of the model was listed as Helsinki-NLP/opus-mt-eng-zho
# The version used here was updated on Aug 19 2020

model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-zh")

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")

## 2.3 TIME THE RUN AND RE-ORG THE OUTPUT CSV FOR CLARITY

The 3 speeches here are not highly technical. But they cover quite a wide range of issues, from Covid-19 to electoral politics, and highly localised concerns like Singapore's annual National Day celebrations. To be able to get the translations of such speeches within minutes is a huge deal, in my view.

Careful checking of the machine translation results would take far more time of course. For those who read Chinese, a quick glance of the [results](https://www.dropbox.com/sh/q4vy8recuib4rs9/AACagocH9mkrdj0yKKqrO0vaa) would show up some obvious errors. But overall, the machine translations via HF and MarianMT hold up pretty well against Google Translate.

In [7]:
%%time

eng["Clean_Text"] = eng['Text'].map(lambda text: clean_text(text))

eng['Machine_Translation'] = eng["Clean_Text"].map(lambda x: translate(x)).copy()

CPU times: user 6min 32s, sys: 30.9 s, total: 7min 3s
Wall time: 2min 33s


In [8]:
cols = ["Date", "Speaker", "Title", "Text", "Machine_Translation", "URL"]

speeches_translated = eng[cols]

In [9]:
speeches_translated.head()

Unnamed: 0,Date,Speaker,Title,Text,Machine_Translation,URL
0,2020-04-30,Lee Hsien Loong,May Day Message 2020,"This year, we mark May Day amidst difficult ci...","今年,我们在困难的条件下纪念“五月节”。 COVID-19最新消息,COVID-19流行病仍...",https://www.pmo.gov.sg/Newsroom/PM-Lee-Hsien-L...
4,2020-07-27,Lee Hsien Loong,Cabinet Swearing-in Ceremony,Singaporeans have just gone through a crucial ...,"新加坡人民刚刚经历了一场在一场巨大危机中举行的至关重要的大选。 我们经历了6个困难的月,处理...",https://www.pmo.gov.sg/Newsroom/Speech-by-PM-L...
7,2020-08-09,Lee Hsien Loong,National Day Message 2020,"Every year, rain or shine, Singaporeans come t...","新加坡人每年8月9日聚集一堂, 参加全国日游行, 庆祝我们国家的形成, 并重申我们对新加坡的...",https://www.pmo.gov.sg/Newsroom/National-Day-M...


In [10]:
#uncomment to generate a separate file if you need it
#speeches_translated.to_csv('speeches_en_to_cn.csv', index=False)

# 3. DASH APP

Plotly has released a good number of sample apps for transformers-based NLP tasks, including one that works with HF's version of MarianMT. Try it out on [Colab](https://github.com/plotly/dash-sample-apps/blob/master/apps/dash-translate/ColabDemo.ipynb), or via [Github](https://github.com/plotly/dash-sample-apps).

Gif below shows how I used it on one of the speeches in this dataset (with edit of the app's headline).

![](https://miro.medium.com/max/1400/1*TNRk5NqqdJp9J9idLZTQNA.gif)