# BATCH MACHINE TRANSLATION WITH HUGGING FACE AND MARIANMT (2/3)

# ENGLISH -TO-CHINESE MACHINE TRANSLATION: 5 NEWS ARTICLES

In this notebook, we'll test the same English-to-Chinese model in 1/3 on a different but common use-case: translating English news articles into Chinese.

The 5 articles were randomly picked from a batch published by [CNA](https://www.channelnewsasia.com/) in March 2020. Feel free to swap out different English news articles for your own trial.

The workflow is exactly the same as 1/3, with minor changes in the text cleaning function. This is one area that will need some additional work, depending on the source of the articles.

In [1]:
from __future__ import print_function

import ipywidgets as widgets
import pandas as pd
import numpy as np
import re
import nltk

from nltk.tokenize import sent_tokenize
from transformers import MarianMTModel, MarianTokenizer



# 1. LOAD DATA, SELECT ENGLISH SPEECHES

In [2]:
raw = pd.read_csv('../data/cna_sample.csv')

In [3]:
raw.head()

Unnamed: 0,Source,Date,Story_Headline,Story_Text,Word_Count,URL
0,CNA,2020-03-19,Wuhan's new COVID-19 cases could cease by mid-...,SHANGHAI: Wuhan is expected to see new coronav...,315,https://www.channelnewsasia.com/news/asia/coro...
1,CNA,2020-03-04,Researchers identify two novel coronavirus typ...,SHANGHAI: Scientists in China studying the vir...,481,https://www.channelnewsasia.com/news/asia/covi...
2,CNA,2020-03-05,Bethlehem's Church of the Nativity ordered clo...,BETHLEHEM: The Church of the Nativity was orde...,436,https://www.channelnewsasia.com/news/world/cor...
3,CNA,2020-03-19,NATAS travel fair in May cancelled as COVID-19...,SINGAPORE: Amid a COVID-19 outbreak that has p...,301,https://www.channelnewsasia.com/news/singapore...
4,CNA,2020-03-12,China's coronavirus epicentre Hubei sees singl...,BEIJING: China confirmed only eight new corona...,425,https://www.channelnewsasia.com/news/asia/coro...


# 2. GENERATE TRANSLATION

## 2.1 DEFINE  FUNCTIONS FOR TEXT CLEANING, TRANSLATION

We'll use 2 simple functions, one to lightly clean the text, and the second for tokenizing the cleaned text at the sentence level + translating the batched input.

Articles scraped from news sites tend to come with assorted ads and promo messages, so additional cleaning rules will be needed if you add articles from different sources to the mix.

In [4]:
def clean_text(text):
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"\n\n", " ", text)
    text = re.sub(r"Advertisement", " ", text)
    text = re.sub(
        r"Download our app or subscribe to our Telegram channel for the latest updates on the coronavirus outbreak: https://cna.asia/telegram",
        " ",
        text,
    )
    text = re.sub(
        r"Download our app or subscribe to our Telegram channel for the latest updates on the COVID-19 outbreak: https://cna.asia/telegram",
        " ",
        text,
    )
    text = text.strip(" ")
    text = re.sub(
        " +", " ", text
    ).strip()  # gets rid of multiple spaces and replace with a single
    return text

In [5]:
def translate(text):
    if text is None or text == "":
        return "Error",

    # Create the batched input
    batch = tokenizer.prepare_translation_batch(sent_tokenize(text))
    for k in batch:
        batch[k] = batch[k]

    # Run the model and decode the output
    translated = model.generate(**batch)
    tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

    return " ".join(tgt_text)

## 2.2 PICK THE RIGHT MODEL FOR YOUR USE CASE

The full list is [here](https://huggingface.co/Helsinki-NLP?utm_campaign=Hugging%2BFace&utm_medium=email&utm_source=Hugging_Face_1). The names of the models can change without notice, so do check before use.

In [6]:
# An older version of the model was listed as Helsinki-NLP/opus-mt-eng-zho
# The version used here was updated on Aug 19 2020

model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-zh")

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")

## 2.3 TIME THE RUN AND RE-ORG THE OUTPUT CSV FOR CLARITY

Slight change from 1/3 to incorporate translation of the headlines separately.

In [7]:
%%time
raw['Headline_Translation'] = raw["Story_Headline"].map(lambda x: translate(x)).copy()

CPU times: user 10.8 s, sys: 93.1 ms, total: 10.9 s
Wall time: 2.93 s


In [8]:
%%time

raw["Clean_Text"] = raw['Story_Text'].map(lambda text: clean_text(text))

raw['Story_Translation'] = raw["Clean_Text"].map(lambda x: translate(x)).copy()

CPU times: user 2min 25s, sys: 3.84 s, total: 2min 28s
Wall time: 38.2 s


In [9]:
cols = [
    "Source",
    "Date",
    "Story_Headline",
    "Headline_Translation",
    "Story_Text",
    "Story_Translation",
    "URL",
]

cna = raw[cols]


In [10]:
cna.head()

Unnamed: 0,Source,Date,Story_Headline,Headline_Translation,Story_Text,Story_Translation,URL
0,CNA,2020-03-19,Wuhan's new COVID-19 cases could cease by mid-...,武汉新的COVID-19案件可能在3月中叶停止:报告,SHANGHAI: Wuhan is expected to see new coronav...,"上海:预计武汉将看到新的冠状病毒感染在三月中至末会干涸, 中华市(疫情中心)的封锁一旦14天...",https://www.channelnewsasia.com/news/asia/coro...
1,CNA,2020-03-04,Researchers identify two novel coronavirus typ...,研究者发现两种新型的冠状病毒 是中国的病例呈下降趋势,SHANGHAI: Scientists in China studying the vir...,"研究病毒疾病爆发起源的中国科学家说,他们发现两种主要类型的新冠状病毒可能造成感染。 来自北京...",https://www.channelnewsasia.com/news/asia/covi...
2,CNA,2020-03-05,Bethlehem's Church of the Nativity ordered clo...,伯利恒的圣诞教堂下令关闭COVID-19恐惧,BETHLEHEM: The Church of the Nativity was orde...,"BETHLEHEM:耶稣降生堂于星期四(Mar 5)被下令关闭,在巴勒斯坦伯利恒镇发现4起疑...",https://www.channelnewsasia.com/news/world/cor...
3,CNA,2020-03-19,NATAS travel fair in May cancelled as COVID-19...,"由于COVID-19疫情继续爆发,5月取消了NATAS旅行展览会",SINGAPORE: Amid a COVID-19 outbreak that has p...,"新加坡:新加坡全国旅行社协会(NATAS)取消了定于5月举办的展览会, 并将“在适当时候决定...",https://www.channelnewsasia.com/news/singapore...
4,CNA,2020-03-12,China's coronavirus epicentre Hubei sees singl...,中国的冠状病毒中心湖北首次看到一位数的病例,BEIJING: China confirmed only eight new corona...,"北京:中国证实湖北省仅有8个新的冠状病毒感染病例, 疫情中心首次记录了一位数的每日统计, 因...",https://www.channelnewsasia.com/news/asia/coro...


In [11]:
#uncomment to generate a separate file if you need it
# cna.to_csv('cna_translated.csv', index=False)

## NOTE:

The model seems to do better with natural-sounding text, such as speeches. The translation quality for news writing, which is a peculiar written format unto itself, seems to suffer a bit. Fine-tuning the machine translation model for news articles or a specific genre of writing won't be trivial, I suspect. For one, assembling the training dataset for such a task would be a huge challenge.

Will zero-shot/few-shot models be a game changer for machine translation? We'll see.