# BATCH MACHINE TRANSLATION WITH HUGGING FACE AND MARIANMT (3/3)

# CHINESE-TO-ENGLISH MACHINE TRANSLATION: 3 SPEECHES

In this notebook, we'll try out Chinese-to-English translation instead, using the official Chinese versions of the same 3 speeches in 1/3. Do note that the Chinese speeches used here are slightly different from their English version, and aren't direct phrase-for-phrase/word-for-word translations. Also, I'm **not** using the translated results from 1/3 for a "reverse-translation". Bear in mind if you are comparing the results here with those from 1/3.

Machine translation of Chinese text brings a new set of challenges. For one, nltk's sentence tokenizer won't work. I didn't manage to get good results with [jieba](https://github.com/fxsjy/jieba) either.

My workaround using a function to split the sentences into individual rows is kinda clumsy, but gets the job done until I find a more elegant solution. If you have one, please share.


## MEDIUM POST

A little more context and discussion here in my Medium post: https://bit.ly/31ldeyj

In [1]:
import pandas as pd
import re

from transformers import MarianMTModel, MarianTokenizer



# 1. LOAD DATA, SELECT CHINESE SPEECHES

In [2]:
raw = pd.read_csv('../data/translation_speeches.csv')

In [3]:
ch = raw[raw['Language'] == 'Chinese'].copy()

ch.head()

Unnamed: 0,Date,Speaker,Title,Language,Text,URL
1,2020-04-30,Lee Hsien Loong,May Day Message 2020,Chinese,今年，我们在艰难的环境中庆祝五一劳动节。\n冠病疫情最新情况\n2019冠状病毒疾病（COV...,https://www.pmo.gov.sg/Newsroom/PM-Lee-Hsien-L...
5,2020-07-27,Lee Hsien Loong,Cabinet Swearing-in Ceremony,Chinese,过去六个月，大家生活不易，全民一心对抗冠病疫情和应付经济危机。整体上，我国疫情已经受到控制，...,https://www.pmo.gov.sg/Newsroom/Speech-by-PM-L...
8,2020-08-09,Heng Swee Keat,National Day Message 2020,Chinese,"每年的 8 月 9 号,新加坡人都会风雨不改,一同观赏国庆庆典,欢庆我国 独立建国,并再次誓...",https://www.pmo.gov.sg/Newsroom/National-Day-M...


# 2. GENERATE TRANSLATION

## 2.1 DEFINE  FUNCTIONS FOR TEXT CLEANING, TEXT SPLIT

We'll use 2 simple functions, one to lightly clean the text, and the second for splitting the corpus into individual sentences within a dataframe. There's probably a better way for tokenizing Chinese sentences, but I haven't found it.

In [4]:
# source: https://github.com/cognoma/genes/blob/721204091a96e55de6dcad165d6d8265e67e2a48/2.process.py#L61-L95
def split_text(df, column, sep="。", keep=False):
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [5]:
def clean_text(text):
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"\n\n", " ", text)
    text = text.strip(" ")
    text = re.sub(' +',' ', text).strip() # gets rid of multiple spaces and replace with a single
    return text

In [6]:
ch["Clean_Text"] = ch['Text'].map(lambda text: clean_text(text))

new_ch = split_text(ch, "Clean_Text")

new_ch = new_ch[new_ch["Clean_Text"] != ""].copy()

## 2.2 PICK THE RIGHT MODEL FOR YOUR USE CASE

The full list is [here](https://huggingface.co/Helsinki-NLP?utm_campaign=Hugging%2BFace&utm_medium=email&utm_source=Hugging_Face_1). The names of the models can change without notice, so do check before use.

In [7]:
# An earlier version was listed as "Helsinki-NLP/opus-mt-zho-eng"
# Version used below was updated on Aug 19 2020

mt_name = "Helsinki-NLP/opus-mt-zh-en"

model = MarianMTModel.from_pretrained(mt_name)
tokenizer = MarianTokenizer.from_pretrained(mt_name)

## 2.3 TIME THE RUN AND RE-ORG THE OUTPUT CSV FOR CLARITY

The original speeches were split into individual sentences and then translated. But I want the results to be re-joined in a single row of text, so some additional steps with pandas needed to knock the dataframe into the format required.

In [8]:
%%time
corpus = list(new_ch["Clean_Text"].values)

translated = model.generate(**tokenizer.prepare_translation_batch(corpus))

new_ch["Machine_Translation"] = [
    tokenizer.decode(t, skip_special_tokens=True) for t in translated
]


CPU times: user 10min 8s, sys: 1min 18s, total: 11min 26s
Wall time: 4min 56s


In [9]:
%%time 
new_ch['English_Translation'] = new_ch.groupby(["Title"])["Machine_Translation"].transform(lambda x: ','.join(x))

new_ch = new_ch.drop_duplicates(subset=["English_Translation"]).copy()

CPU times: user 9.98 ms, sys: 2.93 ms, total: 12.9 ms
Wall time: 48.1 ms


In [10]:
cols = ["Date", "Speaker", "Title", "Text", "English_Translation", "URL"]

speeches_translated = new_ch[cols]


In [11]:
speeches_translated.head()

Unnamed: 0,Date,Speaker,Title,Text,English_Translation,URL
1,2020-04-30,Lee Hsien Loong,May Day Message 2020,今年，我们在艰难的环境中庆祝五一劳动节。\n冠病疫情最新情况\n2019冠状病毒疾病（COV...,"This year, we celebrate Labor Day 51 in a diff...",https://www.pmo.gov.sg/Newsroom/PM-Lee-Hsien-L...
5,2020-07-27,Lee Hsien Loong,Cabinet Swearing-in Ceremony,过去六个月，大家生活不易，全民一心对抗冠病疫情和应付经济危机。整体上，我国疫情已经受到控制，...,"Over the past six months, we've had a hard tim...",https://www.pmo.gov.sg/Newsroom/Speech-by-PM-L...
8,2020-08-09,Heng Swee Keat,National Day Message 2020,"每年的 8 月 9 号,新加坡人都会风雨不改,一同观赏国庆庆典,欢庆我国 独立建国,并再次誓...","On the 9th of August of every year, Singaporea...",https://www.pmo.gov.sg/Newsroom/National-Day-M...


In [12]:
#uncomment to generate a separate file if you need it
#speeches_translated.to_csv('speeches_cn_to_en.csv', index=False)