## Overview

In this notebook we will train two models to translate text from Greek to English and English to Greek respectively, using [Google's AutomML Translation]("https://cloud.google.com/translate/automl/docs?hl=en"). This notebook and be easily modified to demo any of the 21 languages provided in the public dataset, namely: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

The dataset used is the [European Parliment Proceedings]("https://www.statmt.org/europarl/").

## Download and pre-process language files

In [2]:
# Modify the language code to demo different languages
LANGUAGE_CODE = 'el' #Greek

In [3]:
!wget https://www.statmt.org/europarl/v7/$LANGUAGE_CODE-en.tgz

--2020-08-04 07:56:11--  https://www.statmt.org/europarl/v7/el-en.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 151392701 (144M) [application/x-gzip]
Saving to: ‘el-en.tgz.1’

el-en.tgz.1          27%[====>               ]  40.04M  4.20MB/s    eta 26s    ^C


In [7]:
!tar -xzf $LANGUAGE_CODE-en.tgz

## Create TSV files supported by AutoML Translate

AutoML Translation supports TSV and TMX files. More info on [Preparing training data] ('https://cloud.google.com/translate/automl/docs/prepare?hl=en#comma-separated_values_csv')

In [25]:
to_en_min = open('./{}_en.tsv'.format(LANGUAGE_CODE), 'w')
from_en_min = open('./en_{}.tsv'.format(LANGUAGE_CODE), 'w')
pt_f = open('./europarl-v7.{}-en.{}'.format(LANGUAGE_CODE, LANGUAGE_CODE), 'r').readlines()
en_f = open('./europarl-v7.{}-en.en'.format(LANGUAGE_CODE), 'r').readlines()

In [27]:
for i in range(0, 100000):
  to_en_min.write(pt_f[i].strip() + '\t' + en_f[i].strip() + '\n')
  from_en_min.write(en_f[i].strip() + '\t' + pt_f[i].strip() + '\n')

to_en_min.close()
from_en_min.close()

## Store your training files in Google Cloud Storage

In [33]:
BUCKET = 'gs://<BUCKET_NAME>' #Replace with name of your GCS bucket

In [32]:
# Run this only if your bucket does not exist
!gsutil mb $BUCKET

Creating gs://camus-translate/...


In [35]:
!gsutil cp *.tsv $BUCKET

Copying file://el_en.tsv [Content-Type=text/tab-separated-values]...
Copying file://en_el.tsv [Content-Type=text/tab-separated-values]...            
\ [2 files][ 88.8 MiB/ 88.8 MiB]                                                
Operation completed over 2 objects/88.8 MiB.                                     


## Train your models on AutoML Translate UI

Now that you have your data prepared, the hard part is over ;) To train you model, you can use the APIs or the UI. Let's give the UI a try!

1. Create your dataset by pointing to the TSV files stored in GCS, as described [here]('https://cloud.google.com/translate/automl/docs/datasets?hl=en')
2. Start training as described [here]('https://cloud.google.com/translate/automl/docs/models?hl=en')

This can take several hours and you will be notified via email when each stage is complete. So sit back and practice your Greek :)

## Evaluate Model

The model performance is given in terms of [BLUE Score]('https://cloud.google.com/translate/automl/docs/evaluate?hl=en#bleu'). This compares the generated text's closeness to the provided text. The higher the Bluescore the better.

## Use model for predictions

As for all other AutoML functionalities, you can use the API or the UI. Here's and example of using the API for translating text.

In [45]:
project_id = ''
model_id = '' # ex. TRL8577776192420052992
file_path = 'test.txt' # Create your own test.txt file
region = ''

In [None]:
!pip install google-cloud-automl

In [47]:
from google.cloud import automl
prediction_client = automl.PredictionServiceClient()

# Get the full path of the model.
model_full_id = prediction_client.model_path(
    project_id, region, model_id
)

# Read the file content for translation.
with open(file_path, "rb") as content_file:
    content = content_file.read()
content.decode("utf-8")

text_snippet = automl.types.TextSnippet(content=content)
payload = automl.types.ExamplePayload(text_snippet=text_snippet)

response = prediction_client.predict(model_full_id, payload)
translated_content = response.payload[0].translation.translated_content

print(u"Translated content: {}".format(translated_content.content))

Translated content: I declare the session of the European Parliament adjourned.


## Now let's compare that with basic Translation API

First enable the Translate api for your project following the instructions [here]('https://cloud.google.com/translate/docs/setup')

In [48]:
!pip install --upgrade google-cloud-translate

Collecting google-cloud-translate
  Downloading google_cloud_translate-2.0.2-py2.py3-none-any.whl (91 kB)
[K     |████████████████████████████████| 91 kB 3.8 MB/s eta 0:00:011
Installing collected packages: google-cloud-translate
  Attempting uninstall: google-cloud-translate
    Found existing installation: google-cloud-translate 2.0.1
    Uninstalling google-cloud-translate-2.0.1:
      Successfully uninstalled google-cloud-translate-2.0.1
Successfully installed google-cloud-translate-2.0.2


In [51]:
"""Translates text into the target language.

Target must be an ISO 639-1 language code.
See https://g.co/cloud/translate/v2/translate-reference#supported_languages
"""
from google.cloud import translate_v2 as translate
translate_client = translate.Client()

text = open('test.txt', 'r').readline()

# Text can also be a sequence of strings, in which case this method
# will return a sequence of results for each text.
result = translate_client.translate(
    text, target_language='en')

print(u'Text: {}'.format(result['input']))
print(u'Translation: {}'.format(result['translatedText']))
print(u'Detected source language: {}'.format(
    result['detectedSourceLanguage']))

Text: Κηρύσσω τη διακοπή της συνόδου του Ευρωπαϊκού Κοινοβουλίου.
Translation: I declare the suspension of the sitting of the European Parliament.
Detected source language: el
