In [1]:
# Importing main libraries
#importing google translation service
from deep_translator import GoogleTranslator 
#importing a a package that performs automatic scoring of translations
#importing a a package that performs automatic scoring of translations
#using the TER metrics
import  pyter
import pandas as pd
import pyeuropeana.apis as apis
import pyeuropeana.utils as utils
import os
pd.options.mode.chained_assignment = None

# Introduction

The [Europeana Foundation](https://www.europeana.eu/en) digitally collects currently more than 60 Millions Cultural Heritage (CH) records. These records are described by a series of metadata that capture the available information about the objects. For example the title, a text that describes the object, the type of the object (video, textual, etc) are all relevant metadata. 

One of the goals of  Europeana is to improve the multilinguality of its resources, meaning that as many objects as possible should have information available in as many languages are possible, ideally at least all the 24 European languages.
While many records are already available in many languages, there are records that curently do not hold  yet relevant information in the language preferred by the users of the Europeana platform. 
To tackle this problem we could use an automatic translation service to achieve a fuller language coverage of the metadata. 


This notebook contains a brief demo on using the [Europeana Search API](https://pro.europeana.eu/page/search) in combination with [PyEuropeana](https://github.com/europeana/rd-europeana-python-api), a Python client library for Europeana APIs, to perform translations of metadata and evaluate their quality. Read more about how the PyEuropeana package works in the [Documentation](https://rd-europeana-python-api.readthedocs.io/en/stable/).

# PyEuropeana packakge installation

Install PyEuropeana with pip:

In [17]:
#%%capture
#!pip install pyeuropeana
!pip install https://github.com/europeana/rd-europeana-python-api/archive/master.zip

Collecting https://github.com/europeana/rd-europeana-python-api/archive/master.zip
  Downloading https://github.com/europeana/rd-europeana-python-api/archive/master.zip
[K     - 3.8 MB 6.6 MB/s
^C
[31mERROR: Operation cancelled by user[0m
[?25h

In [18]:
#setting enviroment variable
os.environ['EUROPEANA_API_KEY'] = 'api2demo'

# Definition of the translation function

In this section we define the function that will perform language translation of a piece of text.

In [4]:
def translate(txt, target):
    if type(txt)==str: 
        #Here we are using the GoogleTranslator library, defining a source language that is detected 
        #automatically and a target language we want the text to be translated to
        translated=GoogleTranslator(source='auto', target=target).translate(txt)
    else:
        translated= 'Provided text is not a string and cannot be translated'
    return translated

Let us try if this function works on a simple piece of Dutch text to be translated to English

In [5]:
text= "Hoe gaat het ?"
translation=translate(text,'en')
translation

'How are you ?'

It looks like it is working!

In the following we will be using the PyEuropeana module and the Search API to query the Europeana database.

# Querying the Europeana database

Let us specify the query we want to make and the number of CH records we would like to retrieve. The following query looks for the records that have a description in Italian and asks to retrieve 10 of them.

In [6]:
#Here we define the query and the number of record parameters
query= 'proxy_dc_description.it:*'
n_CH_records=10

Once we have defined the parameters we can perform the API call using the apis module of the PyEuropeana package

In [7]:
response = apis.search(
    query = query,
    rows = n_CH_records,
    )

Let us take a look at the call response

In [None]:
response

The response is a rich and complex JSON file, which is essentially a list of nested dictionaries, in Python. The JSON format holds many different metadata fields, for example `itemCount` and `totalResults`. In many cases we are not interested in all the metadata fields, but in a subset, depending on the problem at hand. 

It would then be  useful if we could focus on a selection of the fields and access them in an easier to read  format than the JSON format, for example a table. The PyEuropeana module offers just that!

# Selection of a subset of metadata fields

Here we use the function `search2df` within the utils module to select a subset of the fields and cast them in a tabular form

In [9]:
df_search=utils.search2df(response)
df_search.head(2) #visualizing 2 of 10 requested results in tabular form

Unnamed: 0,europeana_id,uri,type,image_url,country,description,title,creator,language,rights,provider,dataset_name,concept,concept_lang,description_lang,title_lang
0,/9200314/BibliographicResource_3000093755040_s...,http://data.europeana.eu/item/9200314/Bibliogr...,IMAGE,http://www.14-18.it/img/mappa/RML0358106_01/full,Italy,Manifesto che riporta due carte geografiche de...,L'insegnamento della carta geografica della gu...,,it,http://rightsstatements.org/vocab/InC/1.0/,Central Institute for the Union Catalogue of I...,9200314_Ag_EU_TEL_a1192b_Collections_1914-1918,http://data.europeana.eu/concept/loc/sh85148236,"{'de': 'Karte (Kartografie)', 'hi': 'मानचित्र'...",{'it': 'Manifesto che riporta due carte geogra...,{'it': 'L'insegnamento della carta geografica ...
1,/9200314/BibliographicResource_3000093755038_s...,http://data.europeana.eu/item/9200314/Bibliogr...,IMAGE,http://www.14-18.it/img/mappa/RML0195860_01/full,Italy,Manifesto che mostra al centro la carta geogra...,Croce rossa americana,Croce Rossa Americana,it,http://rightsstatements.org/vocab/InC/1.0/,Central Institute for the Union Catalogue of I...,9200314_Ag_EU_TEL_a1192b_Collections_1914-1918,http://data.europeana.eu/concept/loc/sh85148236,"{'de': 'Karte (Kartografie)', 'hi': 'मानचित्र'...",{'it': 'Manifesto che mostra al centro la cart...,{'it': 'Croce rossa americana'}


Comparing the headings of the table above with the original JSON file we can notice that  a subselection of fields has been performed by the `search2df` function.
In the following section we will look to translate the text in the `description` field, one of the most important metadata fields.

# Translations of the `descritpion` field

In this tutorial, the information we are interested in translating is the description of the record, held in the `description` column. Let us see if we can apply the function defined at the beginning of the notebook to translate the description column from its original language, Italian, to English.

We make a new column `description_en` and apply the function `translate` to the `description` column to translate it to English.

In [10]:
df_search['description_en']=df_search['description'].apply(translate,target='en')

Let us visualize only the original text and  the  English translation

In [11]:
#We select only the original description in Italian and its automatic translation to English
df_translation=df_search[['description','description_en',]]
df_translation

Unnamed: 0,description,description_en
0,Manifesto che riporta due carte geografiche de...,Poster showing two geographical maps of Europe...
1,Manifesto che mostra al centro la carta geogra...,Poster showing in the center the geographical ...
2,Manifesto che mostra una carta geografica dell...,Poster showing a map of north-eastern Italy an...
3,Manifesto che mostra al centro una carta geogr...,Poster showing in the center a geographical ma...
4,Manifesto che mostra una carta geografica dell...,Poster showing a map of north-eastern Italy an...
5,Manifesto che mostra una carta geografica dell...,Poster showing a map of Italy and the Balkans ...
6,Manifesto che mostra la carta geografica del m...,Poster showing the geographical map of the wor...
7,Manifesto che mostra una rappresentazione geog...,Poster showing a geographic representation of ...
8,Manifesto che raffigura in azzurro la catena d...,Poster depicting in blue the mountain range an...
9,Manifesto che mostra la carta geograficha dell...,Poster showing the geographical map of Venice ...


We get an idea by scanning the table above, and we can zoom in, for example on the second row, to fully visualize the original text and its translation.

In [12]:
 list(df_translation.loc[1])

["Manifesto che mostra al centro la carta geografica dell'Italia in cui sono indicati i luoghi dove la Croce rossa americana è presente sul territorio,  intorno fanno da cornice alcune fotografie che documentano il lavoro svolto dalla Croce rossa americana, in alto sono presenti i ritratti fotografici di Woodrow Wilson, Robert Perkins ed Henry P. Davison.",
 'Poster showing in the center the geographical map of Italy showing the places where the American Red Cross is present in the area, around it are some photographs documenting the work done by the American Red Cross, at the top there are photographic portraits by Woodrow Wilson, Robert Perkins and Henry P. Davison.']

To a reader that understands both Italian and English the translation looks ok, except for the last line where there is an interesting mistake. Can you spot it?

# Quality of translations

The next question we may ask is, can we measure the quality of these metadata translations? <br>
The standard automatic ways to measure the quality of translations is to compare them to reference translations and measure how close they are. Over time, many metrics have been developed to do so,  some of the most popular are bilingual evaluation understudy, ([BLEU](https://en.wikipedia.org/wiki/BLEU))  and translation error rate ([TER](https://kantanmtblog.com/2015/07/28/what-is-translation-error-rate-ter/)). <br>
In our case, we don't have reference translations at hand, therefore we opt for the following: we translate back the English text into Italian, and we measure how close the original Italian is to the back translated  Italian text. In essence we are using the original text in Italian as a reference. We can then apply the scoring methods comparing  the back translation in Italian to the original text in Italian, assumed here as reference. We can subsequently use this score as an estimate of the quality of the intial translation from Italian to English. This method that uses the back translation, to Italian in this case, is called round trip translation ([RTT](https://en.wikipedia.org/wiki/Round-trip_translation)). Although the RTT technique has some donwnsides  it allows us to form an idea of the quality of the translations when reference translations are not available.

Let us thus add a new column to the dataframe, `description_en_it`, to hold the back translation of the `description` column from English to Italian and perform the translation

In [13]:
df_search['description_en_it']=df_search['description_en'].apply(translate, target= 'it')
df_search.head(2) #visualize the first two rows of the result

Unnamed: 0,europeana_id,uri,type,image_url,country,description,title,creator,language,rights,provider,dataset_name,concept,concept_lang,description_lang,title_lang,description_en,description_en_it
0,/9200314/BibliographicResource_3000093755040_s...,http://data.europeana.eu/item/9200314/Bibliogr...,IMAGE,http://www.14-18.it/img/mappa/RML0358106_01/full,Italy,Manifesto che riporta due carte geografiche de...,L'insegnamento della carta geografica della gu...,,it,http://rightsstatements.org/vocab/InC/1.0/,Central Institute for the Union Catalogue of I...,9200314_Ag_EU_TEL_a1192b_Collections_1914-1918,http://data.europeana.eu/concept/loc/sh85148236,"{'de': 'Karte (Kartografie)', 'hi': 'मानचित्र'...",{'it': 'Manifesto che riporta due carte geogra...,{'it': 'L'insegnamento della carta geografica ...,Poster showing two geographical maps of Europe...,Manifesto raffigurante due carte geografiche d...
1,/9200314/BibliographicResource_3000093755038_s...,http://data.europeana.eu/item/9200314/Bibliogr...,IMAGE,http://www.14-18.it/img/mappa/RML0195860_01/full,Italy,Manifesto che mostra al centro la carta geogra...,Croce rossa americana,Croce Rossa Americana,it,http://rightsstatements.org/vocab/InC/1.0/,Central Institute for the Union Catalogue of I...,9200314_Ag_EU_TEL_a1192b_Collections_1914-1918,http://data.europeana.eu/concept/loc/sh85148236,"{'de': 'Karte (Kartografie)', 'hi': 'मानचित्र'...",{'it': 'Manifesto che mostra al centro la cart...,{'it': 'Croce rossa americana'},Poster showing in the center the geographical ...,Manifesto che mostra al centro la carta geogra...


Now, let us visualize the original text in Italian and the back translation to Italian

In [14]:
df_translation_test=df_search[['description','description_en_it']]
df_translation_test

Unnamed: 0,description,description_en_it
0,Manifesto che riporta due carte geografiche de...,Manifesto raffigurante due carte geografiche d...
1,Manifesto che mostra al centro la carta geogra...,Manifesto che mostra al centro la carta geogra...
2,Manifesto che mostra una carta geografica dell...,Poster raffigurante una mappa dell'Italia nord...
3,Manifesto che mostra al centro una carta geogr...,Manifesto che mostra al centro una mappa geogr...
4,Manifesto che mostra una carta geografica dell...,Poster raffigurante una mappa dell'Italia nord...
5,Manifesto che mostra una carta geografica dell...,Poster raffigurante una mappa dell'Italia e de...
6,Manifesto che mostra la carta geografica del m...,Il manifesto raffigurante la carta geografica ...
7,Manifesto che mostra una rappresentazione geog...,Poster raffigurante una rappresentazione geogr...
8,Manifesto che raffigura in azzurro la catena d...,Poster raffigurante in blu la catena montuosa ...
9,Manifesto che mostra la carta geograficha dell...,Poster raffigurante la carta geografica di Ven...


They look pretty similar but let us quantify our impressions by applying the TER metrics

In [15]:
df_translation_test['Ter_Score']=df_translation_test.apply(lambda x: pyter.ter(x['description'].split( ), x['description_en_it'].split()), axis=1)

In [16]:
df_translation_test

Unnamed: 0,description,description_en_it,Ter_Score
0,Manifesto che riporta due carte geografiche de...,Manifesto raffigurante due carte geografiche d...,0.34375
1,Manifesto che mostra al centro la carta geogra...,Manifesto che mostra al centro la carta geogra...,0.3125
2,Manifesto che mostra una carta geografica dell...,Poster raffigurante una mappa dell'Italia nord...,0.2
3,Manifesto che mostra al centro una carta geogr...,Manifesto che mostra al centro una mappa geogr...,0.357143
4,Manifesto che mostra una carta geografica dell...,Poster raffigurante una mappa dell'Italia nord...,0.347826
5,Manifesto che mostra una carta geografica dell...,Poster raffigurante una mappa dell'Italia e de...,0.416667
6,Manifesto che mostra la carta geografica del m...,Il manifesto raffigurante la carta geografica ...,0.358974
7,Manifesto che mostra una rappresentazione geog...,Poster raffigurante una rappresentazione geogr...,0.25
8,Manifesto che raffigura in azzurro la catena d...,Poster raffigurante in blu la catena montuosa ...,0.3
9,Manifesto che mostra la carta geograficha dell...,Poster raffigurante la carta geografica di Ven...,0.307692


The TER metrics measures the amount of editing needed to bring the translation in line with the original reference, the lower the TER score the better the quality of the back translation. As anticipated above, we could then use this score as an estimate of the quality of the translation from Italian to English, which was our goal.

# Conclusions

In this tutorial we briefily covered the following topics
- Introduction to metadata fields describing a CH object
- Importance of having relevant metadata fields available in many languages
- Use of the PyEuropeana module in combination with the Search API to retrieve CH objects with a description in Italian
- Automatic translation from Italian to English of the retrieved metadata describing the CH object
- Use of the RTT method in combination with the TER score to estimate the quality of the obtained translations