<a href="https://colab.research.google.com/github/d-triana/MEPs/blob/main/tweets_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
Created on Wed Jul 27 11:50:55 2022
@author: Daniel Triana
"""

In [None]:
!pip install pyreadr
!pip install deep_translator

# Automated translation using Python
This is the .ipynb version of the Python script for automated translation. <p>
This script is built upon the Deep-Translator tool created by Nidhal Baccouri: 
https://pypi.org/project/deep-translator/

In [3]:
# %% Import relevant packages
from typing import Union, Any
import pandas as pd
from pandas import DataFrame, Series
from pandas.core.generic import NDFrame
from pandas.io.parsers import TextFileReader
import numpy as np
import matplotlib.pyplot as plt
import pyreadr
import deep_translator
import deep_translator.base
import deep_translator.exceptions
from deep_translator import GoogleTranslator, single_detection, batch_detection
import requests
import time

In [4]:
#%%
# Load your Google Drive in case you have your files stored there 
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading the Data Base
To work with the database, I've created a simplified .csv version of the full .Rda original file. Beware that while working in Colab, you'll need to upload the .csv file every time you start the Colab Notebook.

In [6]:
# %%
# Load the DataBase
tweets_text: Union[Union[TextFileReader, DataFrame], Any
                   ] = pd.read_csv(r'/content/drive/MyDrive/tweets_text.csv', low_memory=False)
tweets_text[['true_author_id', 'id', 'conversation_id', 'commission_dummy',
             'party_id', 'in_reply_to_user_id'
             ]] = tweets_text[['true_author_id', 'id', 'conversation_id',
                               'commission_dummy', 'party_id',
                               'in_reply_to_user_id'
                               ]].astype(str)

# Visualize the first 5 rows of the data base.
tweets_text.head()

The 'language_codes.csv' file is included as a reference for the international language codes. Although uncommon, the Twitter language codes within the database can differ from this reference. It is advised to check the correct database code for the language we want to translate.<p>
Don't forget to load-up the file everytime you open the Notebook.

In [10]:
# File with the international language codes for reference
lang_codes = pd.read_csv(r'/content/drive/MyDrive/language_codes.csv')
lang_codes

Unnamed: 0,ISO language name,639-1,639-2/T,639-2/B,639-3,Notes
0,Abkhazian,ab,abk,abk,abk,also known as Abkhaz
1,Afar,aa,aar,aar,aar,
2,Afrikaans,af,afr,afr,afr,
3,Akan,ak,aka,aka,aka + 2,"macrolanguage, Twi is tw/twi, Fanti is fat"
4,Albanian,sq,sqi,alb,sqi + 4,"macrolanguage, called ""Albanian Phylozone"" in ..."
...,...,...,...,...,...,...
178,Xhosa,xh,xho,xho,xho,
179,Yiddish,yi,yid,yid,yid + 2,macrolanguage. Changed in 1989 from original I...
180,Yoruba,yo,yor,yor,yor,
181,"Zhuang, Chuang",za,zha,zha,zha + 16,macrolanguage


Since we are going to translate slices of specific languages, we´ll need to know how many tweets per language are in the database.

In [12]:
# Create object to know how many tweets per language are in the DataBase
tweets_per_language = (tweets_text['lang'].value_counts())
tweets_per_language

en     121116
fr      83312
es      79794
de      49074
pl      47817
it      34417
nl      18367
cs      12968
el      10446
sv       9322
pt       9027
fi       8649
und      7941
da       6159
sl       5676
ca       5031
zxx      3541
qme      2567
hu       1449
art      1070
lv        891
qht       825
et        660
qam       625
eu        549
bg        549
ro        386
in        237
tr        212
uk        205
no        203
qst       177
ru        172
ht        105
tl         99
cy         92
lt         81
is         56
ar         40
ka         13
hi         10
ja          6
sr          5
fa          5
zh          4
ur          2
hy          1
vi          1
ckb         1
Name: lang, dtype: int64

## Filter the Language Subset
- Filter or slice the relevant set of tweets to be translated. <p>
- The .iloc method is used to slice the translation subset within the language subset. <p>
- The recommended batch length is between 2000 and 3000. <p>
Important note: Google server will shut down your IP Address if you try to translate a massive number of tweets at the same time.

In [15]:
# %%
# 'de' is the language code for German (source language).
# german = tweets_text.query('lang =="de"')
# Create a new data frame for every batch or subset.
# Identify every subset by source language and batch number.
german_2 = tweets_text.query('lang =="de"').iloc[4240:6000]
german_3 = tweets_text.query('lang =="de"').iloc[6001:8001]
german_4 = tweets_text.query('lang =="de"').iloc[8002:10002]
# ...

The next step is to create an empty data frame to be populated with the original text and the translated text. <p>
Although this is not a necessary condition for the translation process, it will help us with version control.

In [16]:
# %%
# Generate empty dataframe with the columns "text_original" & "text_translated"
# Create a new data frame for every translation batch.
# Associate every data frame with its corresponding batch number.
df_Transl_2 = pd.DataFrame(columns=['text', 'text_translated'])
df_Transl_3 = pd.DataFrame(columns=['text', 'text_translated'])
df_Transl_4 = pd.DataFrame(columns=['text', 'text_translated'])
# ...

## Translation Process

In [None]:
# %%
# for loop, translation process
print('Beginning translation...')
start = time.time()

# Use the right batch to translate.
for i, tweet in enumerate(german_3.text):
    if str(tweet) == 'nan':
        print('Reading task completed')
        break
    translation = GoogleTranslator().translate(text=tweet)
    a = 1

    # In case of no success, retries up to six times
    while tweet == translation:
        print('Could not translate the row ' + str(i) + ', retry ' + str(a))
        translator = GoogleTranslator(service_urls=[
            'translate.google.com',
            'translate.google.de',
            'translate.google.co.uk',
            'translate.google.co.kr',
            'translate.google.com.ec',
            'translate.google.com.mx',
            'translate.google.com.uy',
            'translate.google.cn'
        ])
        translation = GoogleTranslator().translate(text=tweet)
        a += 1
        if a > 6: break

    # Check if the text was translated
    if tweet == translation:
        print('Translation attempted on: ' + str(tweet) + ' Returned: ' + str(translation))
    print(i)
    # Populate Data Frame with the original text and the translation
    df_Transl_3.loc[i] = [tweet, translation]
print('... Task completed.')
end = time.time()
print("The time of execution is: ", end-start)


Beginning translation...
0
1
2
3
4
5
6
Could not translate the row 7, retry 1
Could not translate the row 7, retry 2
Could not translate the row 7, retry 3
Could not translate the row 7, retry 4
Could not translate the row 7, retry 5
Could not translate the row 7, retry 6
Translation attempted on: RT @dziedzic_ewa: Proteste in #Russland ein Update #UkraineRussia Returned: RT @dziedzic_ewa: Proteste in #Russland ein Update #UkraineRussia
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
Could not translate the row 88, retry 1
Could not translate the row 88, retry 2
Could not translate the row 88, retry 3
Could not translate the row 88, retry 4
Could not translate the row 88, retry 5
Could not translate the row 88, retry 6
Translation attempted on: Anton Panchev - Asst. Prof. Department of Balkan Stud

## Saving the translations
We need to merge the translations with the subset data frame for every batch. <p>
Suggestion: Check and double-check that nothing funky is going on with the translations and that everything is in the right place. e.g. Duplicated tweets, missing tweets, translations not matching the original text, etc.

In [None]:
# Merge the DataFrames in order to have the translations in the same DataFrame
german_2 = pd.merge(german_2, df_Transl_2, on='text')
german_3 = pd.merge(german_3, df_Transl_3, on='text')
german_4 = pd.merge(german_4, df_Transl_4, on='text')
# ...

Optional: Change the order of the columns to get a better visualization of the data.
Just write the column names in the order you want them to appear in the data frame.

In [None]:
#%%
# Change the order of the DF columns for ease of comparison.
german_3 = german_3[['true_author_id', 'name', 'username', 'day', 'month',
                     'year', 'dob', 'full_name', 'sex', 'country', 'nat_party',
                     'nat_party_abb', 'eu_party_group', 'eu_party_abbr',
                     'commission_dummy', 'party_id', 'engparty', 'party',
                     'eu_position', 'lrgen', 'lrecon', 'galtan',
                     'eu_eu_position', 'eu_lrgen', 'eu_lrecon', 'eu_galtan',
                     'lang', 'text', 'text_translated', 'id',
                     'public_metrics.retweet_count',
                     'public_metrics.reply_count', 'public_metrics.like_count',
                     'public_metrics.quote_count', 'conversation_id', 'source',
                     'in_reply_to_user_id', 'geo.place_id',
                     'geo.coordinates.type','created_at'
                     ]]

Save the data into a .csv file for storage purposes.

In [None]:
#%%
# Save file
german_3.to_csv('german_3_transl.csv', index=False, encoding='utf-8-sig')