<a href="https://colab.research.google.com/github/d-triana/MEPs/blob/main/tweets_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
Created on Wed Jul 27 11:50:55 2022
@author: Daniel Triana
"""

# Automated translation using Python
This is the .ipynb version of the Python script for automated translation. <p>
This script is built upon the Deep-Translator tool created by: Nidhal Baccouri 
https://pypi.org/project/deep-translator/

In [None]:
# %% Import relevant packages
from typing import Union, Any
import pandas as pd
from pandas import DataFrame, Series
from pandas.core.generic import NDFrame
from pandas.io.parsers import TextFileReader
import numpy as np
import matplotlib.pyplot as plt
import pyreadr
import deep_translator
import deep_translator.base
import deep_translator.exceptions
from deep_translator import GoogleTranslator, single_detection, batch_detection
import requests
import time

## Loading the Data Base
To work with the database, I've created a simplified .csv version of the full .Rda original file. Beware that while working in Colab, you'll need to upload the .csv file every time you start the Colab Notebook.

In [None]:
# %%
# Load the DataBase
tweets_text: Union[Union[TextFileReader, DataFrame], Any
                   ] = pd.read_csv(r'tweets_text.csv', low_memory=False)
tweets_text[['true_author_id', 'id', 'conversation_id', 'commission_dummy',
             'party_id', 'in_reply_to_user_id'
             ]] = tweets_text[['true_author_id', 'id', 'conversation_id',
                               'commission_dummy', 'party_id',
                               'in_reply_to_user_id'
                               ]].astype(str)

The 'language_codes.csv' file is included as a reference for the international language codes. Although uncommon, the Twitter language codes within the database can differ from this reference. It is advised to check the correct database code for the language we want to translate.<p>
Don't forget to load-up the file everytime you open the Notebook.

In [None]:
# File with the international language codes for reference
lang_codes = pd.read_csv(r'language_codes.csv')

Since we are going to translate slices of specific languages, we´ll need to know how many tweets per language are in the database.

In [None]:
# Create object to know how many tweets per language are in the DataBase
tweets_per_language = (tweets_text['lang'].value_counts())

## Filter the Language Subset
- Filter or slice the relevant set of tweets to be translated. <p>
- The .iloc method is used to slice the translation subset within the language subset. <p>
- The recommended batch length is between 2000 and 3000. <p>
Important note: Google server will shut down your IP Address if you try to translate a massive number of tweets at the same time.

In [None]:
# %%
# 'de' is the language code for German (source language).
# german = tweets_text.query('lang =="de"')
# Create a new data frame for every batch or subset.
# Identify every subset by source language and batch number.
german_2 = tweets_text.query('lang =="de"').iloc[4240:6000]
german_3 = tweets_text.query('lang =="de"').iloc[6001:8001]
german_4 = tweets_text.query('lang =="de"').iloc[8001:10001]
# ...

The next step is to create an empty data frame to be populated with the original text and the translated text. <p>
Although this is not a necessary condition for the translation process, it will help us with version control.

In [None]:
# %%
# Generate empty dataframe with the columns "text_original" & "text_translated"
# Create a new data frame for every translation batch.
# Associate every data frame with its corresponding batch number.
df_Transl_2 = pd.DataFrame(columns=['text', 'text_translated'])
df_Transl_3 = pd.DataFrame(columns=['text', 'text_translated'])
df_Transl_4 = pd.DataFrame(columns=['text', 'text_translated'])
# ...

## Translation Process

In [None]:
# %%
# for loop, translation process
print('Beginning translation...')
start = time.time()

# Use the right batch to translate.
for i, tweet in enumerate(german_2.text):
    if str(tweet) == 'nan':
        print('Reading task completed')
        break
    translation = GoogleTranslator().translate(text=tweet)
    a = 1

    # In case of no success, retries up to six times
    while tweet == translation:
        print('Could not translate the row ' + str(i) + ', retry ' + str(a))
        translator = GoogleTranslator(service_urls=[
            'translate.google.com',
            'translate.google.de',
            'translate.google.co.uk',
            'translate.google.co.kr',
            'translate.google.com.ec',
            'translate.google.com.mx',
            'translate.google.com.uy',
            'translate.google.cn'
        ])
        translation = GoogleTranslator().translate(text=tweet)
        a += 1
        if a > 6: break

    # Check if the text was translated
    if tweet == translation:
        print('Translation attempted on: ' + str(tweet) + ' Returned: ' + str(translation))
    print(i)
    # Populate Data Frame with the original text and the translation
    df_Transl_2.loc[i] = [tweet, translation]
print('... Task completed.')
end = time.time()
print("The time of execution is: ", end-start)


## Saving the translations
We need to merge the translations with the subset data frame for every batch. <p>
Suggestion: Check and double-check that nothing funky is going on with the translations and that everything is in the right place. e.g. Duplicated tweets, missing tweets, translations not matching the original text, etc.

In [None]:
# Merge the DataFrames in order to have the translations in the same DataFrame
german_2 = pd.merge(german_2, df_Transl_2, on='text')
german_3 = pd.merge(german_2, df_Transl_2, on='text')
german_4 = pd.merge(german_2, df_Transl_2, on='text')
# ...

Optional: Change the order of the columns to get a better visualization of the data.
Just write the column names in the order you want them to appear in the data frame.

In [None]:
#%%
# Change the order of the DF columns for ease of comparison.
german_2 = german_2[['true_author_id', 'name', 'username', 'day', 'month',
                     'year', 'dob', 'full_name', 'sex', 'country', 'nat_party',
                     'nat_party_abb', 'eu_party_group', 'eu_party_abbr',
                     'commission_dummy', 'party_id', 'engparty', 'party',
                     'eu_position', 'lrgen', 'lrecon', 'galtan',
                     'eu_eu_position', 'eu_lrgen', 'eu_lrecon', 'eu_galtan',
                     'lang', 'text', 'text_translated', 'id',
                     'public_metrics.retweet_count',
                     'public_metrics.reply_count', 'public_metrics.like_count',
                     'public_metrics.quote_count', 'conversation_id', 'source',
                     'in_reply_to_user_id', 'geo.place_id',
                     'geo.coordinates.type','created_at'
                     ]]

Save the data into a .csv file for storage purposes.

In [None]:
#%%
# Save file
german_2.to_csv('german_2_translated.csv', index=False, encoding='utf-8-sig')