## Step 0: Defining parameters
Please modify the parameters as follows:
1. `country` that you are working on.

2. `mode` can be one of two values "9to5" or "overnight". Don't forget to use double quotes.

Most of you will only use the "9to5" value. Which will tell the code to only translate 1,000 articles. This amount of articles takes between 3 and 4 hours to run. So this is ideal to use when you are in the office. Some of us (I'm looking at you DAU), will leave the code running the whole night to translate 4,000 articles every night. You are welcome to use the "overnight" mode if you want to finish your batch quickly (although we are not expecting you to finish it... unless you are an active member of the DAU).

3. `counter_day` and `counter_night` are just a counter of how many times you have SUCESSFULLY executed the code using the "9to5" or the "overnight" mode, respectively. If this is the first time you are running the code, both counters should be zero. If you already SUCESSFULLY run the code once using the "9to5" mode, then counter_day should be equal to one. If the code stopped or you had an issue and the code did not finished running, then it DOES NOT count as a SUCESSFUL execution.

4. `folow_up` can take two values True or False. Set it to True if you already finished a translation batch and your session is still running (and your master data is already uploaded). This way you don't need to execute all cells again, only the ones required. If this is the first time you are running this script during your current session, then set this value to False. No double quotes needed.

In [64]:
country       = "Cyprus"
mode          = "overnight" # One of two values: "9to5" OR "overnight"
counter_day   = 49
counter_night = 0
suffix        = "tp" # One of two values: "tp" OR "retrans"
path2SP       = "/Users/ctoruno/OneDrive - World Justice Project/EU Subnational"

## Step 1: Installing and importing required libraries

In [65]:
import os
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
from deep_translator import GoogleTranslator
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ctoruno\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Step 2: Defining functions

In [66]:
def trans2english(text, sourcelang):
    """
    This functions gathers a text in a specific language and it returns its equivalent
    in English using the Google translation engine.

    Parameters:
        text:       String. Text to translate.
        sourcelang: String. Code of the source language you want to translate the text from.
    """
    if text:
      try:
          sentences = sent_tokenize(text)
          batch  = GoogleTranslator(source = sourcelang, target = "en").translate_batch(sentences)
          result = " ".join(batch)
          return result
      except Exception as e:
          out = f"Translation through API failed. Reason: {e}"
          return out
    else:
      return "No information available. No translation performed"

## Step 3: Reading the data

In [67]:
master_data = pd.read_parquet(f"{path2SP}/EU-S Data/Automated Qualitative Checks/Data/data-extraction-1/data4translation/{country}_{suffix}.parquet.gzip")

## Step 4: Subsetting the data

In [68]:
batch = counter_day + counter_night + 1
if mode == "9to5":
    batch_size   = 500
    mode_counter = counter_day
if mode == "overnight":
    batch_size   = 2000
    mode_counter = counter_night

starting_row = (counter_day*500)+(counter_night*2000)
final_row    = starting_row+batch_size
batch_subset = master_data.copy().iloc[starting_row:final_row]

In [69]:
print(starting_row)
print(final_row)
print(master_data.shape)

36500
37000
(38118, 10)


In [70]:
batch_subset.head(3)

Unnamed: 0,id,link,domain_url,published_date,title,description,content,language,is_opinion,country
819,af4cd57ba658c1b1efbb648cbb0e7754,https://www.dnevnik.bg/evropa/2024/02/01/45808...,dnevnik.bg,2024-02-01 00:00:00,"""Съвсем сам"": три сценария за помощта на ЕС за...",Лидерът на Унгария вече ще трябва добре да доз...,Унгарският премиер Виктор Орбан ще е главният ...,bg,False,Bulgaria
820,579e93b5c9f97ee7a94211c79d9be42a,https://www.dnevnik.bg/knigi/2024/02/01/458170...,dnevnik.bg,2024-02-01 00:00:00,"""За евреите и други демони"" ще бъде представен...",Книгата на Еми Барух трудно се побира в познат...,Книгата на Еми Барух трудно се побира в познат...,bg,False,Bulgaria
821,6d495c7fa52310d625f867e061c8db7a,https://www.dnevnik.bg/sviat/2024/01/31/458289...,dnevnik.bg,2024-01-31 19:51:00,"Силите на САЩ са елиминирали ""непосредствена з...",Силите на САЩ са ударили и унищожили йеменска ...,Новоназначени бойци хути по време на церемония...,bg,False,Bulgaria


## Step 5: Translating headline, description, and content

In [71]:
batch_subset[["title_trans", "description_trans", "content_trans"]] = batch_subset.apply(
    lambda row: row[["title", "description", "content"]].apply(lambda x: trans2english(text = x, sourcelang = row["language"])),
    axis = 1
)

## Step 6: Saving batch data

In [72]:
batch_subset.to_parquet(f"/users/ctoruno/Downloads/{country}_batch_{batch}_{mode}_{mode_counter}.parquet.gzip", compression = "gzip")

In [73]:
batch_subset["content_trans"].str.contains("API").value_counts()

content_trans
False    485
True      15
Name: count, dtype: int64