## Step 0: Defining parameters
Please modify the parameters as follows:
1. `country` that you are working on.

2. `mode` can be one of two values "9to5" or "overnight". Don't forget to use double quotes.

Most of you will only use the "9to5" value. Which will tell the code to only translate 1,000 articles. This amount of articles takes between 3 and 4 hours to run. So this is ideal to use when you are in the office. Some of us (I'm looking at you DAU), will leave the code running the whole night to translate 4,000 articles every night. You are welcome to use the "overnight" mode if you want to finish your batch quickly (although we are not expecting you to finish it... unless you are an active member of the DAU).

3. `counter_day` and `counter_night` are just a counter of how many times you have SUCESSFULLY executed the code using the "9to5" or the "overnight" mode, respectively. If this is the first time you are running the code, both counters should be zero. If you already SUCESSFULLY run the code once using the "9to5" mode, then counter_day should be equal to one. If the code stopped or you had an issue and the code did not finished running, then it DOES NOT count as a SUCESSFUL execution.

4. `folow_up` can take two values True or False. Set it to True if you already finished a translation batch and your session is still running (and your master data is already uploaded). This way you don't need to execute all cells again, only the ones required. If this is the first time you are running this script during your current session, then set this value to False. No double quotes needed.

In [115]:
country       = "Luxembourg"
mode          = "9to5" # One of two values: "9to5" OR "overnight"
counter_day   = 1
counter_night = 0
follow_up     = True
suffix        = "retrans" # One of two values: "tp" OR "retrans"


## Step 1: Installing and importing required libraries

In [105]:
if not follow_up:
  !pip install nltk
  !pip install deep_translator



In [106]:
if not follow_up:
  import os
  import pandas as pd
  import nltk
  from nltk.tokenize import sent_tokenize
  from deep_translator import GoogleTranslator
  nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ctoruno\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Step 2: Defining functions

In [107]:
def trans2english(text, sourcelang):
    """
    This functions gathers a text in a specific language and it returns its equivalent
    in English using the Google translation engine.

    Parameters:
        text:       String. Text to translate.
        sourcelang: String. Code of the source language you want to translate the text from.
    """
    if text:
      try:
          sentences = sent_tokenize(text)
          batch  = GoogleTranslator(source = sourcelang, target = "en").translate_batch(sentences)
          result = " ".join(batch)
          return result
      except Exception as e:
          out = f"Translation through API failed. Reason: {e}"
          return out
    else:
      return "No information available. No translation performed"

## Step 3: Reading the data

In [108]:
if not follow_up:
  master_data = pd.read_parquet(f"../data/data-extraction-1/data4translation/{country}_{suffix}.parquet.gzip")

## Step 4: Subsetting the data

In [109]:
batch = counter_day + counter_night + 1
if mode == "9to5":
    batch_size   = 500
    mode_counter = counter_day
if mode == "overnight":
    batch_size   = 2000
    mode_counter = counter_night

starting_row = (counter_day*500)+(counter_night*2000)
final_row    = starting_row+batch_size
batch_subset = master_data.copy().iloc[starting_row:final_row]

In [110]:
batch_subset.head(3)

Unnamed: 0,id,link,domain_url,published_date,title,description,content,language,is_opinion,country,title_trans,description_trans,content_trans
384,8ef63439d2d22bac9de591e10f68edc8,https://lequotidien.lu/a-la-une/le-luxembourg-...,lequotidien.lu,2024-02-07 16:46:43,Le Luxembourg au Laos : «Ce projet a changé no...,Ils ne sont situés qu'à une trentaine de kilom...,Ils ne sont situés qu'à une trentaine de kilom...,fr,False,Luxembourg,Luxembourg in Laos: “This project has changed ...,They are only located about thirty kilometers ...,Translation through API failed. Reason: HTTPSC...
385,9e79ce3944616c2f1a2981a4c2c780f7,https://lequotidien.lu/economie/totalenergies-...,lequotidien.lu,2024-02-07 11:30:03,TotalEnergies dégage un nouveau bénéfice recor...,"Après une année 2022 historique, le français T...",La 4e major mondiale a amélioré son bénéfice n...,fr,False,Luxembourg,Translation through API failed. Reason: HTTPSC...,Translation through API failed. Reason: HTTPSC...,Translation through API failed. Reason: HTTPSC...
386,b7a023517bf6cfd8b2c51ee4f3a551a0,https://lequotidien.lu/sport-national/basket-d...,lequotidien.lu,2024-02-07 05:00:57,[Basket] Delgado n'a plus de temps à perdre,Tous les observateurs disent de lui que c'est ...,Tous les observateurs disent de lui que c'est ...,fr,False,Luxembourg,Translation through API failed. Reason: HTTPSC...,Translation through API failed. Reason: HTTPSC...,Translation through API failed. Reason: HTTPSC...


## Step 5: Translating headline, description, and content

In [111]:
batch_subset[["title_trans", "description_trans", "content_trans"]] = batch_subset.apply(
    lambda row: row[["title", "description", "content"]].apply(lambda x: trans2english(text = x, sourcelang = row["language"])),
    axis = 1
)

## Step 6: Saving batch data

In [112]:
batch_subset.to_parquet(f"/users/ctoruno/Downloads/{country}_batch_{batch}_{mode}_{mode_counter}.parquet.gzip", compression = "gzip")

In [113]:
batch_subset.head(5)

Unnamed: 0,id,link,domain_url,published_date,title,description,content,language,is_opinion,country,title_trans,description_trans,content_trans
384,8ef63439d2d22bac9de591e10f68edc8,https://lequotidien.lu/a-la-une/le-luxembourg-...,lequotidien.lu,2024-02-07 16:46:43,Le Luxembourg au Laos : «Ce projet a changé no...,Ils ne sont situés qu'à une trentaine de kilom...,Ils ne sont situés qu'à une trentaine de kilom...,fr,False,Luxembourg,Luxembourg in Laos: “This project has changed ...,They are only located about thirty kilometers ...,They are only located about thirty kilometers ...
385,9e79ce3944616c2f1a2981a4c2c780f7,https://lequotidien.lu/economie/totalenergies-...,lequotidien.lu,2024-02-07 11:30:03,TotalEnergies dégage un nouveau bénéfice recor...,"Après une année 2022 historique, le français T...",La 4e major mondiale a amélioré son bénéfice n...,fr,False,Luxembourg,TotalEnergies posts new record profit in 2023,"After a historic year 2022, French company Tot...",The 4th largest global major improved its net ...
386,b7a023517bf6cfd8b2c51ee4f3a551a0,https://lequotidien.lu/sport-national/basket-d...,lequotidien.lu,2024-02-07 05:00:57,[Basket] Delgado n'a plus de temps à perdre,Tous les observateurs disent de lui que c'est ...,Tous les observateurs disent de lui que c'est ...,fr,False,Luxembourg,[Basketball] Delgado has no more time to waste,All observers say that he is undoubtedly one o...,All observers say that he is undoubtedly one o...
387,7ca16dd748dae9faf097fc6d46cdbd4c,https://lequotidien.lu/a-la-une/omnisports-geo...,lequotidien.lu,2024-02-07 05:00:49,[Omnisports] Georges Mischo : «On est sur le b...,"C'est à l'INS, hier en fin de matinée, que Geo...","C'est à l'INS, hier en fin de matinée, que Geo...",fr,False,Luxembourg,[Omnisports] Georges Mischo: “We are on the ri...,"It was at the INS, late yesterday morning, tha...","It was at the INS, late yesterday morning, tha..."
388,2653f429c9f8eae273fb0a0bdd7be309,https://lequotidien.lu/monde/surete-nucleaire-...,lequotidien.lu,2024-02-07 00:00:00,Sûreté nucléaire en France : un projet de réfo...,"Fusionner l'ASN, gendarme du nucléaire, avec l...",Décidé dans le huis clos d'un conseil de polit...,fr,False,Luxembourg,Nuclear safety in France: a contested reform p...,"Merge ASN, nuclear policeman, with IRSN, safet...",Decided behind the closed doors of a nuclear p...
