## Step 0: Defining parameters
Please modify the parameters as follows:
1. `country` that you are working on.

2. `mode` can be one of two values "9to5" or "overnight". Don't forget to use double quotes.

Most of you will only use the "9to5" value. Which will tell the code to only translate 1,000 articles. This amount of articles takes between 3 and 4 hours to run. So this is ideal to use when you are in the office. Some of us (I'm looking at you DAU), will leave the code running the whole night to translate 4,000 articles every night. You are welcome to use the "overnight" mode if you want to finish your batch quickly (although we are not expecting you to finish it... unless you are an active member of the DAU).

3. `counter_day` and `counter_night` are just a counter of how many times you have SUCESSFULLY executed the code using the "9to5" or the "overnight" mode, respectively. If this is the first time you are running the code, both counters should be zero. If you already SUCESSFULLY run the code once using the "9to5" mode, then counter_day should be equal to one. If the code stopped or you had an issue and the code did not finished running, then it DOES NOT count as a SUCESSFUL execution.

4. `folow_up` can take two values True or False. Set it to True if you already finished a translation batch and your session is still running (and your master data is already uploaded). This way you don't need to execute all cells again, only the ones required. If this is the first time you are running this script during your current session, then set this value to False. No double quotes needed.

In [None]:
country       = "Luxembourg"
mode          = "9to5" # One of two values: "9to5" OR "overnight"
counter_day   = 0
counter_night = 0
follow_up     = False


## Step 1: Installing and importing required libraries

In [None]:
if not follow_up:
  !pip install nltk
  !pip install deep_translator

Collecting deep_translator
  Downloading deep_translator-1.11.4-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: deep_translator
Successfully installed deep_translator-1.11.4


In [None]:
if not follow_up:
  import os
  import pandas as pd
  import nltk
  from google.colab import files
  from nltk.tokenize import sent_tokenize
  from deep_translator import GoogleTranslator
  nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Step 2: Defining functions

In [None]:
def trans2english(text, sourcelang):
    """
    This functions gathers a text in a specific language and it returns its equivalent
    in English using the Google translation engine.

    Parameters:
        text:       String. Text to translate.
        sourcelang: String. Code of the source language you want to translate the text from.
    """
    if text:
      try:
          sentences = sent_tokenize(text)
          batch  = GoogleTranslator(source = sourcelang, target = "en").translate_batch(sentences)
          result = " ".join(batch)
          return result
      except Exception as e:
          out = f"Translation through API failed. Reason: {e}"
          return out
    else:
      return "No information available. No translation performed"

## Step 3: Reading the data

In [None]:
if not follow_up:
  master_data = pd.read_parquet(f"{country}_tp.parquet.gzip")

## Step 4: Subsetting the data

In [None]:
batch = counter_day + counter_night + 1
if mode == "9to5":
    batch_size   = 500
    mode_counter = counter_day
if mode == "overnight":
    batch_size   = 2000
    mode_counter = counter_night

starting_row = (counter_day*500)+(counter_night*2000)
final_row    = starting_row+batch_size
batch_subset = master_data.copy().iloc[starting_row:final_row]

In [None]:
batch_subset.head(3)

Unnamed: 0,id,link,domain_url,published_date,title,description,content,language,is_opinion,country
0,3357c7b08e7dfa3270a8c025e0fd3bd5,https://lequotidien.lu/sport-national/basket-f...,lequotidien.lu,2024-01-08 12:24:53,[Basket] Flammang jette l'éponge,Christophe Flammang n'est plus l'entraîneur du...,Christophe Flammang n'est plus l'entraîneur du...,fr,False,Luxembourg
1,653f009e5e3a03f7af7bef9585979219,https://lequotidien.lu/monde/decollage-dune-no...,lequotidien.lu,2024-01-08 08:00:18,Décollage d'une nouvelle fusée transportant un...,Une toute nouvelle fusée a décollé de Floride ...,La fusée Vulcan Centaur du groupe industriel U...,fr,False,Luxembourg
2,55c9c40a92d9401a837abc79d14422e3,https://lequotidien.lu/politique-societe/fried...,lequotidien.lu,2024-01-08 08:00:00,Frieden relaie Bettel à Berlin,"Vendredi, l'ancien Premier ministre Xavier Bet...","Vendredi, l'ancien Premier ministre Xavier Bet...",fr,False,Luxembourg


## Step 5: Translating headline, description, and content

In [None]:
batch_subset[["title_trans", "description_trans", "content_trans"]] = batch_subset.apply(
    lambda row: row[["title", "description", "content"]].apply(lambda x: trans2english(text = x, sourcelang = row["language"])),
    axis = 1
)

## Step 6: Saving batch data

In [None]:
batch_subset.to_parquet(f"{country}_batch_{batch}_{mode}_{mode_counter}.parquet.gzip", compression = "gzip")
files.download(f"{country}_batch_{batch}_{mode}_{mode_counter}.parquet.gzip")