# Translate RAIS Datasets
The RAIS datasets come in Portugese. This code will translate all variable and value labels to english using google translate. The document first helps you set up your local development environment (and optionally Quest) to utilize Google Cloud APIs, then proceeds with the workflow.

## Setting Up Google Cloud
Before you begin, you will need to set up an account and project in Google Cloud. You can use a personal Google account for this step.

### 1: Create a Google Cloud account and project
1. Navigate to https://cloud.google.com/ to set up an account with Google Cloud (skip if you already use Google Cloud).
2. In the console, create a new project: https://developers.google.com/workspace/guides/create-project.
3. Enable Google Translate API on that project: https://cloud.google.com/translate/docs/setup

### 2: Set up your Local Development Environment
1. Install Google Cloud CLI on your machine: https://cloud.google.com/sdk/docs/install 
2. On your local development environment, initialize Google Cloud for the first time: `gcloud init`.
3. Once you have authenticated, choose the default project. This makes it easier for you to interact with APIs enabled for that project.
4. You'll need to ensure the GC project is billable from local system auth: `gcloud auth application-default set-quota-project <PROJECT ID>`
5. Ensure Google Cloud core (`conda install -c conda-forge google-cloud-core`) and Google Cloud Translate (`conda install -c conda-forge google-cloud-translate
`) are installed in your anaconda environment. (Optionally use pip).

### 3: [Optional] Set up gcloud on Quest
1. Open a new Quest SSH terminal.
2. [Optional] Activate the conda environment for your project: `conda activate ENV_NAME`
2. Load gcloud CLI:  `module load gcloud/379.0.0`.
3. Initialize Google Cloud for the first time: `gcloud init`
4. The terminal will give you instructions. You must copy the code it gives you into a command prompt terminal on your local machine. That will then pop you over to a browser, where you will authenticate. Once that it done, your local terminal will provide you some code. Copy this back into your Quest terminal to complete authentication.
Note: If this fails for some reason, just type in `gcloud auth application-default set-quota-project <PROJECT ID>` to try again. You will know it worked when the last line in te terminal reads "Quota project "<PROJECT ID>" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource."
5. Select the project ID you wish to work with. This will make that project your default project. Form now on, any Google Cloud API commands you use will use the account and project that you authenticated with. You can change this for any given script, but will need to do so manually. I will not detail that here.
6. Ensure Google Cloud core (`conda install -c conda-forge google-cloud-core`) and Google Cloud Translate (`conda install -c conda-forge google-cloud-translate`) are installed in your anaconda environment. (Optionally use pip).

That's it! To quickly check whether it has worked (on LDE or Quest), open a quick iPython environment (or notebook) and try the following:

In [2]:
from google.cloud import translate_v2 as translate
translate_client = translate.Client()
translate_client.translate("Hello!",target_language='fr',source_language='en')

{'translatedText': 'Bonjour!', 'input': 'Hello!'}

In [4]:
translate_client.translate("Hello!",target_language='hi',source_language='en')

{'translatedText': 'नमस्ते!', 'input': 'Hello!'}

## Automatically Translate .dta Labels
Now I will move on to automatically translating labels from Stata .dta files. in this section, I will create a series of functions:
- `translate_dict`: This function takes a dictionary of text strings and translates each value in the dictionary. It returns a new dictionary with the translated values.
- `labTrans`: This function reads in a Stata .dta file, extracts the variable and value labels, and uses `translate_dict` to translate them. It returns two dictionaries: one with the translated variable labels and one with the translated value labels.
- `doLabs`: This collects the variable and value labels and writes them to .do files.

In [2]:
import pandas as pd
from google.cloud import translate_v2 as translate
import os

maindir = '<path-to-main-directory>'
os.chdir(maindir)

In [3]:
# Define a function to translate each value in a given dictionary
def translate_dict(dictionary, client, target='en', source='pt'):
    """
    Translates each value in a given dictionary using the Google Translate API.

    Args:
        dictionary (dict): The dictionary containing text strings to be translated.
        client (google.cloud.translate_v2.Client): The Google Translate client.
        target (str, optional): The target language code. Defaults to 'en'.
        source (str, optional): The source language code. Defaults to 'pt'.

    Returns:
        dict: The dictionary with translated values.
    """
    for key, value in dictionary.items():
        result = client.translate(value, target_language=target, source_language=source)
        dictionary[key] = result['translatedText'].replace("&#39;", "'")
    return dictionary


# Define the final function to read in the data and translate the variable and value labels
def labTrans(file, client, target='en', source='pt'):
    """
    Reads in a Stata .dta file, extracts the variable and value labels, and translates them using the Google Translate API.

    Args:
        file (str): The path to the Stata .dta file.
        client (google.cloud.translate_v2.Client): The Google Translate client.
        target (str, optional): The target language code. Defaults to 'en'.
        source (str, optional): The source language code. Defaults to 'pt'.

    Returns:
        tuple: A tuple containing the translated variable labels dictionary and the translated value labels dictionary.
    """
    # read in file and get the variable and value labels
    reader = pd.read_stata(file, iterator=True)
    varlabels_source = reader.variable_labels()
    vallabs_source = reader.value_labels()

    # translate variable labels
    varlabels_target = translate_dict(varlabels_source, client, target, source)

    # For each key in vallabs_source, translate the value labels
    vallabs_target = {}
    for key in vallabs_source.keys():
        vallabs_target[key] = translate_dict(vallabs_source[key], client, target, source)

    return varlabels_target, vallabs_target


# Now take the output of the labTrans function and write the labels to do files
def doLabs(infile, outfile_stub, client, target='en', source='pt'):
    """
    Takes the output of the labTrans function and writes the translated variable and value labels to .do files.

    Args:
        infile (str): The path to the Stata .dta file.
        outfile_stub (str): The path and filename stub for the output .do files.
        client (google.cloud.translate_v2.Client): The Google Translate client.
        target (str, optional): The target language code. Defaults to 'en'.
        source (str, optional): The source language code. Defaults to 'pt'.
    """
    varlabels_target, vallabs_target = labTrans(infile, client, target, source)

    # Ensure outfile ends in .do. If not, return error
    if outfile_stub[-3:] != '.do':
        print('Error: outfile must end in .do')
        return

    # Create file names: -varlabs.do and -vallabs.do
    varlab_file = outfile_stub[:-3] + '-varlabs.do'
    vallab_file = outfile_stub[:-3] + '-vallabs.do'

    # Get the name of the file without the path
    fname = os.path.basename(infile)

    # Write the variable labels to a do file with each line being "label variable varname "varlabel""
    with open(varlab_file, 'w') as f:
        f.write(f'* Translated Variable Labels for {fname} \n')
        f.write(f'* Author: Aaron Wolf (aaron.wolf@u.northwestern.edu) \n')
        for key, value in varlabels_target.items():
            f.write('label variable ' + key + ' "' + value + '"\n')

    # Do the same for vallabels (if vallabs_target is not empty)
    if vallabs_target:
        with open(vallab_file, 'w') as f:
            f.write(f'* Translated Value Labels for {fname} \n')
            f.write(f'* Author: Aaron Wolf (aaron.wolf@u.northwestern.edu) \n')
            for key, value in vallabs_target.items():
                f.write(f'* {key} \n')
                for k, v in value.items():
                    f.write(f'label define {key} {k} "{v}", modify \n')
    else:
        print(f'No value labels in {fname}')

In [4]:
# Now, test!
reiswd = '<relative-path-to-data>'
dir = 'samples'
file = 'RAIS_sample5_1992.dta'
infile = os.path.join(reiswd,dir,file)
targetLanguage = 'en'
outfile_stub = f'labels-{targetLanguage}/' + file.replace('.dta','.do')
client = translate.Client() # This is the Google Translate client

doLabs(infile,outfile_stub,client,target=targetLanguage,source='pt')

## Translate Labels for All Files

In [13]:
# Now loop through all directories and files in reiswd = '../../../RAIS Dataset 2023/output/data/' 
#and translate all files. Save the dataset in ../data in the same relative directory and with the same name
reiswd = '<relative-path-to-data>'
dirlist = os.listdir(f'{reiswd}') # Directory with folders full of data
labdir = f'labels-{targetLanguage}'                # Main Directory to save the do files (in relative folders)
if not os.path.exists(labdir):
    os.makedirs(labdir)
client = translate.Client()             # Google translate client
# Start loop
for dir in dirlist:
    print(dir)
    # Make sure ../{labdir}/{dir} exists
    if not os.path.exists(f'{labdir}/{dir}'):
        os.makedirs(f'{labdir}/{dir}')
    # Loop through dirlist, and translate labels from all files, and save to ../{labdir}/{dir}
    filelist = os.listdir(f'{reiswd}/{dir}')
    for file in filelist:
        if file.endswith('.dta'):
            print(file)
            infile = os.path.join(reiswd,dir,file)
            outfile_stub = f'{labdir}/{dir}/' + file.replace('.dta','.do')
            # Translate!
            doLabs(infile,outfile_stub,client,target=targetLanguage,source='pt')