# Translating textual attributes of the reviews
-------------------

> <i>Description: In this notebook, we create an excel file with the translated necessary textual data from the scraped reviews</i>

We will use this file to further classify texts into categories, define sentiment and conduct our analysis.

Input Files: 
1) Glassdoor_reviews_gathered.csv
2) Kununu_reviews_gathered.csv

Output:
1) Glassdoor_reviews_translated.xlsx
2) Kununu_reviews_translated.xlsx

Model:  

We used the **Google Translator model** via the **deep-translator** library. This model offers high accuracy in language detection and translation, capturing context and linguistic nuances effectively. By setting the source language to `'auto'`, the model automatically detects various languages.

In [1]:
import numpy as np 
import pandas as pd
from deep_translator import GoogleTranslator
from deep_translator.exceptions import NotValidPayload
import ast

* Glassdoor_reviews_gathered.csv is a result of scrapping reviews from Glassdoor website. 

In [None]:
# Loading Glassdoor df for translation
df = pd.read_csv('../Initial_data/Glassdoor_reviews_gathered.csv')
print(df.head(5))
# To see which columns can contain text:
print(df.columns)

In [None]:
# Function to detect and translate text to English
def translate_text(text):
    """
    Detects the language of the input text and translates it to English.

    Parameters:
    - text (str): The input text to be translated, typically from a text column in a DataFrame.

    Returns:
    - str: The translated text in English if successful.
    - None: If the input text is invalid or translation fails (e.g., empty string or unsupported language).

    Exceptions:
    - NotValidPayload: Catches this exception if the input text is invalid or if there is an issue with the translation process.
    """
    try:
        # Using GoogleTranslator to auto-detect the language and translate to English
        translated_text = GoogleTranslator(source='auto', target='en').translate(text)
        return translated_text
    except NotValidPayload:
        return None

In [None]:
# Applying to text columns in Glassdoor df
df['advice_translated'] = df['advice'].apply(translate_text)
df['cons_translated'] = df['cons'].apply(translate_text)
df['pros_translated'] = df['pros'].apply(translate_text)
df['summary_translated'] = df['summary'].apply(translate_text)
# To see the new columns:
print(df.head(5))

In [7]:
# Get rid of uneccessary columns
df = df.drop(columns=['Unnamed: 0'])
# Saving new df file as excel file
df.to_excel('Glassdoor_reviews_translated.xlsx')

* Kununu_reviews_gathered.csv is a result of scrapping reviews from Kununu website.
* Please replace '../data/' with your desired directory

In [None]:
# Loading Kununu df for translation
df = pd.read_csv('../Initial_data/Kununu_reviews_gathered.csv')
print(df.head())
# To see which columns can contain text:
print(df.columns)

In [None]:
# Function to translate text fields in JSON-like structures (uses translate_text function)
def translate_nested_text(row, text_key='text'):
    """
    Translates text fields within JSON-like structures in a DataFrame column.

    Parameters:
    - row (str): A JSON-like string representing either a list of dictionaries or a single dictionary. 
                 Each dictionary contains a text field to be translated.
    - text_key (str): The key associated with the text field to be translated. Defaults to 'text'.

    Returns:
    - list or dict: The translated JSON-like structure, with specified text fields translated to English.
    - str: Returns the original row if it's NaN or if the translation process fails due to syntax errors.

    Exceptions:
    - ValueError: Catches this exception if the input string cannot be evaluated to a list or dictionary.
    - SyntaxError: Catches this exception for syntax issues within the JSON-like string structure.
    """
    if pd.isna(row):  # If the row is NaN, return the row as is
        return row
    try:
        # Parse the JSON-like string to a Python list of dictionaries or a single dictionary
        nested_list = ast.literal_eval(row)
        # Check if it's a list of dictionaries or a single dictionary
        if isinstance(nested_list, list):
            for item in nested_list:
                if text_key in item:
                    item[text_key] = translate_text(item[text_key])  # Translate the specified text field
        elif isinstance(nested_list, dict):
            if text_key in nested_list:
                nested_list[text_key] = translate_text(nested_list[text_key])  # Translate the specified text field
        return nested_list
    except (ValueError, SyntaxError):
        return row

In [None]:
# Applying to the Kununu df
df['ratings_translated'] = df['ratings'].apply(lambda x: translate_nested_text(x, text_key='text'))
df['texts_translated'] = df['texts'].apply(lambda x: translate_nested_text(x, text_key='text'))
df['responses_translated'] = df['responses'].apply(lambda x: translate_nested_text(x, text_key='response'))
# To explore new columns:
print(df.head())

In [6]:
# Get rid of uneccessary columns if applicable
# df = df.drop(columns=['...'])
# Saving new df file as excel file
df.to_excel('Kununu_reviews_translated.xlsx')

### End of the notebook