# Classifying textual attributes of the reviews into categories and defining sentiment, where neccessary
-------------------

> <i>Description: In this notebook, we classify text observations from reviews into one or more categories aligned with Hugo Boss's values and benefits. We also define the sentiment for columns where sentiment is not readily apparent.</i>

We will use output file to further conduct our analysis.

Input Files: 
1) reviews_merged.csv
2) Kununu_rating_ids_mapped.xlsx

Output:
1) text_data_classes_sentiment.xlsx

Model:  

We used **GPT-4o-mini**, a state-of-the-art language model. This model’s large scale and advanced transformer-based architecture allow it to capture nuanced language and context, and accurately classify text  in line with Hugo Boss's values and benefits and assess sentiment.

Categories: 

- **Diversity & Equity & Inclusion**: Keywords: Diversity, Inclusion, Equity, LGBTQIA+, Equal opportunities, Gender equality, homophobia, biphobia, transphobia, inequality, sexism, racism, inclusivity.
  
- **Authenticity**: Keywords: Authenticity, True self, Personal expression, Be yourself, Unique identities, independence.

- **Collaboration & Teamwork & Social Culture**: Keywords: Collaboration, Teamwork, team spirit, Togetherness, Working together, Shared goals, social culture, team, Colleague, environment, atmosphere.

- **Creativity and Innovation**: Keywords: Creativity, Innovation, New ideas, Originality, Forward-thinking, problem-solving.

- **Professional Development and Continuous Learning**: Keywords: Development, Learning, Skills, Training, Courses, Self-improvement, Growth opportunities, promotion, Growth, Mentorship, Ownership, Entrepreneurial spirit, task.

- **Youthful Spirit**: Keywords: Intern, working student, young, internship, hazing, youthful.

- **Digital Transformation & Process Management**: Keywords: Digital, Transformation, Technology, Process improvement, Automation, Efficiency.

- **Leadership & Communication**: Keywords: Leadership, Communication, management, Guidance, inspirational leadership, leader, manager, supervisor, feedback, Conflict, support.

- **Fashion and Lifestyle Benefits**: Keywords: Clothing allowance, discount, ArtPass, Museum access, Hugo Boss stores, vip shops, Outlets.

- **Health and Well-being**: Keywords: Mental health program, Life coaching, hotline, Gym, Fitness, insurance, Health events, Personal trainers, Stress management.

- **Work-Life Balance & Flexibility**: Keywords: Mobile work, Remote work, 30 vacation days, Special leave, flexibility, Daycare, Kindergarten, Family services, Childcare support, Domestic help.

- **Mobility & Accessibility**: Keywords: Job bike, Bike leasing, Mobility allowance, Public transport subsidies, E-charging stations, Shuttle service, Campus transportation.

- **Financial Compensation & Benefits**: Keywords: Share investment, Company pension, salary, wage, Pension subsidies, Vacation bonuses, Christmas bonuses.

- **Device Leasing and Corporate Benefits**: Keywords: Smartleasing, Apple devices, Samsung devices, Travel, Mobile phones.

- **Social and Recreational Benefits**: Keywords: Celebration events, After-work events, Summer party, Christmas party, Food, Drinks, canteen, Live entertainment, On-campus restaurant, Coffee bar, Subsidized meals, Barista services.
}

In [1]:
import pandas as pd 
import numpy as np
import re
import ast

* reviews_merged.csv is a result of merging Glassdoor and Kununu translated files. 

In [3]:
df = pd.read_csv('reviews_merged.csv')
print(df.head())

In [8]:
from openai import OpenAI
api_key = "your_key"
# Set up your OpenAI API key
client = OpenAI(api_key=api_key)


### Step 1. Classification

In [9]:
# dictionary with categories and their keywords
categories_keywords = {
    "Diversity & Equity & Inclusion": "Keywords: Diversity, Inclusion, Equity, LGBTQIA+, Equal opportunities, Gender equality, homophobia, biphobia, transphobia, inequality, sexism, racism, inclusivity",
    "Authenticity": "Keywords: Authenticity, True self, Personal expression, Be yourself, Unique identities, independence.",
    "Collaboration & Teamwork & Social Culture": "Keywords: Collaboration, Teamwork, team spirit, Togetherness, Working together, Shared goals, social culture, team, Colleague, environment, atmosphere",
    "Creativity and Innovation": "Keywords: Creativity, Innovation, New ideas, Originality, Forward-thinking, problem-solving.",
    "Professional Development and Continuous Learning": "Keywords: Development, Learning, Skills, Training, Courses, Self-improvement, Growth opportunities, promotion, Growth, Mentorship, Ownership, Entrepreneurial spirit, task.",
    "Youthful Spirit": "Keywords: Intern, working student, young, internship, hazing, youthful.",
    "Digital Transformation & Process Management": "Keywords: Digital, Transformation, Technology, Process improvement, Automation, Efficiency.",
    "Leadership & Communication": "Keywords: Leadership, Communication, management, Guidance, inspirational leadership, leader, manager, supervisor, feedback, Conflict, support",
    "Fashion and Lifestyle Benefits": "Keywords: Clothing allowance, discount, ArtPass, Museum access, Hugo Boss stores, vip shops, Outlets.",
    "Health and Well-being": "Keywords: Mental health program, Life coaching, hotline, Gym, Fitness, insurance, Health events, Personal trainers, Stress management.",
    "Work-Life Balance & Flexibility": "Keywords: Mobile work, Remote work, 30 vacation days, Special leave, flexibility, Daycare, Kindergarten, Family services, Childcare support, Domestic help.",
    "Mobility & Accessibility": "Keywords: Job bike, Bike leasing, Mobility allowance, Public transport subsidies, E-charging stations, Shuttle service, Campus transportation.",
    "Financial Compensation & Benefits": "Keywords: Share investment, Company pension, salary, wage, Pension subsidies, Vacation bonuses, Christmas bonuses.",
    "Device Leasing and Corporate Benefits": "Keywords: Smartleasing, Apple devices, Samsung devices, Travel, Mobile phones.",
    "Social and Recreational Benefits": "Keywords: Celebration events, After-work events, Summer party, Christmas party, Food, Drinks, canteen, Live entertainment, On-campus restaurant, Coffee bar, Subsidized meals, Barista services."
}

In [10]:
def classify_text_with_keywords(text):
    """
    Classifies the given text into one or more predefined categories based on associated keywords.

    Parameters:
    - text (str): The input text to be classified.

    Returns:
    - str: A comma-separated string of categories to which the text belongs. 
           If the text does not match any category, returns "NA".

    Error Handling:
    If an error occurs during classification, the function prints the error message and returns "NA".
    """
    try:
        # Craft the system prompt, including category names and their associated keywords
        prompt = f"""Classify the following text into one or more categories based on the keywords provided. If the text doesn't fit any category, return "NA". 
        Return multiple categories if applicable, separated by commas.

Categories and Keywords:
1. Diversity & Equity & Inclusion: {categories_keywords['Diversity & Equity & Inclusion']}
2. Authenticity: {categories_keywords['Authenticity']}
3. Collaboration & Teamwork & Social Culture: {categories_keywords['Collaboration & Teamwork & Social Culture']}
4. Creativity and Innovation: {categories_keywords['Creativity and Innovation']}
5. Professional Development and Continuous Learning: {categories_keywords['Professional Development and Continuous Learning']}
6. Youthful Spirit: {categories_keywords['Youthful Spirit']}
7. Digital Transformation & Process Management: {categories_keywords['Digital Transformation & Process Management']}
8. Leadership & Communication: {categories_keywords['Leadership & Communication']}
9. Fashion and Lifestyle Benefits: {categories_keywords['Fashion and Lifestyle Benefits']}
10. Health and Well-being: {categories_keywords['Health and Well-being']}
11. Work-Life Balance & Flexibility: {categories_keywords['Work-Life Balance & Flexibility']}
12. Mobility & Accessibility: {categories_keywords['Mobility & Accessibility']}
13. Financial Compensation & Benefits: {categories_keywords['Financial Compensation & Benefits']}
14. Device Leasing and Corporate Benefits: {categories_keywords['Device Leasing and Corporate Benefits']}
15. Social and Recreational Benefits: {categories_keywords['Social and Recreational Benefits']}

If the text doesn't belong to any of these categories, return "NA".

Text: {text}
Categories (separated by commas):"""

        # Use the provided OpenAI client to send the request
        response = client.chat.completions.create(
            model="gpt-4o-mini",  
            messages=[{"role": "user", "content": prompt}]
        )
        
        # Extract and return the classification from the response
        classification = response.choices[0].message.content.strip()

        # If response contains multiple categories, return them
        return classification if classification else "NA"

    except Exception as e:
        print(f"Error classifying text: {e}")
        return "NA"

In [None]:
# Summary column
from tqdm import tqdm
tqdm.pandas()
df['summary_class'] = df['summary_translated'].progress_apply(classify_text_with_keywords)
# To see if classification worked fine
print(df.head())

In [None]:
# concatenated_ratings column
tqdm.pandas()
df['concatenated_ratings_class'] = df['concatenated_ratings'].progress_apply(classify_text_with_keywords)
# To see if classification worked fine
print(df.head())

In [None]:
# Suggestion column
tqdm.pandas()
df['suggestion_class'] = df['suggestion'].progress_apply(classify_text_with_keywords)
# To see if classification worked fine
print(df.head())

In [None]:
# Cons column
tqdm.pandas()
df['cons_class'] = df['cons'].progress_apply(classify_text_with_keywords)
# To see if classification worked fine
print(df.head())

In [None]:
# Pros column
tqdm.pandas()
df['pros_class'] = df['pros'].progress_apply(classify_text_with_keywords)
# To see if classification worked fine
print(df.head())
# You can use the line below to save intermediate result: 
# df.to_excel('classes_df.xlsx')

* We observed that the **Financial Compensation & Benefits** category was not consistently assigned to texts mentioning salary or pay. To address this, we manually added the category using the `add_class_based_on_keyword` function, as shown below.

In [None]:
# General function to add 'Financial Compensation & Benefits' class based on a keyword in a specified column
def add_class_based_on_keyword(df, ratings_col, class_col, keywords, new_class):
    """
    Adds a specified class label to rows in a DataFrame based on the presence of keywords within a given column.

    Parameters:
    - df (pandas.DataFrame): The DataFrame containing the text to analyze.
    - ratings_col (str): The column in the DataFrame where keywords will be searched for.
    - class_col (str): The column where the new class will be added if a keyword is found.
    - keywords (list): A list of keywords to look for within the specified ratings column.
    - new_class (str): The class label to be added if any of the keywords are found.

    Returns:
    - pandas.DataFrame: The updated DataFrame with the new class labels added where applicable.
    """
    def apply_logic(row):
        # Convert NaN to empty string for both ratings_col and class_col
        ratings_text = str(row[ratings_col]) if pd.notna(row[ratings_col]) else ''
        class_text = str(row[class_col]) if pd.notna(row[class_col]) else ''
        
        # Check for any of the keywords in the ratings column
        if any(keyword.lower() in ratings_text.lower() for keyword in keywords):
            if new_class not in class_text:
                return new_class + ', ' + class_text
        return class_text

    # Apply the logic to the specified columns
    df[class_col] = df.apply(apply_logic, axis=1)
    return df

In [None]:
keywords = ['salary', 'pay']
new_class = 'Financial Compensation & Benefits'

# Applying the function for each pair of columns
df = add_class_based_on_keyword(df, 'cons', 'cons_class', keywords, new_class)
df = add_class_based_on_keyword(df, 'pros', 'pros_class', keywords, new_class)
df = add_class_based_on_keyword(df, 'concatenated_ratings', 'concatenated_ratings_class', keywords, new_class)
df = add_class_based_on_keyword(df, 'suggestion', 'suggestion_class', keywords, new_class)
df = add_class_based_on_keyword(df, 'summary_translated', 'summary_class', keywords, new_class)

# Use line below to save intermediate results
# df.to_excel('salarypay.xlsx')

### Step 2.1. Defining Sentiment using GPT-4o-mini

Applied to columns, for which sentiment was not clear: Suammry, Suggestion.

In [25]:
def get_sentiment(text):
    """
    Determines the sentiment of the given text as either 'positive', 'negative', or 'neutral'.

    Parameters:
    - text (str): The input text for which sentiment is to be analyzed.

    Returns:
    - str: The sentiment of the text, which will be one of 'positive', 'negative', or 'neutral'. 
           Returns "NA" if the sentiment analysis fails.

    Error Handling:
    If an error occurs during the sentiment analysis process, the function prints the error message and returns "NA".
    """

    try:
        # Craft the system prompt, including category names and their associated keywords
        prompt = f"""Analyze the sentiment of the following text and return the sentiment as 'positive', 'negative', or 'neutral'. 

Text: "{text}"

Sentiment (positive, negative, neutral):"""

        # Use the provided OpenAI client to send the request
        response = client.chat.completions.create(
            model="gpt-4o-mini",  
            messages=[{"role": "user", "content": prompt}]
        )
        
        # Extract and return the classification from the response
        classification = response.choices[0].message.content.strip()

        # If response contains multiple categories, return them
        return classification if classification else "NA"

    except Exception as e:
        print(f"Error classifying text: {e}")
        return "NA"

In [None]:
# Applying get_sentiment to summary column
tqdm.pandas()
df['summary_sentiment'] = df['summary_translated'].progress_apply(get_sentiment)
print(df.head())

In [None]:
# Applying get_sentiment to suggestion column
tqdm.pandas()
df['suggestion_sentiment'] = df['suggestion'].progress_apply(get_sentiment)
print(df.head())
# Use below line to save intermediate result: 
# df.to_excel('sentiment_summary_suggestion.xlsx')

* We noticed that NaN values were not handled properly, as well as the sentiment text value itself nedded some processing. We did it manually.

In [32]:
def clean_sentiment(sentiment):
    """
    Cleans and standardizes the sentiment value by extracting only the relevant sentiment word 
    ('positive', 'negative', or 'neutral') from the input text.

    Parameters:
    - sentiment (str): The sentiment string that may contain additional text or formatting.

    Returns:
    - str: The cleaned sentiment word in lowercase ('positive', 'negative', or 'neutral'). 
           If no valid sentiment word is found, returns the original sentiment input.

    """
    if pd.isna(sentiment):
        return sentiment
    # Use regex to extract only the sentiment word (positive, negative, neutral)
    cleaned_sentiment = re.search(r'(positive|negative|neutral)', sentiment, re.IGNORECASE)
    return cleaned_sentiment.group(0).lower() if cleaned_sentiment else sentiment

In [None]:
# Putting NaN values to the sentiment column rows, for which original column has no text
df.loc[df['summary_translated'].isna(), 'summary_sentiment'] = np.nan
# Apply the cleaning function to the 'summary_sentiment' column
df['summary_sentiment'] = df['summary_sentiment'].apply(clean_sentiment)

In [37]:
# Putting NaN values to the sentiment column rows, for which original column has no text
df.loc[df['suggestion'].isna(), 'suggestion_sentiment'] = np.nan
# Apply the cleaning function to the 'suggestion_sentiment' column
df['suggestion_sentiment'] = df['suggestion_sentiment'].apply(clean_sentiment)

### Step 2.2. Defining Sentiment using scores for each id

Applied to columns, for which sentiment was not clear and text was too complicated and included several categories: Concatinated ratings.

In [4]:
# Kununu_rating_ids_mapped.xlsx was created by mapping id names to categories
df_mapping = pd.read_excel('Kununu_rating_ids_mapped.xlsx')

Processing Class Scores for Ratings

This process iterates over each row in the DataFrame to calculate average scores for classified categories based on associated IDs.

Steps:
1. **Initialize a Storage List**: An empty list, `class_avg_list`, is created to store the new column values.

2. **Iterate Through Rows**: Each row in the DataFrame is examined, starting with the column 'concatenated_ratings_class' to fetch class names.
   - If the class information is NaN, NaN is appended to the results list, and processing moves to the next row.
   
3. **Split and Extract Classes**: Class names are split into individual categories, and a list of these classes is prepared for score calculation.

4. **Parse Ratings Data**: The 'ratings_translated' column, containing JSON-like strings, is safely converted into a list of dictionaries.
   - If parsing fails, an error message is logged, NaN is appended to `class_avg_list`, and the row is skipped.
   
5. **Match and Calculate Scores**:
   - For each class, corresponding IDs are looked up in `df_mapping`.
   - Scores associated with these IDs in 'ratings_translated' are extracted, and if multiple scores are found, their average is calculated.
   - The results are formatted as "ClassName Score" or "ClassName AvgScore".

6. **Store and Return Results**: The final class-score pairs are concatenated and appended to `class_avg_list`, which stores results in a structured "ClassName Score" format for further analysis.

In [None]:
# Initialize an empty list to store the new column values
class_avg_list = []

# Iterate through each row in df['concatenated_ratings_class']
for index, row in df.iterrows():
    concatenated_classes = row['concatenated_ratings_class']
    
    # If the concatenated classes are NaN, append NaN and move to the next row
    if pd.isna(concatenated_classes):
        class_avg_list.append(np.nan)
        continue

    # Split the concatenated classes into a list of individual class names
    classes = [cls.strip() for cls in concatenated_classes.split(',')]
    
    class_score_pairs = []  # This will store each class and its score

    # Safely convert the 'ratings_translated' string into a list of dictionaries
    try:
        score_entries = ast.literal_eval(row['ratings_translated'])  # Converting the string safely
    except (SyntaxError, ValueError) as e:
        # Log or handle parsing errors
        print(f"Error parsing ratings_translated in row {index}: {e}")
        class_avg_list.append(np.nan)
        continue  # Skip if conversion fails
    
    # Loop through the classes for this particular observation
    for cls in classes:
        # Look up corresponding ID(s) in df_mapping
        id_names = df_mapping[df_mapping['category'] == cls]['id'].tolist()
        
        # Handle cases where we find ID mappings
        if id_names:
            scores = []  # Collect all matching scores for the current class
            
            # Go through each id_name for the current class
            for id_name in id_names:
                # Find the score for the current id_name in the ratings_translated
                matching_scores = [entry['score'] for entry in score_entries if entry['id'] == id_name]
                
                # If we find a score, append it to the scores list
                if matching_scores:
                    scores.extend(matching_scores)
            
            # If scores were found for the class, calculate the average or use the single score
            if scores:
                if len(scores) == 1:
                    class_score_pairs.append(f"{cls} {scores[0]}")
                else:
                    avg_score = np.mean(scores)
                    class_score_pairs.append(f"{cls} {avg_score:.2f}")
    
    # After processing all classes in this row, store the result in the format "ClassName Score, ClassName Score"
    class_avg_list.append(', '.join(class_score_pairs))

# After running this, let's check the first few results of the processed list
class_avg_list[:10]

# Adding the new column to df
df['kununu_ratings_class_avg'] = class_avg_list


* Average ratings can be further used as a sentiment. Fo example, if the score for a category is below 3, it means that the sentiment is negative.

In [5]:
# If applicable delete uneccessary columns
# df = df.drop(columns=['...'])
# Saving new df file as excel file
df.to_excel('text_data_classes_sentiment.xlsx')

### End of the notebook