# Assigning Topics to Positive and Negative Reviews using BerTopic Model
-------------------

> <i>Description: In this notebook, Our Goal was to assign topics to the reviews and compare the topics with GPT base classification.</i>

Input Files: 
1) reviews_merged.csv

Output:
1) reviews_positive_topics.csv
2) reviews_negative_topics.csv


In [9]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer
import pandas as pd
import openpyxl
import re 
import nltk
from nltk.corpus import stopwords, wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
import umap

import ast
import matplotlib.pyplot as plt
import numpy as np

### File Readin and Formatting

In [10]:
reviews_merged = pd.read_csv('reviews_merged.csv')

reviews_positive = reviews_merged[['uuid', 'date', 'year', 'rating', 'position'
                                   , 'position_code', 'department', 'pros','country', 'file']]
reviews_positive = reviews_positive[reviews_positive['pros'].notna()]

reviews_negative = reviews_merged[['uuid', 'date', 'year', 'rating', 'position'
                                   , 'position_code', 'department', 'cons','country', 'file']]
reviews_negative = reviews_negative[reviews_negative['cons'].notna()]
print(reviews_merged.shape)
print(reviews_positive.shape)
print(reviews_negative.shape)

## Bertopic Model For Positive Reviews

### Data Pre-Processing

Preparing the data for embedding and topic model:

* Break down the pros into chunks/sentences to capture multiple Topics in a single review.

* lemmatize words, i.e. reducing a word to its base or root form eg. "working" or "worked" is reduced to "work"
* finally formating where everything is reduced to lower case, remove empty reviews, remove stopwords and remove anything non alphabetical.

In [20]:
nltk.download('punkt')
lemmatizer = WordNetLemmatizer()

# Function to convert NLTK POS tag to WordNet POS tag
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return nltk.corpus.wordnet.ADJ
    elif tag.startswith('V'):
        return nltk.corpus.wordnet.VERB
    elif tag.startswith('N'):
        return nltk.corpus.wordnet.NOUN
    elif tag.startswith('R'):
        return nltk.corpus.wordnet.ADV
    else:
        return nltk.corpus.wordnet.NOUN  # Default to noun if POS tag is not recognized

# Function to clean and process text without removing stopwords initially
def clean_text(text):
    if not isinstance(text, str):
        return ''

    # Remove non-alphabetical characters
    text = re.sub(r'[^A-Za-z\s]', '', text)
    text = text.lower()

    # Tokenize and POS tagging
    words = text.split()
    pos_tags = pos_tag(words)

    # Lemmatize words based on their POS tags (without removing stopwords initially)
    lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]

    return ' '.join(lemmatized_words)

def clean_sentence(sentence):
    cleaned_sentence = clean_text(sentence)
    
    # Remove stopwords after sentence-level splitting
    return ' '.join([word for word in cleaned_sentence.split() if word not in stop_words])


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\baner\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


2707


In [None]:
# Split each review into sentences, ensuring stopwords are preserved initially
reviews_positive['sentences'] = reviews_positive['pros'].apply(sent_tokenize)

# Step 2: Clean sentences after splitting
stop_words = set(stopwords.words('english'))

# Clean each sentence individually after splitting
reviews_positive['cleaned_sentences'] = reviews_positive['sentences'].apply(lambda sents: [clean_sentence(sent) for sent in sents])

# Explode sentences into individual rows for topic modeling
df_sentences = reviews_positive.explode('cleaned_sentences').reset_index(drop=True)
filtered_sentences = df_sentences['cleaned_sentences'].tolist()
print(len(filtered_sentences))

### Embedding, Dimensionality Reduction, BERTopic

* **Embedding:** This step allows us to represent words, phrases, or sentences as dense, continuous vectors in a high-dimensional space. These embeddings capture the semantic meaning of the text in a way that is understandable to machine learning models.

* **Dimentionality Reduction:** Since embedding generates a high-dimensional vector space, we need to reduce it so that the model can capture patterns optimally, UMAP helps to reduce the dimention without losing much local or global relation.

* **BerTopic:** It is a topic modeling technique that combines transformer-based embeddings with clustering methods to discover topics in large sets of text data. It is an advanced and flexible model for extracting coherent topics from textual data.

In [21]:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(filtered_sentences, show_progress_bar=True)

# Step 4: Apply BERTopic for Topic Modeling at the sentence level
# Use UMAP for dimensionality reduction before topic modeling
reducer = umap.UMAP(n_neighbors=10, n_components=5, min_dist=0.01, random_state=42)
umap_embeddings = reducer.fit_transform(embeddings)

# Initialize BERTopic
topic_model = BERTopic(language="english", n_gram_range=(1, 2), calculate_probabilities=True)

# Step 5: Fit the BERTopic model to the sentence data
topics, probs = topic_model.fit_transform(filtered_sentences, umap_embeddings)

# Assign topics and probabilities back to the sentence-level DataFrame
df_sentences['topic'] = topics

# Option 1: Store the maximum probability for each sentence
df_sentences['topic_probability'] = np.max(probs, axis=1)
print(topic_model.get_topic_info())

Batches:   0%|          | 0/85 [00:00<?, ?it/s]

     Topic  Count                                               Name  \
0       -1    228      -1_autonomy_lot_lot responsibility_atmosphere   
1        0     68                      0_hugo_hugo bos_bos_work hugo   
2        1     55            1_fashion_get_industry_fashion industry   
3        2     49  2_uniform_free uniform_uniform allowance_commi...   
4        3     49  3_culture_work culture_company culture_great c...   
..     ...    ...                                                ...   
105    104     11  104_international environment_great culture_cu...   
106    105     11                  105_medical_insurance_health_food   
107    106     11        106_salary_turkey_salary salary_high salary   
108    107     11  107_among_cohesion among_cohesion_among colleague   
109    108     11  108_benefit discount_discount benefit_benefit_...   

                                        Representation  \
0    [autonomy, lot, lot responsibility, atmosphere...   
1    [hugo, hugo bo

In [22]:
topic_model.reduce_topics(filtered_sentences, nr_topics=10)
print(topic_model.get_topic_info())

   Topic  Count                                        Name  \
0     -1    228                   -1_good_team_work_company   
1      0    745               0_good_great_environment_team   
2      1    376                  1_salary_work_good_balance   
3      2    372           2_uniform_campus_discount_canteen   
4      3    296            3_commission_colleague_good_nice   
5      4    286  4_discount_voucher_employee discount_staff   
6      5    142                    5_people_work_place_good   
7      6    141        6_brand_intern_product_international   
8      7     68               7_hugo_bos_hugo bos_work hugo   
9      8     53              8_pro_nothing_everything_think   

                                      Representation  \
0  [good, team, work, company, lot, great, atmosp...   
1  [good, great, environment, team, work, company...   
2  [salary, work, good, balance, benefit, hour, f...   
3  [uniform, campus, discount, canteen, clothing,...   
4  [commission, colleague,

In [23]:
model_topics = topic_model.get_topic_info()
model_topics.to_csv('reviews_positive_topics.csv')

## Bertopic Model For Negative Reviews

In [24]:
# Split each review into sentences, ensuring stopwords are preserved initially
reviews_negative['sentences'] = reviews_negative['cons'].apply(sent_tokenize)

# Step 2: Clean sentences after splitting
stop_words = set(stopwords.words('english'))

# Clean each sentence individually after splitting
reviews_negative['cleaned_sentences'] = reviews_negative['sentences'].apply(lambda sents: [clean_sentence(sent) for sent in sents])

# Explode sentences into individual rows for topic modeling
df_sentences = reviews_negative.explode('cleaned_sentences').reset_index(drop=True)
filtered_sentences = df_sentences['cleaned_sentences'].tolist()
print(len(filtered_sentences))

3066


### Embedding, Dimensionality Reduction, BERTopic

In [25]:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(filtered_sentences, show_progress_bar=True)

# Step 4: Apply BERTopic for Topic Modeling at the sentence level
# Use UMAP for dimensionality reduction before topic modeling
reducer = umap.UMAP(n_neighbors=10, n_components=5, min_dist=0.01, random_state=42)
umap_embeddings = reducer.fit_transform(embeddings)

# Initialize BERTopic
topic_model = BERTopic(language="english", n_gram_range=(1, 2), calculate_probabilities=True)

# Step 5: Fit the BERTopic model to the sentence data
topics, probs = topic_model.fit_transform(filtered_sentences, umap_embeddings)

# Assign topics and probabilities back to the sentence-level DataFrame
df_sentences['topic'] = topics

# Option 1: Store the maximum probability for each sentence
df_sentences['topic_probability'] = np.max(probs, axis=1)
print(topic_model.get_topic_info())

Batches:   0%|          | 0/96 [00:00<?, ?it/s]

     Topic  Count                                               Name  \
0       -1    322                         -1_tire_promise_store_time   
1        0     61         0_management_senior_poor_senior management   
2        1     57           1_manager_management_handle_area manager   
3        2     52  2_opportunity_opportunity grow_grow_opportunit...   
4        3     43   3_balance_worklife_worklife balance_life balance   
..     ...    ...                                                ...   
121    120     11                   120_gossip_hr_response_encourage   
122    121     11  121_stuttgart_metzingen_drive stuttgart_stuttg...   
123    122     11  122_drawback_drawback drawback_find drawback_n...   
124    123     11               123_none_come mind_none nothing_mind   
125    124     10                   124_gift_david_david jones_jones   

                                        Representation  \
0    [tire, promise, store, time, hour, long, exper...   
1    [management, s

In [26]:
topic_model.reduce_topics(filtered_sentences, nr_topics=10)
print(topic_model.get_topic_info())

   Topic  Count                                         Name  \
0     -1    322                      -1_work_hour_store_time   
1      0   1011        0_management_company_employee_manager   
2      1    726                        1_salary_sale_pay_low   
3      2    370                 2_work_balance_location_life   
4      3    240                    3_hour_work_long_overtime   
5      4    164                     4_con_nothing_none_think   
6      5    140          5_metzingen_process_german_decision   
7      6     38    6_disadvantage_drawback_negative_downside   
8      7     35                7_hugo_hugo bos_bos_work hugo   
9      8     20  8_nice unacceptable_na nice_na_unacceptable   

                                      Representation  \
0  [work, hour, store, time, management, pay, lon...   
1  [management, company, employee, manager, staff...   
2  [salary, sale, pay, low, commission, pressure,...   
3  [work, balance, location, life, life balance, ...   
4  [hour, work,

In [None]:
model_topics = topic_model.get_topic_info()
model_topics.to_csv('reviews_negative_topics.csv')