# **Topic project 4.1 Applying NLP for topic modelling in a real-life context**

# Foreword: For anonymytiy & confidentiality reasons the data and outputs are cleared, with only the code and general summaries remaining

In this project, I  bridge the gap between theory and practical application by developing automated topic modelling tools tailored to a specific industry context.

Applying NLP for topic modelling is crucial for data analysis in business because it enables companies to identify and understand key themes and patterns within large volumes of text data. This efficiency allows businesses to extract essential insights and trends without manually sifting through extensive documents. Automated topic modelling helps businesses make informed decisions faster, which helps to improve productivity and gain a competitive edge. Additionally, it supports better information management by uncovering underlying topics in reports, emails, customer feedback, and market research, which enhances overall business intelligence and strategic planning.

Approximately **19 hours** to complete this project.

<br>

## **Business context**

This project was on an unnamed gym, that offers customers high-quality, low-cost, and flexible fitness facilities. The company’s customer-centric proposition – affordable membership fees, no fixed-term contracts, and 24/7 access to high-quality gyms – differentiates it from more traditional gyms and elevates it as a market leader within the space.

This focus on the customer is centred on wanting to understand what motivates members to join and what factors influence their behaviours once they have joined.

Understanding how to leverage innovative technology to influence, improve, and simplify their experience allows this gym to foster an open, welcoming, and diverse environment for its members while maintaining their value proposition.

With the shift in focus to value-for-money memberships across the gym industry,
this gym aims provideg members with affordable access to the benefits being healthy can offer’.

<br>

## **Objective**

To review the gyms data to uncover key drivers that provide actionable insights for enhancing customer experience.

In this project I:

- Use two data sets containing customer reviews from Google and Trustpilot.
- Perform basic level analysis by finding the frequently used words in both data sets.
- Generate a wordcloud to visualise the most frequently used words in the reviews.
- Apply BERTopic for topic modelling, keeping track of gym locations, to identify common topics and words in the negative reviews.
- Identify the locations that have the most negative reviews.
- Use the built-in visualisation functions in BERTopic to cluster and visually represent the topics and words in these reviews, thereby helping to identify specific themes from the reviews.
- Conduct a comparison with Gensim’s LDA model to validate the topic modelling results.
- Perform emotion analysis to identify the emotions associated with customer reviews.
- Filter out angry reviews and apply BERTopic to discover prevalent topics and words being discussed these negative reviews.
- Leverage the multi-purpose capability of the state-of-the-art Falcon-7b-instruct model, with the help of prompts, to identify top topics in each review.
- Use a different prompt with the Falcon-7b-instruct model to further generate suggestions for improvements, based on the top topics identified from the negative reviews.

<br>

## **This project demonstrates that I can:**

- Investigate real-world data to find potential trends for deeper investigation.
-Preprocess and refine textual data for visualisations.
-Apply topic modelling using various techniques.
-Apply emotion analysis using BERT.
-Evaluate the outcomes of my investigation.
-Communicate actionable insights.

# 1. Import packages and data:

* Along with some simpple data cleaning: Removing any rows with missing values in the Comment column (Google review) and Review Content column (Trustpilot)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import random
import torch

# Set seed for reproducibility
np.random.seed(42)
random.seed(42)
torch.manual_seed(42)

<torch._C.Generator at 0x78a8443c71b0>

In [None]:
!pip install nltk
import nltk
nltk.download("all")
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk import FreqDist

In [None]:
import gdown
destination_TP = 'Trustpilot.xlsx'

# Construct the download URL
download_url = f''

# Download the file using gdown
gdown.download(download_url, destination_TP, quiet=False)

destination_G = 'Google.xlsx'

# Construct the download URL
download_url = f''

# Download the file using gdown
gdown.download(download_url, destination_G, quiet=False)

In [5]:
df_trustpilot = pd.read_excel('Trustpilot.xlsx', sheet_name='65bcde732b27bf001a58fef2_821ec1')

#df_google = pd.read_excel('Google.xlsx', sheet_name='Sheet0')

In [None]:
try:
    # Read all tables in the HTML file without specifying headers
    df_list = pd.read_html('Google.xlsx')
    df_google = df_list[0]  # Assuming the table you want is the first one

    # Set the first row as header and reset index
    df_google.columns = df_google.iloc[0]
    df_google = df_google[1:].reset_index(drop=True)

    # Print the first few rows of the DataFrame to verify the column headers
    print(df_google.head())
except Exception as e:
    print(f"Error reading HTML: {e}")

In [None]:
display(df_trustpilot.head(3))

display(df_google.head(3))

Creating a copy as this will be useful in part 6) where I tokenize using an autotokenizer from my transformer model rather than NLTK

In [8]:
df_trustpilot_unprocessed = df_trustpilot.copy()
df_google_unprocessed = df_google.copy()

## Data cleaning - dropping of rows with missing values in the comments column

In [None]:
# Check for missing values in the 'Comment' column
missing_values_count = df_google['Comment'].isna().sum()
print(f"Number of missing values in 'Comment' column: {missing_values_count}")

# Remove rows where 'Comment' is missing
df_google = df_google.dropna(subset=['Comment'])

# Drop the column with integer name 1
df_google = df_google.drop(columns=[1])

In [None]:
display(df_google.head(3))

In [None]:
# Check for missing values in the 'Review Content' column
missing_values_count = df_trustpilot['Review Content'].isna().sum()
print(f"Number of missing values in 'Review Content' column: {missing_values_count}")

# Remove rows where 'Review Content' is missing
df_trustpilot = df_trustpilot.dropna(subset=['Review Content'])

In [None]:
missing_values_count = df_trustpilot.isna().sum()
print(f"Number of missing values in 'df_trustpilot': {missing_values_count}")

# 2. Conducting initial data investigation:


## 2.1 Review of data sets locations

Finding the number of unique locations in the data sets

Finding the no. common locations across the two data sets

In [None]:
display(df_google.head(5))

display(df_trustpilot.head(1))

In [None]:
# Define the column name
google_column_name = "Club's Name"
trustpilot_column_name = "Location Name"

# Print the number of unique locations
print(f"Number of unique locations in Google data set: {df_google[google_column_name].nunique()}")

# Print the number of unique locations
print(f"Number of unique locations in Trustpilot data set: {df_trustpilot[trustpilot_column_name].nunique()}")

In [None]:
df_google["Club's Name"].unique()

Gyms not in the uk:

Redacted


In [None]:
print(df_trustpilot["Review Language"].unique())

# Assuming df_trustpilot is your DataFrame
language_counts = df_trustpilot["Review Language"].value_counts()
print(language_counts)

In [None]:
# Assuming df_trustpilot is your DataFrame
language_counts = df_trustpilot["Review Language"].value_counts()

# Sum the counts of all non-English languages
non_en_count = language_counts.drop('en').sum()

print(f"Total count of non-English reviews: {non_en_count}")

Could remove these non english ones as so far using transformer only in en

Saying that, when I create a df with common locations they are automatically removed as not present in google reviews / intersection

In [None]:
df_trustpilot["Location Name"].unique()

In [None]:
# List of locations to check
locations_to_check = [
    '',
    '',
    '',
    '',
    ''
]

# Check if the locations are present in df_trustpilot["Location Name"]
locations_present = df_trustpilot["Location Name"].isin(locations_to_check)

# Get the actual locations found in the DataFrame
found_locations = df_trustpilot.loc[locations_present, "Location Name"].unique()

# Print the results
for location in locations_to_check:
    if location in found_locations:
        print(f"'{location}' is present in df_trustpilot['Location Name']")
    else:
        print(f"'{location}' is NOT present in df_trustpilot['Location Name']")


In [None]:
# Get unique location names from each data frame
google_locations = set(df_google[google_column_name].dropna().unique())
trustpilot_locations = set(df_trustpilot[trustpilot_column_name].dropna().unique())

# Find common locations
common_locations = google_locations.intersection(trustpilot_locations)

# Print the number of common locations and their names
print(f"Number of common locations: {len(common_locations)}")
print("Common locations:")
print(common_locations)

## 2.2 Preprocessing of the data

Change to lower case, tokenise the data using word_tokenize from NLTK, remove stopwords, and remove numbers.



delete below next time you run this and uncomment that lower bit

In [21]:
display(df_google.head(2))

display(df_trustpilot.head(2))

Unnamed: 0,Customer Name,SurveyID for external use (e.g. tech support),Club's Name,Social Media Source,Creation Date,Comment,Overall Score
1,**,e9b62vyxtkwrrrfyzc5hz6rk,Cambridge Leisure Park,Google Reviews,5/9/24 22:48,Too many students from two local colleges go h...,1
2,**,e2dkxvyxtkwrrrfyzc5hz6rk,London Holborn,Google Reviews,5/9/24 22:08,"Best range of equipment, cheaper than regular ...",5


Unnamed: 0,Review ID,Review Created (UTC),Review Consumer User ID,Review Title,Review Content,Review Stars,Source Of Review,Review Language,Domain URL,Webshop Name,Business Unit ID,Tags,Company Reply Date (UTC),Location Name,Location ID
0,663d40378de0a14c26c2f63c,2024-05-09 23:29:00,663d4036d5fa24c223106005,A very good environment,A very good environment,5,AFSv2,en,http://www.puregym.com,PureGym UK,508df4ea00006400051dd7b1,,2024-05-10 08:12:00,Solihull Sears Retail Park,7b03ccad-4a9d-4a33-9377-ea5bba442dfc
1,663d3c101ccfcc36fb28eb8c,2024-05-09 23:11:00,5f5e3434d53200fa6ac57238,I love to be part of this gym,I love to be part of this gym. Superb value fo...,5,AFSv2,en,http://www.puregym.com,PureGym UK,508df4ea00006400051dd7b1,,2024-05-10 08:13:00,Aylesbury,612d3f7e-18f9-492b-a36f-4a7b86fa5647


In [22]:
"""
def check_and_convert_to_lowercase(df, column_name):
    # Check if there are any uppercase characters before conversion
    contains_uppercase_before = df[column_name].apply(lambda x: any(char.isupper() for char in x))

    # Print the rows that contain uppercase characters before conversion
    print("Rows with uppercase characters before conversion:")
    print(df[contains_uppercase_before])

    # Convert the specified column to lowercase
    df[column_name] = df[column_name].str.lower()

    # Check if there are any uppercase characters after conversion
    contains_uppercase_after = df[column_name].apply(lambda x: any(char.isupper() for char in x))

    # Print the rows that still contain uppercase characters
    print("Rows with uppercase characters after conversion:")
    print(df[contains_uppercase_after])
"""

'\ndef check_and_convert_to_lowercase(df, column_name):\n    # Check if there are any uppercase characters before conversion\n    contains_uppercase_before = df[column_name].apply(lambda x: any(char.isupper() for char in x))\n\n    # Print the rows that contain uppercase characters before conversion\n    print("Rows with uppercase characters before conversion:")\n    print(df[contains_uppercase_before])\n\n    # Convert the specified column to lowercase\n    df[column_name] = df[column_name].str.lower()\n\n    # Check if there are any uppercase characters after conversion\n    contains_uppercase_after = df[column_name].apply(lambda x: any(char.isupper() for char in x))\n\n    # Print the rows that still contain uppercase characters\n    print("Rows with uppercase characters after conversion:")\n    print(df[contains_uppercase_after])\n'

In [23]:
#check_and_convert_to_lowercase(df_google, column_name='Comment')

In [24]:
#check_and_convert_to_lowercase(df_trustpilot, column_name='Review Content')

In [25]:
#display(df_google.head(2))

#display(df_trustpilot.head(2))

In [26]:
"""
# Define the column names
google_column_name = "Comment"
trustpilot_column_name = "Review Content"

# Print 3 rows from the 'Comment' column in df_google
print("Sample rows from Google DataFrame:")
print(df_google[[google_column_name]].head(3))

# Print 3 rows from the 'Review Content' column in df_trustpilot
print(" \n Sample rows from Trustpilot DataFrame:")
print(df_trustpilot[[trustpilot_column_name]].head(3))
"""

'\n# Define the column names\ngoogle_column_name = "Comment"\ntrustpilot_column_name = "Review Content"\n\n# Print 3 rows from the \'Comment\' column in df_google\nprint("Sample rows from Google DataFrame:")\nprint(df_google[[google_column_name]].head(3))\n\n# Print 3 rows from the \'Review Content\' column in df_trustpilot\nprint(" \n Sample rows from Trustpilot DataFrame:")\nprint(df_trustpilot[[trustpilot_column_name]].head(3))\n'

In [27]:
import string

In [28]:
def preprocess_nlp(text):
    """
    Preprocess the text by tokenizing, converting to lowercase,
    removing stopwords (including some custom ones), and filtering out numbers.

    Parameters:
        text (str): The input text to preprocess.

    Returns:
        list: The processed tokens.
    """
    # Tokenization
    tokens = word_tokenize(text.lower())

    # Define stopwords
    stop_words = set(stopwords.words('english'))

    # Add custom stopwords inside the function
    #custom_stopwords = {',', '"', "'", '.', ':', ';', '-', '(', ')', '[', ']', '{', '}'}
    #stop_words.update(custom_stopwords)

    # Remove stopwords
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # Remove numbers
    filtered_numbers = [token for token in filtered_tokens if not token.isdigit()]

    return filtered_numbers

In [None]:
display(df_google.head(2))
display(df_trustpilot.head(2))

In [30]:
df_google['Comment'] = df_google['Comment'].apply(preprocess_nlp)
df_trustpilot['Review Content'] = df_trustpilot['Review Content'].apply(preprocess_nlp)

In [None]:
display(df_google.head(2))
display(df_trustpilot.head(2))

## 2.3 EDA/ Visualisation

- Frequency distribution of the words from each data set's reviews.
- Histogram/bar plot showing the top 10 words from each data set.

In [32]:
def plot_top_words(df, column_name, top_n=10, title=None):
    """
    Plot a histogram of the top N words by frequency from a specified column in a DataFrame.

    Parameters:
    - df: The DataFrame containing the text data.
    - column_name: The name of the column containing the tokenized text.
    - top_n: The number of top words to display in the plot (default is 10).
    - title: Optional custom title for the plot.
    """
    # Flatten all tokens into a single list
    all_tokens = [token for sublist in df[column_name] for token in sublist]

    # Create a frequency distribution object
    freq_dist = FreqDist(all_tokens)

    # Get the top N most common tokens
    top_n_tokens = freq_dist.most_common(top_n)

    # Separate the tokens and their counts
    words, counts = zip(*top_n_tokens)

    # Determine the title
    if title is None:
        title = f'Top {top_n} Words by Frequency in DataFrame'
    else:
        title = f'Top {top_n} Words by Frequency in DataFrame - {title}'

    # Plotting
    plt.figure(figsize=(10, 6))
    plt.bar(words, counts, color='skyblue')
    plt.xlabel('Words')
    plt.ylabel('Frequency')
    plt.title(title)
    plt.xticks(rotation=45)
    plt.show()

In [None]:
plot_top_words(df_google, column_name='Comment', top_n=10, title='Google')

plot_top_words(df_trustpilot, column_name='Review Content', top_n=10, title='Trustpilot')

In [None]:
pip install wordcloud

In [35]:
from wordcloud import WordCloud

In [36]:
def plot_wordcloud(df, column_name, title=None):
    """
    Generate and plot a word cloud from a DataFrame column.

    Parameters:
    - df: The DataFrame containing the text data.
    - column_name: The name of the column containing the tokenized text.
    - title: Optional title for the word cloud plot.
    """
    # Join all text data into a single string
    text = ' '.join([' '.join(sublist) for sublist in df[column_name]])
    # combines all lists of tokens into a single string suitable for generating the word cloud.

    # Generate the word cloud
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

    # Plot the word cloud
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    if title:
        plt.title(title)
    plt.show()

In [None]:
plot_wordcloud(df_google, 'Comment', title='Word Cloud for Google Reviews')
plot_wordcloud(df_trustpilot, 'Review Content', title='Word Cloud for Trustpilot Reviews')

## 2.4 Wordcloud for negative only reviews

Create a new dataframe by filtering out the data to extract only the negative reviews from both data sets.

  • For Google reviews, overall scores < 3 can be considered negative scores.

  • For Trustpilot reviews, stars < 3 can be considered negative scores.

  Repeat the frequency distribution and wordcloud steps on the filtered data consisting of only negative reviews.

In [None]:
df_google.head(2)

In [None]:
# Check current data type of 'Overall Score'
print("Current data type of 'Overall Score':")
print(df_google['Overall Score'].dtype)

print(df_google['Overall Score'].unique())
# If scores were objects with decimal points would have converted to float, instead they are all int so its easy to handle

# Check for missing values in 'Overall Score'
print("Missing values in 'Overall Score':")
print(df_google['Overall Score'].isna().sum())
# No missing values so did not need to handle that

# Convert 'Overall Score' directly to integer
df_google['Overall Score'] = df_google['Overall Score'].astype(int)

# Verify the conversion
print("\nDataFrame after conversion to integer:")
print(df_google)
print("\nData type of 'Overall Score' after conversion:")
print(df_google['Overall Score'].dtype)

In [40]:
df_google_negative = df_google[df_google['Overall Score'] < 3]
df_trustpilot_negative = df_trustpilot[df_trustpilot['Review Stars'] < 3]

In [None]:
plot_top_words(df_google_negative, column_name='Comment', top_n=10, title='Google - Negative')
plot_top_words(df_trustpilot_negative, column_name='Review Content', top_n=10, title='Trustpilot - Negative')

In [None]:
plot_wordcloud(df_google_negative, 'Comment', title='Word Cloud for Google Reviews - Negative')
plot_wordcloud(df_trustpilot_negative, 'Review Content', title='Word Cloud for Trustpilot Reviews - Negative')

# 3 Conducting initial topic modelling:

## 3.1

- Filtering out the reviews that are from the locations common to both data sets.
Merging the reviews to form a new list

- Preprocessing
- Applying BERTopic
- Top topics, words and frequencies
- Visualising of the topics to identify clusters of topics and to understand the intertopic distance map
- Barchart of the topics
- Heatmap of similarity matrix.

In [43]:
# from transformers import pipeline

In [None]:
display(df_google_negative.head(2))
display(df_trustpilot_negative.head(2))

In [45]:
# Convert the set to a list
common_locations_list = list(common_locations)

# Filter the DataFrame
df_google_neg_common = df_google_negative[df_google_negative["Club's Name"].isin(common_locations_list)]

# Filter the DataFrame
df_trustpilot_neg_common = df_trustpilot_negative[df_trustpilot_negative["Location Name"].isin(common_locations_list)]

# Filter the DataFrame for rows where 'Club\'s Name' is not in the common_locations_list
df_google_neg_uncommon = df_google_negative[~df_google_negative["Club's Name"].isin(common_locations_list)]

# Filter the DataFrame for rows where 'Club\'s Name' is not in the common_locations_list
df_trustpilot_neg_uncommon = df_trustpilot_negative[~df_trustpilot_negative["Location Name"].isin(common_locations_list)]

In [None]:
display(df_google_neg_common.head(2))
display(df_trustpilot_neg_common.head(2))

display(df_google_neg_uncommon.head(2))
display(df_trustpilot_neg_uncommon.head(2))

In [47]:
def concatenate_dataframes(df1, df2, df1_cols, df2_cols, rename_dict=None):
    """
    Concatenates two DataFrames along rows, retaining specified columns from each DataFrame.

    Parameters:
    - df1: The first DataFrame.
    - df2: The second DataFrame.
    - df1_cols: List of columns to retain from df1.
    - df2_cols: List of columns to retain from df2.
    - rename_dict: Optional dictionary to rename columns in df2 to match df1.

    Returns:
    - Concatenated DataFrame.
    """
    # Select the relevant columns from df1
    df1_subset = df1[df1_cols]

    # Select the relevant columns from df2
    df2_subset = df2[df2_cols]

    # Rename columns in df2 if rename_dict is provided
    if rename_dict:
        df2_subset = df2_subset.rename(columns=rename_dict)

    # Add a column to identify the source DataFrame
    df1_subset['Source'] = 'df1'
    df2_subset['Source'] = 'df2'

    # Concatenate the DataFrames
    concatenated_df = pd.concat([df1_subset, df2_subset], ignore_index=True)

    return concatenated_df

In [None]:
df_google_cols = ["Club's Name", "Creation Date", "Overall Score", "Comment"]
df_trustpilot_cols = ["Review Created (UTC)", "Review Title", "Review Content", "Review Stars", "Company Reply Date (UTC)", "Location Name"]
rename_dict = {"Location Name": "Club's Name"}

merged_common = concatenate_dataframes(df_google_neg_common, df_trustpilot_neg_common, df_google_cols, df_trustpilot_cols, rename_dict)
merged_uncommon = concatenate_dataframes(df_google_neg_uncommon, df_trustpilot_neg_uncommon, df_google_cols, df_trustpilot_cols, rename_dict)

In [None]:
display(merged_common.head(2))
display(merged_uncommon.head(2))

In [None]:
merged_common

In [None]:
!pip install bertopic
!pip install transformers
!pip install sentence-transformers

In [52]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

In [None]:
merged_common['Comment']

In [None]:
merged_common['Review Content']

In [None]:
# List of columns expected to contain lists that need NaN values filled with empty lists
#list_columns = ['Comment', 'Review Content']

# Fill NaN values with empty lists for specific columns
#for col in list_columns:
#    merged_common[col] = merged_common[col].apply(lambda x: x if isinstance(x, list) else [])

documents = []

# Fill NaN values with empty lists
merged_common['Comment'] = merged_common['Comment'].apply(lambda x: x if isinstance(x, list) else [])
merged_common['Review Content'] = merged_common['Review Content'].apply(lambda x: x if isinstance(x, list) else [])

# Combine the tokens from both columns row-wise into single documents
documents = [
    " ".join((comment if isinstance(comment, list) else []) +
             (review_content if isinstance(review_content, list) else []))
    for comment, review_content in zip(merged_common['Comment'], merged_common['Review Content'])
]

# Check the first few documents
print(f"First few documents:\n{documents[:10]}")

# Check the total number of documents
print(f"Total Documents: {len(documents)}")

In [None]:
documents

In [None]:
def check_invalid_entries(documents):
    empty_entries = [i for i, doc in enumerate(documents) if not doc.strip()]
    nan_entries = [i for i, doc in enumerate(documents) if pd.isna(doc)]
    valid_entries = [i for i, doc in enumerate(documents) if doc.strip() and not pd.isna(doc)]

    print(f"Total Documents: {len(documents)}")
    print(f"Empty Entries: {len(empty_entries)}")
    print(f"NaN Entries: {len(nan_entries)}")
    print(f"Valid Entries: {len(valid_entries)}")

    return empty_entries, nan_entries, valid_entries

check_invalid_entries(documents)

In [None]:
# Initialize the BERTopic model
model = BERTopic(embedding_model=SentenceTransformer("paraphrase-MiniLM-L6-v2"))

# Fit the model to your documents
topics, probabilities = model.fit_transform(documents)

# Display topics
model.visualize_topics()

In [None]:
model.visualize_topics()

In [None]:
model.get_topic_info()

Sizeable number of outliers (topic -1) and should typically be ignored. Let's take a look at the most frequent topic that was generated, topic 0

In [None]:
model.get_topic(0)

In [None]:
model.get_topic(1)

In [None]:
# Extract topic-document mappings
topic_docs = model.get_representative_docs()

# Create a list of dictionaries to convert to a DataFrame
data = []
for topic, docs in topic_docs.items():
    for doc in docs:
        data.append({'Topic': topic, 'Document': doc, 'Probability': 1.0})  # Assuming probability is not available, use a placeholder

# Convert to DataFrame
topics_df = pd.DataFrame(data)

# Display the DataFrame to understand its structure
print(topics_df.head())

# For each topic, get the top 10 documents
top_docs_per_topic = {}
for topic in topics_df['Topic'].unique():
    topic_docs_df = topics_df[topics_df['Topic'] == topic]
    # Sort documents based on their probabilities
    sorted_docs = topic_docs_df.sort_values(by='Probability', ascending=False)
    top_docs_per_topic[topic] = sorted_docs.head(10)  # Get top 10 documents

# Print the top 10 documents for each topic
for topic, top_docs in top_docs_per_topic.items():
    print(f"Topic {topic}:")
    for _, row in top_docs.iterrows():
        print(f"Document: {row['Document']}\nProbability: {row['Probability']}\n")

In [None]:
# Get topic information
topic_info = model.get_topic_info()

# Identify the top 2 topics based on frequency
top_topics = topic_info.head(3)['Topic'].tolist()

# Get top words for these topics
for topic in top_topics:
    top_words = model.get_topic(topic)
    print(f"Top words for Topic {topic}:")
    for word, weight in top_words:
        print(f"{word}: {weight}")
    print("\n")

In [None]:
# Check the structure of topic_docs
print(topic_docs)

In [None]:
model.visualize_barchart()

In [None]:
model.visualize_heatmap()

## 3.2. Exploring the data further:

- Top 20 locations with the highest number of negative reviews across the data sets

- Merging the 2 data sets using Location Name and Club's Name

    Sorting based on the total number of reviews.

    Reviewing the locations and the total number of reviews for each location on Trustpilot and Google, as well as the combined total of reviews from both platforms

- Redoing the word frequency and word cloud for the top 30 locations

- For the top 30 locations, combine the reviews for both platforms and running them through BERTopic.

Recall that merged_common is only capturing the negative scores, so I should just be able to count by freq of location "Club's Name" to find the top 20 locations with the highest number of negative reviews, sorted in descending order separately for the 2 review sites

Recall that df1 is for google and df2 is for trustpilot reviews

May be worth renaming merged_common to merged_common_neg

In [None]:
merged_common

In [None]:
# Create counts for Google reviews
google_location_counts = (
    merged_common[merged_common['Source'] == 'df1']["Club's Name"]
    .value_counts()
    .reset_index()
)
google_location_counts.columns = ["Club's Name", "Negative Review Count"]
google_top_20_locations = google_location_counts.sort_values(by="Negative Review Count", ascending=False).head(20)

# Create counts for Trustpilot reviews
trustpilot_location_counts = (
    merged_common[merged_common['Source'] == 'df2']["Club's Name"]
    .value_counts()
    .reset_index()
)
trustpilot_location_counts.columns = ["Club's Name", "Negative Review Count"]
trustpilot_top_20_locations = trustpilot_location_counts.sort_values(by="Negative Review Count", ascending=False).head(20)

# Display the results
print("Top 20 locations with highest number of negative reviews for Google:")
print(google_top_20_locations)

print("\nTop 20 locations with highest number of negative reviews for Trustpilot:")
print(trustpilot_top_20_locations)

In [None]:
# Get the top 5 locations for Google and Trustpilot reviews
google_top_5_locations = google_top_20_locations.head(5)
trustpilot_top_5_locations = trustpilot_top_20_locations.head(5)

# Find the intersection of the top 5 locations
common_top_5_locations = pd.merge(google_top_5_locations, trustpilot_top_5_locations, on="Club's Name", suffixes=('_google', '_trustpilot'))

# Display the common top 5 locations
print("Common clubs in the top 5 of both Google and Trustpilot negative reviews: \n")

common_top_5_locations

In [None]:
merged_common.head(3)

• Locations

• Number of Trustpilot reviews for this location

• Number of Google reviews for this location

• Total number of reviews for this location (sum of Google reviews and Trustpilot reviews)

Sorting based on the total number of reviews.

In [None]:
# Count the negative reviews by location for Google and Trustpilot separately
google_negative_review_counts = merged_common[merged_common['Source'] == 'df1']['Club\'s Name'].value_counts().reset_index()
trustpilot_negative_review_counts = merged_common[merged_common['Source'] == 'df2']['Club\'s Name'].value_counts().reset_index()

# Rename the columns for clarity
google_negative_review_counts.columns = ['Club\'s Name', 'negative_review_count_google']
trustpilot_negative_review_counts.columns = ['Club\'s Name', 'negative_review_count_trustpilot']

# Merge the counts DataFrames
merged_negative_review_counts = pd.merge(
    google_negative_review_counts,
    trustpilot_negative_review_counts,
    on='Club\'s Name',
    how='outer'
)

# Fill NaN values with 0 for counts
merged_negative_review_counts['negative_review_count_google'] = merged_negative_review_counts['negative_review_count_google'].fillna(0).astype(int)
merged_negative_review_counts['negative_review_count_trustpilot'] = merged_negative_review_counts['negative_review_count_trustpilot'].fillna(0).astype(int)

# Calculate the total number of reviews for each location
merged_negative_review_counts['total_negative_reviews'] = merged_negative_review_counts['negative_review_count_google'] + merged_negative_review_counts['negative_review_count_trustpilot']

# Sort the DataFrame in descending order based on the total number of reviews
sorted_negative_review_counts = merged_negative_review_counts.sort_values(by='total_negative_reviews', ascending=False)

# Display the final DataFrame
sorted_negative_review_counts

Worth renaming to top 30 neg reviews

In [None]:
top_30_locations = sorted_negative_review_counts.head(30)['Club\'s Name']
top_30_reviews = merged_common[merged_common['Club\'s Name'].isin(top_30_locations)]
top_30_reviews

Worth renaming combined_reviews_df to combined_reviews_top_30_neg

In [None]:
# Create DataFrames from each column
review_content_df = top_30_reviews[['Review Content']].copy() # This one is Trustpilot
comment_df = top_30_reviews[['Comment']].copy() # This one is Google

# Rename columns to facilitate concatenation
review_content_df.columns = ['Review']
comment_df.columns = ['Review']

# Concatenate the DataFrames
combined_reviews_df = pd.concat([review_content_df, comment_df], ignore_index=True)

# Convert the combined reviews into individual documents
documents_top_30_neg_locations = combined_reviews_df['Review'].apply(lambda x: ' '.join(x) if isinstance(x, list) else str(x)).tolist()

# Assuming documents_top_30_neg_locations is a list of strings
# Remove empty strings from the list
cleaned_documents_top_30_neg_locations = [doc for doc in documents_top_30_neg_locations if doc.strip() != '']

# Check the cleaned list
print(f"First few cleaned documents:\n{cleaned_documents_top_30_neg_locations[:10]}")
print(f"Total Cleaned Documents: {len(cleaned_documents_top_30_neg_locations)}")

# Optionally, you can convert the cleaned list back into a single text if needed
all_reviews_text_cleaned = ' '.join(cleaned_documents_top_30_neg_locations)

In [None]:
cleaned_documents_top_30_neg_locations

In [None]:
combined_reviews_df.head(20)

In [None]:
# Ensure no empty arrays or NaNs are present
combined_reviews_df = combined_reviews_df[combined_reviews_df['Review'].apply(lambda x: isinstance(x, list) and len(x) > 0)]

# Plot top words
plot_top_words(combined_reviews_df, column_name='Review', top_n=10, title='Top 30 highest negative location reviews')

In [None]:
plot_wordcloud(combined_reviews_df, 'Review', title='Word Cloud for top 30 negative location reviews')

In [None]:
documents

In [None]:
cleaned_documents_top_30_neg_locations

In [None]:
# Filter out empty or short documents
cleaned_documents_top_30_neg_locations = [
    doc for doc in cleaned_documents_top_30_neg_locations
    if doc.strip() != '' and len(doc.split()) > 1
]

# Check the cleaned list
print(f"First few cleaned documents:\n{cleaned_documents_top_30_neg_locations[:10]}")
print(f"Total Cleaned Documents: {len(cleaned_documents_top_30_neg_locations)}")

In [83]:
# Initialize the BERTopic model
model = BERTopic(embedding_model=SentenceTransformer("paraphrase-MiniLM-L6-v2"))

# Fit the model to your documents
topics, probabilities = model.fit_transform(cleaned_documents_top_30_neg_locations)

# Display topics
#model.visualize_topics()

In [84]:
# Print number of topics
print(f"Number of topics found: {len(set(topics))}")

Number of topics found: 6


In [None]:
# Print topics and their distribution
from collections import Counter

topic_counts = Counter(topics)
print(f"Topic distribution: {topic_counts}")

# Display topics
for i in set(topics):
    print(f"Topic {i}: {model.get_topic(i)}")

In [None]:
# Get and print the top words for each topic
for i in set(topics):
    print(f"Topic {i}:")
    print(model.get_topic(i))
    print()

In [None]:
model.get_topic_info()

In [None]:
# Get topic information
topic_info = model.get_topic_info()

# Identify the top 2 topics based on frequency
top_topics = topic_info.head(3)['Topic'].tolist()

# Get top words for these topics
for topic in top_topics:
    top_words = model.get_topic(topic)
    print(f"Top words for Topic {topic}:")
    for word, weight in top_words:
        print(f"{word}: {weight}")
    print("\n")

In [None]:
# Extract topic-document mappings
topic_docs = model.get_representative_docs()

# Create a list of dictionaries to convert to a DataFrame
data = []
for topic, docs in topic_docs.items():
    for doc in docs:
        data.append({'Topic': topic, 'Document': doc, 'Probability': 1.0})  # Assuming probability is not available, use a placeholder

# Convert to DataFrame
topics_df = pd.DataFrame(data)

# Display the DataFrame to understand its structure
print(topics_df.head())

# For each topic, get the top 10 documents
top_docs_per_topic = {}
for topic in topics_df['Topic'].unique():
    topic_docs_df = topics_df[topics_df['Topic'] == topic]
    # Sort documents based on their probabilities
    sorted_docs = topic_docs_df.sort_values(by='Probability', ascending=False)
    top_docs_per_topic[topic] = sorted_docs.head(10)  # Get top 10 documents

# Print the top 10 documents for each topic
for topic, top_docs in top_docs_per_topic.items():
    print(f"Topic {topic}:")
    for _, row in top_docs.iterrows():
        print(f"Document: {row['Document']}\nProbability: {row['Probability']}\n")

In [None]:
model.visualize_barchart()

In [None]:
model.visualize_heatmap()

# 4. Conducting emotion analysis:

- Using BERT model bhadresh-savani/bert-base-uncased-emotion, setting up a pipeline for text classification
- Testing the model on a sample and displaying the different emotion classifications that the model outputs
- Capturing the top emotion for each review
- Visualising the top emotion distribution for all negative reviews in both data sets
- Extract all the negative reviews (from both data sets) where anger is top emotion
- Running BERTopic on the output of the previous step
- Visualing the clusters from this run to see if it is possible to narrow down the primary issues that have led to an angry review

df_neg df has only been filtered for where score <3

In [92]:
# Convert to datetime
df_google_negative['Creation Date'] = pd.to_datetime(df_google_negative['Creation Date'], format='%m/%d/%y %H:%M')
df_trustpilot_negative['Review Created (UTC)'] = pd.to_datetime(df_trustpilot_negative['Review Created (UTC)'])

In [93]:
from datetime import timedelta

In [None]:
# Function to find overlaps within a given time window
def find_temporal_overlap(df1, df2, time_col1, time_col2, time_window=timedelta(minutes=10)):
    overlaps = []
    for idx, row in df1.iterrows():
        time1 = row[time_col1]
        mask = (df2[time_col2] >= time1 - time_window) & (df2[time_col2] <= time1 + time_window)
        overlapping_rows = df2[mask]
        for _, overlap_row in overlapping_rows.iterrows():
            overlaps.append((row, overlap_row))
    return overlaps

# Find overlapping rows
overlaps = find_temporal_overlap(df_google_negative, df_trustpilot_negative, 'Creation Date', 'Review Created (UTC)')

# Print overlapping pairs
for google_row, trustpilot_row in overlaps:
    print("Google Row:\n", google_row)
    print("Trustpilot Row:\n", trustpilot_row)
    print("\n---\n")

In [None]:
# Extract indices of overlapping rows
google_indices = [google_row.name for google_row, _ in overlaps]
trustpilot_indices = [trustpilot_row.name for _, trustpilot_row in overlaps]

# Create new DataFrames excluding overlapping rows
df_google_reduced = df_google_negative.drop(google_indices)
df_trustpilot_reduced = df_trustpilot_negative.drop(trustpilot_indices)

print("Reduced Google DataFrame:\n", df_google_reduced)
print("Reduced Trustpilot DataFrame:\n", df_trustpilot_reduced)

In [None]:
df_google_reduced

In [None]:
df_trustpilot_reduced

In [98]:
from transformers import pipeline

In [None]:
classifier = pipeline("text-classification", model='bhadresh-savani/bert-base-uncased-emotion', return_all_scores=True)

In [100]:
example_sentence = "I found NLP boring when studying the material but now putting things into practive via this topic project I found its actually quite interesting and I'm mclovin it!"

In [101]:
emotion_labels = classifier(example_sentence, )

In [102]:
emotion_labels_sorted = sorted(emotion_labels[0], key=lambda x: x["score"], reverse=True)

In [None]:
print(emotion_labels_sorted)

In [None]:
display(df_google.head(2))
display(df_trustpilot.head(2))

In [105]:
def get_top_emotion(text):
    results = classifier(text)
    # results is a list of dictionaries containing labels and scores
    if results:
        # Extracting the label with the highest score
        return max(results[0], key=lambda x: x['score'])['label']
    return None

In [None]:
# Apply the classifier to each review
#df_google['Top_Emotion'] = df_google["Comment"].apply(get_top_emotion)
#df_trustpilot['Top_Emotion'] = df_trustpilot["Review Content"].apply(get_top_emotion)

# Apply the classifier to each review
#df_google_negative['Top_Emotion'] = df_google_negative["Comment"].apply(get_top_emotion)
#df_trustpilot_negative['Top_Emotion'] = df_trustpilot_negative["Review Content"].apply(get_top_emotion)

In [None]:
df_google_negative

In [None]:
df_trustpilot_negative

In [None]:
# Apply the classifier to each review
df_google_reduced['Top_Emotion'] = df_google_reduced["Comment"].apply(get_top_emotion)
df_trustpilot_reduced['Top_Emotion'] = df_trustpilot_reduced["Review Content"].apply(get_top_emotion)

Use batch processing next time

In [109]:
# Save the df_google DataFrame to a CSV file
df_google_reduced.to_csv('/content/df_google_with_emotions.csv', index=False)

# Save the df_trustpilot DataFrame to a CSV file
df_trustpilot_reduced.to_csv('/content/df_trustpilot_with_emotions.csv', index=False)

In [110]:
from google.colab import files

# Download the df_google CSV file
files.download('/content/df_google_with_emotions.csv')

# Download the df_trustpilot CSV file
files.download('/content/df_trustpilot_with_emotions.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## continuing 5.

In [113]:
# Load the df_google DataFrame from the CSV file
df_google_with_emotions = pd.read_csv('/content/df_google_with_emotions.csv')

# Load the df_trustpilot DataFrame from the CSV file
df_trustpilot_with_emotions = pd.read_csv('/content/df_trustpilot_with_emotions.csv')

In [111]:
import ast

In [114]:
# Convert string representations of lists back to actual lists
df_google_with_emotions['Comment'] = df_google_with_emotions['Comment'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else [])
df_trustpilot_with_emotions['Review Content'] = df_trustpilot_with_emotions['Review Content'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else [])

In [None]:
df_google_with_emotions

In [None]:
df_trustpilot_with_emotions

In [117]:
# Step 1: Add a 'Source' column to each DataFrame
df_google_with_emotions['Source'] = 'Google Reviews'
df_trustpilot_with_emotions['Source'] = 'Trustpilot'

# Step 2: Combine the DataFrames
df_combined_neg = pd.concat([
    df_google_with_emotions[['Top_Emotion', 'Source']],
    df_trustpilot_with_emotions[['Top_Emotion', 'Source']]
])

In [None]:
# Step 3: Plot with Seaborn
plt.figure(figsize=(12, 6))
sns.countplot(data=df_combined_neg, x='Top_Emotion', hue='Source', palette='viridis')
plt.title('Top Emotion Distribution for Negative Reviews')
plt.xlabel('Top Emotion')
plt.ylabel('Count')
plt.legend(title='Source')
plt.show()

In [None]:
# Plot for Google Reviews
plt.figure(figsize=(14, 6))
sns.countplot(data=df_google_with_emotions, x='Top_Emotion', palette='viridis')
plt.title('Top Emotion Distribution for Negative Google Reviews')
plt.xlabel('Top Emotion')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

# Plot for Trustpilot Reviews
plt.figure(figsize=(14, 6))
sns.countplot(data=df_trustpilot_with_emotions, x='Top_Emotion', palette='viridis')
plt.title('Top Emotion Distribution for Negative Trustpilot Reviews')
plt.xlabel('Top Emotion')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Create a FacetGrid for side-by-side plots
g = sns.FacetGrid(df_combined_neg, col='Source', col_wrap=2, height=6, aspect=1.5, sharey=False)

# Map the countplot onto the grid
g.map(sns.countplot, 'Top_Emotion', palette='viridis')

# Adjust the titles and labels
g.set_titles(col_template="{col_name}")
g.set_axis_labels('Top Emotion', 'Count')
g.set_xticklabels(rotation=45)

# Add a main title
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Top Emotion Distribution for Negative Reviews by Source', fontsize=16)

# Show the plot
plt.show()

"""

# Plot for Google Reviews
plt.figure(figsize=(14, 6))
sns.countplot(data=df_google_with_emotions, x='Top_Emotion', palette='viridis')
plt.title('Top Emotion Distribution for Negative Google Reviews')
plt.xlabel('Top Emotion')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

# Plot for Trustpilot Reviews
plt.figure(figsize=(14, 6))
sns.countplot(data=df_trustpilot_with_emotions, x='Top_Emotion', palette='viridis')
plt.title('Top Emotion Distribution for Negative Trustpilot Reviews')
plt.xlabel('Top Emotion')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

"""

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create a subplot with 2 columns
fig = make_subplots(rows=1, cols=2, subplot_titles=('Google Reviews', 'Trustpilot Reviews'))

# Create bar plots for Google Reviews
google_emotion_counts = df_google_with_emotions['Top_Emotion'].value_counts().reset_index()
google_emotion_counts.columns = ['Top_Emotion', 'Count']
fig.add_trace(
    go.Bar(x=google_emotion_counts['Top_Emotion'], y=google_emotion_counts['Count'], name='Google Reviews', marker_color='blue'),
    row=1, col=1
)

# Create bar plots for Trustpilot Reviews
trustpilot_emotion_counts = df_trustpilot_with_emotions['Top_Emotion'].value_counts().reset_index()
trustpilot_emotion_counts.columns = ['Top_Emotion', 'Count']
fig.add_trace(
    go.Bar(x=trustpilot_emotion_counts['Top_Emotion'], y=trustpilot_emotion_counts['Count'], name='Trustpilot Reviews', marker_color='orange'),
    row=1, col=2
)

# Update layout for the subplots
fig.update_layout(
    title_text='Top Emotion Distribution for Negative Reviews by Source',
    xaxis_title='Top Emotion',
    yaxis_title='Count',
    xaxis2_title='Top Emotion',
    yaxis2_title='Count',
    showlegend=False
)

# Show the figure
fig.show()

In [None]:
df_google_anger = df_google_with_emotions[df_google_with_emotions['Top_Emotion'] == 'anger']

df_trustpilot_anger = df_trustpilot_with_emotions[df_trustpilot_with_emotions['Top_Emotion'] == 'anger']

df_trustpilot_anger.head(5)

In [None]:
df_trustpilot_anger['Review Language'].unique()

In [124]:
# Define the list of languages to drop
languages_to_drop = ['pl', 'da', 'de', 'ro']

# Remove rows where 'Review Language' is in the list and update df_trustpilot_anger
df_trustpilot_anger = df_trustpilot_anger[~df_trustpilot_anger['Review Language'].isin(languages_to_drop)]

In [None]:
df_google_cols = ["Club's Name", "Creation Date", "Overall Score", "Comment"]
df_trustpilot_cols = ["Review Created (UTC)", "Review Title", "Review Content", "Review Stars", "Company Reply Date (UTC)", "Location Name"]
rename_dict = {"Location Name": "Club's Name"}

merged_common_anger = concatenate_dataframes(df_google_anger, df_trustpilot_anger, df_google_cols, df_trustpilot_cols, rename_dict)

merged_common_anger.head(8)

In [126]:
# Drop rows where 'Club\'s Name' is '209 - Slagelse, Jernbanegade'
merged_common_anger = merged_common_anger[merged_common_anger["Club's Name"] != "209 - Slagelse, Jernbanegade"]

In [None]:
merged_common_anger["Club's Name"].unique()

In [None]:
merged_common_anger.head(8)

In [None]:
documents_anger = []

# Fill NaN values with empty lists
merged_common_anger['Comment'] = merged_common_anger['Comment'].apply(lambda x: x if isinstance(x, list) else [])
merged_common_anger['Review Content'] = merged_common_anger['Review Content'].apply(lambda x: x if isinstance(x, list) else [])

# Combine the tokens from both columns row-wise into single documents
documents_anger = [
    " ".join((comment if isinstance(comment, list) else []) +
             (review_content if isinstance(review_content, list) else []))
    for comment, review_content in zip(merged_common_anger['Comment'], merged_common_anger['Review Content'])
]

# Check the first few documents
print(f"First few documents:\n{documents_anger[:10]}")

# Check the total number of documents
print(f"Total Documents: {len(documents_anger)}")

In [None]:
# Check the data type of each element in 'Comment' column
print(merged_common_anger['Comment'].apply(type).value_counts())

# Check the data type of each element in 'Review Content' column
print(merged_common_anger['Review Content'].apply(type).value_counts())

In [None]:
documents_anger

In [None]:
check_invalid_entries(documents_anger)

In [None]:
# Example code to check if the model is fitted properly
print(hasattr(model, 'components_'))  # Check if the model has components
print(f"Number of topics: {model.n_topics if hasattr(model, 'n_topics') else 'N/A'}")

In [134]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# Initialize the BERTopic model
model = BERTopic(embedding_model=SentenceTransformer("paraphrase-MiniLM-L6-v2"))

# Fit the model to your documents
topics, probabilities = model.fit_transform(documents_anger)

In [None]:
# Check if the model has topics and other attributes
print(f"Topics: {model.get_topic_info()}")
print(f"Number of topics: {len(model.get_topic_info())}")

In [None]:
# Check if any document is empty or has unexpected content
print(f"Number of documents: {len(documents_anger)}")
print(f"Number of empty documents: {sum(not doc.strip() for doc in documents_anger)}")

print(f"Topics found: {model.get_topic_info()}")
print(f"Number of topics: {len(model.get_topic_info())}")

# Check for valid topic assignments
print(f"Topic assignments: {set(topics)}")

In [None]:
# Display topics
model.visualize_topics()

In [None]:
model.get_topic_info()
model.visualize_barchart()

# 5. Using a large language model from Hugging Face

- Loading: tiiuae/falcon-7b-instruct. Using a pipeline for text generation, max length of 1,000 for each review for efficiency.

- Before passing to the model will use this prompt for every review:

**In the following customer review, pick out the main 3 topics. Return them in a numbered list format, with each one on a new line.**

Output should be top 3 topics from each review

In [None]:
!pip install accelerate
!pip install -U safetensors

In [14]:
from transformers import AutoTokenizer
import transformers
import torch
import accelerate

In [15]:
from transformers import pipeline, AutoTokenizer

In [None]:
# Load the model and set up the text generation pipeline
#model_name = "distilgpt2"
model_name = "tiiuae/falcon-7b-instruct"
text_generator = pipeline("text-generation", model=model_name, max_length=1000, truncation=True) #max_length=1000, max_new_tokens=1000
#tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
model_name = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

In [None]:
df_trustpilot_unprocessed

In [None]:
df_google_unprocessed

In [None]:
# Convert 'Overall Score' directly to integer
df_trustpilot_unprocessed['Review Stars'] = df_trustpilot_unprocessed['Review Stars'].astype(int)

# Convert 'Overall Score' directly to integer
df_google_unprocessed['Overall Score'] = df_google_unprocessed['Overall Score'].astype(int)

In [None]:
df_google_negative_unprocessed = df_google_unprocessed[df_google_unprocessed['Overall Score'] < 3]
df_trustpilot_negative_unprocessed = df_trustpilot_unprocessed[df_trustpilot_unprocessed['Review Stars'] < 3]

In [None]:
df_google_negative_unprocessed.info()
print("\n")
df_trustpilot_negative_unprocessed.info()

In [None]:
df_subset_google_negative = df_google_negative_unprocessed.sample(frac=0.1, random_state=1)

df_subset_trustpilot_negative = df_trustpilot_negative_unprocessed.sample(frac=0.1, random_state=1)
df_subset_trustpilot_negative.head(5)

In [None]:
# Un comment if want this to be super quick
#df_subset_trustpilot_negative = df_subset_trustpilot_negative[:3]

In [None]:
df_subset_google_negative.head(5)

In [None]:
prefix = "In the following customer review, pick out the main 3 topics. Return them in a numbered list format, with each one on a new line"

Had an issue with batch processing so opting for this

In [None]:
# Function to generate topics for each review
def generate_topics_for_df(df, review_column, prefix="", max_length=1000):
    # Initialize a list to store the results
    results = []

    # Loop through each review in the DataFrame
    for idx, review_text in enumerate(df[review_column]):
        # Add the prefix to the review text
        input_text = prefix + review_text

        # Generate the text using the model
        sequences = pipeline(
            input_text,
            max_length=max_length,
            do_sample=True,
            top_k=10,
            temperature=0.5,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
        )

        # Extract the generated text
        generated_text = sequences[0]['generated_text']
        results.append(generated_text)

        print(f"Processed review {idx + 1}/{len(df)}: {generated_text}")

    # Add the results as a new column to the DataFrame
    df['Generated Topics'] = results
    return df

This function below lets me choose various parameters as currently unsure of the format I want things to be in....but above one works too but less complex, modify above once decided

In [None]:
def generate_topics_for_df(df, review_column, prefix="", max_length=200, clean_topics=False, store_individually=False, store_as_list=False, store_entire_text=False):
    # Initialize lists to store results
    results = []
    individual_topics = []
    topics_as_list = []

    # Loop through each review in the DataFrame
    for idx, review_text in enumerate(df[review_column]):
        # Add the prefix to the review text
        input_text = prefix + review_text

        # Generate the text using the model
        sequences = pipeline(
            input_text,
            max_length=max_length,
            do_sample=True,
            top_k=10,
            temperature=0.5,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
        )

        # Extract the generated text
        generated_text = sequences[0]['generated_text']

        if clean_topics:
            # Clean the generated topics
            topic_splits = generated_text.splitlines()
            if len(topic_splits) > 1:
                topic_splits.pop(0)  # Remove the first line if it's not useful

            cleaned_topics = []
            for topic_split in topic_splits:
                # Remove list numbering, hyphens, and extra spaces
                parts = topic_split.split('.', 1)
                if len(parts) > 1:
                    cleaned_topic = parts[1].strip()
                else:
                    cleaned_topic = topic_split.strip()

                # Additional cleaning
                cleaned_topic = cleaned_topic.replace('-', '').strip()

                if cleaned_topic:
                    cleaned_topics.append(cleaned_topic)

            if store_individually:
                # Store topics in separate columns
                individual_topics.append(cleaned_topics)
            if store_as_list:
                # Store topics as a list
                topics_as_list.append(cleaned_topics)
            if store_entire_text:
                # Store entire cleaned/generated text if requested
                results.append("\n".join(cleaned_topics))
        else:
            # Handle uncleaned text
            topic_splits = generated_text.splitlines()
            cleaned_topics = [line.strip() for line in topic_splits if line.strip()]

            if store_individually:
                # Store topics in separate columns
                individual_topics.append(cleaned_topics)
            if store_as_list:
                # Store topics as a list
                topics_as_list.append(cleaned_topics)
            if store_entire_text:
                # Store entire generated text if requested
                results.append(generated_text)

        print(f"Processed review {idx + 1}/{len(df)}: {generated_text}")

    # Add the results to the DataFrame
    if store_individually:
        # Create columns for each topic
        max_topics = max(len(topics) for topics in individual_topics)
        for i in range(max_topics):
            df[f'Topic {i + 1}'] = [topics[i] if i < len(topics) else '' for topics in individual_topics]

    if store_as_list:
        df['Topics List'] = topics_as_list

    if store_entire_text:
        df['Generated Topics'] = results

    return df

In [None]:
# Apply the function to your DataFrame
df_subset_google_negative = generate_topics_for_df(df_subset_google_negative, review_column='Comment', prefix=prefix)

# Display the updated DataFrame
print(df_subset_google_negative[['Comment', 'Generated Topics']])

In [None]:
# Apply the function to your DataFrame
df_subset_trustpilot_negative = generate_topics_for_df(df_subset_trustpilot_negative, review_column='Review Content', prefix=prefix)

# Display the updated DataFrame
print(df_subset_trustpilot_negative[['Review Content', 'Generated Topics']])

In [None]:
# Save the DataFrame to a CSV file
df_subset_trustpilot_negative.to_csv('df_subset_trustpilot_negative.csv', index=False)



"""

# Convert column to a NumPy array and save
np.save('np_topics_llm.npy', df_subset_trustpilot_negative['Generated Topics'].to_numpy())


"""

In [None]:
# Load the DataFrame from the CSV file
df_loaded = pd.read_csv('df_subset_trustpilot_negative.csv')

"""
# Load the NumPy array and convert back to list
loaded_arr = np.load('np_topics_llm.npy', allow_pickle=True)  # allow_pickle=True is needed for objects like lists
topics_llm = loaded_arr.tolist()
"""

In [None]:
from google.colab import files

# Download the CSV file
files.download('df_subset_trustpilot_negative.csv')

In [None]:
# Load the DataFrame from the uploaded file
df_subset_trustpilot_negative = pd.read_csv('df_subset_trustpilot_negative.csv')

In [None]:
df_subset_trustpilot_negative['Generated Topics']

In [None]:
print(df_subset_trustpilot_negative.loc[349, 'Generated Topics'])

In [None]:
def extract_and_clean_topics(df, column_name):
    all_topics = []

    for idx, row in df.iterrows():
        generated_text = row[column_name]

        if pd.notna(generated_text):
            # Split the text by lines
            lines = generated_text.splitlines()

            # Identify where the topics start
            topic_start_index = 0
            for i, line in enumerate(lines):
                if line.strip().startswith("1."):
                    topic_start_index = i
                    break

            # Extract topics
            if topic_start_index > 0:
                for line in lines[topic_start_index:]:
                    # Remove list numbering and extra spaces
                    if line.strip():
                        # Remove list numbers, clean up extra spaces and other characters
                        parts = line.split('.', 1)
                        if len(parts) > 1:
                            cleaned_topic = parts[1].strip()
                        else:
                            cleaned_topic = line.strip()

                        # Additional cleaning for any unexpected formatting
                        cleaned_topic = cleaned_topic.replace('-', '').strip()

                        if cleaned_topic:
                            all_topics.append(cleaned_topic)

    return all_topics

# Extract topics from DataFrame
topics_llm = extract_and_clean_topics(df_subset_trustpilot_negative, 'Generated Topics')

# Create a comprehensive list of topics
topic_string = ', '.join(topics_llm)

# Display the first 2000 characters (or adjust as needed)
print(topic_string[:2000])  # Display a snippet if the list is very long

# Optional: Save the topics to a file for later use
with open('topics_list.txt', 'w') as file:
    file.write(topic_string)


###  Redoing BERTopic with output from prev step

In [None]:
# Initialize the BERTopic model
model = BERTopic(embedding_model=SentenceTransformer("paraphrase-MiniLM-L6-v2"))

# Fit the model to your documents
topics, probabilities = model.fit_transform(documents)

# Display topics
model.visualize_topics()

In [None]:
model.visualize_topics()

In [None]:
model.get_topic_info()

Sizeable number of outliers (topic -1) and should typically be ignored. Let's take a look at the most frequent topic that was generated, topic 0

In [None]:
model.get_topic(0)

In [None]:
model.get_topic(1)

In [None]:
# Extract topic-document mappings
topic_docs = model.get_representative_docs()

# Create a list of dictionaries to convert to a DataFrame
data = []
for topic, docs in topic_docs.items():
    for doc in docs:
        data.append({'Topic': topic, 'Document': doc, 'Probability': 1.0})  # Assuming probability is not available, use a placeholder

# Convert to DataFrame
topics_df = pd.DataFrame(data)

# Display the DataFrame to understand its structure
print(topics_df.head())

# For each topic, get the top 10 documents
top_docs_per_topic = {}
for topic in topics_df['Topic'].unique():
    topic_docs_df = topics_df[topics_df['Topic'] == topic]
    # Sort documents based on their probabilities
    sorted_docs = topic_docs_df.sort_values(by='Probability', ascending=False)
    top_docs_per_topic[topic] = sorted_docs.head(10)  # Get top 10 documents

# Print the top 10 documents for each topic
for topic, top_docs in top_docs_per_topic.items():
    print(f"Topic {topic}:")
    for _, row in top_docs.iterrows():
        print(f"Document: {row['Document']}\nProbability: {row['Probability']}\n")

In [None]:
# Get topic information
topic_info = model.get_topic_info()

# Identify the top 2 topics based on frequency
top_topics = topic_info.head(3)['Topic'].tolist()

# Get top words for these topics
for topic in top_topics:
    top_words = model.get_topic(topic)
    print(f"Top words for Topic {topic}:")
    for word, weight in top_words:
        print(f"{word}: {weight}")
    print("\n")

In [None]:
# Check the structure of topic_docs
print(topic_docs)

In [None]:
model.visualize_barchart()

In [None]:
model.visualize_heatmap()

# Hide next step for now

Using the list from the prev step running Falcon-7b-Instruct model

  This time using this pre-fix as the prompt for each review: **For the following text topics obtained from negative customer reviews, can you give some actionable insights that would help this gym company?**

Listing the output in the form of suggestions, that the company can employ to address customer concerns.

In [None]:
# Path to the file
file_path = 'topics_list.txt'

# Read the contents of the file
try:
    with open(file_path, 'r') as file:
        topic_string = file.read()
        print("Data loaded from file successfully.")
except FileNotFoundError:
    print(f"Error: The file {file_path} does not exist.")
except IOError as e:
    print(f"Error reading the file: {e}")

# Display the contents or process them as needed
print(topic_string[:2000])  # Display a snippet if the data is very long

In [None]:
topic_string

In [None]:
BERT_prefix = "For the following text topics obtained from negative customer reviews, can you give some actionable insights that would help this gym company?"

apply falcon on model again with topic list string, make sure this function is correct before running the code below

In [None]:
# Tokenize the input text
tokens = tokenizer(topic_string, return_tensors='pt')

# Count the number of tokens
num_tokens = tokens['input_ids'].size(1)
print(f"Number of tokens: {num_tokens}")


In [None]:
# Define the function to generate actionable insights
def generate_actionable_insights(topic_string, prefix="", max_length=6500): #max_new_tokens=500
    # Combine prefix with topic string
    input_text = f"{prefix}\n\n{topic_string}"

    # Generate actionable insights using the Falcon model
    try:
        # Generate text with Falcon
        sequences = pipeline(
            input_text,
            #max_new_tokens=max_new_tokens,
            max_length=max_length,
            do_sample=True,
            top_k=10,
            temperature=0.7,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
        )

        # Extract the generated text
        generated_text = sequences[0]['generated_text']

        # Optionally: Process generated text to improve readability
        actionable_insights = [line.strip() for line in generated_text.split('\n') if line.strip()]

        return actionable_insights

    except Exception as e:
        print(f"Error generating insights: {e}")
        return ["Error generating insights"]

In [None]:
# Call the function with the provided prefix and topic string
actionable_insights = generate_actionable_insights(topic_string, BERT_prefix)
print("Actionable Insights:")
for insight in actionable_insights:
    print(insight)

In [None]:
from transformers import AutoModelForCausalLM

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name)

In [None]:
# Define the text generation function
def generate_actionable_insights(topic_string, prefix="", max_new_tokens=300, max_length=2048):
    input_text = f"{prefix}\n\n{topic_string}"

    # Tokenize the input
    inputs = tokenizer(input_text, return_tensors='pt', truncation=True, max_length=max_length)

    # Generate text
    try:
        outputs = model.generate(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_k=10,
            temperature=0.7
        )

        # Decode and process the generated text
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        actionable_insights = [line.strip() for line in generated_text.split('\n') if line.strip()]

        return actionable_insights

    except RuntimeError as e:
        if 'CUDA out of memory' in str(e):
            print("CUDA out of memory. Consider reducing the number of new tokens.")
        else:
            print(f"Error generating insights: {e}")
        return ["Error generating insights"]

In [None]:
# Call the function with the provided prefix and topic string
actionable_insights = generate_actionable_insights(topic_string, BERT_prefix)
print("Actionable Insights:")
for insight in actionable_insights:
    print(insight)