# **Identifying Challenges Faced by Developers in Scientific Workflow Management Systems using BERTopic**

This notebook uses BERTopic to identify challenges faced by developers in scientific workflow management systems using Stack Overflow posts and GitHub issues. The dataset used in this notebook is available at https://figshare.com/projects/SWsChallengesbySOandGitHub/172476.

## Installing Dependencies

In [None]:
%pip install -r requirements.txt

## Define functions to preprocess data

In [None]:
import pandas as pd
import re
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bertopic import BERTopic

# Download stopwords if not already available
nltk.download('stopwords')
nltk.download('wordnet')

# Custom stopwords list (add domain-specific stopwords if needed)
custom_stopwords = set([])

def clean_text(text):
    # Remove HTML tags using BeautifulSoup
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()

    # Remove non-alphanumeric characters and convert to lowercase
    text = re.sub(r'[^A-Za-z\s]', ' ', text.lower())

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def remove_stopwords_and_lemmatize(text):
    stop_words = set(stopwords.words('english')) | custom_stopwords
    lemmatizer = WordNetLemmatizer()
    return " ".join(lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words)

def train_bertopic(data):
    model = BERTopic(language="english", calculate_probabilities=True)
    topics, probabilities = model.fit_transform(data)
    return model, topics, probabilities

## Import and Preprocess Data

In [None]:
# Import the data
try:
    new_df = pd.read_csv('Dataset/StackOverflowPostsDataset.csv')
    new_df["merged"] = new_df[["Body", "Title", "Tags"]].apply("-".join, axis=1)
except FileNotFoundError:
    print("Dataset file not found. Please provide the correct file path.")
    exit(1)

# Preprocess the data
new_df["merged"] = new_df["merged"].apply(clean_text)
new_df["processed"] = new_df["merged"].apply(remove_stopwords_and_lemmatize)

# Save the preprocessed data
new_df.to_csv('Results/ConcatenatedDatasetSO.csv', index=False)

In [None]:
# View first 5 rows of the processed data
print(new_df.head()["processed"])

## Building and training the model

Instantiate the model and train it on the data. The model will automatically select the best topic based on the topic coherence. The higher the topic coherence, the better.

In [None]:
# Train BERTopic on processed data
data = new_df["processed"].values.tolist()
model, topics, probabilities = train_bertopic(data)

# Get topics and their top words
topics_df = model.get_topic_freq()
topics_df.head()

# Save the BERTopic model
model.save("Results/BERTopicModelSO")

## Extracting Topics
After fitting the model, we can extract the topics from the model. This will return the topics with their corresponding IDs, the dominant topic per sentence, and the frequency of each topic.

In [None]:
freq = model.get_topic_info(); freq.head(5)

-1 refers to all outliers that BERTopic was not able to assign a topic to. Next, look at the most frequent topics and their words to determine what the topic is about.

In [None]:
model.get_topic(0)  # Select the most frequent topic

## Assess trained model

In [None]:
# assess predicted topics for first 10 posts
model.topics_[:10]

## Visualize Topics
After having trained our model, we can visualize the topics that were generated in a way very similar to LDAvis.

In [None]:
model.visualize_topics()


In [None]:
model.visualize_distribution(probabilities[70], min_probability=0.015)

In [None]:
# visualize topic hierarchy
model.visualize_hierarchy(top_n_topics=30)

In [None]:
# visualize terms
model.visualize_barchart(top_n_topics=8)

In [None]:
# visualize topic similarity
model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

In [None]:
# visualize term score decline
model.visualize_term_rank()

## Refine Model

Our model has identified over 70 different topics, which is a bit too many to be useful. We can fine-tune the model by adjusting the parameters and retraining the model.

In [None]:
# update topics to include bigrams and trigrams
model.update_topics(data, n_gram_range=(1, 3))

In [None]:
# get top words and their c-TF-IDF scores
model.get_topic(0)

In [None]:
# get topic frequencies
model.get_topic_freq()

In [None]:
model = BERTopic.load("Results/BERTopicModelSO")

# reduce topics
model.reduce_topics(data, nr_topics=32)
print(model.topics_)

# visualize reduced topics
model.visualize_topics()