# **Identifying Challenges Faced by Developers in Scientific Workflow Management Systems using BERTopic**

This notebook uses BERTopic to identify challenges faced by developers in scientific workflow management systems using Stack Overflow posts and GitHub issues. The dataset used in this notebook is available at https://figshare.com/projects/SWsChallengesbySOandGitHub/172476.

## Installing Dependencies

In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Define functions to preprocess data

In [2]:
import pandas as pd
import re
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bertopic import BERTopic

# Download stopwords if not already available
nltk.download('stopwords')
nltk.download('wordnet')

# Custom stopwords list (add domain-specific stopwords if needed)
custom_stopwords = set([])

def clean_text(text):
    # Remove HTML tags using BeautifulSoup
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()

    # Remove non-alphanumeric characters and convert to lowercase
    text = re.sub(r'[^A-Za-z\s]', ' ', text.lower())

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def remove_stopwords_and_lemmatize(text):
    stop_words = set(stopwords.words('english')) | custom_stopwords
    lemmatizer = WordNetLemmatizer()
    return " ".join(lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words)

def train_bertopic(data):
    model = BERTopic(language="english", calculate_probabilities=True)
    topics, probabilities = model.fit_transform(data)
    return model, topics, probabilities

  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()
[nltk_data] Downloading package stopwords to /home/dev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/dev/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Import and Preprocess Data

In [3]:
# Import the data
try:
    new_df = pd.read_csv('Dataset/StackOverflowPostsDataset.csv')
    new_df["merged"] = new_df[["Body", "Title", "Tags"]].apply("-".join, axis=1)
except FileNotFoundError:
    print("Dataset file not found. Please provide the correct file path.")
    exit(1)

# Preprocess the data
new_df["merged"] = new_df["merged"].apply(clean_text)
new_df["processed"] = new_df["merged"].apply(remove_stopwords_and_lemmatize)

# Save the preprocessed data
new_df.to_csv('Results/ConcatenatedDatasetSO.csv', index=False)

In [4]:
# View first 5 rows of the processed data
print(new_df.head()["processed"])

0    dealing small project feel break even point st...
1    way know table locked kind lock currently tabl...
2    anyone used rapidminer sentiment analysis righ...
3    able fill table data excel file text file usin...
4    im trying import data table file using bteq im...
Name: processed, dtype: object


## Building and training the model

Instantiate the model and train it on the data. The model will automatically select the best topic based on the topic coherence. The higher the topic coherence, the better.

In [5]:
# Train BERTopic on processed data
data = new_df["processed"].values.tolist()
model, topics, probabilities = train_bertopic(data)

# Get topics and their top words
topics_df = model.get_topic_freq()
topics_df.head()

# Save the BERTopic model
model.save("Results/BERTopicModelSO")

## Extracting Topics
After fitting the model, we can extract the topics from the model. This will return the topics with their corresponding IDs, the dominant topic per sentence, and the frequency of each topic.

In [6]:
freq = model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2233,-1_sample_rule_input_output,"[sample, rule, input, output, snakemake, fastq...",[pretty new using snakemake looked around see ...
1,0,444,0_eclipse_org_java_internal,"[eclipse, org, java, internal, kepler, core, u...",[rcp application based compatibility layer one...
2,1,416,1_monetdb_database_table_query,"[monetdb, database, table, query, monetdblite,...",[tried using unixodbc version monetdb odbc cli...
3,2,287,2_luigi_task_self_def,"[luigi, task, self, def, worker, bdx, return, ...",[using luigi workflow workflow divided three g...
4,3,237,3_nextflow_process_channel_debug,"[nextflow, process, channel, debug, nf, main, ...",[working colleague pipeline nextflow weird beh...


-1 refers to all outliers that BERTopic was not able to assign a topic to. Next, look at the most frequent topics and their words to determine what the topic is about.

In [7]:
model.get_topic(0)  # Select the most frequent topic

[('eclipse', 0.06435535455251368),
 ('org', 0.0476609627512649),
 ('java', 0.04221754885207246),
 ('internal', 0.030445007146727706),
 ('kepler', 0.022938658999402162),
 ('core', 0.02276118695951159),
 ('ui', 0.01984208091053176),
 ('maven', 0.01778293577241705),
 ('project', 0.016919831666311256),
 ('plugin', 0.015455870928870906)]

## Assess trained model

In [8]:
# assess predicted topics for first 10 posts
model.topics_[:10]

[1, -1, 13, 4, 4, 4, 18, 1, 29, -1]

## Visualize Topics
After having trained our model, we can visualize the topics that were generated in a way very similar to LDAvis.

In [9]:
model.visualize_topics()