# **Identifying Challenges Faced by Developers in Scientific Workflow Management Systems using BERTopic**

This notebook uses BERTopic to identify challenges faced by developers in scientific workflow management systems using Stack Overflow posts and GitHub issues. The dataset used in this notebook is available at https://figshare.com/projects/SWsChallengesbySOandGitHub/172476.

## Installing Dependencies

In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Define functions to preprocess data

In [2]:
import pandas as pd
import re
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bertopic import BERTopic

# Download stopwords if not already available
nltk.download('stopwords')
nltk.download('wordnet')

# Custom stopwords list (add domain-specific stopwords if needed)
custom_stopwords = set([])

def clean_text(text):
    # Remove HTML tags using BeautifulSoup
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()

    # Remove non-alphanumeric characters and convert to lowercase
    text = re.sub(r'[^A-Za-z\s]', ' ', text.lower())

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def remove_stopwords_and_lemmatize(text):
    stop_words = set(stopwords.words('english')) | custom_stopwords
    lemmatizer = WordNetLemmatizer()
    return " ".join(lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words)

def train_bertopic(data):
    model = BERTopic(language="english", calculate_probabilities=True)
    topics, probabilities = model.fit_transform(data)
    return model, topics, probabilities

  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()
[nltk_data] Downloading package stopwords to /home/dev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/dev/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Import and Preprocess Data

In [3]:
# Import the data
try:
    new_df = pd.read_csv('Dataset/StackOverflowPostsDataset.csv')
    new_df["merged"] = new_df[["Body", "Title", "Tags"]].apply("-".join, axis=1)
except FileNotFoundError:
    print("Dataset file not found. Please provide the correct file path.")
    exit(1)

# Preprocess the data
new_df["merged"] = new_df["merged"].apply(clean_text)
new_df["processed"] = new_df["merged"].apply(remove_stopwords_and_lemmatize)

# Save the preprocessed data
new_df.to_csv('Results/ConcatenatedDatasetSO.csv', index=False)

In [4]:
# View first 5 rows of the processed data
print(new_df.head()["processed"])

0    dealing small project feel break even point st...
1    way know table locked kind lock currently tabl...
2    anyone used rapidminer sentiment analysis righ...
3    able fill table data excel file text file usin...
4    im trying import data table file using bteq im...
Name: processed, dtype: object


## Building and training the model

Instantiate the model and train it on the data. The model will automatically select the best topic based on the topic coherence. The higher the topic coherence, the better.

In [5]:
# Train BERTopic on processed data
data = new_df["processed"].values.tolist()
model, topics, probabilities = train_bertopic(data)

# Get topics and their top words
topics_df = model.get_topic_freq()
topics_df.head()

# Save the BERTopic model
model.save("Results/BERTopicModelSO")

## Extracting Topics
After fitting the model, we can extract the topics from the model. This will return the topics with their corresponding IDs, the dominant topic per sentence, and the frequency of each topic.

In [6]:
freq = model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2126,-1_sample_input_rule_output,"[sample, input, rule, output, snakemake, file,...",[currently working project iam struggling issu...
1,0,441,0_eclipse_org_java_internal,"[eclipse, org, java, internal, kepler, core, u...",[rcp application based compatibility layer one...
2,1,411,1_monetdb_table_database_query,"[monetdb, table, database, query, monetdblite,...",[try run simple dplyr command monetdb sql back...
3,2,288,2_luigi_task_self_def,"[luigi, task, self, def, worker, bdx, return, ...",[luigi task read sql file output bigquery ques...
4,3,233,3_nextflow_process_channel_debug,"[nextflow, process, channel, debug, nf, main, ...",[nextflow pipeline four process last process i...


-1 refers to all outliers that BERTopic was not able to assign a topic to. Next, look at the most frequent topics and their words to determine what the topic is about.

In [7]:
model.get_topic(0)  # Select the most frequent topic

[('eclipse', 0.066029961726494),
 ('org', 0.049205673470792806),
 ('java', 0.043673502550706574),
 ('internal', 0.030956036786204),
 ('kepler', 0.023310045281065596),
 ('core', 0.023282987493646354),
 ('ui', 0.02006351483523414),
 ('maven', 0.01800152271643605),
 ('project', 0.017194127974160756),
 ('plugin', 0.016016481658954196)]

## Assess trained model

In [8]:
# assess predicted topics for first 10 posts
model.topics_[:10]

[1, -1, 10, 5, 5, 5, -1, 1, 4, -1]

## Visualize Topics
After having trained our model, we can visualize the topics that were generated in a way very similar to LDAvis.

In [9]:
model.visualize_topics()


In [10]:
model.visualize_distribution(probabilities[70], min_probability=0.015)

In [11]:
# visualize topic hierarchy
model.visualize_hierarchy(top_n_topics=30)

In [12]:
# visualize terms
model.visualize_barchart(top_n_topics=8)

In [13]:
# visualize topic similarity
model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

In [14]:
# visualize term score decline
model.visualize_term_rank()

## Refine Model

Our model has identified over 70 different topics, which is a bit too many to be useful. We can fine-tune the model by adjusting the parameters and retraining the model.

In [15]:
# update topics to include bigrams and trigrams
model.update_topics(data, n_gram_range=(1, 3))

In [16]:
# get top words and their c-TF-IDF scores
model.get_topic(0)

[('eclipse', 0.0346487670534546),
 ('org', 0.026851934881386187),
 ('java', 0.02521887960600546),
 ('org eclipse', 0.02237543031848014),
 ('java org', 0.017833324285112008),
 ('java org eclipse', 0.017443779542907815),
 ('internal', 0.01452429387085522),
 ('core', 0.011206570118054378),
 ('kepler', 0.010826193982955628),
 ('ui', 0.008917641804775054)]

In [17]:
# get topic frequencies
model.get_topic_freq()

Unnamed: 0,Topic,Count
1,-1,2126
13,0,441
0,1,411
19,2,288
53,3,233
...,...,...
70,71,12
68,72,12
66,73,11
58,74,11


In [18]:
model = BERTopic.load("Results/BERTopicModelSO")

# reduce topics
model.reduce_topics(data, nr_topics=32)
print(model.topics_)

# visualize reduced topics
model.visualize_topics()

[3, -1, 8, 1, 1, 1, -1, 3, 1, -1, -1, 1, 1, 5, -1, 10, 12, 1, 1, -1, -1, 6, -1, 7, -1, -1, -1, -1, 3, 6, 1, 1, -1, -1, -1, 20, 3, 2, 13, 1, -1, -1, 17, 1, 1, 2, 7, 8, 16, 1, -1, -1, -1, 1, 5, -1, -1, -1, 4, 8, -1, -1, 3, 5, 6, -1, 17, -1, 3, 5, 18, 2, -1, 20, -1, 7, 1, 1, -1, -1, 7, -1, 1, 3, -1, 26, 3, 3, 6, 7, 21, 3, -1, 8, 2, 8, 17, 12, 3, -1, -1, -1, 3, 5, 20, -1, -1, 8, -1, 6, 1, 3, -1, 3, 15, 3, 1, 3, 17, 1, -1, -1, 7, -1, -1, 1, 1, 26, 19, 3, 7, 30, 3, -1, -1, 7, 7, 7, -1, 27, 1, -1, 8, -1, 1, -1, -1, 8, -1, 1, -1, 5, 23, -1, -1, -1, -1, 5, 5, -1, 8, 2, -1, 12, 6, 8, 3, 3, -1, -1, 17, -1, 27, 2, 8, -1, 5, 12, 1, 21, 8, -1, 6, -1, -1, -1, 8, 6, 6, 6, -1, 23, -1, 1, -1, 1, 15, -1, -1, 18, 8, 8, 8, 1, -1, -1, 8, 8, 5, 1, 15, -1, 1, -1, -1, -1, 21, 1, 6, 8, 5, 12, 1, -1, -1, 1, -1, 1, 17, 1, -1, -1, -1, 3, -1, -1, 8, 26, -1, 2, 8, 19, 8, 3, 1, -1, 1, -1, 1, -1, 1, 1, 8, 8, 2, -1, -1, -1, -1, -1, -1, 27, -1, 1, 5, 3, 15, 1, 5, -1, -1, 5, -1, -1, -1, 15, 2, 2, 1, 8, 17, 2, 6, -1, 2, 1

In [19]:
# search a particular topic and view predictions for similar topics
similar_topics, similarity = model.find_topics("rule", top_n=5)

print("Similar topics to 'rule':")

for topic in similar_topics:
    words = [word for word, _ in model.get_topic(topic)]
    words_str = ", ".join(words)
    print(f"Topic {topic}: {words_str}")

Similar topics to 'rule':
Topic 12: regexp, string, character, regex, column, substr, like, teradata, replace, search
Topic 0: snakemake, rule, output, sample, input, file, fastq, nextflow, gz, config
Topic 13: knime, node, workflow, column, col, awayteam, hometeam, like, want, data
Topic -1: sample, output, input, rule, snakemake, file, table, fastq, data, teradata
Topic 24: log, rule, snakemake, script, save, file, output, shell, logging, flaky
