# **Identifying Challenges Faced by Developers in Scientific Workflow Management Systems using BERTopic**

This notebook uses BERTopic to identify challenges faced by developers in scientific workflow management systems using Stack Overflow posts and GitHub issues. The dataset used in this notebook is available at https://figshare.com/projects/SWsChallengesbySOandGitHub/172476.

## Installing Dependencies

In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Define functions to preprocess data

In [2]:
import pandas as pd
import re
import nltk
from bs4 import BeautifulSoup
from bertopic import BERTopic

# Download stopwords if not already available
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def clean_text(text):
    # Remove HTML tags
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()

    # Remove non-alphanumeric characters and convert to lowercase
    text = re.sub(r'[^A-Za-z]', ' ', text.lower())

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def remove_stopwords_and_stem(text):
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    return " ".join(stemmer.stem(word) for word in text.split() if word not in stop_words)

def train_bertopic(data):
    model = BERTopic(language="english", calculate_probabilities=True)
    topics, probabilities = model.fit_transform(data)
    return model, topics, probabilities

  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()
[nltk_data] Downloading package stopwords to /home/dev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Import and Preprocess Data

In [3]:
# Import the data
try:
    new_df = pd.read_csv('Dataset/StackOverflowPostsDataset.csv')
    new_df["merged"] = new_df[["Body", "Title", "Tags"]].apply("-".join, axis=1)
except FileNotFoundError:
    print("Dataset file not found. Please provide the correct file path.")
    exit(1)

# Preprocess the data
new_df["merged"] = new_df["merged"].apply(clean_text)
new_df["processed"] = new_df["merged"].apply(remove_stopwords_and_stem)

# Save the preprocessed data
new_df.to_csv('Dataset/ConcatenatedDatasetSO.csv', index=False)

## Building and training the model

Instantiate the model and train it on the data. The model will automatically select the best topic based on the topic coherence. The higher the topic coherence, the better.

In [4]:
# Train BERTopic on processed data
data = new_df["processed"].values.tolist()
model, topics, probabilities = train_bertopic(data)

# Get topics and their top words
topics_df = model.get_topic_freq()
topics_df.head()

# Save the BERTopic model
model.save("./Results/BERTopicModelSO")

## Extracting Topics
After fitting the model, we can extract the topics from the model. This will return the topics with their corresponding IDs, the dominant topic per sentence, and the frequency of each topic.

In [5]:
freq = model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1884,-1_tabl_teradata_valu_data,"[tabl, teradata, valu, data, use, id, select, ...",[would like translat follow oracl sql queri te...
1,0,947,0_sampl_rule_input_output,"[sampl, rule, input, output, fastq, snakemak, ...",[current work project iam struggl issu current...
2,1,416,1_eclips_org_java_intern,"[eclips, org, java, intern, core, kepler, ui, ...",[rcp applic base compat layer one part stack t...
3,2,374,2_monetdb_databas_tabl_monetdblit,"[monetdb, databas, tabl, monetdblit, queri, da...",[work get monetdb hook jdbc issu even basic tu...
4,3,313,3_date_month_day_dt,"[date, month, day, dt, end, sale, select, sum,...",[tabl everi row transact column client id date...


-1 refers to all outliers that BERTopic was not able to assign a topic to. Next, look at the most frequent topics and their words to determine what the topic is about.

In [6]:
model.get_topic(0)  # Select the most frequent topic

[('sampl', 0.029217438910670673),
 ('rule', 0.026246475362948114),
 ('input', 0.025044950510734826),
 ('output', 0.022352397745616807),
 ('fastq', 0.021711838714991362),
 ('snakemak', 0.021416948570340426),
 ('wildcard', 0.018091689231034117),
 ('file', 0.01802723556284566),
 ('gz', 0.017382453130048204),
 ('bam', 0.01596821260328908)]

## Assess trained model

In [7]:
# assess predicted topics for first 10 posts
model.topics_[:10]

[-1, -1, 55, -1, 43, 38, 3, -1, 66, -1]

## Visualize Topics
After having trained our model, we can visualize the topics that were generated in a way very similar to LDAvis.

In [8]:
model.visualize_topics()