# **Project: Topic Modelling**

- Karyl Abigail Grasparil

## Executive Summary:
In this project, I performed topic modeling on a dataset containing research papers categorized by subject, sourced from Kaggle. The objective was to identify hidden topics within the text data using two popular topic modeling techniques: LDA and NMF.

The preprocessing phase involved cleaning and preparing the text data, which included standardizing case, removing stop words, lemmatizing, and tokenizing. Following this, I vectorized the data using various vectorization techniques (e.g., TF-IDF and CountVectorizer) to convert the text into numerical format.

I developed six LDA and NMF models using different sets of hyperparameters to identify the optimal models. The models were evaluated based on metrics like perplexity, coherence scores, and topic-word distributions. I visualized the best LDA model using pyLDAvis for interactive exploration of topics.

The results indicate that *[insert best model here]* performed better at extracting meaningful topics relevant to the use case. The findings from the best model are discussed in detail, with recommendations for further improvements. The outcomes are expected to benefit stakeholders by providing deeper insights into research trends and themes within academic publications.


## Preprocssing:

Importing the libraries that will be used for the project

In [13]:
# Suppress warnings and import libraries
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.metrics import mean_squared_error
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.metrics import log_loss
from sklearn.decomposition import NMF
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abiga\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\abiga\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abiga\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Load the Dataset

In [2]:
# Load the dataset
df = pd.read_csv("arXiv-DataFrame.csv")
df.head()

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  elif issubclass(data.dtype.type, np.bool) or is_bool_dtype(data):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  elif issubclass(data.dtype.type, np.bool) or is_bool_dtype(data):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  elif issubclass(data.dtype.type, np.bool) or is_bool_dtype(data):
  if LooseVersion(IPython.__version__) < LooseVersion('3.0'):
  if LooseVersion(IPython.__version__) < LooseVersion('3.0'):


Unnamed: 0.1,Unnamed: 0,id,Title,Summary,Author,Link,Publish Date,Update Date,Primary Category,Category
0,0,cs/9308101v1,Dynamic Backtracking,Because of their occasional need to return to ...,M. L. Ginsberg,http://arxiv.org/pdf/cs/9308101v1,1993-08-01T00:00:00Z,1993-08-01T00:00:00Z,cs.AI,['cs.AI']
1,1,cs/9308102v1,A Market-Oriented Programming Environment and ...,Market price systems constitute a well-underst...,M. P. Wellman,http://arxiv.org/pdf/cs/9308102v1,1993-08-01T00:00:00Z,1993-08-01T00:00:00Z,cs.AI,['cs.AI']
2,2,cs/9309101v1,An Empirical Analysis of Search in GSAT,We describe an extensive study of search in GS...,I. P. Gent,http://arxiv.org/pdf/cs/9309101v1,1993-09-01T00:00:00Z,1993-09-01T00:00:00Z,cs.AI,['cs.AI']
3,3,cs/9311101v1,The Difficulties of Learning Logic Programs wi...,As real logic programmers normally use cut (!)...,F. Bergadano,http://arxiv.org/pdf/cs/9311101v1,1993-11-01T00:00:00Z,1993-11-01T00:00:00Z,cs.AI,['cs.AI']
4,4,cs/9311102v1,Software Agents: Completing Patterns and Const...,To support the goal of allowing users to recor...,J. C. Schlimmer,http://arxiv.org/pdf/cs/9311102v1,1993-11-01T00:00:00Z,1993-11-01T00:00:00Z,cs.AI,['cs.AI']


Data Preprocessing

In [3]:
df = df["Summary"].tolist()
print("Summary:\n{lines}\n".format(lines=df[:5]))
print("LENGTH:\n{length}\n".format(length=len(df)))

Summary:
['Because of their occasional need to return to shallow points in a search tree, existing backtracking methods can sometimes erase meaningful progress toward solving a search problem. In this paper, we present a method by which backtrack points can be moved deeper in the search space, thereby avoiding this difficulty. The technique developed is a variant of dependency-directed backtracking that uses only polynomial space while still providing useful control information and retaining the completeness guarantees provided by earlier approaches.', 'Market price systems constitute a well-understood class of mechanisms that under certain conditions provide effective decentralization of decision making with minimal communication overhead. In a market-oriented programming approach to distributed problem solving, we derive the activities and resource allocations for a set of computational agents by computing the competitive equilibrium of an artificial economy. WALRAS provides basic co

In [4]:
# Convert all text to lowercase
df = [text.lower() for text in df]

# Remove special characters, numbers, and extra spaces
df = [re.sub(r'[^a-z\s]', '', text) for text in df]

# Tokenization
df = [word_tokenize(text) for text in df]

# Initialize lemmatizer and stop words
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Define a preprocessing function
def preprocess(text):
    return [lemmatizer.lemmatize(word) for word in text if word not in stop_words and len(word) > 2]

# Apply preprocessing function to each summary
df = [preprocess(tokens) for tokens in df]

# Join tokens back into a single string for each summary
df = [' '.join(tokens) for tokens in df]

# Display the cleaned summaries
print("Processed Summary:\n{lines}\n".format(lines=df[:5]))
print("LENGTH:\n{length}\n".format(length=len(df)))

Processed Summary:
['occasional need return shallow point search tree existing backtracking method sometimes erase meaningful progress toward solving search problem paper present method backtrack point moved deeper search space thereby avoiding difficulty technique developed variant dependencydirected backtracking us polynomial space still providing useful control information retaining completeness guarantee provided earlier approach', 'market price system constitute wellunderstood class mechanism certain condition provide effective decentralization decision making minimal communication overhead marketoriented programming approach distributed problem solving derive activity resource allocation set computational agent computing competitive equilibrium artificial economy walras provides basic construct defining computational market structure protocol deriving corresponding price equilibrium particular realization approach form multicommodity flow problem see careful construction decision

## Models

Vectorization

In [5]:
# Vectorize using CountVectorizer for LDA
count_vectorizer = CountVectorizer(max_df=0.9, min_df=10)
count_data = count_vectorizer.fit_transform(df) 

# Vectorize using TF-IDF for NMF
tfidf_vectorizer = TfidfVectorizer(max_df=0.9, min_df=10)
tfidf_data = tfidf_vectorizer.fit_transform(df)

In [6]:
# Define the function to reformat output matrices into readable tables
def get_topics(mod, vec, names, docs, ndocs, nwords):
    # Word to topic matrix
    W = mod.components_
    W_norm = W / W.sum(axis=1)[:, np.newaxis]
    # Topic to document matrix
    H = mod.transform(vec)
    
    W_dict = {}
    H_dict = {}
    
    for tpc_idx, tpc_val in enumerate(W_norm):
        topic = f"Topic{tpc_idx + 1}"
        
        # Formatting W (Word-Topic)
        W_indices = tpc_val.argsort()[::-1][:nwords]
        W_names_values = [(round(tpc_val[j], 4), names[j]) for j in W_indices]
        W_dict[topic] = W_names_values
        
        # Formatting H (Document-Topic)
        H_indices = H[:, tpc_idx].argsort()[::-1][:ndocs]
        H_names_values = [(round(H[:, tpc_idx][j], 4), docs[j]) for j in H_indices]
        H_dict[topic] = H_names_values
        
    # Convert to DataFrames
    W_df = pd.DataFrame(W_dict, index=[f"Word{i+1}" for i in range(nwords)])
    H_df = pd.DataFrame(H_dict, index=[f"Doc{i+1}" for i in range(ndocs)])
        
    return W_df, H_df

Train and Evaluate LDA Models with Different Hyperparameters

In [39]:
# Function to train and evaluate multiple LDA models with alpha and beta parameters
def train_lda_models(data, vectorizer, docs, num_topics_list, alpha=None, beta=None, ndocs=5, nwords=10):
    """
    Train and evaluate LDA models with specified parameters, displaying word-topic and document-topic tables for all models.
    """
    results = []
    feature_names = vectorizer.get_feature_names_out()
    
    for num_topics in num_topics_list:
        print(f"\nTraining LDA with {num_topics} topics, alpha={alpha}, beta={beta}...")
        lda = LatentDirichletAllocation(
            n_components=num_topics,
            random_state=42,
            max_iter=10,
            doc_topic_prior=alpha,  # Alpha parameter
            topic_word_prior=beta   # Beta parameter
        )
        lda.fit(data)
        
        # Calculate perplexity
        perplexity = lda.perplexity(data)
        print(f"Perplexity for LDA with {num_topics} topics: {perplexity:.4f}")
        
        # Generate word-topic and document-topic tables
        W_df, H_df = get_topics(lda, data, feature_names, docs, ndocs, nwords)
        
        # Display the tables for the current model
        print("\nWord-Topic Table:")
        print(W_df)
        print("\nDocument-Topic Table:")
        print(H_df)
        
        # Store the model's evaluation results
        results.append({
            "num_topics": num_topics,
            "model": lda,
            "perplexity": perplexity,
            "word_topic_table": W_df,
            "doc_topic_table": H_df
        })
    
    return results

# Train and evaluate LDA models with alpha and beta
num_topics_list_lda = [15, 20, 25, 30]
docs = df

# Alpha and Beta Parameters
alpha = 0.5  # Balances sparsity and density
beta = 0.1   # Balances sparsity and density

# Train LDA models and evaluate them
lda_results = train_lda_models(count_data, count_vectorizer, docs, num_topics_list_lda, alpha=alpha, beta=beta)

# Select the best model based on perplexity
best_lda_model = min(lda_results, key=lambda x: x["perplexity"])

# Display the best model's results
print("\nBest LDA Model:")
print(f"Number of Topics: {best_lda_model['num_topics']}")
print(f"Perplexity: {best_lda_model['perplexity']:.4f}")
print("\nWord-Topic Table for Best Model:")
print(best_lda_model["word_topic_table"])
print("\nDocument-Topic Table for Best Model:")
print(best_lda_model["doc_topic_table"])


Training LDA with 15 topics, alpha=0.5, beta=0.1...
Perplexity for LDA with 15 topics: 2324.4677

Word-Topic Table:
                    Topic1              Topic2              Topic3  \
Word1     (0.0153, energy)    (0.0667, system)     (0.0179, graph)   
Word2       (0.0118, mass)   (0.0258, control)  (0.0168, function)   
Word3      (0.0098, model)     (0.0221, power)    (0.0136, number)   
Word4   (0.0092, spectrum)     (0.017, design)     (0.0132, bound)   
Word5       (0.0082, star)     (0.0102, paper)   (0.0115, lattice)   
Word6     (0.0061, galaxy)  (0.0097, proposed)      (0.011, model)   
Word7   (0.0057, emission)     (0.009, energy)    (0.0109, result)   
Word8    (0.0056, cluster)     (0.0083, state)      (0.0081, show)   
Word9     (0.0055, result)   (0.0073, circuit)    (0.0081, random)   
Word10      (0.0055, data)     (0.0071, based)      (0.0078, tree)   

                    Topic4                  Topic5                 Topic6  \
Word1     (0.0298, method)     (0.0

Train and Evaluate NMF Models with Different Hyperparameters

In [40]:
def evaluate_nmf_model(nmf, data, vectorizer, docs, ndocs=5, nwords=10):
    feature_names = vectorizer.get_feature_names_out()
    # Reconstruction error
    reconstruction_error = mean_squared_error(
        data.toarray(),
        nmf.inverse_transform(nmf.transform(data))
    )
    
    # Sparsity: proportion of non-zero elements in W and H
    W = nmf.transform(data)
    H = nmf.components_
    sparsity_W = np.mean(W != 0)
    sparsity_H = np.mean(H != 0)
    
    # Generate word-topic and document-topic tables for interpretability
    W_df, H_df = get_topics(nmf, data, feature_names, docs, ndocs, nwords)
    
    return reconstruction_error, sparsity_W, sparsity_H, W_df, H_df


def train_and_evaluate_nmf_models(data, vectorizer, docs, num_topics_list, ndocs=5, nwords=10):
    """
    Train and evaluate NMF models with specified numbers of topics.
    """
    results = []
    feature_names = vectorizer.get_feature_names_out()
    
    for num_topics in num_topics_list:
        print(f"\nTraining NMF with {num_topics} topics...")
        nmf = NMF(n_components=num_topics, random_state=42, max_iter=10, init="nndsvd")
        nmf.fit(data)
        
        # Evaluate the model
        reconstruction_error, sparsity_W, sparsity_H, W_df, H_df = evaluate_nmf_model(
            nmf, data, vectorizer, docs, ndocs, nwords
        )
        
        # Store results
        results.append({
            "num_topics": num_topics,
            "reconstruction_error": reconstruction_error,
            "sparsity_W": sparsity_W,
            "sparsity_H": sparsity_H,
            "word_topic_table": W_df,
            "doc_topic_table": H_df
        })
        
        print(f"\nReconstruction Error: {reconstruction_error:.4f}")
        print(f"Sparsity (W): {sparsity_W:.4f}")
        print(f"Sparsity (H): {sparsity_H:.4f}")
        print("\nWord-Topic Table:")
        print(W_df)
        print("\nDocument-Topic Table:")
        print(H_df)
    
    return results


# Define the list of numbers of topics
num_topics_list_nmf = [15, 20, 25, 30]
docs = df  

# Train and evaluate NMF models
nmf_results = train_and_evaluate_nmf_models(tfidf_data, tfidf_vectorizer, docs, num_topics_list_nmf)

# Select the best model based on reconstruction error
best_nmf_model = min(nmf_results, key=lambda x: x["reconstruction_error"])

print("\nBest NMF Model:")
print(f"Number of Topics: {best_nmf_model['num_topics']}")
print(f"Reconstruction Error: {best_nmf_model['reconstruction_error']:.4f}")
print(f"Sparsity (W): {best_nmf_model['sparsity_W']:.4f}")
print(f"Sparsity (H): {best_nmf_model['sparsity_H']:.4f}")
print("\nWord-Topic Table for Best Model:")
print(best_nmf_model["word_topic_table"])
print("\nDocument-Topic Table for Best Model:")
print(best_nmf_model["doc_topic_table"])


Training NMF with 15 topics...

Reconstruction Error: 0.0001
Sparsity (W): 0.5074
Sparsity (H): 0.3895

Word-Topic Table:
                       Topic1                Topic2                 Topic3  \
Word1          (0.0101, data)         (0.0102, set)       (0.0077, energy)   
Word2   (0.0047, information)       (0.01, theorem)        (0.0061, phase)   
Word3         (0.0043, paper)       (0.0086, prove)        (0.0054, field)   
Word4          (0.0043, user)    (0.0081, manifold)  (0.0053, temperature)   
Word5   (0.0038, application)        (0.008, proof)        (0.0051, state)   
Word6      (0.0036, analysis)       (0.0065, point)     (0.0051, magnetic)   
Word7      (0.0035, approach)        (0.0064, give)     (0.0042, electron)   
Word8        (0.0035, design)         (0.0064, map)   (0.0041, transition)   
Word9   (0.0033, performance)         (0.0063, let)       (0.0038, effect)   
Word10     (0.0031, research)  (0.0063, polynomial)         (0.0036, wave)   

                  

## Discussion

### Best LDA Model

In [41]:
print(f"Best LDA model:\n{best_lda_model}")

Best LDA model:
{'num_topics': 15, 'model': LatentDirichletAllocation(doc_topic_prior=0.5, n_components=15, random_state=42,
                          topic_word_prior=0.1), 'perplexity': 2324.4676980942677, 'word_topic_table':                     Topic1              Topic2              Topic3  \
Word1     (0.0153, energy)    (0.0667, system)     (0.0179, graph)   
Word2       (0.0118, mass)   (0.0258, control)  (0.0168, function)   
Word3      (0.0098, model)     (0.0221, power)    (0.0136, number)   
Word4   (0.0092, spectrum)     (0.017, design)     (0.0132, bound)   
Word5       (0.0082, star)     (0.0102, paper)   (0.0115, lattice)   
Word6     (0.0061, galaxy)  (0.0097, proposed)      (0.011, model)   
Word7   (0.0057, emission)     (0.009, energy)    (0.0109, result)   
Word8    (0.0056, cluster)     (0.0083, state)      (0.0081, show)   
Word9     (0.0055, result)   (0.0073, circuit)    (0.0081, random)   
Word10      (0.0055, data)     (0.0071, based)      (0.0078, tree)   

 

Here are the features of the best LDA model:
- `Number of Topics`: 15
- `Perplexity`: 2324.47
Among the LDA models trained, this model got the lowest perplexity. However, perplexity is not the only basis of determining whether it is the best data. Looking at the word-topic and document-topic tables, the associated words and documents show that the topics are interpretable.

Abstract Topics for the LDA Model:
1. Topic 1:

- **Dominant Words:** energy, mass, spectrum, star, galaxy, emission
- **Possible Topic:** Astrophysics and Cosmology
- **Description:** Focuses on physical and astronomical phenomena, such as energy, celestial masses, and the electromagnetic spectrum, often relating to the study of stars, galaxies, and cosmic emission.

2. Topic 2:

- **Dominant Words:** system, control, power, design, circuit
- **Possible Topic:** Electrical Engineering and Control Systems
- **Description:** Covers systems engineering, including circuit design, power management, and control mechanisms in various applications.

3. Topic 3:

- **Dominant Words:** graph, function, number, bound, lattice
- **Possible Topic:** Mathematics and Graph Theory
- **Description:** This topic deals with mathematical structures, such as graphs and lattices, as well as functions and numerical analysis.

4. Topic 4:

- **Dominant Words:** method, model, imaged, based, using
- **Possible Topic:** Image Processing and Machine Learning
- **Description:** Focuses on developing and applying computational methods for image-based learning, leveraging models, approaches, and features for achieving results in areas like computer vision and pattern recognition.

5. Topic 5:

- **Dominant Words:** algorithm, problem, matrix, time, performance
- **Possible Topic:** Algorithms and Optimization
- **Description:** Explores the properties of quantum fields, particle states, and energy at microscopic levels, fundamental to physics.

In [52]:
lda_model = best_lda_model['model']
lda_model

LatentDirichletAllocation(doc_topic_prior=0.5, n_components=15, random_state=42,
                          topic_word_prior=0.1)

In [53]:
lda_plot = pyLDAvis.sklearn.prepare(lda_model, count_data, count_vectorizer)
pyLDAvis.display(lda_plot)

TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type

### Best NMF Model

In [54]:
print(f"Best NMF model:\n{best_nmf_model}")

Best NMF model:
{'num_topics': 30, 'reconstruction_error': 6.623565181562852e-05, 'sparsity_W': 0.43985675281445186, 'sparsity_H': 0.31946327885233033, 'word_topic_table':                        Topic1                Topic2                 Topic3  \
Word1          (0.0128, data)         (0.0171, set)        (0.0167, phase)   
Word2           (0.006, user)     (0.0151, theorem)  (0.0122, temperature)   
Word3   (0.0052, information)       (0.0125, proof)   (0.0106, transition)   
Word4   (0.0048, application)       (0.0111, prove)        (0.0093, state)   
Word5      (0.0043, research)         (0.0097, let)       (0.007, lattice)   
Word6       (0.004, analysis)  (0.0095, polynomial)        (0.006, effect)   
Word7      (0.0039, software)        (0.0087, ring)      (0.0056, density)   
Word8      (0.0038, approach)      (0.0086, finite)         (0.0055, wave)   
Word9          (0.0036, tool)      (0.0086, number)         (0.0053, spin)   
Word10       (0.0035, design)      (0.0084, resu

Here are the features of the best NMF model:
- `Number of Topics`: 30
- `Reconstruction Error`: 6.623565181562852e-05
This low error indicates a strong ability to reconstruct the original data from the model’s components, making it useful for tasks requiring detailed analysis.
- `Sparsity (W)`: 0.44 (document-topic sparsity).
This indicates that most documents are represented by a limited number of topics, leading to focused topic distributions.
- `Sparsity (H)`: 0.32 (word-topic sparsity).
This shows that each topic is defined by a smaller subset of the vocabulary, potentially making the topics more interpretable.

Abstract Topics for the NMF Model:
1. Topic 1:

- **Dominant Words:** data, user, information, application, research
- **Possible Topic:** Data Science and Applications
- **Description:** Focuses on the use of data and software tools to analyze, design, and apply methods in information-driven research and development.

2. Topic 2:

- **Dominant Words:** set, theorem, proof, prove, let
- **Possible Topic:** Mathematics and Theoretical Proofs
- **Description:** Explores mathematical structures and theories, including set theory, polynomial analysis, and number theory, with an emphasis on theorems and proofs.

3. Topic 3:

- **Dominant Words:** phase, temperature, transition, state, lattice
- **Possible Topic:** Thermodynamics and Material Science
- **Description:** Examines physical phenomena related to phase transitions, thermal properties, and lattice structures in materials.

4. Topic 4:

- **Dominant Words:** equation, solution, differential, nonlinear, wave
- **Possible Topic:** Differential Equations and Mathematical Modeling
- **Description:** Centers around solving mathematical equations (e.g., Schrodinger equation) with applications in physics, engineering, and numerical modeling.

5. Topic 5:

- **Dominant Words:** model, market, price, volatility, option
- **Possible Topic:** Financial Modeling and Risk Analysis
- **Description:** Focuses on mathematical and statistical models for financial markets, including pricing, volatility analysis, and risk management.

### Comparison Between LDA and NMF:
1. Interpretability:
- LDA topics are well-separated and interpretable, focusing on high-level themes like energy, systems, and algorithms.
- NMF provides fine-grained topics, but their interpretability might require domain-specific knowledge.

2. Performance:

- LDA's perplexity (2324.47) is relatively low, indicating a better generalization for unseen data.
- NMF has a significantly lower reconstruction error, suggesting it reconstructs the document-term matrix more accurately.

3. Sparsity:

- NMF has higher sparsity, which can lead to more interpretable topics but may miss overlapping themes.
- LDA captures overlapping word usage better due to its probabilistic nature.

4. Flexibility:

- LDA provides overlapping word distributions across topics, making it better for capturing nuanced relationships.
- NMF enforces non-negativity, ensuring that topics remain non-overlapping and concise.

## Conclusion

After evaluating both the LDA and NMF models, the **NMF model** is chosen as the better fit for this use case. The dataset's structure, which includes subtopics within general topics, aligns more effectively with the interpretability and coherence of the topics generated by the **NMF model**. Here’s why the NMF model was selected:

1. Quantitative Reasons:
- Reconstruction Error: The NMF model demonstrated a significantly lower reconstruction error, indicating a better fit to the data compared to the LDA model's perplexity score.
- Sparsity: The sparsity metrics for the NMF model highlight its ability to provide compact and interpretable representations of topics, reducing overlap and improving clarity.
2. Qualitative Reasons:
- Coherence of Topics: The NMF model produced topics with well-defined dominant words that clearly align with the dataset's general themes and subtopics, such as "Data Science and Applications" and "Thermodynamics and Material Science."
- Alignment with Dataset Structure: The subtopic-focused nature of the dataset favors NMF's deterministic approach to topic generation, resulting in more granular and interpretable topic clusters compared to the probabilistic LDA model.
- Stakeholder Usability: The topics derived from the NMF model provide actionable insights and categorizations that align with the stakeholders' goals of extracting well-defined subtopics for further analysis or application.
3. Limitations of the LDA Model:
While the LDA model is effective in probabilistically identifying broader themes, its topics lacked the granularity and coherence required for this dataset's subtopic-focused structure. This reduced its usefulness for stakeholders who rely on precise topic delineations.

The NMF model effectively addresses the use case by providing granular, coherent, and interpretable topics that align with the dataset’s structure. This ensures that stakeholders can utilize the topics meaningfully for further categorization, analysis, or reporting.