# **Project 4: LLM Project Activity - Topic Modeling**
### **Week 23** 3-Pre-Trained-Model

Select a pre-trained model for your project and perform data preprocessing. (9.4)

- Project Task is Topic Modeling which is typically an unsupervised learning task and doesn't require labeled data
- Commonly used methods could be LDA (classical approach) or BERTopic (using tranformer embeddings)
- Noted through research that recommended approach to do topic modeling with transformers is using BERTopic
- BERTopic cannot use Hugging Face pipeline() API
- Although it's not a supported task type in Hugging Face's pipeline function, can utilize a pre-trained Hugging Face model in workflow
- NLP Tasks performing is Topic Modeling including text preprocessing, text embedding, topic extraction, and inference
- Hugging Face pre-trained model for specific task will be: guibvieira/topic_modelling


In [None]:
#Load pre-trained model requirements: BERTopic
!pip install -U bertopic



In [None]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

#Load the embedding model (same one used when training)
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

#Load BERTopic model with embedding model passed in
topic_model = BERTopic.load("guibvieira/topic_modelling", embedding_model=embedding_model)

#Extract tokenized documents
docs_tokenized = ds_test["text"].tolist()

#Join tokens into strings
docs = [" ".join(tokens) for tokens in docs_tokenized]

#Transform
topics, probs = topic_model.transform(docs)

#Show topic info
print(topic_model.get_topic_info())

#Print first few documents’ topics
for i in range(5):
    print(f"Document {i} Topic: {topics[i]}, Probability: {probs[i]}")

Batches:   0%|          | 0/236 [00:00<?, ?it/s]

2025-07-04 07:11:32,931 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


      Topic   Count                                     Name  \
0        -1  241723             -1_liquidity_coins_bro_sorry   
1         0    3993              0_token_tokens_supply_value   
2         1    2185                   1_main_problem_bro_tho   
3         2    2122           2_twitter_comment_space_social   
4         3    2100  3_project_projects_interesting_research   
...     ...     ...                                      ...   
6281   6280      15         6280_listing_project_list_simple   
6282   6281      15          6281_code_phone_number_withdraw   
6283   6282      15                     6282_fun_details_dm_   
6284   6283      15         6283_different_okay_wrong_people   
6285   6284      15     6284_thought_bitcoin_members_service   

                                         Representation  Representative_Docs  
0     [liquidity, coins, bro, sorry, morning, answer...                  NaN  
1     [token, tokens, supply, value, price, holders,...                  

**Summary of Outputs**

Number of topics found:
- 6,286 topics (including topic -1, usually outlier or "no topic" group)

Topic Distribution:
- Topic -1 has the largest count with 241,723 documents assigned to it indicating these documents didn't fit well into any coherent topic
- Other topics have much smaller document counts (e.g., topic 0 has 3,993 docs, topic 1 has 2,185 docs, etc.)
- Distribution shows many very specific topics, some with as few as 15 documents

Topic Names and Keywords:
- Each topic's labeled with a numeric ID and name constructed from top representative words (e.g., 0_token_tokens_supply_value, 1_main_problem_bro_tho).
- Keywords provide insight into the main theme or concept captured by each topic
- Example keywords for topic 0 include: token, tokens, supply, value, price, holders — likely related to cryptocurrency tokens

Topic Representations:
- Each topic has a list of keywords (e.g., topic 3: twitter, comment, space, social, media, post), which summarize the topic's content.

Document Topic Assignments for the first 5 documents:
- Document 0 assigned to topic 347 with probability ~0.38
- Document 1 assigned to topic 3907 with probability ~0.43
- Document 2 assigned to topic 2453 with probability ~0.53
- Document 3 assigned to topic 1482 with probability ~0.54
- Document 4 assigned to topic 2543 with probability ~0.55

Interpretation of Probabilities:
- Indicate the confidence or strength of the topic assignment for each document.
- Higher probability means the document fits well into that topic.

Overall Interpretation
- BERTopic model identified a very large number of detailed topics from dataset.
- Most documents are assigned to an outlier topic -1, possibly indicating many documents don't strongly fit well into the discovered topics or topics could be pruned for better clarity.
- Named topics provide useful keywords summarizing major themes, useful for understanding your corpus' thematic structure
- Model assigns each document to a topic with a probability score, indicating how confidently the document matches that topic.

In [None]:
#Rerun with additional preprocessing implementing lowercase, removing punctuation and stopwords
import re
from nltk.corpus import stopwords
from nltk import download
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

#Download NLTK stopwords (only run once)
download("stopwords")
stop_words = set(stopwords.words("english"))

#Define text cleaning function
def clean_tokens(tokens):
    # Lowercase, remove punctuation and stopwords
    cleaned = [re.sub(r"\W+", "", word.lower()) for word in tokens]
    return [word for word in cleaned if word and word not in stop_words]

#Load embedding model and BERTopic model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
topic_model = BERTopic.load("guibvieira/topic_modelling", embedding_model=embedding_model)

#Extract and preprocess your tokenized dataset
docs_tokenized = ds_test["text"].tolist()  # Assuming each entry is a list of tokens
docs_cleaned = [" ".join(clean_tokens(tokens)) for tokens in docs_tokenized]

#Transform using BERTopic
topics, probs = topic_model.transform(docs_cleaned)

#View topic summary
print(topic_model.get_topic_info())

#Show topic assignments for a few example docs
for i in range(5):
    print(f"Document {i} → Topic: {topics[i]} | Probability: {probs[i]}")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Batches:   0%|          | 0/236 [00:00<?, ?it/s]

2025-07-04 07:54:23,767 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


      Topic   Count                                     Name  \
0        -1  241723             -1_liquidity_coins_bro_sorry   
1         0    3993              0_token_tokens_supply_value   
2         1    2185                   1_main_problem_bro_tho   
3         2    2122           2_twitter_comment_space_social   
4         3    2100  3_project_projects_interesting_research   
...     ...     ...                                      ...   
6281   6280      15         6280_listing_project_list_simple   
6282   6281      15          6281_code_phone_number_withdraw   
6283   6282      15                     6282_fun_details_dm_   
6284   6283      15         6283_different_okay_wrong_people   
6285   6284      15     6284_thought_bitcoin_members_service   

                                         Representation  Representative_Docs  
0     [liquidity, coins, bro, sorry, morning, answer...                  NaN  
1     [token, tokens, supply, value, price, holders,...                  

**Final Observation:** outputs from further preprocessed data remain unchanged from intial. This is likely due to BERTopic embedding models (SentenceTransformers), inference uses fixed topics, consistent input, stable embeddings.