# **Project 4: LLM Project Activity - Topic Modeling**




### **Week 23** Project 4: 3-Pre-Trained-Model
* Select a pre-trained model for your project and perform data preprocessing. (9.4)
* Apply transfer learning concepts to enhance the model, followed by evaluating and optimizing the project model and creating an LLM Model Report. (9.5)

**Observation:**
After research, conclude that Hugging Face's pipeline does not have built-in support for topic modeling as it does for sentiment analysis or summarization. As topic modeling is an unsupervised NLP task, will require use of a different library. In response, will use a pretrained transformer model for embedding through BERTopic.
Notes: given this requirement, the course requests run this Hugging Face pipeline format:

Hugging Face pipeline()
data = ['list of the text for inference']
preds = pipe(data)

But as this process will use BERTopic, the following will be used to match the pipeline format and complete required passing of Text Data correctly:
docs = ds_test_df['text'].tolist()
topics, probs = topic_model.transform(docs)

Additionally, Hugging Face pipeline expects:
- A list of strings = returns predictions (classification, generation, etc.)

BERTopic expects the same input format:
- A list of strings (documents) = returns topics and their probabilities.

Select a pre-trained model for your project and perform data preprocessing. (9.4)

In [None]:
#install required packages
!pip install bertopic



In [None]:
#selected and loaded pre-trained BERTopic model from Hugging Face
from bertopic import BERTopic
topic_model = BERTopic.load("guibvieira/topic_modelling")

topic_model.get_topic_info()

topics.json:   0%|          | 0.00/10.7M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

topic_embeddings.safetensors:   0%|          | 0.00/9.66M [00:00<?, ?B/s]

ctfidf_config.json: 0.00B [00:00, ?B/s]

ctfidf.safetensors:   0%|          | 0.00/5.04M [00:00<?, ?B/s]



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,241723,-1_liquidity_coins_bro_sorry,"[liquidity, coins, bro, sorry, morning, answer...",
1,0,3993,0_token_tokens_supply_value,"[token, tokens, supply, value, price, holders,...",
2,1,2185,1_main_problem_bro_tho,"[main, problem, bro, tho, brother, form, solut...",
3,2,2122,2_twitter_comment_space_social,"[twitter, comment, space, social, media, post,...",
4,3,2100,3_project_projects_interesting_research,"[project, projects, interesting, research, mat...",
...,...,...,...,...,...
6281,6280,15,6280_listing_project_list_simple,"[listing, project, list, simple, fine, team, a...",
6282,6281,15,6281_code_phone_number_withdraw,"[code, phone, number, withdraw, wrong, correct...",
6283,6282,15,6282_fun_details_dm_,"[fun, details, dm, , , , , , , ]",
6284,6283,15,6283_different_okay_wrong_people,"[different, okay, wrong, people, lol, sure, da...",


In [None]:
from datasets import load_dataset
!pip install -U fsspec datasets

Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


In [None]:
ds = load_dataset('SetFit/20_newsgroups')
import pandas as pd
ds_train = pd.DataFrame(ds['train'])
ds_test = pd.DataFrame(ds['test'])

Repo card metadata block was not found. Setting CardData to empty.


In [None]:
from datasets import Dataset, DatasetDict

train = Dataset.from_pandas(ds_train)
test = Dataset.from_pandas(ds_test)

new_ds = DatasetDict({
    'train': train,
    'test': test
})

In [None]:
#pre-processing to improve embeddings and clustering quality in BERTopic
#text cleaning
import pandas as pd # Import pandas if not already imported
from datasets import load_dataset # Import load_dataset if not already imported
import re

# Load the dataset and convert to DataFrame if not already available
if 'ds_test' not in globals():
    ds = load_dataset('SetFit/20_newsgroups')
    ds_test = pd.DataFrame(ds['test'])

# Define 'docs' from the 'text' column of ds_test
docs = ds_test['text'].tolist()


def clean_text(text):
    text = text.lower() #lowercase
    text = re.sub(r"http\S+", "", text) #remove URLs
    text = re.sub(r"\n|\r", " ", text) #remove newlines
    text = re.sub(r"[^a-zA-Z\s]", "", text) #remove punctuation/symbols
    text = re.sub(r"\s+", " ", text).strip() #remove extra spaces
    return text

docs = [clean_text(doc) for doc in docs]

#pre-processing stowpword removal
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stop_words])

docs = [remove_stopwords(doc) for doc in docs]

#lemmatization/stemming (reduce words to root form)
import spacy
# Download the spacy model if not already downloaded
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")


def lemmatize_text(text):
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc if token.lemma_ != "-PRON-"])

docs = [lemmatize_text(doc) for doc in docs]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Given this is a BERTopic model, transfer learning applies to the embedding model, not the topic model itself. The transfer learning will be using a pre-trained sentence transformer and fine-tuning the embedding model.

In [None]:
#topic model requires embedding model specified explicitly, added code to address error
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

#load sentence transformer embedding model; note: automatically includes tokenizer 'sentence transformers' embedded in BERTopic handling internally
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

#load BERTopic with pretrained topic model AND embedding model
topic_model = BERTopic.load("guibvieira/topic_modelling", embedding_model=embedding_model)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Convert Hugging Face test set back to pandas DataFrame
ds_test_df = new_ds['test'].to_pandas()
docs = ds_test_df['text'].tolist()

# Run BERTopic
topics, probs = topic_model.transform(docs)

Batches:   0%|          | 0/236 [00:00<?, ?it/s]

2025-07-01 03:35:48,408 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


In [None]:
# See an overview of the topics (topic IDs, counts, names)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,241723,-1_liquidity_coins_bro_sorry,"[liquidity, coins, bro, sorry, morning, answer...",
1,0,3993,0_token_tokens_supply_value,"[token, tokens, supply, value, price, holders,...",
2,1,2185,1_main_problem_bro_tho,"[main, problem, bro, tho, brother, form, solut...",
3,2,2122,2_twitter_comment_space_social,"[twitter, comment, space, social, media, post,...",
4,3,2100,3_project_projects_interesting_research,"[project, projects, interesting, research, mat...",
...,...,...,...,...,...
6281,6280,15,6280_listing_project_list_simple,"[listing, project, list, simple, fine, team, a...",
6282,6281,15,6281_code_phone_number_withdraw,"[code, phone, number, withdraw, wrong, correct...",
6283,6282,15,6282_fun_details_dm_,"[fun, details, dm, , , , , , , ]",
6284,6283,15,6283_different_okay_wrong_people,"[different, okay, wrong, people, lol, sure, da...",


In [None]:
#see all topic IDs (top 25) with top words in DF, sorted by count (descending)
topic_info = topic_model.get_topic_info()
top_topics = topic_info[topic_info['Topic'] != -1].head(25)  # Exclude outliers (-1)

# Print top 25 topics with keywords
for topic_id in top_topics['Topic']:
    print(f"\nTopic {topic_id}:")
    print(topic_model.get_topic(topic_id))


Topic 0:
[['token', 0.004001408178575309], ['tokens', 0.0029945150278735655], ['supply', 0.0008455264937135059], ['value', 0.0006919489090980801], ['price', 0.000566843502246489], ['holders', 0.0005436769527987886], ['listing', 0.0004734404445939499], ['term', 0.0004562376632669302], ['benefits', 0.0003750672722392172], ['moment', 0.0003700421193462019]]

Topic 1:
[['main', 0.007226464143878195], ['problem', 0.001421015891857106], ['bro', 0.0013290071860660487], ['tho', 0.0012461907091677457], ['brother', 0.0012082180945614982], ['form', 0.0008479626147150821], ['solution', 0.0008349817812795357], ['words', 0.0007285316380101317], ['sense', 0.0007094432671324394], ['reason', 0.0006550526772855643]]

Topic 2:
[['twitter', 0.007901630402460707], ['comment', 0.002576973305001958], ['space', 0.0021819795307875013], ['social', 0.0019043336537359773], ['media', 0.0016370816135478658], ['post', 0.00114604692352579], ['announcement', 0.000678050862015953], ['updates', 0.0006433470881594879], 

In [None]:
#inspect predictions (topics per document)
import pandas as pd

results_df = pd.DataFrame({
    "document": docs,
    "topic": topics,
    "probability": probs
})
print(results_df.head())

                                            document  topic  probability
0  I am a little confused on all of the models of...    666     0.322807
1  I'm not familiar at all with the format of the...   2319     0.420362
2                                \nIn a word, yes.\n   2453     0.503035
3  \nThey were attacking the Iraqis to drive them...   1482     0.502123
4  \nI've just spent two solid months arguing tha...   2543     0.560197


Notes on predictions output:

A higher probability (closer to 1.0) means the model is more confident in the topic assignment.

A lower probability (closer to 0.0) means the document is less clearly matched to a topic.

If the topic were -1, it would mean the model considers the document an outlier that doesn't fit any topic well.


In [None]:
#how many labels (topics) predicted; unique topics (excluding -1 which means 'no topic assigned')
unique_topics = set(topics)
n_topics = len([t for t in unique_topics if t != -1])
print(f"Number of topics found: {n_topics}")

Number of topics found: 1564


In [None]:
#summary of all topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,241723,-1_liquidity_coins_bro_sorry,"[liquidity, coins, bro, sorry, morning, answer...",
1,0,3993,0_token_tokens_supply_value,"[token, tokens, supply, value, price, holders,...",
2,1,2185,1_main_problem_bro_tho,"[main, problem, bro, tho, brother, form, solut...",
3,2,2122,2_twitter_comment_space_social,"[twitter, comment, space, social, media, post,...",
4,3,2100,3_project_projects_interesting_research,"[project, projects, interesting, research, mat...",
...,...,...,...,...,...
6281,6280,15,6280_listing_project_list_simple,"[listing, project, list, simple, fine, team, a...",
6282,6281,15,6281_code_phone_number_withdraw,"[code, phone, number, withdraw, wrong, correct...",
6283,6282,15,6282_fun_details_dm_,"[fun, details, dm, , , , , , , ]",
6284,6283,15,6283_different_okay_wrong_people,"[different, okay, wrong, people, lol, sure, da...",


Apply transfer learning concepts to enhance the model, followed by evaluating and optimizing the project model and creating an LLM Model Report. (9.5)

From Hugging Face transformer documentation task guide, natural language processing, selected the text classifciation to apply as it may be the best fit for transfer learning in topic modeling.

In [None]:
#text classfication - load necessary libraries from Hugging face
!pip install transformers datasets evaluate accelerate



In [None]:
#load data set
from datasets import load_dataset
ds = load_dataset('SetFit/20_newsgroups')

Repo card metadata block was not found. Setting CardData to empty.


In [None]:
print(ds["test"][0])

{'text': 'I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.', 'label': 7, 'label_text': 'rec.autos'}


In [None]:
#preprocess load DistilBERT tokenizer to prepcess text field
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [None]:
#create preprocessing function to tokenize text and truncate sequences to no longer than DistilBERT's max input length
# Define preprocessing function
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

# Tokenize all splits
tokenized_ds = ds.map(preprocess_function, batched=True)

Map:   0%|          | 0/11314 [00:00<?, ? examples/s]

Map:   0%|          | 0/7532 [00:00<?, ? examples/s]

In [None]:
train_ds = tokenized_ds["train"]
eval_ds = tokenized_ds["test"]

In [None]:
#create batch examples with DataCollatorWithPadding
#Pytorch
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")


In [None]:
#Evaluate model performance
import evaluate
accuracy = evaluate.load("accuracy")

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
#function to pass predictions and labels to compute calculate accuracy
import numpy as np
from evaluate import load  #Hugging Face's metric loading

accuracy = load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
#check dataset labels
print(set(ds["train"]["label"]))

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}


In [None]:
#before training, map expected ids to labels
#manually define label names
label_names = [
    'alt.atheism',
    'comp.graphics',
    'comp.os.ms-windows.misc',
    'comp.sys.ibm.pc.hardware',
    'comp.sys.mac.hardware',
    'comp.windows.x',
    'misc.forsale',
    'rec.autos',
    'rec.motorcycles',
    'rec.sport.baseball',
    'rec.sport.hockey',
    'sci.crypt',
    'sci.electronics',
    'sci.med',
    'sci.space',
    'soc.religion.christian',
    'talk.politics.guns',
    'talk.politics.mideast',
    'talk.politics.misc',
    'talk.religion.misc'
]

# Create mappings
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {label: i for i, label in enumerate(label_names)}

In [None]:
#train model, load DistilBert
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased",
    num_labels=len(label_names),
    id2label=id2label,
    label2id=label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
#define training hyperparameters in TrainingArguments, choose appropriate metric for model to calculate metrics during training, pass training arguments to Trainer, call train() to finetune model;F1 will determine best model at end
#metrics selected: accuracy overall percentage of correctly classified documents and Macro F1 Score to determine mean of precision/recall per class and averaged
import numpy as np
from evaluate import load

# Load all required metrics
accuracy = load("accuracy")
f1 = load("f1")
precision = load("precision")
recall = load("recall")
roc_auc = load("roc_auc")  # AUC for multi-class classification

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)

    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1_macro": f1.compute(predictions=preds, references=labels, average="macro")["f1"],
        "precision": precision.compute(predictions=preds, references=labels, average="macro")["precision"],
        "recall": recall.compute(predictions=preds, references=labels, average="macro")["recall"],
        "roc_auc_ovr": roc_auc.compute(prediction_scores=logits, references=labels, average="macro", multi_class="ovr")["roc_auc"]
    }

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
training_args = TrainingArguments(
    output_dir="newsgroups_model",             #checkpoints & model are saved
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    push_to_hub=False,
    save_strategy="steps",             # save checkpoints every N steps
    save_steps=25,
    save_total_limit=3,                # keep last 3 checkpoints to save disk space
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [7]:
import nbformat

input_path = "Project_4_3_pre-trained_model.ipynb"
output_path = "Project_4_3_pre-trained_model_cleaned.ipynb"

# Load the notebook
with open(input_path, 'r', encoding='utf-8') as f:
    nb = nbformat.read(f, as_version=4)

# Clear notebook-level metadata
nb.metadata = {}

# Clean each cell
for cell in nb.cells:
    cell.outputs = []
    cell.execution_count = None
    cell.metadata = {}

# Save cleaned notebook
with open(output_path, 'w', encoding='utf-8') as f:
    nbformat.write(nb, f)

print(f"Cleaned notebook saved to {output_path}")

FileNotFoundError: [Errno 2] No such file or directory: 'Project_4_3_pre-trained_model.ipynb'

In [3]:
import os
os.listdir()

['.config', 'sample_data']