# **Project 4: LLM Project Activity - Topic Modeling**
### **Week 23** 4-Optimization
Apply transfer learning concepts to enhance the model, followed by evaluating and optimizing the project model and creating an LLM Model Report. (9.5)

**First Step:** Transfer Learning Concepts
- Continue work with the dataset, next step is to fine-tune the model by first loading all the required packages to support loading model, preparing dataset, training model, and evaluating performance
- From Hugging Face Tasks Guides for Natural Language Processing (NLP), selected "Text Classification" as this is a common BERTopic embeddings that will align with the next goal of a predictive model, specifically supervised prediction of topics on unseen documents).
- Output of this process will be a classifier that assigns the most likely topic to new text (instead of re-running BERTopic)

Additional information about tokenizer, classification model, data, and task:
- Tokenizer: sentence-transformers/all-MiniLM-L6-v2
- Classification model: microsoft/MiniLM-L12-H384-uncased (MiniLM architecture with classification head) - this is a lightweight transformer-based model pre-trained for general language understanding tasks and 'with classification head' indicates model equipped with classification layer to perform classification tasks, including text classification.
- Data: Preprocessed and tokenized dataset from BERTopic labels
- Task: Text classification (with accuracy and F1 metrics)



In [None]:
#save model; mount to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#Install required packages
!pip install transformers datasets torch accelerate evaluate



In [None]:
#set up and imports
import os
from datasets import Dataset, DatasetDict, load_dataset, ClassLabel
import pandas as pd
from sklearn.model_selection import train_test_split
import evaluate
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

In [None]:
#Prepare data
df = pd.DataFrame({
    "text": docs_cleaned,            # Preprocessed text strings
    "topic": topics                  # BERTopic-generated labels
})

# Remove rare topics
counts = df["topic"].value_counts()
df = df[df["topic"].isin(counts[counts>=2].index)]

# Map topic IDs to 0…N-1
topic_to_label = {t:i for i,t in enumerate(sorted(df["topic"].unique()))}
df["label"] = df["topic"].map(topic_to_label)

# Stratified train/validation split
df_train, df_val = train_test_split(df, stratify=df["label"], test_size=0.2, random_state=42)

dataset = DatasetDict({
    "train": Dataset.from_pandas(df_train[["text","label"]].reset_index(drop=True)),
    "validation": Dataset.from_pandas(df_val[["text","label"]].reset_index(drop=True))
})

In [None]:
#Tokenize and format
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def preprocess(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

encoded = dataset.map(preprocess, batched=True)
encoded = encoded.rename_column("label", "labels")

# Specify label feature for correct dtype
num_labels = len(topic_to_label)
encoded = encoded.cast_column("labels", ClassLabel(num_classes=num_labels))
encoded.set_format("torch")

Map:   0%|          | 0/5519 [00:00<?, ? examples/s]

Map:   0%|          | 0/1380 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/5519 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/1380 [00:00<?, ? examples/s]

In [None]:
#Define metrics
accuracy = evaluate.load("accuracy")
f1w = evaluate.load("f1")

def compute_metrics(pred):
    logits, labels = pred
    preds = logits.argmax(axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1_weighted": f1w.compute(predictions=preds, references=labels, average="weighted")["f1"]
    }

In [None]:
#Load and fine tune model
model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/MiniLM-L12-H384-uncased",
    num_labels=num_labels
)

training_args = TrainingArguments(
    output_dir="my_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    push_to_hub=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded["train"],
    eval_dataset=encoded["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/MiniLM-L12-H384-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Currently logged in as: [33malia-locken[0m ([33malia-locken-lighthouse-labs[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,6.7089
20,6.711
30,6.7017
40,6.6944
50,6.6868
60,6.6607
70,6.6364
80,6.6221
90,6.6359
100,6.6254


TrainOutput(global_step=1035, training_loss=6.267840712542695, metrics={'train_runtime': 208.8339, 'train_samples_per_second': 79.283, 'train_steps_per_second': 4.956, 'total_flos': 276678068599296.0, 'train_loss': 6.267840712542695, 'epoch': 3.0})

In [None]:
trainer.evaluate()

{'eval_loss': 6.021093845367432,
 'eval_accuracy': 0.08333333333333333,
 'eval_f1_weighted': 0.031218083104875557,
 'eval_runtime': 3.1587,
 'eval_samples_per_second': 436.894,
 'eval_steps_per_second': 27.543,
 'epoch': 3.0}

**Training and Evaluation Summary**

The model was trained for 3 epochs, completing a total of 1,035 steps. The training loss was approximately 6.27, which is relatively high and indicates that the model struggled to learn meaningful representations from the training data. In terms of speed, training processed about 79 samples per second and approximately 5 steps per second, this performance is typical and reflects hardware capabilities rather than learning quality.

On evaluation, the model produced an evaluation loss of ~6.02, very close to the training loss. This suggests the model did not overfit, but also failed to learn effectively. The accuracy is extremely low at 8.3%, which, given a high number of topic classes (as often produced by BERTopic), may be near random guessing. The weighted F1 score of ~0.03 further supports this, indicating the model cannot meaningfully differentiate between classes, especially in the presence of class imbalance.

Several contributing factors may explain the poor performance. First, a large number of topic classes (e.g., 800+) creates a challenging classification task with limited training examples per class. Reducing or merging infrequent topics may help. Second, label imbalance can lead to poor generalization and addressing this with better stratification or data weighting may be necessary. Third, suboptimal hyperparameters or model limitations (MiniLM being a lightweight model) may hinder learning, in which case adjusting learning rates or switching to a more expressive architecture like BERT-base could improve results. Additionally, the unsupervised nature of BERTopic label generation may introduce label noise, so some manual curation or semi-supervised refinement could enhance label quality. Overall, while the model is functioning, it is not currently learning effectively and will require targeted improvements to become viable.

In [None]:
#Save and inference
trainer.save_model("my_model")
tokenizer.save_pretrained("my_model")

#Pipeline usage for inference
from transformers import pipeline
classifier = pipeline("text-classification", model="my_model", tokenizer="my_model")
print(classifier("Here is a new unseen document..."))

Device set to use cuda:0


[{'label': 'LABEL_23', 'score': 0.02989175170660019}]


In [None]:
#save final trained model to Google Drive
model_save_path = "/content/drive/MyDrive/topic_modeling_project4"

#Save the model, tokenizer, and config
trainer.save_model(model_save_path)        #Save model, config, and training state
tokenizer.save_pretrained(model_save_path) #Save tokenizer files too

('/content/drive/MyDrive/topic_modeling_project4/tokenizer_config.json',
 '/content/drive/MyDrive/topic_modeling_project4/special_tokens_map.json',
 '/content/drive/MyDrive/topic_modeling_project4/vocab.txt',
 '/content/drive/MyDrive/topic_modeling_project4/added_tokens.json',
 '/content/drive/MyDrive/topic_modeling_project4/tokenizer.json')

**Next Step: Last Project Task**

Evaluate and Optimize model:
- Optimize performance
- Implement findings from the optimization task (Optimizing LLM Performance)
- Retrain model using changes to hyperparameters
- Bonus: fine-tune multiple models for a point of comparison

In [None]:
#Authenticate notebook and send model to hub
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!apt-get install git-lfs #Size of model and assets
!pip install -q transformers datasets evaluate

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [None]:
repo_name = "topic_modeling_project4"  #model repo name

In [None]:
pip install --upgrade transformers



In [None]:
#Model #1: adjusted hyperparameters to attempt to improve model performance
#Repeat all steps to train model, implement changes to Training Arguments to pass to Trainer object
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=repo_name,
    push_to_hub=True,
    learning_rate=1e-5,  #Lowered to prevent overshooting minima
    per_device_train_batch_size=8,  #Reduced batch for better generalization
    per_device_eval_batch_size=8,
    num_train_epochs=6,  # Added more epochs to allow learning with a harder task
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    metric_for_best_model="f1_weighted",
    greater_is_better=True
)

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded["train"],
    eval_dataset=encoded["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
#Review model performance
trainer.evaluate()

#Evaluate on the validation set and print metrics
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 6.021093845367432, 'eval_model_preparation_time': 0.0051, 'eval_accuracy': 0.08333333333333333, 'eval_f1_weighted': 0.031218083104875557, 'eval_runtime': 5.8309, 'eval_samples_per_second': 236.671, 'eval_steps_per_second': 29.67}


**Model #1 Outputs Summary:**
The evaluation of Model #1 shows that its performance remained essentially unchanged from the previous run. The evaluation loss was 6.02, identical to the earlier result, indicating no improvement in optimization or generalization. Accuracy was again just 8.3%, which is extremely low and likely near random guessing given the large number of topic classes typically generated by BERTopic. The weighted F1 score remained at approximately 0.0312, further confirming that the model is not effectively learning to differentiate between classes. While the evaluation speed improved slightly to around 237 samples per second, this is tied to hardware and does not reflect better learning. Overall, this model did not perform any better than the first attempt. The lack of improvement despite increasing training epochs suggests that the model is not learning, likely due to challenges such as a high number of classes, label imbalance, noisy topic labels from BERTopic, and the limitations of using a lightweight architecture like MiniLM for a complex classification task. Hyperparameters and label quality will likely require significant refinement for any meaningful performance gains.

In [None]:
#Model #2: due to significance of underperformance observed in Model #1, exploring integration of class reduction into pipeline along with adjusted hyperparameters to determin if improvement in performance
from collections import Counter
import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate

#Build DataFrame from preprocessed docs and BERTopic labels
df = pd.DataFrame({
    "text": docs_cleaned,     #list of cleaned text strings
    "topic": topics           #BERTopic-generated topic IDs
})

#Remove rare topics (class reduction)
topic_counts = Counter(df["topic"])
allowed = {t for t, c in topic_counts.items() if c >= 10}
df = df[df["topic"].isin(allowed)]

#Remap topics to continuous label indices
label_map = {t: i for i, t in enumerate(sorted(df["topic"].unique()))}
df["label"] = df["topic"].map(label_map)
num_labels = len(label_map)

#Stratified train/validation split
df_train, df_val = train_test_split(df, test_size=0.2, stratify=df["label"], random_state=42)

#Convert to Hugging Face Dataset format
dataset = DatasetDict({
    "train": Dataset.from_pandas(df_train[["text", "label"]].reset_index(drop=True)),
    "validation": Dataset.from_pandas(df_val[["text", "label"]].reset_index(drop=True)),
})

#Tokenization
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def preprocess(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)
encoded = dataset.map(preprocess, batched=True).rename_column("label", "labels")
encoded.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

#Metrics
accuracy = evaluate.load("accuracy")
f1w = evaluate.load("f1")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1_weighted": f1w.compute(predictions=preds, references=labels, average="weighted")["f1"]
    }

#Load classification model aligned with tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",  #stronger reduced classes
    num_labels=num_labels
)

#TrainingArguments with improved hyperparameters
training_args = TrainingArguments(
    output_dir=repo_name,
    push_to_hub=True,
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=6,
    weight_decay=0.01,
    warmup_steps=100,
    max_grad_norm=1.0,
    metric_for_best_model="f1_weighted",
    greater_is_better=True,
    logging_dir="./logs",
    logging_steps=10,
)

#Train and evaluate
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded["train"],
    eval_dataset=encoded["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()
metrics = trainer.evaluate()
print(metrics)

Map:   0%|          | 0/3531 [00:00<?, ? examples/s]

Map:   0%|          | 0/883 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
10,5.1609
20,5.1159
30,5.057
40,5.0779
50,5.0075
60,5.0357
70,5.0735
80,4.9929
90,5.0253
100,4.949


{'eval_loss': 2.913451910018921, 'eval_accuracy': 0.42242355605889015, 'eval_f1_weighted': 0.313321903378084, 'eval_runtime': 6.379, 'eval_samples_per_second': 138.423, 'eval_steps_per_second': 17.401, 'epoch': 6.0}


In [None]:
#Review model performance
trainer.evaluate()

{'eval_loss': 2.913451910018921,
 'eval_accuracy': 0.42242355605889015,
 'eval_f1_weighted': 0.313321903378084,
 'eval_runtime': 6.5197,
 'eval_samples_per_second': 135.435,
 'eval_steps_per_second': 17.025,
 'epoch': 6.0}

**Model #2 Outputs Summary:**
Building upon the initial hyperparameter tuning in Model #1, Model #2 introduced key enhancements to address the poor performance observed earlier. The pipeline was updated to reduce number of classes by filtering out rare topics with fewer than 10 examples, reducing noise and data sparsity. Topic IDs were remapped into compact, continuous label indices to facilitate classification. The train/validation split was stratified to ensure balanced class distributions across splits, improving the model's generalization.

For tokenization, the MiniLM tokenizer was used for efficient input encoding, while the model architecture was upgraded to a stronger bert-base-uncased to better handle the complexity of the reduced label set. Hyperparameters were adjusted with a lower learning rate, increased number of epochs (6 total), warmup steps, gradient clipping, and weight decay to promote more stable and effective training.

Evaluation metrics, including accuracy and weighted F1 score, were integrated and computed at the end of training to monitor progress.

Performance Improvements:
- Eval Loss dropped substantially from ~6.02 to ~2.91, indicating better learning and reduced validation error
- Eval Accuracy improved dramatically from ~8.3% to ~42.2%, showing the model now classifies many more examples correctly.
- Eval F1 Weighted increased from ~0.03 to ~0.31, reflecting improved handling of class imbalance and class differentiation
- Training for 6 epochs helped convergence, while evaluation speed remained reasonable and hardware-dependent

Overall, the class reduction, stronger BERT, and tuned hyperparameters collectively enabled the model to learn meaningful representations and improve significantly over previous runs.

In [None]:
#Push Model #2 results to hub for later use
trainer.push_to_hub()

Uploading...:   0%|          | 0.00/438M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/alocken/topic_modeling_project4/commit/4af6ed2c8a434c49d24da312d8fe7052e14d0042', commit_message='End of training', commit_description='', oid='4af6ed2c8a434c49d24da312d8fe7052e14d0042', pr_url=None, repo_url=RepoUrl('https://huggingface.co/alocken/topic_modeling_project4', endpoint='https://huggingface.co', repo_type='model', repo_id='alocken/topic_modeling_project4'), pr_revision=None, pr_num=None)

In [None]:
from transformers import pipeline

#New unseen texts to classify
data = [
    "The Hubble Space Telescope captured a new image of a galaxy.",
    "I need help setting up my Linux dual boot system.",
    "The new graphics card performs exceptionally in gaming benchmarks.",
    "The Lakers won their game last night against the Bulls."
]

#Load the text-classification pipeline with your Model #2 from the Hub
classifier = pipeline("text-classification", model="alocken/topic_modeling_project4")

#Run predictions
predictions = classifier(data)

#Print results
for text, pred in zip(data, predictions):
    print(f"Text: {text}")
    print(f"Predicted Label: {pred['label']}, Confidence: {pred['score']:.2f}")
    print()

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

Device set to use cuda:0


Text: The Hubble Space Telescope captured a new image of a galaxy.
Predicted Label: LABEL_78, Confidence: 0.25

Text: I need help setting up my Linux dual boot system.
Predicted Label: LABEL_144, Confidence: 0.08

Text: The new graphics card performs exceptionally in gaming benchmarks.
Predicted Label: LABEL_144, Confidence: 0.23

Text: The Lakers won their game last night against the Bulls.
Predicted Label: LABEL_119, Confidence: 0.19



**Evaluation of model performance on unseen data:**

The model's predictions on unseen data show moderate confidence levels ranging roughly from 8% to 25%, indicating some uncertainty in classifying these examples. Each input was assigned a distinct label (e.g., LABEL_78, LABEL_144, LABEL_119), but the relatively low confidence scores suggest the model is not strongly certain about its predictions. This may be due to the complexity of the topic space, overlap between classes, or limited training data for certain topics. While the model can produce plausible categorical outputs, confidence values imply room for improvement for the model to better handle unseen, diverse texts.

Ideas to further improve performance:
- refining and consolidating topic labels to reduce ambiguity and improve label quality
- gathering more balanced training data or using data augmentation to help model learn better across all classes
- experimenting with more powerful transformer models or fine-tuning hyperparameters to capture complex patterns and improve overall accuracy