<a href="https://colab.research.google.com/github/chenwh0/Natural-Language-Processing-work/blob/main/module6/TextClassificationModelsComparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment: Advanced Text Classification and Model Interpretability**

This implemented multiple approaches to text classification, compare traditional machine learning with modern transformer models, and critically analyze model behavior, interpretability, and real-world deployment considerations. The focus is on **understanding the complete pipeline** from data preprocessing to model interpretation rather than just achieving high accuracy.

# *Sources/References*

* https://www.geeksforgeeks.org/nlp/text-augmentation-techniques-in-nlp/

# *Installs & Imports*

In [None]:
!pip install transformers sentence-transformers datasets vaderSentiment gensim -U -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.6/26.6 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.2/38.2 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
opencv-contrib-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you ha

In [None]:
# Data source library
from datasets import load_dataset

# Data preprocessing libraries
import pandas
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Libraries for embedding
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt_tab')
from gensim.models import Word2Vec


from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# LLM libraries
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm import tqdm

# Get classification summary
from sklearn.metrics import classification_report

# Get TF-IDF & cosine similarity
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import numpy

# Visualize data
import matplotlib.pyplot as pyplot

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 1. **Multi-Modal Text Classification Setup**

## Creating additional classification tasks
*To create fine-grained (rating range) sentiment...* VADER (Valence Aware Dictionary and sEntiment Reasoner) is a sentiment analysis tool. Analyze polarity of words and assigning a sentiment score to each word based on its emotional value. These individual word scores are then combined to calculate an overall sentiment score for the entire text. VADER is fast and simple to implement on large datasets as compared to generative LLM or manual approaches.

*To subject-based classifcation...* Added a function to count the mention of words relevant to the subject (or the subject name itself). This method was simple and quick to implement but may miss out on alot of potential classification to categories that isn't "other".

## Fine-grained sentiment is more challenging than binary classification
In binary classification there is a 50% chance that a model can predict a correct classification by mere random guess. In fine-grained sentiment, the increase of classes means there's less of a chance that a random guess would be the correct prediction. Furthermore, there is now added nuance to the classfication task when a given input could be classified as multiple categories and the model can only pick one.

## single-task learning and multi-task learning trade-offs
**single-task learning**: Each model tasked with 1 problem. Ex: Break multi-class classification down to different combinations of individual binary classification models.
* Pro - Simple to develop and configure. Easier to optimize model to complete its 1 specific classifcation task
* Cons - Must train multiple models so higher training costs. More data needed. No cohesion of context/understanding between models

**multi-task learning**: Train 1 model to do multiple tasks (such as classification to multiple classes)
* Pro - Cohesive context understanding. Efficient use of data.
* Con - More complex to develop and configure. Easier tasks are learned well while model still struggles with difficult task. Ex: Classify dogs vs octopuses very well but not dogs vs cats

In [None]:
# Load our data by specifying splits
splits = ["train", "validation", "test"]
data = load_dataset("rotten_tomatoes", split=splits)

In [None]:
dataframes = {}
for split_index, split in zip(range(len(splits)), splits):
    dataframe = pandas.DataFrame(data[split_index])
    dataframes[split] = dataframe
    print(f"\n\n\nSamples from {split} dataset ({len(dataframe)} total samples)")
    display(dataframe.sample(5, random_state=42))
    display(dataframe.value_counts("label"))





Samples from train dataset (8530 total samples)


Unnamed: 0,text,label
2499,it would take a complete moron to foul up a sc...,1
8033,so we got ten little indians meets friday the ...,0
6808,"unfortunately , heartbreak hospital wants to c...",0
6057,sustains its dreamlike glide through a success...,0
6631,a modest and messy metaphysical thriller offer...,0


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,4265
1,4265





Samples from validation dataset (1066 total samples)


Unnamed: 0,text,label
31,where janice beard falters in its recycled asp...,1
807,"some body is a shaky , uncertain film that nev...",0
413,". . . a solid , unassuming drama .",1
1025,it's a documentary that says that the alternat...,0
798,shafer's feature doesn't offer much in terms o...,0


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,533
1,533





Samples from test dataset (1066 total samples)


Unnamed: 0,text,label
31,"the main story . . . is compelling enough , bu...",1
807,some motion pictures portray ultimate passion ...,0
413,there is a general air of exuberance in all ab...,1
1025,though there are many tense scenes in trapped ...,0
798,what we get in feardotcom is more like somethi...,0


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,533
1,533


In [None]:
print("Multi-label train data")
display(multi_label_dataframes[0].value_counts("label"))
print("Multi-label validation data")
display(multi_label_dataframes[1].value_counts("label"))
print("Multi-label test data")
display(multi_label_dataframes[2].value_counts("label"))

Multi-label train data


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
4,2343
3,2297
5,1918
2,1329
1,643


Multi-label validation data


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
3,281
4,275
5,271
2,154
1,85


Multi-label test data


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
3,271
4,263
5,257
2,186
1,89


In [None]:
def classify_aspects(review):
    acting_mentions = review.count("acting") + review.count("actress") + review.count("actor")
    plot_mentions = review.count("plot")
    cinematography_mentions = review.count("cinematography") + review.count("lighting") + review.count("visual") + review.count("style")
    mentions = [acting_mentions, plot_mentions, cinematography_mentions]
    max = numpy.max(mentions)
    if max == 0: # No mentions of 3 aspects from above
        return 3 # other=3
    else:
        return numpy.argmax(mentions)

# Example review
review = "The acting was superb, but the plot felt very slow and boring."

# Run aspect extraction and sentiment classification
aspects = ["acting", "plot", "cinematography", "other"] # In labels, acting=0, plot=1, cinematography=2, other=3
multi_aspect_dataframes = []
for split in data:
    dataframe = pandas.DataFrame(split)
    dataframe["label"] = dataframe.copy()["text"].apply(classify_aspects)
    multi_aspect_dataframes.append(dataframe)

print("Multi-aspect train data")
display(multi_aspect_dataframes[0].value_counts("label"))
print("Multi-aspect validation data")
display(multi_aspect_dataframes[1].value_counts("label"))
print("Multi-aspect test data")
display(multi_aspect_dataframes[2].value_counts("label"))

Multi-aspect train data


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
3,7903
0,269
2,183
1,175


Multi-aspect validation data


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
3,983
0,40
1,22
2,21


Multi-aspect test data


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
3,991
0,26
1,26
2,23


In [None]:
# Create multi-class dataframe versions
analyzer = SentimentIntensityAnalyzer()

def vader_star_rating(text):
    score = analyzer.polarity_scores(text)['compound']
    if score <= -0.6:
        return 1
    elif score <= -0.2:
        return 2
    elif score <= 0.2:
        return 3
    elif score <= 0.6:
        return 4
    else:
        return 5
multi_label_dataframes = []
for split in data:
    dataframe = pandas.DataFrame(split)
    dataframe["label"] = dataframe.copy()["text"].apply(vader_star_rating)
    multi_label_dataframes.append(dataframe)

# **2. Comparative Algorithm Implementation**

- Implemented **3 different approaches** to text classification:
  
  **Traditional ML Pipeline:**
  - TF-IDF or Count Vectorization + Logistic Regression/SVM
  - Include proper preprocessing (tokenization, stopword removal, etc.)
  
  **Embedding-Based Approach:**
  - Sentence transformers or pre-trained word embeddings
  - Simple neural network classifier on top of embeddings
  
  **Transformer Fine-tuning:**
  - Fine-tune a pre-trained model (BERT, RoBERTa, or similar)
  - Use proper train/validation/test splits

## computational & memory trade-offs between 3 approaches
* **Traditional ML Pipeline** (TF-IDF + Logistic Regression)
> * Cheaper computational time and less complexity.
> * Highly inefficient with memory storage because 0s are kept for the sake of maintaining a common dimensionality across all vectors.
> * As vocabulary grows with dataset sizes, the performance decreases
* **Embedding-Based** ()
> * Expensive computational time and resources more complexity
  - How does performance scale with dataset size for each method?
  - Which approach generalizes better to out-of-distribution data?

## Word level Tokenizer (TF-IDF with Logistic Regression)

In [None]:
def preprocess_text(text: str) -> str:
    text = text.lower() # Lowercase all text.
    text = re.sub(r"[^\w\s]", "", text) # Remove punctuation
    text = re.sub(r"\d", "", text) # Remove digits
    text = re.sub(r"\s+", " ", text) # remove extra whitespace
    english_stopwords = set(stopwords.words("english"))
    text = " ".join([word for word in text.split() if word not in english_stopwords]) # Remove stopwords
    return text


def word_level_tokenizer(train_data, validate_data):
    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(train_data["text"])
    y_train = train_data["label"].values.ravel().astype(int)

    X_validate = vectorizer.transform(validate_data["text"])
    y_validate = validate_data["label"].values.ravel().astype(int)

    return X_train, y_train, X_validate, y_validate


# Preprocess all dataframes
for dataframe_name, dataframe in dataframes.items():
    preprocessed_dataframe = dataframe.copy() # Make a copy of original dataframe
    preprocessed_dataframe["text"] = dataframe["text"].map(preprocess_text) # Preprocess the copy's data
    dataframes[dataframe_name] = preprocessed_dataframe

X_train, y_train, X_validate, y_validate = word_level_tokenizer(dataframes["train"], dataframes["validation"])
logistic_regression_model = LogisticRegression(max_iter=1000)
logistic_regression_model.fit(X_train, y_train)

# Predict on validation
y_validate_predict = logistic_regression_model.predict(X_validate)
print(classification_report(y_validate, y_validate_predict))

              precision    recall  f1-score   support

           0       0.74      0.78      0.76       533
           1       0.77      0.73      0.75       533

    accuracy                           0.75      1066
   macro avg       0.75      0.75      0.75      1066
weighted avg       0.75      0.75      0.75      1066



In [None]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2') # Load model

# Convert text to text embeddings
vector = model.encode("The Data Science and Analytics Program at the University of Missouri in Columbia!")

In [None]:
from transformers import Trainer, TrainingArguments, DistilBertTokenizer, DistilBertForSequenceClassification
from datasets import load_dataset
import torch

# Step 1: Load the dataset (IMDB sentiment analysis dataset)
dataset = load_dataset("rotten_tomatoes")

# Step 2: Load the pre-trained DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# Step 3: Tokenize the input data (convert text into tokens for the model)
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)

# Step 4: Split the dataset into training and evaluation sets
train_dataset = encoded_dataset["train"]
test_dataset = encoded_dataset["test"]

# Step 5: Define the training arguments (parameters like batch size, epochs, etc.)
training_args = TrainingArguments(
    output_dir="./results",           # output directory for saving model checkpoints
    eval_strategy="epoch",      # evaluate at the end of each epoch
    learning_rate=2e-5,               # learning rate for optimization
    per_device_train_batch_size=8,    # batch size for training
    per_device_eval_batch_size=16,    # batch size for evaluation
    num_train_epochs=3,               # number of epochs for training
    weight_decay=0.01,                # weight decay for regularization
    logging_dir="./logs",             # directory for storing logs
)

# Step 6: Define the Trainer
trainer = Trainer(
    model=model,                      # the model to train
    args=training_args,               # training arguments
    train_dataset=train_dataset,      # training dataset
    eval_dataset=test_dataset         # evaluation dataset
)

# Step 7: Train the model
trainer.train()

# Step 8: Evaluate the model
results = trainer.evaluate()
print("Evaluation results:", results)

# Step 9: Use the trained model to predict sentiment for new sentences
def predict_sentiment(sentences):
    inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions

# Example sentences to classify
test_sentences = ["I love this movie!", "I hated the movie, it was awful."]
predictions = predict_sentiment(test_sentences)
print(f"Predictions: {predictions}")  # Output: tensor([1, 0]) => [positive, negative]


# **3. Model Interpretability and Error Analysis**

- **Interpretability Investigation:**
  - For the traditional ML model: Analyze top features/words for each class
  - For the transformer model: Extract attention weights and analyze what the model focuses on
  - Compare interpretability between approaches

- **Systematic Error Analysis:**
  - Identify classes of examples where each model fails
  - Analyze length bias, domain bias, and linguistic complexity effects
  - Create confusion matrices and analyze misclassification patterns

- **Critical Questions:**
  - Do the models learn semantically meaningful patterns or exploit spurious correlations?
  - How do attention patterns relate to human understanding of sentiment indicators?
  - What are the implications of model opacity for real-world deployment?

In [None]:
print("\n\n\n... Phi-3-mini-4k-instruct ...")
outputs_phi3_mini4K_metrics = create_metric_calculations(outputs_phi3_mini4K, 0, 0)
augmented_outputs_phi3_mini4K_metrics = create_metric_calculations(augmented_outputs_phi3_mini4K, 0, 0, "augmented")

# **4. Robustness and Adversarial Testing**

## How different preprocessing choices affect model robustness
Preprocessing is similar to allowing model to learn from most optimal data. Any potential issues such as emojis, special characters, and or unusal spacing is removed in order for the model to concentrate on the actual words in the text. However, in a real-world setting, input text data is full of the noise that preprocessing tries to remove. Therefore, over-aggressive preprocessing choices can lead to less robust models while mild preprocessing choice may lead to more robust models that can handle the noisy and real-world data.

## Ethical implications of sentiment analysis models
Sentiment analysis models are trained based on human-labeled data. Every human-labeler in the world has their own biases, therefore that same bias can carry on into what the model categories under a specific class, making it impossible to be truly impartial.

## How to detect & mitigate bias in a production system?
One could use a model trained on a completly different set of similar data and compare both models' results in order to understand where a certain model may fall short. Bias can never be 100% mitigated because sentiment analysis models are built by data that at its core, is generated from biased humans. In production, it is better practice to state to users upfront that your model is biased, list the reasonings for why it's biased, and discuss introduction to more diverse datasets to mitigate this shortcoming. Transparency is necessary in a production system.

In [None]:
noisy_prompts = add_extreme_noise(prompts)
noisy_outputs_tinyLlama = model_generate(tinyLlama_tokenizer, tinyLlama_model, noisy_prompts[:4]) # Try first 4 prompts

print("... TinyLlama ...")
for i in range(0, 4):
    print()
    print("Noisy prompt:", noisy_prompts[i])
    print("Output:", noisy_outputs_tinyLlama[i])

... TinyLlama ...

Noisy prompt: What are quint chief characteristic typically require in a geospatial package cock?<|assistant|>
Output: What are quint chief characteristic typically require in a geospatial package cock?<|assistant|>

Quint is a geospatial package that is commonly used in cartography, GIS, and remote sensing. It is known for its high-quality vector data, precise geographic coordinates, and advanced features such as vector and raster editing, vector and raster mapping, and vector and raster analysis. Here are some of its typical characteristics:

1. Vector data: Quint provides vector data in various formats such as shapefiles, geojson, and raster data. These data types are ideal for creating maps, charts, and other visualizations.

2. Precise geographic coordinates: Quint's vector data is accurate to within a few meters, making it ideal for creating detailed maps and charts.

3. Advanced features: Quint's vector and raster editing, vector and raster mapping, and vector

# **Technical Reflection**

## Real-World Deployment Considerations (inference speed, memory usage, and scalability)
Methods with relatively fast inference were the traditional ML approach followed by embeddings based approach. The memory usage was high in traditional ML and transformer approach. Traditional approaches can not scale well while both embeddings and transformer approaches can but at the expense of a higher computational cost.

## Handling class imbalance and concept drift
* Class imbalance can be handled by obtaining more data, data augumentation to artificially create more data, or dropping samples from the majority class so that it matches that of the minority.
* Concept drift can be handled by fine-tuning the model on a new set of data, provide methods for model to search on the Internet for more timely information (and add new findings as context in the prompt window), or updating the model if neccessary with the new data in the training data.

## Methods for continuous model monitoring and updating
* Continuous model monitoring can be handled by using scheduling algorithms, or evenTriggers in order for iternal calls to be made to the model to alert it to update itself.
* For continuous model updating, user prompt on more timely information can trigger AI agents to be deployed to extract information from the more timely source and providing that timely information to the model as context.