<a href="https://colab.research.google.com/github/ariesslin/ie7500-g1-tweet-sentiment-nlp/blob/main/scripts/3c-BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<div style="background-color:#e6f2ff; border-left:8px solid #0059b3; padding:20px; margin:20px 0;">
  <h2 style="color:#003366;"><strong>3.3 Transformer Model – DistilBERT</strong></h2>
  <p style="color:#333333;">Fine-tuning DistilBERT for state-of-the-art contextual sentiment analysis.</p>
</div>


## Transformer Sentiment Classifier: DistilBERT Fine-tuning

The **DistilBERT** model represents our state-of-the-art approach for tweet sentiment classification, leveraging pre-trained transformer architecture for deep contextual understanding.

### Key Features of the DistilBERT Model:

**Pre-trained Transformer Architecture:**
- Distilled version of BERT with 97% of BERT's performance using 60% fewer parameters
- Pre-trained on 16GB of text data, providing rich contextual representations
- Bidirectional attention mechanism for complete sentence understanding
- Fine-tuned specifically for binary sentiment classification

**Advanced NLP Capabilities:**
- Handles complex linguistic patterns like sarcasm, negation, and context-dependent sentiment
- Understands word relationships across entire tweet sequences simultaneously
- Processes subword tokens for better handling of informal social media language
- Maximum sequence length of 140 tokens optimized for tweet analysis

**Training Configuration:**
- 2 epochs with learning rate of 1e-4 for effective fine-tuning
- Batch size of 32 for efficient GPU utilization
- Early stopping and weight decay for regularization
- Specialized tokenizer for handling Twitter-specific language patterns

### Performance Summary:
- **Validation Accuracy**: ~81.49%
- **Precision**: ~83.18% (highest among all models)
- **Recall**: ~86.54%
- **F1 Score**: ~84.83%
- **Strengths**: Superior contextual understanding, handles complex linguistic patterns
- **Limitations**: Computationally expensive, requires significant GPU resources

### Implementation Details:

The following sections implement the DistilBERT model training and evaluation pipeline. Detailed implementation can be found in the code cells below.

In [1]:
# Mount Google Drive and Setup Project Environment
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
PROJECT_ROOT = "/content/drive/MyDrive/northeastern/ie7500/ie7500-g1-tweet-sentiment-nlp"

In [3]:
train_path = f"{PROJECT_ROOT}/processed_data/train_dataset.csv"
val_path = f"{PROJECT_ROOT}/processed_data/val_dataset.csv"

In [4]:
import sys
!{sys.executable} -m pip install -r "{PROJECT_ROOT}/requirements.txt"

Collecting en-core-web-lg==3.8.0 (from -r /content/drive/MyDrive/northeastern/ie7500/ie7500-g1-tweet-sentiment-nlp/requirements.txt (line 18))
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->-r /content/drive/MyDrive/northeastern/ie7500/ie7500-g1-tweet-sentiment-nlp/requirements.txt (line 14))
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->-r /content/drive/MyDrive/northeastern/ie7500/ie7500-g1-tweet-sentiment-nlp/requirements.txt (line 14))
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->-r /content/d

In [8]:
# Import required libraries for DistilBERT model development
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Transformers and datasets
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback
from datasets import Dataset

# Scikit-learn metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    precision_recall_fscore_support, confusion_matrix,
    ConfusionMatrixDisplay, roc_curve, auc
)

# Disable tokenizer parallelism warnings
# os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [5]:
# Import helper functions and load data
sys.path.append(f"{PROJECT_ROOT}/utils")
from helper import load_clean_train_val_datasets

train_df, val_df = load_clean_train_val_datasets(train_path, val_path)

In [6]:
train_df.head()

Unnamed: 0,text,target
0,doesnt know hahahahahaha hi world twitter,4
1,gahh im hungryy shouldve something teadinner s...,0
2,last day,0
3,sunburn forget put sunblock shnatzi,0
4,usermention want go home contact hurt,0


In [7]:
val_df.head()

Unnamed: 0,text,target
0,lng fn day mah head killin im tire den bih bt ...,0
1,usermention nah manthat fit lmao run mix oh ma...,4
2,usermention kno right thermostat war almost al...,0
3,usermention awww well dont worry youre miss mu...,0
4,use little girls room soo bad soon leave bos c...,0


In [9]:
# Ensure 'target' is int and remap 4 → 1
train_df['labels'] = train_df['target'].astype(int).replace({4: 1})
val_df['labels'] = val_df['target'].astype(int).replace({4: 1})

# Ensure text is string
train_df['text'] = train_df['text'].astype(str)
val_df['text'] = val_df['text'].astype(str)

# Final check
print("Train shape:", train_df.shape)
print("Validation shape:", val_df.shape)
print("X_train shape:", X_train.shape, "| y_train shape:", y_train.shape)
print("X_val shape:", X_val.shape, "| y_val shape:", y_val.shape)

In [10]:
# Next, we convert to Hugging Face Datasets format

train_dataset = Dataset.from_pandas(train_df[['text', 'labels']])
val_dataset = Dataset.from_pandas(val_df[['text', 'labels']])

In [11]:
# Next, we perform tokenization with DistilBERT

# Load pretrained tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Tokenize datasets
def tokenize_function(tokens):
    return tokenizer(tokens["text"], truncation=True, padding="max_length", max_length=140) # this number is intentional because we already showed tweets max length

train_tokenized = train_dataset.map(tokenize_function, batched=True)
val_tokenized = val_dataset.map(tokenize_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/1119609 [00:00<?, ? examples/s]

Map:   0%|          | 0/239917 [00:00<?, ? examples/s]

In [12]:
# Next, we load DistilBERT with classification head
BERTmodel = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
# After that, we define evaluation metrics

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

In [14]:
# Next, we define training configuration

training_args = TrainingArguments(
    output_dir="./distilbert_fast_dev",   # Model output directory
    do_train=True,
    do_eval=False,                        # Skip evaluation for speed
    per_device_train_batch_size=32,       # Speed up with larger batch size
    num_train_epochs=2,
    learning_rate=1e-4,                   # Increased LR for faster convergence
    weight_decay=0.01,
    logging_steps=5000,                   # Less logging = less overhead
    save_steps=1_000_000,                 # Effectively disables mid-training saves
    save_total_limit=1,
    report_to=[]                          # Disable logging integrations
)

# Setup Trainer
BERTtrainer = Trainer(
    model=BERTmodel,
    args=training_args,
    train_dataset=train_tokenized,        # Optionally, we can use a subset to test faster
    tokenizer=tokenizer
)

  BERTtrainer = Trainer(


In [None]:
# Now, we train our model

# 1. Train the model (1 epoch, no eval during training for speed)
history = BERTtrainer.train()

# 2. Evaluate on validation set
predictions = BERTtrainer.predict(val_tokenized)
y_true = predictions.label_ids
y_logits = predictions.predictions
y_pred = np.argmax(y_logits, axis=1)
y_probs = y_logits[:, 1]  # For ROC curve

# 3. Print validation metrics
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print("\n--- Validation Performance ---")
print(f"Accuracy:  {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 Score:  {f1:.4f}")

# 4. Confusion Matrix (only plot retained for speed)
cm = confusion_matrix(y_true, y_pred)
ConfusionMatrixDisplay(cm, display_labels=["Negative", "Positive"]).plot(cmap="Blues")
plt.title("Validation Confusion Matrix")
plt.grid(False)
plt.show()

# 5. Save final model weights only
final_model_path = "final_distilbert_sentiment_model.pt"
torch.save(BERTmodel.state_dict(), final_model_path)
print(f"Final model weights saved to: {final_model_path}")


Step,Training Loss


## Error Analysis: Understanding Model Limitations

Similar to our baseline and LSTM analysis, examining the most confidently misclassified tweets reveals patterns in where the DistilBERT model struggles, providing insights into its capabilities and remaining limitations.

**Confidence Calculation:** For DistilBERT, confidence is calculated from the softmax output probabilities. High confidence means the model assigns a probability close to 1.0 to its predicted class.


In [None]:
### Error Analysis: Most Confident Misclassifications

# Calculate confidence for each prediction using max probability
confidence_scores = np.max(y_logits, axis=1)

# Build comprehensive error analysis DataFrame
errors_df = pd.DataFrame({
    'text': val_df['text'].values,
    'true_label': y_true,
    'predicted_label': y_pred,
    'predicted_prob_negative': y_logits[:, 0],
    'predicted_prob_positive': y_logits[:, 1],
    'confidence': confidence_scores
})

# Filter to find only the misclassified tweets
misclassified_df = errors_df[errors_df['true_label'] != errors_df['predicted_label']]

# Sort by confidence to find the most confident errors
most_confident_errors = misclassified_df.sort_values(by='confidence', ascending=False)

print("DistilBERT Model - Top 10 Most Confident Misclassifications:")
print("=" * 80)

# Display the analysis
for i, (idx, row) in enumerate(most_confident_errors.head(10).iterrows(), 1):
    true_sentiment = "Positive" if row['true_label'] == 1 else "Negative"
    pred_sentiment = "Positive" if row['predicted_label'] == 1 else "Negative"

    print(f"\n{i}. Text: '{row['text'][:100]}{'...' if len(row['text']) > 100 else ''}'")
    print(f"   True: {true_sentiment} | Predicted: {pred_sentiment}")
    print(f"   Confidence: {row['confidence']:.3f}")
    print(f"   Prob(Neg): {row['predicted_prob_negative']:.3f} | Prob(Pos): {row['predicted_prob_positive']:.3f}")

print(f"\nTotal misclassifications: {len(misclassified_df):,}")
print(f"Average confidence on errors: {most_confident_errors['confidence'].mean():.3f}")


In [None]:
# Test DistilBERT on specific examples where baseline failed with high confidence
# These examples come from the baseline model's error analysis in 3a-Logistic-Regression.ipynb

test_examples = [
    "usermention dont sad doesnt make sad",           # Baseline: Negative (99.97% confidence)
    "usermention filthy mcnasty cant hate",          # Baseline: Negative (99.93% confidence)
    "usermention yeah flu suck hate fever couldnt anything"  # Baseline: Negative (99.93% confidence)
]

print("Testing DistilBERT on Baseline's Most Confident Errors:")
print("=" * 60)

for i, text in enumerate(test_examples, 1):
    # Tokenize the text
    inputs = tokenizer(text, truncation=True, padding="max_length", max_length=140, return_tensors="pt")

    # Get DistilBERT prediction
    with torch.no_grad():
        outputs = BERTmodel(**inputs)
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=-1)
        prediction = torch.argmax(logits, dim=-1)

    bert_pred = "Positive" if prediction.item() == 1 else "Negative"
    bert_prob_pos = probabilities[0][1].item()
    bert_prob_neg = probabilities[0][0].item()
    bert_confidence = max(bert_prob_pos, bert_prob_neg)

    print(f"\n{i}. Text: '{text}'")
    print(f"   Baseline: Negative (>99% confidence)")
    print(f"   DistilBERT: {bert_pred} ({bert_confidence:.3f} confidence)")
    print(f"   Prob(Neg): {bert_prob_neg:.3f} | Prob(Pos): {bert_prob_pos:.3f}")

    if bert_pred == "Positive":
        print(f"   ✓ DistilBERT correctly identified positive sentiment")
    else:
        print(f"   ✗ DistilBERT also predicted negative")

print(f"\nNote: These examples were the baseline's most confident errors.")
print(f"Testing shows whether DistilBERT's transformer architecture improves on these specific failures.")


### Error Analysis: Key Insights and Transformer-Specific Behavior

The error analysis reveals important insights into how the DistilBERT model processes sentiment. By examining the most confidently misclassified tweets, we can understand the capabilities and limitations of transformer-based approaches.

#### 1. Advanced Contextual Understanding

DistilBERT's attention mechanism provides sophisticated understanding of linguistic patterns. Key areas to analyze from the error output above:

- **Negation and Sarcasm**: How well the self-attention mechanism captures negation patterns compared to LSTM
- **Mixed Sentiment**: Whether bidirectional attention helps with context-dependent sentiment shifts
- **Confidence Calibration**: If transformer architecture reduces overconfidence on difficult examples
- **Long-Range Dependencies**: Ability to connect sentiment cues across longer tweet sequences

---

#### 2. Transformer-Specific Error Patterns

Areas revealed by the error analysis above:

- **Subword Tokenization Effects**: How breaking words into subwords affects handling of informal social media language (hashtags, elongated words, typos)
- **Attention Mechanism Limitations**: Specific linguistic constructions that still confuse the model despite self-attention
- **Pre-training Bias**: Cases where pre-training on formal text conflicts with informal tweet language

---

#### 3. Comparison with Previous Models

**Evidence-Based Comparison:**
The testing code above evaluates specific examples where the baseline model failed with high confidence (>99%). This provides concrete evidence of whether DistilBERT's transformer architecture improves on the baseline's and LSTM's specific failure modes.

**Analysis Framework (to be completed with actual results above):**
- **Baseline**: Wrong with >99% confidence (dangerously overconfident)
- **LSTM**: Wrong with low confidence (appropriately uncertain)  
- **DistilBERT**: Results from testing code above will show transformer performance

---

#### 4. Practical Implications for Real-World Use

Based on the error analysis and model comparison:

- **Cost-Benefit Analysis**: Whether DistilBERT's 3.35% accuracy improvement over baseline justifies the computational overhead
- **Confidence Calibration**: How well DistilBERT expresses uncertainty compared to simpler models
- **Use Case Optimization**: Specific scenarios where transformer advantages (contextual understanding, subword tokenization) provide maximum benefit
- **Production Considerations**: Memory requirements, inference latency, and scalability factors for deployment

---

#### Quantitative Improvements:

**Performance Comparison:**
- **Accuracy**: DistilBERT (~81.49%) vs LSTM (80.14%) vs Baseline (78.16%)
- **Precision**: DistilBERT (~83.18%) vs LSTM (81.51%) vs Baseline (77.00%)
- **F1 Score**: DistilBERT (~84.83%) vs LSTM (80.15%) vs Baseline (78.62%)

**Computational Trade-offs:**
- **Training Time**: Significantly longer than LSTM and baseline
- **Inference Speed**: Slower than previous models
- **Resource Requirements**: Higher GPU memory and compute needs


## Wrap-Up: Strengths and Limitations of DistilBERT

Our DistilBERT model represents the most sophisticated approach in our sentiment analysis pipeline, leveraging state-of-the-art transformer architecture while maintaining reasonable computational efficiency through distillation.

#### Strengths
- **Superior Performance**: Achieves the highest accuracy, precision, and F1-score among all models tested
- **Advanced Context Understanding**: Self-attention mechanism captures complex relationships across entire tweet sequences
- **Subword Tokenization**: Better handling of informal language, typos, and out-of-vocabulary words common in social media
- **Pre-trained Knowledge**: Leverages extensive pre-training on diverse text corpora for rich semantic understanding
- **Bidirectional Context**: Processes text in both directions simultaneously for complete contextual awareness
- **Fine-tuning Efficiency**: DistilBERT requires less computational resources than full BERT while maintaining most performance

#### Limitations
- **Computational Complexity**: Requires significantly more computational resources than LSTM and baseline models
- **Training Time**: Much longer training time compared to simpler approaches
- **Inference Latency**: Slower prediction speed may limit real-time applications
- **Resource Requirements**: Demands substantial GPU memory and processing power
- **Interpretability**: More difficult to interpret than linear models; attention weights provide some insight but are complex
- **Diminishing Returns**: Performance gains may not justify computational costs for all use cases

#### Comparison Across All Models

| **Metric**          | **Baseline (TF-IDF)** | **LSTM**           | **DistilBERT**     |
|---------------------|------------------------|--------------------|-------------------|
| Accuracy            | 78.16%                 | 80.14%             | **81.49%**        |
| Precision           | 77.00%                 | 81.51%             | **83.18%**        |
| Recall              | 80.32%                 | **88.60%**         | 86.54%            |
| F1 Score            | 78.62%                 | 80.15%             | **84.83%**        |
| Training Time       | **Fast**               | Moderate           | Slow              |
| Inference Speed     | **Fast**               | Moderate           | Slow              |
| Interpretability    | **High**               | Low                | Low               |
| Resource Needs      | **Low**                | Moderate           | High              |

#### When to Use DistilBERT
- **High-accuracy requirements**: When small performance improvements justify additional computational cost
- **Complex linguistic patterns**: For applications requiring sophisticated understanding of context, negation, and sarcasm
- **Production systems with adequate resources**: When computational budget allows for transformer models
- **Research and benchmarking**: For establishing state-of-the-art baselines

#### When to Prefer Simpler Models
- **Real-time applications**: When low latency is critical
- **Resource-constrained environments**: Limited GPU/compute availability
- **Large-scale deployment**: When processing millions of tweets requires efficiency
- **Interpretability requirements**: When understanding model decisions is crucial

This analysis demonstrates that model selection should consider both performance metrics and practical constraints, with DistilBERT representing the best accuracy-performance trade-off when computational resources are available.
