**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [2]:
# imports for the project

from datasets import load_dataset, DatasetDict

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [3]:
# Load a subset of the training data (20%) and the full test set.
ag_news_train = load_dataset("fancyzhx/ag_news", split="train[:20%]")
ag_news_test = load_dataset("fancyzhx/ag_news", split="test")

# Create a DatasetDict to hold the train and test splits.
ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

# Print dataset information
print(ag_news)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 24000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


In [4]:
from transformers import pipeline

# Use the standard BERT model for feature extraction.
embedder = pipeline(
    model="bert-base-uncased",      # Standard pre-trained BERT model
    tokenizer="bert-base-uncased",  # Corresponding tokenizer
    task="feature-extraction",      # Returns embeddings for each token
    device=-1                       # Use CPU; change to 0 if you have a GPU
)


Device set to use cpu


In [5]:
def get_embeddings(data):
    """
    Extracts the [CLS] embedding for each text entry.
    
    BERT (and ModernBERT) returns embeddings for each token.
    We use the embedding corresponding to the first token ([CLS])
    as a representation for the whole text.
    """
    # Process a batch of texts to obtain token embeddings.
    embeddings = embedder(data["text"])
    # Extract the first token's embedding ([CLS]) for each example.
    cls_embeddings = [e[0][0] for e in embeddings]
    return {"embeddings": cls_embeddings}

# Map the get_embeddings function over the dataset (batched processing for speed).
# Adjust batch_size if you run into memory issues.
ag_news = ag_news.map(get_embeddings, batched=True, batch_size=8)

# Inspect the updated dataset structure.
print(ag_news)


Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 24000
    })
    test: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 7600
    })
})


In [6]:
import numpy as np

# Convert the embeddings and labels from the dataset into NumPy arrays.
X_train = np.array(ag_news["train"]["embeddings"])  # BERT embeddings as features
y_train = np.array(ag_news["train"]["label"])         # Labels

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])

# Check the shapes of the resulting arrays.
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


X_train shape: (24000, 768), y_train shape: (24000,)
X_test shape: (7600, 768), y_test shape: (7600,)


In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train a baseline Logistic Regression classifier on the BERT embeddings.
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

# Predict on the training set and evaluate performance.
y_pred_train = lr.predict(X_train)
print("Performance on the Training Set:")
print(classification_report(y_train, y_pred_train))

# Predict on the test set and evaluate performance.
y_pred = lr.predict(X_test)
print("Performance on the Test Set:")
print(classification_report(y_test, y_pred))


Performance on the Training Set:
              precision    recall  f1-score   support

           0       0.93      0.92      0.93      6195
           1       0.97      0.99      0.98      5856
           2       0.88      0.87      0.88      5601
           3       0.90      0.91      0.90      6348

    accuracy                           0.92     24000
   macro avg       0.92      0.92      0.92     24000
weighted avg       0.92      0.92      0.92     24000

Performance on the Test Set:
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      1900
           1       0.95      0.95      0.95      1900
           2       0.83      0.82      0.83      1900
           3       0.83      0.86      0.84      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600



In [8]:
# Experiment with a different regularization parameter (C).
# A lower C value applies stronger regularization.
lr_exp = LogisticRegression(max_iter=1000, C=0.5)
lr_exp.fit(X_train, y_train)

# Evaluate the experimental model on the test set.
y_pred_exp = lr_exp.predict(X_test)
print("Test Set Performance with Logistic Regression (C=0.5):")
print(classification_report(y_test, y_pred_exp))


Test Set Performance with Logistic Regression (C=0.5):
              precision    recall  f1-score   support

           0       0.91      0.88      0.89      1900
           1       0.96      0.96      0.96      1900
           2       0.83      0.82      0.83      1900
           3       0.83      0.87      0.85      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600



# Reflection on BERT Embeddings and Classifier Performance

In this experiment, I used a pre-trained BERT model (via the Hugging Face pipeline) to embed the AG News articles. The resulting embeddings are 768-dimensional vectors, and the processed dataset contains 24,000 training examples and 7,600 test examples.

### Key Observations

- Feature representation: the BERT embeddings provide a dense, semantic representation of the text. Each article is converted into a fixed-length 768-dimensional vector, capturing nuanced contextual information that is hard to achieve with traditional methods such as BoW.

- Dataset and embedding extraction: After embedding, the training set has a shape of (24,000, 768) and the test set (7,600, 768). This consistency in dimensions makes the embeddings directly usable by standard classifiers like Logistic Regression.

- Classifier performance on training data: The Logistic Regression classifier achieved an accuracy of 92% on the training set with high precision and recall across all categories:
  - Class 0 (World): Precision = 0.93, Recall = 0.92, F1 = 0.93
  - Class 1 (Sports): Precision = 0.97, Recall = 0.99, F1 = 0.98
  - Class 2 (Business): Precision = 0.88, Recall = 0.87, F1 = 0.88
  - Class 3 (Sci/Tech): Precision = 0.90, Recall = 0.91, F1 = 0.90

- Generalization to test data: On the test set, the classifier achieves an overall accuracy of 88%. The per-class performance is slightly lower than the training metrics, indicating that the model generalizes reasonably well:
  - Class 0: F1 ≈ 0.89
  - Class 1: F1 ≈ 0.95
  - Class 2: F1 ≈ 0.83
  - Class 3: F1 ≈ 0.84
  
  This moderate drop in performance from training to test data suggests that the BERT embeddings help reduce overfitting compared to more sparse representations.

- Hyperparameter tuning:
  Experimenting with a lower regularization parameter (C=0.5) resulted in nearly identical performance on the test set. This suggests that the default regularization of the Logistic Regression model is already well-suited for the robust features provided by BERT, and minor adjustments do not significantly impact performance.

### Conclusion

OVerall, the use of BERT embeddings has proven effective for the AG News classification task:
- Good Feature Extraction, the 768-dimensional embeddings capture rich semantic content, which in turn enables the classifier to achieve high accuracy.
- Good Performance, with a training accuracy of 92% and test accuracy of 88%, the gap between training and test performance is moderate, indicating good generalization.
- Hyperparameter Sensitivity, changes to the regularization parameter did not dramatically alter performance, which implies that the embeddings provide a stable basis for classification.

