**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [1]:
# imports for the project

from datasets import load_dataset, DatasetDict

## 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [2]:
ag_news_train = load_dataset("fancyzhx/ag_news", split="train[:20%]")  # 20% of the training data
ag_news_test = load_dataset("fancyzhx/ag_news", split="test")  # full test data

ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

ag_news

README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 24000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

## 2. Setup BERT Pipeline

In [3]:
from transformers import pipeline

embedder = pipeline(
    model="answerdotai/ModernBERT-base",      # model used for embedding
    tokenizer="answerdotai/ModernBERT-base",  # tokenizer used for embedding
    task="feature-extraction",                # feature extraction task (returns embeddings)
    device=0                                  # use GPU 0 if available
)

config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Device set to use mps:0


## 3. Encode the data

In [4]:
import numpy as np

def get_embeddings(data):
    """ Extract the [CLS] embedding for each text. """
    embeddings = embedder(data["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=8)

X_train = np.array(ag_news["train"]["embeddings"])  # Feature embeddings
y_train = np.array(ag_news["train"]["label"])       # Labels

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])

# Check shapes
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


Map:   0%|          | 0/24000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

X_train shape: (24000, 768), y_train shape: (24000,)
X_test shape: (7600, 768), y_test shape: (7600,)


## 4. Train a classifier
We went too overboard with models in Part I: BoW, so we will keep it simple here. We will use a Logistic Regression model as a baseline and then tune it with some hyperparameters.

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Baseline Logistic Regression
lr_baseline = LogisticRegression(max_iter=2000) # Increase max_iter to avoid convergence warnings
lr_baseline.fit(X_train, y_train)

# Tuned Logistic Regression
lr_tuned = LogisticRegression(C=0.1, max_iter=5000)
lr_tuned.fit(X_train, y_train)

# Predict on test set
y_pred_baseline = lr_baseline.predict(X_test)
y_pred_tuned = lr_tuned.predict(X_test)

# Evaluate
print("=== Baseline Model ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_baseline):.3f}")
print(classification_report(y_test, y_pred_baseline))

print("\n=== Tuned Model ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_tuned):.3f}")
print(classification_report(y_test, y_pred_tuned))

=== Baseline Model ===
Accuracy: 0.876
              precision    recall  f1-score   support

           0       0.90      0.87      0.88      1900
           1       0.96      0.96      0.96      1900
           2       0.84      0.81      0.82      1900
           3       0.82      0.87      0.84      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600


=== Tuned Model ===
Accuracy: 0.878
              precision    recall  f1-score   support

           0       0.90      0.87      0.89      1900
           1       0.96      0.95      0.96      1900
           2       0.84      0.81      0.83      1900
           3       0.82      0.88      0.85      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600



## 5. Reflections

The use of BERT with Logistic Regression as a classifier, using 20% of the training data, resulted in an accuracy of 0.876 for the baseline model and 0.878 for the tuned model. By setting C to 0.1 and max_iter to 5000, we were able to improve the accuracy slightly as it improves generalisation, but we assume the effect of the tuning would be higher if we used more of the data set than 20%.
Regretfully as we finished Part I first without considering the potential latency of the code for Part II, it is hard to compare the results with Part I, as the model in Part I is trained with all of the data meanwhile the model in Part II is trained with only 20% of the data. But regarding performance in speed, it takes significantly longer to train the BERT model than the BoW model, with about 11 minutes for the BoW with an Apple M1 Pro chip and over 20 minutes for the BERT model with just 20% of the data. This is expected as the BERT embeddings capture richer semantic information than the BoW model, which consequently leads to higher computational costs.