**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [2]:
# imports for the project
import pandas as pd
import numpy as np
import torch
from transformers import BertTokenizer, BertModel, pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from datasets import load_dataset, DatasetDict


  from .autonotebook import tqdm as notebook_tqdm


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

Have to make sets smaller due to HUGE processing times

In [3]:

#making the datasets smaller due to large prosessing time
ag_news_train = load_dataset("fancyzhx/ag_news", split="train[:10%]", keep_in_memory=True)  # 20% of the training data
ag_news_test = load_dataset("fancyzhx/ag_news", split="test[:10%]", keep_in_memory=True)  # 20% of the test data

ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

ag_news

Using the latest cached version of the dataset since fancyzhx/ag_news couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at C:\Users\vald0\.cache\huggingface\datasets\fancyzhx___ag_news\default\0.0.0\eb185aade064a813bc0b7f42de02595523103ca4 (last modified on Sun Mar 30 13:41:37 2025).


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 12000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 760
    })
})

In [4]:

print(torch.cuda.is_available())

# Load the tokenizer and model
embedder = pipeline(
    model="answerdotai/ModernBERT-base",      # model used for embedding
    tokenizer="answerdotai/ModernBERT-base",  # tokenizer used for embedding
    task="feature-extraction",                # feature extraction task (returns embeddings)
    device= 0                                  # use GPU 0 if available
)

False


Device set to use cpu


Extracting the embeddings

In [6]:
def get_embeddings(data):
    """ Extract the [CLS] embedding for each text. """
    embeddings = embedder(data["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=8)

Map: 100%|██████████| 12000/12000 [16:53<00:00, 11.84 examples/s]
Map: 100%|██████████| 760/760 [01:06<00:00, 11.37 examples/s]


Show embeddings as a feature

In [7]:
ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 12000
    })
    test: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 760
    })
})

extract features and labels into our traning and test splits

In [10]:
X_train = np.array(ag_news["train"]["embeddings"])  # Feature embeddings
y_train = np.array(ag_news["train"]["label"])       # Labels

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])

# Check shapes
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


X_train shape: (12000, 768), y_train shape: (12000,)
X_test shape: (760, 768), y_test shape: (760,)


Train a classifier


In [None]:
lr = LogisticRegression(max_iter=5000)

lr.fit(X_train, y_train)

y_pred_train = lr.predict(X_train)

print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.94      0.92      0.93      2976
           1       0.97      0.99      0.98      2789
           2       0.89      0.88      0.88      3039
           3       0.90      0.90      0.90      3196

    accuracy                           0.92     12000
   macro avg       0.92      0.92      0.92     12000
weighted avg       0.92      0.92      0.92     12000



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Predict

In [14]:
y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.88      0.87       197
           1       0.95      0.94      0.95       199
           2       0.81      0.84      0.83       158
           3       0.88      0.85      0.86       206

    accuracy                           0.88       760
   macro avg       0.88      0.88      0.88       760
weighted avg       0.88      0.88      0.88       760



### tuning

In [None]:

##hyperparameter tuning with GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],  # Regularization type
    'solver': ['liblinear', 'saga'],  # Solver for logistic regression
    'max_iter': [1000, 5000, 7000],  # Number of iterations for convergence
}

grid = GridSearchCV(lr, param_grid, cv=5, scoring='f1')
grid.fit(X_train, y_train)
# Print the best hyperparameters and score
print("Best hyperparameters:", grid.best_params_)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))

# Use the best estimator to predict and evaluate on the test set
best_lr = grid.best_estimator_
y_pred_grid = best_lr.predict(X_test)
print(classification_report(y_test, y_pred_grid))

Second classifier prediction

In [None]:
#import randomforrest
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
print(classification_report(y_test, y_pred_rf))

              precision    recall  f1-score   support

           0       0.79      0.80      0.80       197
           1       0.88      0.91      0.90       199
           2       0.68      0.73      0.70       158
           3       0.80      0.72      0.76       206

    accuracy                           0.79       760
   macro avg       0.79      0.79      0.79       760
weighted avg       0.80      0.79      0.79       760



In [None]:
# parameter grid for RandomForestClassifier
param_grid_rf = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [50, 100, 200]
}

grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5, scoring='f1')
grid_rf.fit(X_train, y_train)

print("Best hyperparameters:", grid_rf.best_params_)
print("Best cross-validation accuracy: {:.2f}".format(grid_rf.best_score_))

# Evaluate the best estimator on the test set
best_rf = grid_rf.best_estimator_
y_pred_rf_grid = best_rf.predict(X_test)
print(classification_report(y_test, y_pred_rf_grid))

## reflections
Only experimented with different C values for the logisticregression, computation could not finish if gridsearch had too many variables

Logisticregression:
              precision    recall  f1-score   support

           0       0.87      0.88      0.87       197
           1       0.95      0.94      0.95       199
           2       0.81      0.84      0.83       158
           3       0.88      0.85      0.86       206

    accuracy                           0.88       760
   macro avg       0.88      0.88      0.88       760
weighted avg       0.88      0.88      0.88       760



For random forrest:
              precision    recall  f1-score   support

           0       0.79      0.80      0.80       197
           1       0.88      0.91      0.90       199
           2       0.68      0.73      0.70       158
           3       0.80      0.72      0.76       206

    accuracy                           0.79       760
   macro avg       0.79      0.79      0.79       760
weighted avg       0.80      0.79      0.79       760

Initially logisticregression scored better, and that was without scaling the data. 

The bert model requires huge processseing times and power