# 1. Activating GPU and installing dependencies

Because Colab does not have required libraries preinstalled in its environment we should install them manually.

- Transformers: A library for state-of-the-art natural language processing (NLP) with pretrained models.
- Accelerate: Facilitates easy multi-GPU and multi-TPU training for deep learning models.
- Datasets: A library for easily accessing and managing large datasets, particularly for NLP tasks.
- Git-lfs: Git extension for versioning large files.

In [None]:
!pip install transformers -U
!pip install accelerate -U
!pip install datasets transformers huggingface_hub
!apt-get install git-lfs

Collecting accelerate
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

In [None]:
# Activate GPU for faster training by clicking on 'Runtime' > 'Change runtime type' and then selecting GPU as the Hardware accelerator
# Then check if GPU is available
from datasets import load_dataset
import torch
import accelerate
import transformers
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
import numpy as np
from datasets import load_metric
from transformers import TrainingArguments, Trainer, TrainerCallback
from scipy.special import softmax
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix
import pandas as pd
from collections import Counter

print("accelerate: ", accelerate.__version__)
print("transformers: ", transformers.__version__)
print("cuda is available: ", torch.cuda.is_available())

accelerate:  0.31.0
transformers:  4.41.2
cuda is available:  True


We have put API token from HuggingFace to the Colab's variable 'HF_TOKEN' to save and extract saved models from our HuggingFace account. Here you can find detailed description of the tokens mechanism on HuggingFace: https://huggingface.co/docs/hub/en/security-tokens

In [None]:
from google.colab import userdata
import os
from huggingface_hub import notebook_login
os.environ['HUGGINGFACE_HUB_TOKEN'] = userdata.get('HF_TOKEN')
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In the next cell we connect our Google Drive to Google Colab. We will store results BERT's fine-tuning on Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#2. Preprocessing data

First, we should download the dataset from the HuggingFace.

You can notice that in our final report we mention that the used dataset can be found at this link: https://zenodo.org/records/10231028 . This datasets are the same. We used HuggingFace's downloading mechanism because of its convinience.

In [None]:
# Load data
news_media_bias = load_dataset("newsmediabias/news-bias-full-data")

Downloading readme:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.40M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3674927 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50001 [00:00<?, ? examples/s]

Second, we should choose the data for training. Downloaded dataset has

In [None]:
def filter_none_text(example):
    return example['text'] is not None

# chosen_aspects = ['Racial', 'Xenophobia', 'Nation stereotype', 'Religious', 'Geographical']
chosen_aspects = ['Racial', 'Geographical']
train_dataset = news_media_bias['train'].filter(lambda row: row['aspect'] in chosen_aspects).filter(filter_none_text).remove_columns(['dimension', 'biased_words', 'aspect', 'sentiment', 'toxic', 'identity_mention'])
test_dataset = news_media_bias['test'].filter(lambda row: row['aspect'] in chosen_aspects).filter(filter_none_text).remove_columns(['dimension', 'biased_words', 'aspect', 'sentiment', 'toxic', 'identity_mention'])

label_mapping = {
    'Neutral': 0,
    'Slightly Biased': 1,
    'Highly Biased': 2
}

def map_labels(example):
    example['label'] = label_mapping[example['label']]
    return example

# Apply the function to the dataset
train_dataset = train_dataset.map(map_labels)
test_dataset = test_dataset.map(map_labels)

# small_train_dataset = train_dataset.shuffle(seed = 42).select([i for i in list(range(3000))])
# small_test_dataset = test_dataset.shuffle(seed = 42)

samples_num = 80000
sampled_train_dataset = train_dataset.shuffle(seed = 42).select([i for i in list(range(samples_num))])

# Set DistilBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
# Prepare the text inputs for the model
def preprocess_function(examples):
  return tokenizer(examples["text"], truncation = True, padding = True)

tokenized_dataset = sampled_train_dataset.map(preprocess_function, batched = True)

# Use data_collector to convert our samples to PyTorch tensors and concatenate them with the correct amount of padding
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

Map:   0%|          | 0/80000 [00:00<?, ? examples/s]

# 3. Evaluating the model

In [None]:
# Define the evaluation metrics
def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions = predictions, references = labels)["accuracy"]
    f1 = load_f1.compute(predictions = predictions, references = labels, average = 'weighted')["f1"]
    precision = precision_score(labels, predictions, average = 'weighted')
    recall = recall_score(labels, predictions, average = 'weighted')

    return {"accuracy": accuracy, "f1": f1, "precision": precision, "recall": recall}

In [None]:
# Custom callback to log metrics after each epoch
class LoggingCallback(TrainerCallback):
    def __init__(self, fold, results_list):
        self.fold = fold
        self.results_list = results_list

    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        if metrics:
            self.results_list.append({
                "fold": self.fold + 1,
                "epoch": state.epoch,
                "accuracy": metrics.get("eval_accuracy"),
                "precision": metrics.get("eval_precision"),
                "recall": metrics.get("eval_recall"),
                "f1": metrics.get("eval_f1")
            })

In [None]:
# Prepare for cross-validation
fold_num = 3
kf = KFold(n_splits = fold_num, shuffle = True, random_state = 42)
fold_results = []

# Set hyperparameters
lr = 2e-5
batch_size = 64
num_epochs = 5

# Cross-validation
for fold, (train_index, val_index) in enumerate(kf.split(tokenized_dataset)):
    print(f"Training fold {fold + 1}")

    repo_name = f"BERT-racial_bias_model_{samples_num/1000}K_samples_fold_{fold}"

    train_fold = tokenized_dataset.select(train_index)
    val_fold = tokenized_dataset.select(val_index)

    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels = 3)

    training_args = TrainingArguments(
        output_dir = repo_name,
        learning_rate = lr,
        per_device_train_batch_size = batch_size,
        per_device_eval_batch_size = batch_size,
        num_train_epochs = num_epochs,
        weight_decay = 0.01,
        evaluation_strategy="epoch",
        save_strategy = "epoch"
    )

    trainer = Trainer(
        model = model,
        args = training_args,
        train_dataset = train_fold,
        eval_dataset = val_fold,
        tokenizer = tokenizer,
        data_collator = data_collator,
        compute_metrics = compute_metrics,
        callbacks = [LoggingCallback(fold, fold_results)]
    )

    trainer.train()

    trainer.push_to_hub()

# Create a DataFrame with the results
df_results = pd.DataFrame(fold_results)

Training fold 1


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.5892,0.429513,0.815502,0.815885,0.816644,0.815502
2,0.3645,0.379531,0.845277,0.845204,0.845261,0.845277
3,0.2725,0.373976,0.856339,0.856409,0.856608,0.856339
4,0.2123,0.381028,0.858214,0.858652,0.859297,0.858214
5,0.176,0.412025,0.858102,0.858878,0.859866,0.858102


  load_accuracy = load_metric("accuracy")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datase

events.out.tfevents.1718005879.6843fc8259a2.4589.2:   0%|          | 0.00/9.33k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Training fold 2


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.5993,0.434354,0.819702,0.817849,0.816452,0.819702
2,0.3654,0.377836,0.844339,0.843117,0.842234,0.844339
3,0.2747,0.388429,0.848689,0.849921,0.851785,0.848689
4,0.2137,0.401987,0.855177,0.855355,0.855793,0.855177
5,0.1757,0.424132,0.854577,0.855041,0.85565,0.854577


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datase

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

events.out.tfevents.1718009407.6843fc8259a2.4589.3:   0%|          | 0.00/9.33k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

Training fold 3


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.5959,0.432599,0.814408,0.812913,0.811712,0.814408


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.5959,0.432599,0.814408,0.812913,0.811712,0.814408
2,0.3665,0.383183,0.841409,0.841826,0.842372,0.841409
3,0.2709,0.393084,0.847784,0.848314,0.849825,0.847784
4,0.2125,0.418038,0.849621,0.850034,0.851529,0.849621
5,0.1731,0.414479,0.856671,0.857563,0.858635,0.856671


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datase

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

events.out.tfevents.1718012941.6843fc8259a2.4589.4:   0%|          | 0.00/9.33k [00:00<?, ?B/s]

In [None]:
df_results

Unnamed: 0,fold,epoch,accuracy,precision,recall,f1
0,1,1.0,0.815502,0.816644,0.815502,0.815885
1,1,2.0,0.845277,0.845261,0.845277,0.845204
2,1,3.0,0.856339,0.856608,0.856339,0.856409
3,1,4.0,0.858214,0.859297,0.858214,0.858652
4,1,5.0,0.858102,0.859866,0.858102,0.858878
5,2,1.0,0.819702,0.816452,0.819702,0.817849
6,2,2.0,0.844339,0.842234,0.844339,0.843117
7,2,3.0,0.848689,0.851785,0.848689,0.849921
8,2,4.0,0.855177,0.855793,0.855177,0.855355
9,2,5.0,0.854577,0.85565,0.854577,0.855041


In [None]:
df_results.to_csv(f"/content/drive/MyDrive/Colab Notebooks/M2/data/results/results_bert/lr_{lr}&batch_{batch_size}&data_{samples_num/1000}K.csv", index = False)