# Insert Title Here
**DATA102 S11 Group 3*
- Banzon, Beatrice Elaine B.
- Buitre, Cameron
- Marcelo, Andrea Jean C.
- Navarro, Alyssa Riantha R.
- Vicente, Francheska Josefa

## Requirements and Imports

Before starting, the relevant libraries and files in building and training the model should be loaded into the notebook first.

### Imports
Several libraries are required to perform a thorough analysis of the dataset. Each of these libraries will be imported and described below:

**Basic Libraries**

Import `numpy`, `pandas`, and `datasets`.

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis
* `datasets` contains functions that allow easier pre-processing for datasets and smart caching for easier loading of data

In [1]:
import numpy as np
import pandas as pd
import datasets

**Machine Learning Libraries**

The `train_test_split` is a function that allows the dataset to be split into two randomly.

In [2]:
from sklearn.model_selection import train_test_split

Meanwhile, the following imports are used to create the dataset :
* `torch` library is an open source ML library for deep neural network creation
* `Dataset` and `DataLoader` are two data primitives that makes loading and using dataset easier
* `RandomSampler` and `SequentialSampler` are samplers that is used by the `DataLoader`
* `ProgressBarBase` and `RichProgressBar` are components that shows the progress bar of training the models.

In [3]:
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from pytorch_lightning.callbacks import ProgressBarBase, RichProgressBar

The next imports are from `transformers`, which contains pre-trained models and tokenizers that can be fine-tuned.
* `AutoTokenizer` automatically creates the tokenizer based on the architecture passed
* `AutoModelForSequenceClassification` automatically instantiates a sequence classification model based on the type of model passed
* `TrainerCallback` is an object that determines how the training loop will behave
* `TrainingArguments` is a dataclass that allows the customization of the arguments in training
* `Trainer` is a class that has a complete training and validation loop

In [4]:
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          TrainerCallback, TrainingArguments, Trainer)

On the other hand, these classes computes and visualizes the different scores about how well a model works.
* `f1_score` computes the balanced F-score by comparing the actual classes and the predicted classes
* `hamming_loss` computes the fraction of labels that were incorrectly labeled by the model
* `accuracy_score` computes the accuracy by determining how many classes were correctly predicted
* `EvalPrediction` is an object in transformers that holds the prediction of the model and the target output
* `evaluate` is a libray that is used to evaluate and compare metrics
* `load_metric` is a function in the datasets library that allows different metrics to be loaded

In [5]:
from sklearn.metrics import f1_score, hamming_loss, accuracy_score
from transformers import EvalPrediction
import evaluate
from datasets import load_metric

Next, `optuna` is used to tune the hyperparameters of machine learning models.

In [6]:
import optuna

Last, `pickle` is a module that can serialize and deserialize objects. In this notebook, it is used to save and load models.

In [7]:
import pickle

### Datasets and Files
To train the BERT and RoBERTa model, let us load the cleaned dataset using the [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [8]:
df = pd.read_csv ('cleaned_data.csv')
df

Unnamed: 0,label,text
0,0,"Ayon sa TheWrap.com, naghain ng kaso si Krupa,..."
1,0,Kilala rin ang singer sa pagkumpas ng kanyang ...
2,0,"BLANTYRE, Malawi (AP) -- Bumiyahe patungong Ma..."
3,0,"Kasama sa programa ang pananalangin, bulaklak ..."
4,0,Linisin ang Friendship Department dahil dadala...
...,...,...
23130,0,The winner of the special election in Cavite t...
23131,0,The remains of four people inside the Cessna p...
23132,0,A Kabataan Party-list representative visited t...
23133,0,The Philippine Coast Guard is expected to have...


Before we start directly dealing with the data, we will set the **device** on where the model will run. If there is an existence of a CUDA-enabled device, it will automatically pick CUDA as its device. Otherwise, it will run on the CPU.

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Preparing data for Feature Engineering

Before creating the features that the BERT and RoBERTa models will use for training, there are two steps that we must first do: (1) splitting the dataset into the train, val, and test sets, and (2) transforming our [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) into a [`Dataset`](https://pypi.org/project/datasets/). This would allow us to utilize the data for the training more easily.

### Splitting the Dataset into Train, Val, and Test Split
Let us first define the **X** (input) and **y** (target/output) of our model. This is done to allow the stratifying of the data when it is split into the train, val and test.

The **X** (input) can be retrieved by getting the `text` column in the original dataset.

In [10]:
X = df ['text']
X

0        Ayon sa TheWrap.com, naghain ng kaso si Krupa,...
1        Kilala rin ang singer sa pagkumpas ng kanyang ...
2        BLANTYRE, Malawi (AP) -- Bumiyahe patungong Ma...
3        Kasama sa programa ang pananalangin, bulaklak ...
4        Linisin ang Friendship Department dahil dadala...
                               ...                        
23130    The winner of the special election in Cavite t...
23131    The remains of four people inside the Cessna p...
23132    A Kabataan Party-list representative visited t...
23133    The Philippine Coast Guard is expected to have...
23134    National Bureau of Investigation-National Capi...
Name: text, Length: 23135, dtype: object

Meanwhile, the **y** value (i.e., the value that we would be "feeding" our models) is the `class` column. 

In [11]:
y = df ['label']
y

0        0
1        0
2        0
3        0
4        0
        ..
23130    0
23131    0
23132    0
23133    0
23134    0
Name: label, Length: 23135, dtype: int64

Now that we have declared the input and the target output of our models, we can use the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to divide the dataset into two splits. Some things to note are: (1) the split is stratified based on the **y values**, (2) the value of the random state was set to 42 for reproducibility, and (3) the dataset is shuffled.

First, let us create the train and test set. The test set is made up of 20% of the original dataset, which infers that the second split is 80% of the original. 

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
                                                    stratify = y,
                                                    random_state = 42, 
                                                    shuffle = True)

Second, we will be splitting the remaining 80% of the original dataset into two: the train and val sets. The train set will be 90% of the second split, while the val set will be 10% of it. 

In [13]:
X_train, X_val, y_train, y_val = train_test_split(X_train, 
                                                  y_train, 
                                                  test_size = 0.1,
                                                  stratify = y_train,
                                                  random_state = 42, 
                                                  shuffle = True)

To check if the shapes of the input and output are the same, we will be looking at the shapes of the resulting DataFrame.

In [14]:
print('Train')
print('Input  shape: ', X_train.shape)
print('Output shape: ', y_train.shape, '\n')

print('Val')
print('Input  shape: ', X_val.shape)
print('Output shape: ', y_val.shape, '\n')

print('Test')
print('Input  shape: ', X_test.shape)
print('Output shape: ', y_test.shape, '\n')

Train
Input  shape:  (16657,)
Output shape:  (16657,) 

Val
Input  shape:  (1851,)
Output shape:  (1851,) 

Test
Input  shape:  (4627,)
Output shape:  (4627,) 



As we have already split the data into three (i.e., train, val, test) sets, we can now combine the **X** and **y** values per set through the use of [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html). This is done for easier tokenizing of the dataset when using BERT and RoBERTa. In addition, using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function, we would also be resetting the index to make it sequential starting from 0. 

First, we would concatenate the **X** and **y** values of the train set.

In [15]:
train_df = pd.concat([X_train, y_train], axis = 1).reset_index(drop = True)
train_df

Unnamed: 0,text,label
0,Kahit nagluwag na noong Pebrero sa COVID-19 re...,0
1,Former Interior Undersecretary Jonathan Malaya...,0
2,Saker Message: No current Saker messages. Russ...,1
3,Japan’s Lost Black Hole Satellite Took This LA...,1
4,Home › SCIENCE & TECHNOLOGY › MOBILE PASSES DE...,1
...,...,...
16652,"TRUTH: No Apartheid in Israel, Says Black Sout...",1
16653,"Mark Warner, Virginia Ron Wyden, Oregon In b...",1
16654,WASHINGTON -- President Barack Obama joined ne...,0
16655,"Ayon sa TheWrap.com, naghain ng kaso si Krupa,...",0


Next, let us combine for the val (i.e., validation) set. 

In [16]:
val_df = pd.concat([X_val, y_val], axis = 1).reset_index(drop = True)
val_df

Unnamed: 0,text,label
0,"LABUAN BAJO — President Ferdinand ""Bongbong"" M...",0
1,"CorbettReport.com November 8, 2016 In Dougla...",1
2,WASHINGTON - The United States scrambled F-16 ...,0
3,Miami (CNN) There were few softballs Wednesday...,0
4,Trump and Sanders get all the attention for th...,0
...,...,...
1846,"The owner of a convenience store in Malvar, Ba...",0
1847,While the full field of Republican presidentia...,0
1848,WASHINGTON - US President Joe Biden called on ...,0
1849,Baltimore leaders say the first night of the c...,0


Last, we would also be doing these same steps to the test set. 

In [17]:
test_df = pd.concat([X_test, y_test], axis = 1).reset_index(drop = True)
test_df

Unnamed: 0,text,label
0,"OTTAWA, Canada - Canada's government struck a ...",0
1,Editor’s note: This press release is sponsored...,0
2,(CNN) Donald Trump and Bernie Sanders are conf...,0
3,The Alliance of Concerned Teachers (ACT) calle...,0
4,Manila's City Engineering Office has received ...,0
...,...,...
4622,The House Committee on Human Rights has approv...,0
4623,The Metro Rail Transit Line 3 (MRT-3) has impl...,0
4624,"October 28, 2016 at 9:00 PM Why would Putin ...",1
4625,PHNOM PENH — An eleven-year-old girl in Cambod...,0


Next, to ensure that there would be no **NaN** data when we train our models, we would be dropping the rows that has at least one **na** value using the function [`dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html).

In [18]:
train_df = train_df.dropna (how = 'any')
train_df

Unnamed: 0,text,label
0,Kahit nagluwag na noong Pebrero sa COVID-19 re...,0
1,Former Interior Undersecretary Jonathan Malaya...,0
2,Saker Message: No current Saker messages. Russ...,1
3,Japan’s Lost Black Hole Satellite Took This LA...,1
4,Home › SCIENCE & TECHNOLOGY › MOBILE PASSES DE...,1
...,...,...
16652,"TRUTH: No Apartheid in Israel, Says Black Sout...",1
16653,"Mark Warner, Virginia Ron Wyden, Oregon In b...",1
16654,WASHINGTON -- President Barack Obama joined ne...,0
16655,"Ayon sa TheWrap.com, naghain ng kaso si Krupa,...",0


In [19]:
val_df = val_df.dropna (how = 'any').reset_index (drop = True)
val_df

Unnamed: 0,text,label
0,"LABUAN BAJO — President Ferdinand ""Bongbong"" M...",0
1,"CorbettReport.com November 8, 2016 In Dougla...",1
2,WASHINGTON - The United States scrambled F-16 ...,0
3,Miami (CNN) There were few softballs Wednesday...,0
4,Trump and Sanders get all the attention for th...,0
...,...,...
1845,"The owner of a convenience store in Malvar, Ba...",0
1846,While the full field of Republican presidentia...,0
1847,WASHINGTON - US President Joe Biden called on ...,0
1848,Baltimore leaders say the first night of the c...,0


In [20]:
test_df = test_df.dropna (how = 'any').reset_index (drop = True)
test_df

Unnamed: 0,text,label
0,"OTTAWA, Canada - Canada's government struck a ...",0
1,Editor’s note: This press release is sponsored...,0
2,(CNN) Donald Trump and Bernie Sanders are conf...,0
3,The Alliance of Concerned Teachers (ACT) calle...,0
4,Manila's City Engineering Office has received ...,0
...,...,...
4622,The House Committee on Human Rights has approv...,0
4623,The Metro Rail Transit Line 3 (MRT-3) has impl...,0
4624,"October 28, 2016 at 9:00 PM Why would Putin ...",1
4625,PHNOM PENH — An eleven-year-old girl in Cambod...,0


Next, using [`value_counts`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html), let's check if the dataset is balanced (i.e., has equal instances of positive and negative texts).

In [21]:
train_df ['label'].value_counts ()

0    13359
1     3298
Name: label, dtype: int64

In [22]:
val_df ['label'].value_counts ()

0    1484
1     366
Name: label, dtype: int64

In [23]:
test_df ['label'].value_counts ()

0    3711
1     916
Name: label, dtype: int64

From this, we can see that the instances of news and fake news are not balanced in our training, validation, and testing data.

### Creation of Dataset
Since we have already created three different sets, we can now transform our DataFrames into one single Dataset. To do this, we first have to transform each set into a single dataset before combining them into one dataset.

First, we would be converting out train DataFrame into a dataset. In this, it can be seen that there are **16,657** rows in our train dataset.

In [24]:
train_dataset = datasets.Dataset.from_pandas(train_df)
train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 16657
})

This is followed by transforming the val DataFrame also. This would result in a dataset with **1,850** rows.

In [25]:
val_dataset = datasets.Dataset.from_pandas(val_df)
val_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 1850
})

Last is the test DataFrame, which would become a dataset with **4,627** rows.

In [26]:
test_dataset = datasets.Dataset.from_pandas(test_df)
test_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 4627
})

As we now have a dataset form for all of our sets, we can now merge them together into one dataset.

In [27]:
dataset = datasets.DatasetDict({
    "train" : train_dataset, 
    "val" : val_dataset, 
    "test" : test_dataset
})

dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16657
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 1850
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 4627
    })
})

## Feature Engineering

Because we are done preparing our data, we can now start with transforming it into a form that the machine learning algorithms can understand through feature engineering. For this notebook, we will be utilizing tokenization, specifically through the use of BERT and RoBERTa tokenizers.

### Defining of Functions and Values
Before starting with the tokenizing itself, we will first have to define the needed functions and values. 

One of these values is the **MAX_LENGTH**, which determines the maximum length that will be allowed by the model. This means that it will be used by the tokenizer in two ways: (1) inputs that are longer than this length will be truncated to this value, and (2) inputs that are shorter than this length will be padded so that it will reach this length. For this notebook, **512** is set as the maximum length. 

In [28]:
MAX_LENGTH = 512

In addition, the preprocessing function for an instance is created. In this function, a text is tokenized by the tokenizer (i.e., padded and truncated to the maximum length) and its corresponding label is transformed into a tensor. 

In [29]:
def preprocess_function(examples, tokenizer):
    encoding = tokenizer(examples["text"], add_special_tokens = True,
                         padding = "max_length", truncation = True, max_length = MAX_LENGTH)
    
    encoding["labels"] = torch.tensor(examples ['label'])
    return encoding

Last, the function that would call the preprocessing function on the dataset is defined. In this function, the dataset is also set into a **torch** format. 

In [30]:
def create_encoded_dataset (tokenizer):
    encoded_dataset = dataset.map(preprocess_function, 
                                  batched=True, 
                                  remove_columns=dataset['train'].column_names, 
                                  fn_kwargs = {"tokenizer": tokenizer})
    
    encoded_dataset.set_format("torch")
    
    return encoded_dataset

### Tokenizing with BERT
As our functions and values are ready, the tokenizer can be instantiated. Since we would be utilizing a BERT model, specifically the **bert-base-cased** model, we would be creating a tokenizer that can prepare the text data into the input accepted by the model. 

This can be done through the [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) class and the `from_pretrained` function, since the model and the tokenizer that we want to use has already been pretrained.

In [31]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased', use_fast = False)

With this tokenizer, we will be encoding the dataset into the correct form that is needed by the BERT model.

In [32]:
bert_encoded_dataset = create_encoded_dataset (bert_tokenizer)

  0%|          | 0/17 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

### Tokenizing with RoBERTa
Next, as we also want to use a pretrained RoBERTa model (i.e., **roberta-base**), we also have to do the same steps.

To start with, we need to create an instance of the specific RoBERTa model. 

In [33]:
roberta_tokenizer = AutoTokenizer.from_pretrained('roberta-base')

Since we already have an instance of the tokenizer, we can now use this tokenizer and the pre-processing function we defined previously to transform the dataset.

In [34]:
roberta_encoded_dataset = create_encoded_dataset (roberta_tokenizer)

  0%|          | 0/17 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

## Modeling and Evaluation

As we have already created the features that we would be using for our models, we can now proceed with the modeling proper. For this project, we would be fine-tuning two pre-trained models: **BERT** and **RoBERTa**. 

### Defining of Functions and Values

Before we start with the training proper, we would need to define the functions that will be used for training and evaluating. 

First, we would be creating the function that would be used to compute the scores of the model. In this, we would be using four metrics to evaluate our models: (1) **F1 Macro Score**, (2) **Accuracy**, (3) **Precision**, and (4) **Recall**.

In [35]:
def compute_metrics(p: EvalPrediction):
    logits, labels = p
    predictions = np.argmax(logits, axis=-1)
    
    precision_metric = load_metric("precision")
    recall_metric = load_metric("recall")
    accuracy_metric = load_metric("accuracy")
    f1_metric = load_metric("f1")
    
    f1_macro_score = f1_metric.compute(predictions=predictions, references=labels, average="macro")
    accuracy_score = accuracy_metric.compute(predictions=predictions, references=labels)
    precision_score = precision_metric.compute(predictions=predictions, references=labels)
    recall_score = recall_metric.compute(predictions=predictions, references=labels)
    
    results = {
        'Accuracy' : accuracy_score ['accuracy'],
        'F1 Macro Score' : f1_macro_score ['f1'], 
        'Precision' : precision_score["precision"],
        'Recall' : recall_score["recall"]
    }
    
    return results

Second, we would be specifying the hyperparameter space that would determine the possible hyperparameter vaues to be tuned. In this, only three hyperparameters would be considered for tuning: (1) the **learning rate**, (2) the **train batch size**, and (3) the **number of training epochs**.

Note that the combination of values would be randomized from the sets of values, and there would only be three combinations that would be used for the tuning.

In [36]:
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_categorical("learning_rate", [0.1, 0.01, 0.001]),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16]),
        "num_train_epochs": trial.suggest_categorical("num_train_epochs", [2, 3, 4])
    }

### BERT Model
Now, we are ready to move on to training the BERT model. 

#### Model Training 

To start with, let us define the pre-trained model that we would be using. For the BERT, [**bert-base-cased**](https://huggingface.co/bert-base-cased)—a model that was pre-trained on a case-sensitive English corpus for masked language modeling (MLM)—would be utilized.

In [37]:
model_checkpoint = 'bert-base-cased'

Let us create an instance of a BERT model using this pretrained model. 

As we would be fine-tuning this model to classify text (i.e., if it is a fake news or not), an instance of [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification) would be created specifically. It is also important to note that the input that it would accept is based on the **MAX_LENGTH** variable that we have previously declare, which has the value of **512**.

In [38]:
bert_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels = 2, 
    max_length = MAX_LENGTH
).to(device)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

Next, we would be defining the [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) that the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) would be using. The parameters for the [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) that are used for the training loop are as follows:
* `output_dir` indicates that the model predictions and checkpoints will be saved in the **bert_trainer** folder
* `save_steps` means that the checkpoint will be saved every **20,000** steps
* `save_strategy` specifies that the saving of checkpoint will be based on the number of steps that the model has done 
* `fp16` stipulates that the **16-bit floating point precision** will be used (since its value is True) to save memory
* `evaluation_strategy` designates that the **evaluation** should be done **every after epochs**
* `resume_from_checkpoint` indicates that the training could be **restarted from a previous checkpoint**

In [39]:
training_args = TrainingArguments(output_dir = "bert_trainer", 
                                  save_steps = 20000,
                                  save_strategy = 'steps',
                                  fp16 = True,
                                  evaluation_strategy = "epoch", 
                                  resume_from_checkpoint = True)

As we have now declared the pre-trained model and the training arguments that we would be using, we can now instantiate a [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) that can do training and evaluation using the following parameters:
* `model` is the BERT model that we would be using for sequence classification
* `args` holds the training arguments that we have previously defined
* `train_dataset` is the tokenized dataset that we would be using for training
* `eval_dataset` is the tokenized dataset that we would be using for evaluating (i.e., the val set)
* `tokenizer` is the tokenizer that we used to prepare our data for the BERT model
* `compute_metrics` is the function that the evaluation loop would use to score the model
* `callbacks` holds the **ProgressBar**, which would allow us to see the progress of our model in training and evaluation

In [40]:
trainer = Trainer(
    model = bert_model,
    args = training_args,
    train_dataset = bert_encoded_dataset ['train'],
    eval_dataset = bert_encoded_dataset ['val'],
    tokenizer = bert_tokenizer,
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

Using cuda_amp half precision backend


Using the instance of [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) that we have created, we can now fine-tune the pre-trained BERT model through the use of the [`train`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.train) function.

In [41]:
trainer.train()

***** Running training *****
  Num examples = 16657
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6249
  Number of trainable parameters = 108311810
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mfrancheska_vicente[0m ([33mtonely[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.1372,0.093312,0.976216,0.962762,0.932796,0.948087
2,0.0769,0.150093,0.970811,0.954847,0.90625,0.95082
3,0.0325,0.108422,0.982162,0.971871,0.956164,0.953552


***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  precision_metric = load_metric("precision")
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=6249, training_loss=0.096801379486587, metrics={'train_runtime': 1962.4095, 'train_samples_per_second': 25.464, 'train_steps_per_second': 3.184, 'total_flos': 1.314792254739456e+16, 'train_loss': 0.096801379486587, 'epoch': 3.0})

From the result above, we can see that the model received the highest evaluation score on the validation set on the third epoch. 

#### Saving BERT base model
To use this model outside the notebook, we would be saving the model. First, let us define the folder where we would be saving the model.

In [45]:
path_for_models ='./saved_models/BERTv1'

Now, let us save the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) (i.e., with the weights, the configurations, and the model) and the [`BertTokenizer`](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer) in the specified folder. 

In [46]:
trainer.save_model(path_for_models)
bert_tokenizer.save_pretrained(path_for_models)

Saving model checkpoint to ./saved_models/BERTv1
Configuration saved in ./saved_models/BERTv1\config.json
Model weights saved in ./saved_models/BERTv1\pytorch_model.bin
tokenizer config file saved in ./saved_models/BERTv1\tokenizer_config.json
Special tokens file saved in ./saved_models/BERTv1\special_tokens_map.json
tokenizer config file saved in ./saved_models/BERTv1\tokenizer_config.json
Special tokens file saved in ./saved_models/BERTv1\special_tokens_map.json


('./saved_models/BERTv1\\tokenizer_config.json',
 './saved_models/BERTv1\\special_tokens_map.json',
 './saved_models/BERTv1\\vocab.txt',
 './saved_models/BERTv1\\added_tokens.json')

Using the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) we have trained, we can now [`evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate) the model using the test set to determine its test score.

In [47]:
trainer.evaluate(eval_dataset=bert_encoded_dataset['test'])

***** Running Evaluation *****
  Num examples = 4627
  Batch size = 8


{'eval_loss': 0.07904903590679169,
 'eval_Accuracy': 0.9846552842014265,
 'eval_F1 Macro Score': 0.9756279882453602,
 'eval_Precision': 0.9720670391061452,
 'eval_Recall': 0.9497816593886463,
 'eval_runtime': 56.1269,
 'eval_samples_per_second': 82.438,
 'eval_steps_per_second': 10.316,
 'epoch': 3.0}

From the result above, it can be seen that the model was able to be correctly trained. It achieved the following scores: 98.46% for Accuracy, 97.56% F1 Macro Score, 97.20% for Precision, and 94.98% for Recall.

#### Hyperparameter Tuning
Now, let us try to tune the hyperparameters (i.e., the learning rate, the number of training epochs and the training batch size) of the model, which means that we would try to find the value that would give us the highest score. In this, we would be trying three combinations of these hyperparameters, and we would compare the scores received by the three combinations to the score of the base model. 

To do this, we will first create a function that would return a base model of a BERT [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification) for initializaiton. 

In [40]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint,
                                                              num_labels = 2, 
                                                              max_length = MAX_LENGTH)

Like in training the base model, we would be creating the [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) that we would be using for training. We would be using the same parameters for the [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) as before, except for the **fp16**. 

In the tuning, **bf16** (bfloat16) will be used. This was done because using **fp16** resulted in 0.0 scores due to the loss of floating points in fp16.

In [41]:
training_args_tuning = TrainingArguments(output_dir = "bert_trainer", 
                                         save_steps = 20000, 
                                         bf16 = True,
                                         save_strategy = 'steps',
                                         evaluation_strategy = "epoch", 
                                         resume_from_checkpoint = True)

Next, we can create an instance of [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) class. Since we would be using the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) for tuning, we passed an **initialization of the model** instead of a model. This initial model is used as the base (i.e., the model is reinitialized every run of new hyperparameter values). This means that all of the models use the values of the base model and only the values of the hyperparameter passed are changed.

In [42]:
trainer_tuning = Trainer(
    model_init = model_init,
    args = training_args_tuning,
    train_dataset = bert_encoded_dataset ['train'],
    eval_dataset = bert_encoded_dataset ['val'],
    tokenizer = bert_tokenizer,
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4bb33641f5c76a55d11f846e0\config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_length": 512,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file pytorch_model.bin from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4b

Using the [`hyperparameter_search`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.hyperparameter_search) function, we can now start finding the best values of the hyperparameters to use. Note that this function will return the information about the best run (i.e., the model that received the best score).

In [43]:
best_trial = trainer_tuning.hyperparameter_search(
    direction = "maximize",
    backend = "optuna",
    hp_space = optuna_hp_space,
    n_trials = 3
)

[32m[I 2023-07-22 21:22:36,884][0m A new study created in memory with name: no-name-1603b342-70a4-4d1b-8959-69a98b43ca31[0m
Trial: {'learning_rate': 0.01, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}
loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4bb33641f5c76a55d11f846e0\config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_length": 512,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.8049,0.758189,0.802162,0.445111,0.0,0.0
2,0.6975,0.525683,0.802162,0.445111,0.0,0.0
3,0.5587,0.497785,0.802162,0.445111,0.0,0.0


***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  precision_metric = load_metric("precision")
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))


Training completed. Do not forget to share your model on huggingface.co/models =)


[32m[I 2023-07-22 21:51:21,700][0m Trial 0 finished with value: 1.2472731399666013 and parameters: {'learning_rate': 0.01, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}. Best is trial 0 with value: 1.2472731399666013.[0m
Trial: {'learning_rate': 0.001, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}
loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4bb33641f5c76a55d11f846e0\c

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/Accuracy,▁▁▁
eval/F1 Macro Score,▁▁▁
eval/Precision,▁▁▁
eval/Recall,▁▁▁
eval/loss,█▂▁
eval/runtime,█▁▃
eval/samples_per_second,▁█▆
eval/steps_per_second,▁█▆
train/epoch,▁▂▂▄▅▅▆███
train/global_step,▁▂▂▄▅▅▆███

0,1
eval/Accuracy,0.80216
eval/F1 Macro Score,0.44511
eval/Precision,0.0
eval/Recall,0.0
eval/loss,0.49779
eval/runtime,25.7737
eval/samples_per_second,71.779
eval/steps_per_second,9.001
train/epoch,3.0
train/global_step,3126.0


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016933333333266395, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.5342,0.504901,0.802162,0.445111,0.0,0.0
2,0.5187,0.501915,0.802162,0.445111,0.0,0.0
3,0.5059,0.497616,0.802162,0.445111,0.0,0.0


***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))


Training completed. Do not forget to share your model on huggingface.co/models =)


[32m[I 2023-07-22 22:19:48,145][0m Trial 1 finished with value: 1.2472731399666013 and parameters: {'learning_rate': 0.001, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}. Best is trial 0 with value: 1.2472731399666013.[0m
Trial: {'learning_rate': 0.001, 'per_device_train_batch_size': 8, 'num_train_epochs': 3}
loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4bb33641f5c76a55d11f846e0\config.json
Model config BertConfig {
  "_name_

0,1
eval/Accuracy,▁▁▁
eval/F1 Macro Score,▁▁▁
eval/Precision,▁▁▁
eval/Recall,▁▁▁
eval/loss,█▅▁
eval/runtime,▁▇█
eval/samples_per_second,█▂▁
eval/steps_per_second,█▂▁
train/epoch,▁▂▂▄▅▅▆███
train/global_step,▁▂▂▄▅▅▆███

0,1
eval/Accuracy,0.80216
eval/F1 Macro Score,0.44511
eval/Precision,0.0
eval/Recall,0.0
eval/loss,0.49762
eval/runtime,25.8073
eval/samples_per_second,71.685
eval/steps_per_second,8.99
train/epoch,3.0
train/global_step,3126.0


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.5524,0.505642,0.802162,0.445111,0.0,0.0
2,0.5241,0.498315,0.802162,0.445111,0.0,0.0
3,0.5014,0.498992,0.802162,0.445111,0.0,0.0


***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))


Training completed. Do not forget to share your model on huggingface.co/models =)


[32m[I 2023-07-22 22:52:30,000][0m Trial 2 finished with value: 1.2472731399666013 and parameters: {'learning_rate': 0.001, 'per_device_train_batch_size': 8, 'num_train_epochs': 3}. Best is trial 0 with value: 1.2472731399666013.[0m


In [44]:
best_trial

BestRun(run_id='0', objective=1.2472731399666013, hyperparameters={'learning_rate': 0.01, 'per_device_train_batch_size': 16, 'num_train_epochs': 3})

In this, it can be seen that there were three BERT models that were created in tuning, with the following hyperparameters:
* **Learning Rate** = 0.001, **Train Batch Size** = 8, **Number of Train Epochs** = 3
* **Learning Rate** = 0.001, **Train Batch Size** = 16, **Number of Train Epochs** = 3
* **Learning Rate** = 0.01, **Train Batch Size** = 16, **Number of Train Epochs** = 3

These values were randomly generated based on the hyperparameter space that we have declared.

##### Saving BERT tuned model

Like in the base model, we will also save the files of the best trial of the tuned model. 

In [45]:
path_for_models ='./saved_models/BERTv1_tuned'
trainer_tuning.save_model(path_for_models)

Saving model checkpoint to ./saved_models/BERTv1_tuned
Configuration saved in ./saved_models/BERTv1_tuned\config.json
Model weights saved in ./saved_models/BERTv1_tuned\pytorch_model.bin
tokenizer config file saved in ./saved_models/BERTv1_tuned\tokenizer_config.json
Special tokens file saved in ./saved_models/BERTv1_tuned\special_tokens_map.json


#### Evaluation

To test how the best trial of the BERT tuning fared in the test dataset, we will be using the [`evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate) function. 

In [46]:
trainer_tuning.evaluate(eval_dataset=bert_encoded_dataset['test'])

***** Running Evaluation *****
  Num examples = 4627
  Batch size = 8


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.49919524788856506,
 'eval_Accuracy': 0.802031553922628,
 'eval_F1 Macro Score': 0.4450707603741904,
 'eval_Precision': 0.0,
 'eval_Recall': 0.0,
 'eval_runtime': 53.8581,
 'eval_samples_per_second': 85.911,
 'eval_steps_per_second': 10.75,
 'epoch': 3.0}

In this result, it can be seen that the BERT model (with the learning rate of 0.001 and a train batch size of 16) was only accurate on 80.20% of the test samples. However, it is important to remember that the samples are heavily imbalanced, which is why, the model could receive this accuracy even if they just label everything as 0 (i.e., not a fake news).

Comparing the scores of these two models from tuning to the base model in the validation, the scores received by the base model was still better. especially the Precision and Recall. Thus, for the BERT model, we will consider the base model as our best model.

### RoBERTa Model
Now, we can move on to training the RoBERTa model.

#### Model Training 
Like in the BERT model, we would need to define the pre-trained model that we would be fine-tuning. For this, we would be using [**roberta-base**](https://huggingface.co/roberta-base). This model, which is case-sensitive, was also pre-trained for the purpose of masked language modeling (MLM) on an English corpus, however, it uses the RoBERTa architecture, instead of the BERT architecture.

In [37]:
model_checkpoint_roberta = 'roberta-base'

Using this pre-trained model, we can instantiate a [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification) object, which will create a RoBERTa model. In addition, we would also be defining the **MAX_LENGTH** of the model to be the same as the previously defined **MAX_LENGTH** (i.e., 512).

In [48]:
roberta_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint_roberta,
    num_labels = 2, 
    max_length = MAX_LENGTH
).to(device)

loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af2e92b6eb5679868df33e224075ca68\config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_length": 512,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file pytorch_model.bin from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af

We would also need to create an instance of [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). This would have the same values as the previous [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) of the BERT model, except for the `output_dir`, as we wnat to save the checkpoints in another folder.

In [49]:
training_args = TrainingArguments(output_dir = "roberta_trainer", 
                                  save_steps = 20000,
                                  save_strategy = 'steps',
                                  fp16 = True,
                                  evaluation_strategy = "epoch", 
                                  resume_from_checkpoint = True)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Using this RoBERTa model and the previously created [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) object, we can now create a  [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer). Its parameters are also the same with the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) for BERT, but the `model`, `train_dataset`, and `eval_dataset` are changed to the RoBERTa counterparts.  

In [50]:
trainer = Trainer(
    model = roberta_model,
    args = training_args,
    train_dataset = roberta_encoded_dataset ['train'],
    eval_dataset = roberta_encoded_dataset ['val'],
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

Using cuda_amp half precision backend


Now, we can train the RoBERTa model.

In [51]:
trainer.train()

***** Running training *****
  Num examples = 16657
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6249
  Number of trainable parameters = 124647170
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.2746,0.303108,0.918919,0.860363,0.877622,0.685792
2,0.2438,0.171242,0.938378,0.906341,0.813433,0.893443
3,0.0902,0.059516,0.983784,0.974348,0.964088,0.953552


***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=6249, training_loss=0.20192054520227143, metrics={'train_runtime': 1965.7021, 'train_samples_per_second': 25.421, 'train_steps_per_second': 3.179, 'total_flos': 1.314792254739456e+16, 'train_loss': 0.20192054520227143, 'epoch': 3.0})

From this, it can be seen that, in the third epoch, the RoBERTa base model was able to achieve an **Accuracy** of **98.38%**, a **F1 Macro Score** of **97.43%**, a **Precision** of **96.41%**, and a **Recall** of **95.36%**.

#### Saving RoBERTa base model
Since we are done training the model, we would be saving the RoBERTa model, and its configuration and tokenizer. 

In [52]:
path_for_models ='./saved_models/RoBERTav1'
trainer.save_model(path_for_models)
roberta_tokenizer.save_pretrained(path_for_models)

Saving model checkpoint to ./saved_models/RoBERTav1
Configuration saved in ./saved_models/RoBERTav1\config.json
Model weights saved in ./saved_models/RoBERTav1\pytorch_model.bin
tokenizer config file saved in ./saved_models/RoBERTav1\tokenizer_config.json
Special tokens file saved in ./saved_models/RoBERTav1\special_tokens_map.json


('./saved_models/RoBERTav1\\tokenizer_config.json',
 './saved_models/RoBERTav1\\special_tokens_map.json',
 './saved_models/RoBERTav1\\vocab.json',
 './saved_models/RoBERTav1\\merges.txt',
 './saved_models/RoBERTav1\\added_tokens.json',
 './saved_models/RoBERTav1\\tokenizer.json')

We can now [`evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate) this RoBERTa model on the test set.

In [53]:
trainer.evaluate(eval_dataset=roberta_encoded_dataset['test'])

***** Running Evaluation *****
  Num examples = 4627
  Batch size = 8


{'eval_loss': 0.08169636875391006,
 'eval_Accuracy': 0.9807650745623514,
 'eval_F1 Macro Score': 0.9690856993273599,
 'eval_Precision': 0.9769319492502884,
 'eval_Recall': 0.9246724890829694,
 'eval_runtime': 53.9361,
 'eval_samples_per_second': 85.787,
 'eval_steps_per_second': 10.735,
 'epoch': 3.0}

Comparing the scores received by the RoBERTa base model and the best BERT model, it is apparent that the **BERT model received higher scores in every metric except for Precision**. 

#### Hyperparameter Tuning
To further see if we can improve the current RoBERTa model, we can tune the model's hyperparameters. 

Like in the BERT model, we would first need to create a function that would return the initial state of the model that would be tuned. 

In [38]:
def model_init_roberta ():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint_roberta,
                                                              num_labels = 2, 
                                                              max_length = MAX_LENGTH)

Next, we would have to create the [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) that we would be using for the training loop.

In [39]:
training_args_tuning = TrainingArguments(output_dir = "roberta_trainer", 
                                         save_steps = 20000, 
                                         bf16 = True,
                                         save_strategy = 'steps',
                                         evaluation_strategy = "epoch", 
                                         resume_from_checkpoint = True)

With this, we can now proceed with creating an instance of the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) object.

In [40]:
trainer_tuning = Trainer(
    model_init = model_init_roberta,
    args = training_args_tuning,
    train_dataset = roberta_encoded_dataset ['train'],
    eval_dataset = roberta_encoded_dataset ['val'],
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af2e92b6eb5679868df33e224075ca68\config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_length": 512,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file pytorch_model.bin from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af

We can now proceed with utilizing the [`hyperparameter_search`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.hyperparameter_search) function to: (1) randomize values for the three hyperparameters that we want to tune based on the search space, (2) train three models using the values, and (3) pick the best model from the three trained models. 

In [41]:
best_trial_roberta = trainer_tuning.hyperparameter_search(
    direction = "maximize",
    backend = "optuna",
    hp_space = optuna_hp_space,
    n_trials = 3
)

[32m[I 2023-07-22 23:33:08,759][0m A new study created in memory with name: no-name-af6798e5-2b0b-4bb6-8aca-b4461caa04c9[0m
Trial: {'learning_rate': 0.1, 'per_device_train_batch_size': 8, 'num_train_epochs': 2}
loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af2e92b6eb5679868df33e224075ca68\config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_length": 512,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version"

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,6.5164,0.500437,0.802162,0.445111,0.0,0.0
2,2.1407,0.817817,0.802162,0.445111,0.0,0.0


***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  precision_metric = load_metric("precision")
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))


Training completed. Do not forget to share your model on huggingface.co/models =)


[32m[I 2023-07-22 23:55:13,114][0m Trial 0 finished with value: 1.2472731399666013 and parameters: {'learning_rate': 0.1, 'per_device_train_batch_size': 8, 'num_train_epochs': 2}. Best is trial 0 with value: 1.2472731399666013.[0m
Trial: {'learning_rate': 0.1, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}
loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af2e92b6eb5679868df33e224075ca68\config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "att

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/Accuracy,▁▁
eval/F1 Macro Score,▁▁
eval/Precision,▁▁
eval/Recall,▁▁
eval/loss,▁█
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▂▃▄▄▅▆▇███
train/global_step,▁▂▃▄▄▅▆▇███

0,1
eval/Accuracy,0.80216
eval/F1 Macro Score,0.44511
eval/Precision,0.0
eval/Recall,0.0
eval/loss,0.81782
eval/runtime,25.6617
eval/samples_per_second,72.092
eval/steps_per_second,9.041
train/epoch,2.0
train/global_step,4166.0


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016666666666666666, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,5.6646,1.941329,0.802162,0.445111,0.0,0.0
2,3.4128,3.758919,0.802162,0.445111,0.0,0.0
3,1.3948,0.496893,0.802162,0.445111,0.0,0.0


***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))


Training completed. Do not forget to share your model on huggingface.co/models =)


[32m[I 2023-07-23 00:24:04,955][0m Trial 1 finished with value: 1.2472731399666013 and parameters: {'learning_rate': 0.1, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}. Best is trial 0 with value: 1.2472731399666013.[0m
Trial: {'learning_rate': 0.01, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}
loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af2e92b6eb5679868df33e224075ca68\config.json
Model config RobertaConfig {
  "_name_or

VBox(children=(Label(value='0.001 MB of 0.042 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.032111…

0,1
eval/Accuracy,▁▁▁
eval/F1 Macro Score,▁▁▁
eval/Precision,▁▁▁
eval/Recall,▁▁▁
eval/loss,▄█▁
eval/runtime,▄█▁
eval/samples_per_second,▅▁█
eval/steps_per_second,▅▁█
train/epoch,▁▂▂▄▅▅▆███
train/global_step,▁▂▂▄▅▅▆███

0,1
eval/Accuracy,0.80216
eval/F1 Macro Score,0.44511
eval/Precision,0.0
eval/Recall,0.0
eval/loss,0.49689
eval/runtime,25.5622
eval/samples_per_second,72.373
eval/steps_per_second,9.076
train/epoch,3.0
train/global_step,3126.0


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016933333333145128, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.8379,2.268301,0.197838,0.165162,0.197838,1.0
2,0.6763,0.557782,0.802162,0.445111,0.0,0.0
3,0.5792,0.501915,0.802162,0.445111,0.0,0.0


***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1850
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))


Training completed. Do not forget to share your model on huggingface.co/models =)


[32m[I 2023-07-23 00:52:55,520][0m Trial 2 finished with value: 1.2472731399666013 and parameters: {'learning_rate': 0.01, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}. Best is trial 0 with value: 1.2472731399666013.[0m


In [42]:
best_trial_roberta

BestRun(run_id='0', objective=1.2472731399666013, hyperparameters={'learning_rate': 0.1, 'per_device_train_batch_size': 8, 'num_train_epochs': 2})

In the tuning, three RoBERTa models were created and compared with the following hyperparameter values:
* **Learning Rate** = 0.1, **Train Batch Size** = 8, **Number of Train Epochs** = 2
* **Learning Rate** = 0.001, **Train Batch Size** = 16, **Number of Train Epochs** = 3
* **Learning Rate** = 0.01, **Train Batch Size** = 16, **Number of Train Epochs** = 3

Out of these three, the best run for the RoBERTa model was the first model. However, based on the performance on the validation set, we can see that the BERT base still performed better.

##### Saving RoBERTa tuned model

To use this model outside of this notebook, we will save the RoBERTa [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) object and the [`RoBERTa Tokenizer`](https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizer).

In [43]:
path_for_models ='./saved_models/RoBERTav1_tuned'
trainer_tuning.save_model(path_for_models)
roberta_tokenizer.save_pretrained(path_for_models)

Saving model checkpoint to ./saved_models/RoBERTav1_tuned
Configuration saved in ./saved_models/RoBERTav1_tuned\config.json
Model weights saved in ./saved_models/RoBERTav1_tuned\pytorch_model.bin
tokenizer config file saved in ./saved_models/RoBERTav1_tuned\tokenizer_config.json
Special tokens file saved in ./saved_models/RoBERTav1_tuned\special_tokens_map.json


('./saved_models/RoBERTav1_tuned\\tokenizer_config.json',
 './saved_models/RoBERTav1_tuned\\special_tokens_map.json',
 './saved_models/RoBERTav1_tuned\\vocab.json',
 './saved_models/RoBERTav1_tuned\\merges.txt',
 './saved_models/RoBERTav1_tuned\\added_tokens.json',
 './saved_models/RoBERTav1_tuned\\tokenizer.json')

#### Evaluation

Last, let us see how the best model from the RoBERTa tuning fared in the test dataset.

In [44]:
trainer_tuning.evaluate(eval_dataset = roberta_encoded_dataset['test'])

***** Running Evaluation *****
  Num examples = 4627
  Batch size = 8


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.5021297931671143,
 'eval_Accuracy': 0.802031553922628,
 'eval_F1 Macro Score': 0.4450707603741904,
 'eval_Precision': 0.0,
 'eval_Recall': 0.0,
 'eval_runtime': 54.5782,
 'eval_samples_per_second': 84.778,
 'eval_steps_per_second': 10.609,
 'epoch': 3.0}

From this, it is evident that the BERT base performed better even in the test set compared to the model returned in the tuning.

In conclusion, comparing the final models of the BERT and RoBERTa (i.e., which made use of the default values for their hyperparameters and the MAX_LENGTH of 512), the BERT received a higher score for all of the metrics except for Precision. 