# **Insert Title Here**
**DATA103 S11 Group 4**
- GOZON, Jean Pauline D.
- JAMIAS, Gillian Nicole A.
- MARCELO Andrea Jean C. 
- REYES, Anton Gabriel G.
- VICENTE, Francheska Josefa

## Requirements and Imports

Before starting, the relevant libraries and files in building and training the model should be loaded into the notebook first.

### Imports
Several libraries are required to perform a thorough analysis of the dataset. Each of these libraries will be imported and described below:

**Basic Libraries**

Import `numpy`, `pandas`, and `datasets`.

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis
* `datasets` contains functions that allow easier pre-processing for datasets and smart caching for easier loading of data

In [1]:
import numpy as np
import pandas as pd
import datasets

**Machine Learning Libraries**

The `train_test_split` is a function that allows the dataset to be split into two randomly.

In [2]:
from sklearn.model_selection import train_test_split

Meanwhile, the following imports are used to create the dataset :
* `torch` library is an open source ML library for deep neural network creation
* `Dataset` and `DataLoader` are two data primitives that makes loading and using dataset easier
* `RandomSampler` and `SequentialSampler` are samplers that is used by the `DataLoader`
* `ProgressBarBase` and `RichProgressBar` are components that shows the progress bar of training the models.

In [3]:
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from pytorch_lightning.callbacks import ProgressBarBase, RichProgressBar

The next imports are from `transformers`, which contains pre-trained models and tokenizers that can be fine-tuned.
* `AutoTokenizer` automatically creates the tokenizer based on the architecture passed
* `AutoModelForSequenceClassification` automatically instantiates a sequence classification model based on the type of model passed
* `TrainerCallback` is an object that determines how the training loop will behave
* `TrainingArguments` is a dataclass that allows the customization of the arguments in training
* `Trainer` is a class that has a complete training and validation loop

In [4]:
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          TrainerCallback, TrainingArguments, Trainer)

On the other hand, these classes computes and visualizes the different scores about how well a model works.
* `f1_score` computes the balanced F-score by comparing the actual classes and the predicted classes
* `hamming_loss` computes the fraction of labels that were incorrectly labeled by the model
* `accuracy_score` computes the accuracy by determining how many classes were correctly predicted
* `EvalPrediction` is an object in transformers that holds the prediction of the model and the target output
* `evaluate` is a libray that is used to evaluate and compare metrics
* `load_metric` is a function in the datasets library that allows different metrics to be loaded

In [5]:
from sklearn.metrics import f1_score, hamming_loss, accuracy_score
from transformers import EvalPrediction
import evaluate
from datasets import load_metric

Next, `optuna` is used to tune the hyperparameters of machine learning models.

In [6]:
import optuna

Last, `pickle` is a module that can serialize and deserialize objects. In this notebook, it is used to save and load models.

In [7]:
import pickle

### Datasets and Files
To train the BERT and RoBERTa model, let us load the cleaned dataset with minimual pre-processing using the [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [8]:
df = pd.read_csv ('cleaned_data.csv')
df

Unnamed: 0,class,text
0,0,"['Its not a viable option, and youll be leavin..."
1,1,['It can be hard to appreciate the notion that...
2,1,"['Hi, so last night i was sitting on the ledge..."
3,1,['I tried to kill my self once and failed badl...
4,1,['Hi NEM3030. What sorts of things do you enjo...
...,...,...
242155,0,If you don't like rock then your not going to ...
242156,0,You how you can tell i have so many friends an...
242157,0,pee probably tastes like salty tea😏💦‼️ can som...
242158,1,The usual stuff you find hereI'm not posting t...


Before we start directly dealing with the data, we will set the **device** on where the model will run. If there is an existence of a CUDA-enabled device, it will automatically pick CUDA as its device. Otherwise, it will run on the CPU.

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Preparing data for Feature Engineering

Before creating the features that the BERT and RoBERTa models will use for training, there are two steps that we must first do: (1) splitting the dataset into the train, val, and test sets, and (2) transforming our [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) into a [`Dataset`](https://pypi.org/project/datasets/). This would allow us to utilize the data for the training more easily.

### Splitting the Dataset into Train, Val, and Test Split
Let us first define the **X** (input) and **y** (target/output) of our model. This is done to allow the stratifying of the data when it is split into the train, val and test.

The **X** (input) can be retrieved by getting the `text` column in the original dataset.

In [10]:
X = df ['text']
X

0         ['Its not a viable option, and youll be leavin...
1         ['It can be hard to appreciate the notion that...
2         ['Hi, so last night i was sitting on the ledge...
3         ['I tried to kill my self once and failed badl...
4         ['Hi NEM3030. What sorts of things do you enjo...
                                ...                        
242155    If you don't like rock then your not going to ...
242156    You how you can tell i have so many friends an...
242157    pee probably tastes like salty tea😏💦‼️ can som...
242158    The usual stuff you find hereI'm not posting t...
242159    I still haven't beaten the first boss in Hollo...
Name: text, Length: 242160, dtype: object

Meanwhile, the **y** value (i.e., the value that we would be "feeding" our models) is the `class` column. 

In [11]:
y = df ['class']
y

0         0
1         1
2         1
3         1
4         1
         ..
242155    0
242156    0
242157    0
242158    1
242159    0
Name: class, Length: 242160, dtype: int64

Now that we have declared the input and the target output of our models, we can use the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to divide the dataset into two splits. Some things to note are: (1) the split is stratified based on the **y values**, (2) the value of the random state was set to 42 for reproducibility, and (3) the dataset is shuffled.

First, let us create the train and test set. The test set is made up of 20% of the original dataset, which infers that the second split is 80% of the original. 

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
                                                    stratify = y,
                                                    random_state = 42, 
                                                    shuffle = True)

Second, we will be splitting the remaining 80% of the original dataset into two: the train and val sets. The train set will be 90% of the second split, while the val set will be 10% of it. 

In [13]:
X_train, X_val, y_train, y_val = train_test_split(X_train, 
                                                  y_train, 
                                                  test_size = 0.1,
                                                  stratify = y_train,
                                                  random_state = 42, 
                                                  shuffle = True)

To check if the shapes of the input and output are the same, we will be looking at the shapes of the resulting DataFrame.

In [14]:
print('Train')
print('Input  shape: ', X_train.shape)
print('Output shape: ', y_train.shape, '\n')

print('Val')
print('Input  shape: ', X_val.shape)
print('Output shape: ', y_val.shape, '\n')

print('Test')
print('Input  shape: ', X_test.shape)
print('Output shape: ', y_test.shape, '\n')

Train
Input  shape:  (174355,)
Output shape:  (174355,) 

Val
Input  shape:  (19373,)
Output shape:  (19373,) 

Test
Input  shape:  (48432,)
Output shape:  (48432,) 



As we have already split the data into three (i.e., train, val, test) sets, we can now combine the **X** and **y** values per set through the use of [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html). This is done for easier tokenizing of the dataset when using BERT and RoBERTa. In addition, using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function, we would also be resetting the index to make it sequential starting from 0. 

First, we would concatenate the **X** and **y** values of the train set.

In [20]:
train_df = pd.concat([X_train, y_train], axis = 1).reset_index(drop = True)
train_df

Unnamed: 0,text,class
0,How do you explain to your family that you wer...,0
1,I DONT UNDERSTAND THE US DEBT WHO DO THEY OWE ...,0
2,FireIt’s been a bit but I still think of her a...,1
3,AITA for telling my wife (34F) that reddit agr...,0
4,Join among us SGGFIF Jesjeuejjejejeeieieijdjdj...,0
...,...,...
174350,"Fellow teenagers, I have been influenced by th...",0
174351,I felt like talkingSo I was just outside at 01...,1
174352,i am trying to but i just cant i have everythi...,1
174353,I just want my suffering to endAll I have hear...,1


Next, let us combine for the val (i.e., validation) set. 

In [21]:
val_df = pd.concat([X_val, y_val], axis = 1).reset_index(drop = True)
val_df

Unnamed: 0,text,class
0,Really down........just need some words of enc...,1
1,I’m not gonna buy a carThe day gets closer. I’...,1
2,Help me kill myself. Please. Please. Please.I’...,1
3,The only thing keeping me alive is the fact th...,1
4,"I'm not.I'm not the sweet, determined girl eve...",1
...,...,...
19368,when she says Hi! This post seems to be relate...,0
19369,I gotta go to school tmmr for orientation at 9...,0
19370,Hey lads! Can I get some help from y'all? So.....,0
19371,My birthday is this coming month and it will b...,1


Last, we would also be doing these same steps to the test set. 

In [22]:
test_df = pd.concat([X_test, y_test], axis = 1).reset_index(drop = True)
test_df

Unnamed: 0,text,class
0,I just felt myself snapI have to pretend to be...,1
1,Are you envious of something about the opposit...,0
2,"We get it. Men have problems, too. We never sa...",0
3,Happy Birthday to everyone having Birthday on ...,0
4,i cant deal with life any longer but ive tried...,1
...,...,...
48427,I just need to go for everyone's sakeI can't e...,1
48428,Hope is now goneI'm 17m and I'm considering ta...,1
48429,18f needs someone to talk toI understand if th...,1
48430,"Help mePlease someone help me, just pm me.\nI'...",1


### Creation of Dataset
Since we have already created three different sets, we can now transform our DataFrames into one single Dataset. To do this, we first have to transform each set into a single dataset before combining them into one dataset.

First, we would be converting out train DataFrame into a dataset. In this, it can be seen that there are **174,355** rows in our train dataset.

In [23]:
train_dataset = datasets.Dataset.from_pandas(train_df)
train_dataset

Dataset({
    features: ['text', 'class'],
    num_rows: 174355
})

This is followed by transforming the val DataFrame also. This would result in a dataset with **19,373** rows.

In [24]:
val_dataset = datasets.Dataset.from_pandas(val_df)
val_dataset

Dataset({
    features: ['text', 'class'],
    num_rows: 19373
})

Last is the test DataFrame, which would become a dataset with **48,432** rows.

In [25]:
test_dataset = datasets.Dataset.from_pandas(test_df)
test_dataset

Dataset({
    features: ['text', 'class'],
    num_rows: 48432
})

As we now have a dataset form for all of our sets, we can now merge them together into one dataset.

In [26]:
dataset = datasets.DatasetDict({
    "train" : train_dataset, 
    "val" : val_dataset, 
    "test" : test_dataset
})

dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'class'],
        num_rows: 174355
    })
    val: Dataset({
        features: ['text', 'class'],
        num_rows: 19373
    })
    test: Dataset({
        features: ['text', 'class'],
        num_rows: 48432
    })
})

## Feature Engineering

Because we are done preparing our data, we can now start with transforming it into a form that the machine learning algorithms can understand through feature engineering. For this notebook, we will be utilizing tokenization, specifically through the use of BERT and RoBERTa tokenizers.

### Defining of Functions and Values
Before starting with the tokenizing itself, we will first have to define the needed functions and values. 

One of these values is the **MAX_LENGTH**, which determines the maximum length that will be allowed by the model. This means that it will be used by the tokenizer in two ways: (1) inputs that are longer than this length will be truncated to this value, and (2) inputs that are shorter than this length will be padded so that it will reach this length. For this notebook, **512** is set as the maximum length. 

In [27]:
MAX_LENGTH = 512

In addition, the preprocessing function for an instance is created. In this function, a text is tokenized by the tokenizer (i.e., padded and truncated to the maximum length) and its corresponding label is transformed into a tensor. 

In [28]:
def preprocess_function(examples, tokenizer):
    encoding = tokenizer(examples["text"], padding = "max_length", truncation = True, max_length = MAX_LENGTH)
    encoding["labels"] = torch.tensor(examples ['class'])
    return encoding

Last, the function that would call the preprocessing function on the dataset is defined. In this function, the dataset is also set into a **torch** format. 

In [29]:
def create_encoded_dataset (tokenizer):
    encoded_dataset = dataset.map(preprocess_function, 
                                  batched=True, 
                                  remove_columns=dataset['train'].column_names, 
                                  fn_kwargs = {"tokenizer": tokenizer})
    
    encoded_dataset.set_format("torch")
    
    return encoded_dataset

### Tokenizing with BERT
As our functions and values are ready, the tokenizer can be instantiated. Since we would be utilizing a BERT model, specifically the **bert-base-cased** model, we would be creating a tokenizer that can prepare the text data into the input accepted by the model. 

This can be done through the [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) class and the `from_pretrained` function, since the model and the tokenizer that we want to use has already been pretrained.

In [30]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased', use_fast = False)

With this tokenizer, we will be encoding the dataset into the correct form that is needed by the BERT model.

In [31]:
bert_encoded_dataset = create_encoded_dataset (bert_tokenizer)

  0%|          | 0/175 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/49 [00:00<?, ?ba/s]

### Tokenizing with RoBERTa
Next, as we also want to use a pretrained RoBERTa model (i.e., **roberta-base**), we also have to do the same steps.

To start with, we need to create an instance of the specific RoBERTa model. 

In [28]:
roberta_tokenizer = AutoTokenizer.from_pretrained('roberta-base')

Since we already have an instance of the tokenizer, we can now use this tokenizer and the pre-processing function we defined previously to transform the dataset.

In [29]:
roberta_encoded_dataset = create_encoded_dataset (roberta_tokenizer)

  0%|          | 0/175 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/49 [00:00<?, ?ba/s]

## Modeling and Evaluation

As we have already created the features that we would be using for our models, we can now proceed with the modeling proper. For this project, we would be fine-tuning two pre-trained models: **BERT** and **RoBERTa**. 

### Defining of Functions and Values

Before we start with the training proper, we would need to define the functions that will be used for training and evaluating. 

First, we would be creating the function that would be used to compute the scores of the model. In this, we would be using four metrics to evaluate our models: (1) **F1 Macro Score**, (2) **Accuracy**, (3) **Precision**, and (4) **Recall**.

In [33]:
def compute_metrics(p: EvalPrediction):
    logits, labels = p
    predictions = np.argmax(logits, axis=-1)
    
    precision_metric = load_metric("precision")
    recall_metric = load_metric("recall")
    accuracy_metric = load_metric("accuracy")
    f1_metric = load_metric("f1")
    
    f1_macro_score = f1_metric.compute(predictions=predictions, references=labels, average="macro")
    accuracy_score = accuracy_metric.compute(predictions=predictions, references=labels)
    precision_score = precision_metric.compute(predictions=predictions, references=labels)
    recall_score = recall_metric.compute(predictions=predictions, references=labels)
    
    results = {
        'Accuracy' : accuracy_score ['accuracy'],
        'F1 Macro Score' : f1_macro_score ['f1'], 
        'Precision' : precision_score["precision"],
        'Recall' : recall_score["recall"]
    }
    
    return results

Second, we would be specifying the hyperparameter space that would determine the possible hyperparameter vaues to be tuned. In this, only three hyperparameters would be considered for tuning: (1) the **learning rate**, (2) the **train batch size**, and (3) the **number of training epochs**.

Note that the combination of values would be randomized from the sets of values, and there would only be three combinations that would be used for the tuning.

In [30]:
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_categorical("learning_rate", [0.1, 0.01, 0.001]),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16]),
        "num_train_epochs": trial.suggest_categorical("num_train_epochs", [2, 3, 4])
    }

### BERT Model
Now, we are ready to move on to training the BERT model. 

#### Model Training 

To start with, let us define the pre-trained model that we would be using. For the BERT, [**bert-base-cased**](https://huggingface.co/bert-base-cased)—a model that was pre-trained on a case-sensitive English corpus for masked language modeling (MLM)—would be utilized.

In [34]:
model_checkpoint = 'bert-base-cased'

Let us create an instance of a BERT model using this pretrained model. 

As we would be fine-tuning this model to classify text (i.e., if it is a suicidal or non-suicidal text), an instance of [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification) would be created specifically. It is also important to note that the input that it would accept is based on the **MAX_LENGTH** variable that we have previously declare, which has the value of **512**.

In [32]:
bert_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels = 2, 
    max_length = MAX_LENGTH
).to(device)

Next, we would be defining the [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) that the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) would be using. The parameters for the [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) that are used for the training loop are as follows:
* `output_dir` indicates that the model predictions and checkpoints will be saved in the **bert_trainer** folder
* `save_steps` means that the checkpoint will be saved every **20,000** steps
* `save_strategy` specifies that the saving of checkpoint will be based on the number of steps that the model has done 
* `fp16` stipulates that the **16-bit floating point precision** will be used (since its value is True) to save memory
* `evaluation_strategy` designates that the **evaluation** should be done **every after epochs**
* `resume_from_checkpoint` indicates that the training could be **restarted from a previous checkpoint**

In [None]:
training_args = TrainingArguments(output_dir = "bert_trainer", 
                                  save_steps = 20000,
                                  save_strategy = 'steps',
                                  fp16 = True,
                                  evaluation_strategy = "epoch", 
                                  resume_from_checkpoint = True)

As we have now declared the pre-trained model and the training arguments that we would be using, we can now instantiate a [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) that can do training and evaluation using the following parameters:
* `model` is the BERT model that we would be using for sequence classification
* `args` holds the training arguments that we have previously defined
* `train_dataset` is the tokenized dataset that we would be using for training
* `eval_dataset` is the tokenized dataset that we would be using for evaluating (i.e., the val set)
* `tokenizer` is the tokenizer that we used to prepare our data for the BERT model
* `compute_metrics` is the function that the evaluation loop would use to score the model
* `callbacks` holds the **ProgressBar**, which would allow us to see the progress of our model in training and evaluation

In [34]:
trainer = Trainer(
    model = bert_model,
    args = training_args,
    train_dataset = bert_encoded_dataset ['train'],
    eval_dataset = bert_encoded_dataset ['val'],
    tokenizer = bert_tokenizer,
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

Using the instance of [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) that we have created, we can now fine-tune the pre-trained BERT model through the use of the [`train`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.train) function.

In [38]:
trainer.train()

***** Running training *****
  Num examples = 174355
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 65385
  Number of trainable parameters = 108311810
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mfrancheska_vicente[0m ([33mtonely[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.3674,0.413518,0.88293,0.882769,0.855347,0.921319
2,0.561,1.071023,0.500748,0.333666,0.0,0.0
3,0.1946,0.162171,0.952924,0.952922,0.957012,0.948304


Saving model checkpoint to bert_trainer\checkpoint-20000
Configuration saved in bert_trainer\checkpoint-20000\config.json
Model weights saved in bert_trainer\checkpoint-20000\pytorch_model.bin
tokenizer config file saved in bert_trainer\checkpoint-20000\tokenizer_config.json
Special tokens file saved in bert_trainer\checkpoint-20000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  precision_metric = load_metric("precision")
Saving model checkpoint to bert_trainer\checkpoint-40000
Configuration saved in bert_trainer\checkpoint-40000\config.json
Model weights saved in bert_trainer\checkpoint-40000\pytorch_model.bin
tokenizer config file saved in bert_trainer\checkpoint-40000\tokenizer_config.json
Special tokens file saved in bert_trainer\checkpoint-40000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to bert_trainer

TrainOutput(global_step=65385, training_loss=0.3582221678567332, metrics={'train_runtime': 20003.5531, 'train_samples_per_second': 26.149, 'train_steps_per_second': 3.269, 'total_flos': 1.376241841718784e+17, 'train_loss': 0.3582221678567332, 'epoch': 3.0})

From the result above, we can see that the model received the highest evaluation score on the validation set on the third epoch. 

#### Saving BERT base model
To use this model outside the notebook, we would be saving the model. First, let us define the folder where we would be saving the model.

In [39]:
path_for_models ='./saved_models/BERTv4'

Now, let us save the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) (i.e., with the weights, the configurations, and the model) and the [`BertTokenizer`](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer) in the specified folder. 

In [40]:
trainer.save_model(path_for_models)
bert_tokenizer.save_pretrained(path_for_models)

Saving model checkpoint to ./saved_models/BERTv4
Configuration saved in ./saved_models/BERTv4\config.json
Model weights saved in ./saved_models/BERTv4\pytorch_model.bin
tokenizer config file saved in ./saved_models/BERTv4\tokenizer_config.json
Special tokens file saved in ./saved_models/BERTv4\special_tokens_map.json
tokenizer config file saved in ./saved_models/BERTv4\tokenizer_config.json
Special tokens file saved in ./saved_models/BERTv4\special_tokens_map.json


('./saved_models/BERTv4\\tokenizer_config.json',
 './saved_models/BERTv4\\special_tokens_map.json',
 './saved_models/BERTv4\\vocab.txt',
 './saved_models/BERTv4\\added_tokens.json')

Using the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) we have trained, we can now [`evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate) the model using the test set to determine its test score.

In [34]:
trainer.evaluate(eval_dataset=bert_encoded_dataset['test'])

***** Running Evaluation *****
  Num examples = 48432
  Batch size = 8


  precision_metric = load_metric("precision")
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mfrancheska_vicente[0m ([33mtonely[0m). Use [1m`wandb login --relogin`[0m to force relogin


{'eval_loss': 0.16548407077789307,
 'eval_Accuracy': 0.9506111661711265,
 'eval_F1 Macro Score': 0.9506109295912639,
 'eval_Precision': 0.9511326458773347,
 'eval_Recall': 0.949873857479631,
 'eval_runtime': 545.5507,
 'eval_samples_per_second': 88.776,
 'eval_steps_per_second': 11.097}

From the result above, it can be seen that the model was able to be correctly trained. It achieved the following scores: 95.06% for Accuracy and F1 Macro Score, 95.11% for Precision, and 94.99% for Recall.

#### Hyperparameter Tuning
Now, let us try to tune the hyperparameters (i.e., the learning rate, the number of training epochs and the training batch size) of the model, which means that we would try to find the value that would give us the highest score. In this, we would be trying three combinations of these hyperparameters, and we would compare the scores received by the three combinations to the score of the base model. 

To do this, we will first create a function that would return a base model of a BERT [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification) for initializaiton. 

In [36]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint,
                                                              num_labels = 2, 
                                                              max_length = MAX_LENGTH)

Like in training the base model, we would be creating the [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) that we would be using for training. We would be using the same parameters for the [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) as before, except for the **fp16**. 

In the tuning, **bf16** (bfloat16) will be used. This was done because using **fp16** resulted in 0.0 scores due to the loss of floating points in fp16.

In [37]:
training_args_tuning = TrainingArguments(output_dir = "bert_trainer", 
                                         save_steps = 20000, 
                                         bf16 = True,
                                         save_strategy = 'steps',
                                         evaluation_strategy = "epoch", 
                                         resume_from_checkpoint = True)

Next, we can create an instance of [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) class. Since we would be using the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) for tuning, we passed an **initialization of the model** instead of a model. This initial model is used as the base (i.e., the model is reinitialized every run of new hyperparameter values). This means that all of the models use the values of the base model and only the values of the hyperparameter passed are changed.

In [38]:
trainer_tuning = Trainer(
    model_init = model_init,
    args = training_args_tuning,
    train_dataset = bert_encoded_dataset ['train'],
    eval_dataset = bert_encoded_dataset ['val'],
    tokenizer = bert_tokenizer,
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4bb33641f5c76a55d11f846e0\config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file pytorch_model.bin from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4bb33641f5c76a55d11f846

Using the [`hyperparameter_search`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.hyperparameter_search) function, we can now start finding the best values of the hyperparameters to use. Note that this function will return the information about the best run (i.e., the model that received the best score).

In [39]:
best_trial = trainer_tuning.hyperparameter_search(
    direction = "maximize",
    backend = "optuna",
    hp_space = optuna_hp_space,
    n_trials = 3
)

[32m[I 2023-04-07 06:36:09,875][0m A new study created in memory with name: no-name-8501078e-7df7-41fd-b022-6dc74c71cc6e[0m
Trial: {'learning_rate': 0.001, 'per_device_train_batch_size': 8, 'num_train_epochs': 3}
loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4bb33641f5c76a55d11f846e0\config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_siz

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.7517,0.710669,0.500748,0.333666,0.0,0.0
2,0.7202,0.695365,0.499252,0.333001,0.499252,1.0
3,0.6972,0.691395,0.500748,0.333666,0.0,0.0


Saving model checkpoint to bert_trainer\run-0\checkpoint-20000
Configuration saved in bert_trainer\run-0\checkpoint-20000\config.json
Model weights saved in bert_trainer\run-0\checkpoint-20000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-0\checkpoint-20000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-0\checkpoint-20000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  precision_metric = load_metric("precision")
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to bert_trainer\run-0\checkpoint-40000
Configuration saved in bert_trainer\run-0\checkpoint-40000\config.json
Model weights saved in bert_trainer\run-0\checkpoint-40000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-0\checkpoint-40000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-0\checkpoint-40000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19

VBox(children=(Label(value='0.001 MB of 0.041 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.033370…

0,1
eval/Accuracy,█▁█
eval/F1 Macro Score,█▁█
eval/Precision,▁█▁
eval/Recall,▁█▁
eval/loss,█▂▁
eval/runtime,▁▅█
eval/samples_per_second,█▄▁
eval/steps_per_second,█▄▁
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███

0,1
eval/Accuracy,0.50075
eval/F1 Macro Score,0.33367
eval/Precision,0.0
eval/Recall,0.0
eval/loss,0.69139
eval/runtime,221.2597
eval/samples_per_second,87.558
eval/steps_per_second,10.946
train/epoch,3.0
train/global_step,65385.0


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016916666666414434, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,7.0334,11.454621,0.499252,0.333001,0.499252,1.0
2,3.3215,4.41292,0.499252,0.333001,0.499252,1.0
3,0.8058,0.695219,0.500748,0.333666,0.0,0.0


Saving model checkpoint to bert_trainer\run-1\checkpoint-20000
Configuration saved in bert_trainer\run-1\checkpoint-20000\config.json
Model weights saved in bert_trainer\run-1\checkpoint-20000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-1\checkpoint-20000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-1\checkpoint-20000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
Saving model checkpoint to bert_trainer\run-1\checkpoint-40000
Configuration saved in bert_trainer\run-1\checkpoint-40000\config.json
Model weights saved in bert_trainer\run-1\checkpoint-40000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-1\checkpoint-40000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-1\checkpoint-40000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
Saving model checkpoint to bert_trainer\run-1\checkpoint-60000
Configuration sav

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/Accuracy,▁▁█
eval/F1 Macro Score,▁▁█
eval/Precision,██▁
eval/Recall,██▁
eval/loss,█▃▁
eval/runtime,▅▁█
eval/samples_per_second,▄█▁
eval/steps_per_second,▄█▁
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███

0,1
eval/Accuracy,0.50075
eval/F1 Macro Score,0.33367
eval/Precision,0.0
eval/Recall,0.0
eval/loss,0.69522
eval/runtime,219.0251
eval/samples_per_second,88.451
eval/steps_per_second,11.058
train/epoch,3.0
train/global_step,65385.0


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01693333333338766, max=1.0)…

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,7.0334,11.454621,0.499252,0.333001,0.499252,1.0
2,3.3215,4.41292,0.499252,0.333001,0.499252,1.0
3,0.8058,0.695219,0.500748,0.333666,0.0,0.0


Saving model checkpoint to bert_trainer\run-2\checkpoint-20000
Configuration saved in bert_trainer\run-2\checkpoint-20000\config.json
Model weights saved in bert_trainer\run-2\checkpoint-20000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-2\checkpoint-20000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-2\checkpoint-20000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
Saving model checkpoint to bert_trainer\run-2\checkpoint-40000
Configuration saved in bert_trainer\run-2\checkpoint-40000\config.json
Model weights saved in bert_trainer\run-2\checkpoint-40000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-2\checkpoint-40000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-2\checkpoint-40000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
Saving model checkpoint to bert_trainer\run-2\checkpoint-60000
Configuration sav

In [40]:
best_trial

BestRun(run_id='0', objective=0.8344142826144729, hyperparameters={'learning_rate': 0.001, 'per_device_train_batch_size': 8, 'num_train_epochs': 3})

In this, it can be seen that there were only two BERT models that were created in tuning, with the following hyperparameters:
* **Learning Rate** = 0.001, **Train Batch Size** = 8, **Number of Train Epochs** = 3
* **Learning Rate** = 0.1, **Train Batch Size** = 8, **Number of Train Epochs** = 3

These values were randomly generated based on the hyperparameter space that we have declared.

##### Saving BERT tuned model

Like in the base model, we will also save the files of the best trial of the tuned model. 

In [43]:
path_for_models ='./saved_models/BERTv2_tuned'
trainer_tuning.save_model(path_for_models)

Saving model checkpoint to ./saved_models/BERTv2_tuned
Configuration saved in ./saved_models/BERTv2_tuned\config.json
Model weights saved in ./saved_models/BERTv2_tuned\pytorch_model.bin
tokenizer config file saved in ./saved_models/BERTv2_tuned\tokenizer_config.json
Special tokens file saved in ./saved_models/BERTv2_tuned\special_tokens_map.json


#### Evaluation

To test how the best trial of the BERT tuning fared in the test dataset, we will be using the [`evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate) function. 

In [34]:
trainer_tuning.evaluate(eval_dataset=bert_encoded_dataset['test'])

***** Running Evaluation *****
  Num examples = 48432
  Batch size = 8


  precision_metric = load_metric("precision")
  _warn_prf(average, modifier, msg_start, len(result))
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mfrancheska_vicente[0m ([33mtonely[0m). Use [1m`wandb login --relogin`[0m to force relogin


{'eval_loss': 0.6942650675773621,
 'eval_Accuracy': 0.5007639577139081,
 'eval_F1 Macro Score': 0.3336726972552796,
 'eval_Precision': 0.0,
 'eval_Recall': 0.0,
 'eval_runtime': 1057.7462,
 'eval_samples_per_second': 45.788,
 'eval_steps_per_second': 5.723}

In this result, it can be seen that the BERT model (with the learning rate of 0.001) was only accurate on 50% of the test samples. Bsed on the precision, this means that if the model predicts that the text is **Suicidal**, it is **correct 0% of the time**. 

Comparing the scores of these two models from tuning to the base model in the validation, the scores received by the base model was still better. Note that the only difference between these three models is the **learning rate** (i.e., the BERT base model has a learning rate of **0.0001**). Thus, for the BERT model, we will consider the base model as our best model.

### RoBERTa Model
Now, we can move on to training the RoBERTa model.

#### Model Training 
Like in the BERT model, we would need to define the pre-trained model that we would be fine-tuning. For this, we would be using [**roberta-base**](https://huggingface.co/roberta-base). This model, which is case-sensitive, was also pre-trained for the purpose of masked language modeling (MLM) on an English corpus, however, it uses the RoBERTa architecture, instead of the BERT architecture.

In [34]:
model_checkpoint_roberta = 'roberta-base'

Using this pre-trained model, we can instantiate a [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification) object, which will create a RoBERTa model. In addition, we would also be defining the **MAX_LENGTH** of the model to be the same as the previously defined **MAX_LENGTH** (i.e., 512).

In [35]:
roberta_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint_roberta,
    num_labels = 2, 
    max_length = MAX_LENGTH
).to(device)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

We would also need to create an instance of [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). This would have the same values as the previous [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) of the BERT model, except for the `output_dir`, as we wnat to save the checkpoints in another folder.

In [36]:
training_args = TrainingArguments(output_dir = "roberta_trainer", 
                                  save_steps = 20000,
                                  save_strategy = 'steps',
                                  fp16 = True,
                                  evaluation_strategy = "epoch", 
                                  resume_from_checkpoint = True)

Using this RoBERTa model and the previously created [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) object, we can now create a  [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer). Its parameters are also the same with the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) for BERT, but the `model`, `train_dataset`, and `eval_dataset` are changed to the RoBERTa counterparts.  

In [38]:
trainer = Trainer(
    model = roberta_model,
    args = training_args,
    train_dataset = roberta_encoded_dataset ['train'],
    eval_dataset = roberta_encoded_dataset ['val'],
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

Using cuda_amp half precision backend


Now, we can train the RoBERTa model.

In [39]:
trainer.train()

***** Running training *****
  Num examples = 174355
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 65385
  Number of trainable parameters = 124647170
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mfrancheska_vicente[0m ([33mtonely[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.6226,1.136392,0.500748,0.333666,0.0,0.0
2,0.6935,0.934038,0.500748,0.333666,0.0,0.0
3,0.1714,0.176044,0.954731,0.954715,0.970673,0.937655


Saving model checkpoint to roberta_trainer\checkpoint-20000
Configuration saved in roberta_trainer\checkpoint-20000\config.json
Model weights saved in roberta_trainer\checkpoint-20000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  precision_metric = load_metric("precision")
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to roberta_trainer\checkpoint-40000
Configuration saved in roberta_trainer\checkpoint-40000\config.json
Model weights saved in roberta_trainer\checkpoint-40000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to roberta_trainer\checkpoint-60000
Configuration saved in roberta_trainer\checkpoint-60000\config.json
Model weights saved in roberta_trainer\checkpoint-60000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8


Training completed. Do 

TrainOutput(global_step=65385, training_loss=0.5497573370633435, metrics={'train_runtime': 20653.9299, 'train_samples_per_second': 25.325, 'train_steps_per_second': 3.166, 'total_flos': 1.376241841718784e+17, 'train_loss': 0.5497573370633435, 'epoch': 3.0})

From this, it can be seen that, in the third epoch, the RoBERTa base model was able to achieve an **Accuracy and F1 Macro Score** of **95.47%**, a **Precision** of **97.07%**, and a **Recall** of	**93.77%**.

#### Saving RoBERTa base model
Since we are done training the model, we would be saving the RoBERTa model, and its configuration and tokenizer. 

In [40]:
path_for_models ='./saved_models/RoBERTav2'
trainer.save_model(path_for_models)
roberta_tokenizer.save_pretrained(path_for_models)

Saving model checkpoint to ./saved_models/RoBERTav2
Configuration saved in ./saved_models/RoBERTav2\config.json
Model weights saved in ./saved_models/RoBERTav2\pytorch_model.bin


We can now [`evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate) this RoBERTa model on the test set.

In [41]:
trainer.evaluate(eval_dataset=roberta_encoded_dataset['test'])

***** Running Evaluation *****
  Num examples = 48432
  Batch size = 8


{'eval_loss': 0.16810336709022522,
 'eval_Accuracy': 0.95701189296333,
 'eval_F1 Macro Score': 0.9569997463552207,
 'eval_Precision': 0.9713724988267417,
 'eval_Recall': 0.9416435750031019,
 'eval_runtime': 524.6124,
 'eval_samples_per_second': 92.32,
 'eval_steps_per_second': 11.54,
 'epoch': 3.0}

Comparing the scores received by the RoBERTa base model and the best BERT model, it is apparent that the **RoBERTa model received higher scores in every metric except for Recall**. 

#### Hyperparameter Tuning
To further see if we can improve the current RoBERTa model, we can tune the model's hyperparameters. 

Like in the BERT model, we would first need to create a function that would return the initial state of the model that would be tuned. 

In [30]:
def model_init_roberta ():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint_roberta,
                                                              num_labels = 2, 
                                                              max_length = MAX_LENGTH)

Next, we would have to create the [`TrainingArguments`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) that we would be using for the training loop.

In [31]:
training_args_tuning = TrainingArguments(output_dir = "roberta_trainer", 
                                         save_steps = 20000, 
                                         bf16 = True,
                                         save_strategy = 'steps',
                                         evaluation_strategy = "epoch", 
                                         resume_from_checkpoint = True)

With this, we can now proceed with creating an instance of the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) object.

In [35]:
trainer_tuning = Trainer(
    model_init = model_init_roberta,
    args = training_args_tuning,
    train_dataset = roberta_encoded_dataset ['train'],
    eval_dataset = roberta_encoded_dataset ['val'],
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af2e92b6eb5679868df33e224075ca68\config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_length": 512,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file pytorch_model.bin from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af

We can now proceed with utilizing the [`hyperparameter_search`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.hyperparameter_search) function to: (1) randomize values for the three hyperparameters that we want to tune based on the search space, (2) train three models using the values, and (3) pick the best model from the three trained models. 

In [36]:
best_trial_roberta = trainer_tuning.hyperparameter_search(
    direction = "maximize",
    backend = "optuna",
    hp_space = optuna_hp_space,
    n_trials = 3
)

[32m[I 2023-04-09 22:02:53,389][0m A new study created in memory with name: no-name-8d7ea55e-5a1a-4654-9a1c-e2933a9a9274[0m
Trial: {'learning_rate': 0.01, 'per_device_train_batch_size': 16, 'num_train_epochs': 4}
loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af2e92b6eb5679868df33e224075ca68\config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_length": 512,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_versio

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,1.1348,1.785331,0.500748,0.333666,0.0,0.0
2,0.9914,0.825983,0.499252,0.333001,0.499252,1.0
3,0.8305,0.776718,0.500748,0.333666,0.0,0.0
4,0.703,0.695388,0.499252,0.333001,0.499252,1.0


***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  precision_metric = load_metric("precision")
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to roberta_trainer\run-0\checkpoint-20000
Configuration saved in roberta_trainer\run-0\checkpoint-20000\config.json
Model weights saved in roberta_trainer\run-0\checkpoint-20000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to roberta_trainer\run-0\checkpoint-40000
Configuration saved in roberta_trainer\run-0\checkpoint-40000\config.json
Model weights saved in roberta_trainer\run-0\checkpoint-40000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)


[32m[I 2023-04-10 04:32:41,916][

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/Accuracy,█▁█▁
eval/F1 Macro Score,█▁█▁
eval/Precision,▁█▁█
eval/Recall,▁█▁█
eval/loss,█▂▂▁
eval/runtime,█▇▅▁
eval/samples_per_second,▁▂▄█
eval/steps_per_second,▁▂▄█
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████

0,1
eval/Accuracy,0.49925
eval/F1 Macro Score,0.333
eval/Precision,0.49925
eval/Recall,1.0
eval/loss,0.69539
eval/runtime,213.3167
eval/samples_per_second,90.818
eval/steps_per_second,11.354
train/epoch,4.0
train/global_step,43592.0


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01718333333337796, max=1.0)…

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.6952,0.693392,0.499252,0.333001,0.499252,1.0
2,0.5836,1.026052,0.500748,0.333666,0.0,0.0


Saving model checkpoint to roberta_trainer\run-1\checkpoint-20000
Configuration saved in roberta_trainer\run-1\checkpoint-20000\config.json
Model weights saved in roberta_trainer\run-1\checkpoint-20000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
Saving model checkpoint to roberta_trainer\run-1\checkpoint-40000
Configuration saved in roberta_trainer\run-1\checkpoint-40000\config.json
Model weights saved in roberta_trainer\run-1\checkpoint-40000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))


Training completed. Do not forget to share your model on huggingface.co/models =)


[32m[I 2023-04-10 08:18:07,732][0m Trial 1 finished with value: 0.8344142826144729 and parameters: {'learning_rate': 0.001, 'per_device_train_batch_size': 8, 'num_train_epochs': 2}. Best is trial 0 with value: 2.3315035877247845.[0m
Trial: {'learning_rate': 0.01, 'per_devic

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/Accuracy,▁█
eval/F1 Macro Score,▁█
eval/Precision,█▁
eval/Recall,█▁
eval/loss,▁█
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████

0,1
eval/Accuracy,0.50075
eval/F1 Macro Score,0.33367
eval/Precision,0.0
eval/Recall,0.0
eval/loss,1.02605
eval/runtime,221.8741
eval/samples_per_second,87.315
eval/steps_per_second,10.916
train/epoch,2.0
train/global_step,43590.0


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01693333333338766, max=1.0)…

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,1.0126,0.735783,0.499252,0.333001,0.499252,1.0
2,0.7181,0.693415,0.499252,0.333001,0.499252,1.0


***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
Saving model checkpoint to roberta_trainer\run-2\checkpoint-20000
Configuration saved in roberta_trainer\run-2\checkpoint-20000\config.json
Model weights saved in roberta_trainer\run-2\checkpoint-20000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)


[32m[I 2023-04-10 11:35:47,593][0m Trial 2 finished with value: 2.3315035877247845 and parameters: {'learning_rate': 0.01, 'per_device_train_batch_size': 16, 'num_train_epochs': 2}. Best is trial 0 with value: 2.3315035877247845.[0m


In [37]:
best_trial_roberta

BestRun(run_id='0', objective=2.3315035877247845, hyperparameters={'learning_rate': 0.01, 'per_device_train_batch_size': 16, 'num_train_epochs': 4})

In the tuning, three RoBERTa models were created and compared with the following hyperparameter values:
* **Learning Rate** = 0.01, **Train Batch Size** = 16, **Number of Train Epochs** = 4
* **Learning Rate** = 0.001, **Train Batch Size** = 8, **Number of Train Epochs** = 2
* **Learning Rate** = 0.01, **Train Batch Size** = 16, **Number of Train Epochs** = 2

Out of these three, the best run for the RoBERTa model was the first model that **trained for four (4) epochs with the learning rate of 0.01 and the train batch size of 16**. However, based on the performance on the validation set, we can see that the RoBERTa base still performed better.

##### Saving RoBERTa tuned model

To use this model outside of this notebook, we will save the RoBERTa [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) object and the [`RoBERTa Tokenizer`](https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizer).

In [38]:
path_for_models ='./saved_models/RoBERTav2_tuned'
trainer_tuning.save_model(path_for_models)
roberta_tokenizer.save_pretrained(path_for_models)

Saving model checkpoint to ./saved_models/RoBERTav2_tuned
Configuration saved in ./saved_models/RoBERTav2_tuned\config.json
Model weights saved in ./saved_models/RoBERTav2_tuned\pytorch_model.bin
tokenizer config file saved in ./saved_models/RoBERTav2_tuned\tokenizer_config.json
Special tokens file saved in ./saved_models/RoBERTav2_tuned\special_tokens_map.json


('./saved_models/RoBERTav2_tuned\\tokenizer_config.json',
 './saved_models/RoBERTav2_tuned\\special_tokens_map.json',
 './saved_models/RoBERTav2_tuned\\vocab.json',
 './saved_models/RoBERTav2_tuned\\merges.txt',
 './saved_models/RoBERTav2_tuned\\added_tokens.json',
 './saved_models/RoBERTav2_tuned\\tokenizer.json')

#### Evaluation

Last, let us see how the best model from the RoBERTa tuning fared in the test dataset.

In [39]:
trainer_tuning.evaluate(eval_dataset = roberta_encoded_dataset['test'])

***** Running Evaluation *****
  Num examples = 48432
  Batch size = 8


{'eval_loss': 0.6934160590171814,
 'eval_Accuracy': 0.49923604228609186,
 'eval_F1 Macro Score': 0.33299362355565965,
 'eval_Precision': 0.49923604228609186,
 'eval_Recall': 1.0,
 'eval_runtime': 535.8659,
 'eval_samples_per_second': 90.381,
 'eval_steps_per_second': 11.298,
 'epoch': 2.0}

From this, it is evident that the RoBERTa base performed better even in the test set compared to the model returned in the tuning.

In conclusion, comparing the final models of the BERT and RoBERTa (i.e., which made use of the default values for their hyperparameters and the MAX_LENGTH of 512), the RoBERTa received a higher score for all of the metrics except for Recall. 