# **Insert Title Here**
**DATA103 S11 Group 4**
- GOZON, Jean Pauline D.
- JAMIAS, Gillian Nicole A.
- MARCELO Andrea Jean C. 
- REYES, Anton Gabriel G.
- VICENTE, Francheska Josefa

## Requirements and Imports

### Imports

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis



In [1]:
import numpy as np
import pandas as pd
import datasets

**Machine Learning Libraries**

* `torch` this is an open source ML library for deep neural network creation
* `transformers` contains pre-trained models

In [2]:
from sklearn.model_selection import train_test_split

In [3]:
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from pytorch_lightning.callbacks import ProgressBarBase, RichProgressBar

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainerCallback, TrainingArguments, Trainer, DataCollatorWithPadding

In [5]:
from sklearn.metrics import f1_score, roc_auc_score, hamming_loss, accuracy_score
from transformers import EvalPrediction
import evaluate

from datasets import load_metric

In [6]:
import optuna

In [7]:
import pickle

In [8]:
df = pd.read_csv ('cleaned_data.csv')
df

Unnamed: 0,class,text
0,0,"['Its not a viable option, and youll be leavin..."
1,1,['It can be hard to appreciate the notion that...
2,1,"['Hi, so last night i was sitting on the ledge..."
3,1,['I tried to kill my self once and failed badl...
4,1,['Hi NEM3030. What sorts of things do you enjo...
...,...,...
242155,0,If you don't like rock then your not going to ...
242156,0,You how you can tell i have so many friends an...
242157,0,pee probably tastes like salty tea😏💦‼️ can som...
242158,1,The usual stuff you find hereI'm not posting t...


In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Preparing data for Feature Engineering

### Splitting the Dataset into Train, Val, and Test Split

In [10]:
X = df ['text']
X

0         ['Its not a viable option, and youll be leavin...
1         ['It can be hard to appreciate the notion that...
2         ['Hi, so last night i was sitting on the ledge...
3         ['I tried to kill my self once and failed badl...
4         ['Hi NEM3030. What sorts of things do you enjo...
                                ...                        
242155    If you don't like rock then your not going to ...
242156    You how you can tell i have so many friends an...
242157    pee probably tastes like salty tea😏💦‼️ can som...
242158    The usual stuff you find hereI'm not posting t...
242159    I still haven't beaten the first boss in Hollo...
Name: text, Length: 242160, dtype: object

In [11]:
y = df ['class']
y

0         0
1         1
2         1
3         1
4         1
         ..
242155    0
242156    0
242157    0
242158    1
242159    0
Name: class, Length: 242160, dtype: int64

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
                                                    stratify = y,
                                                    random_state = 42, 
                                                    shuffle = True)

In [13]:
X_train, X_val, y_train, y_val = train_test_split(X_train, 
                                                  y_train, 
                                                  test_size = 0.1,
                                                  stratify = y_train,
                                                  random_state = 42, 
                                                  shuffle = True)

In [14]:
print('Train input  shape: ', X_train.shape)
print('Train output shape: ', y_train.shape)

Train input  shape:  (174355,)
Train output shape:  (174355,)


In [15]:
print('Val input  shape: ', X_val.shape)
print('Val output shape: ', y_val.shape)

Val input  shape:  (19373,)
Val output shape:  (19373,)


In [16]:
print('Test input  shape: ', X_test.shape)
print('Test output shape: ', y_test.shape)

Test input  shape:  (48432,)
Test output shape:  (48432,)


In [17]:
train_df = pd.concat([X_train, y_train], axis = 1).reset_index(drop = True)
train_df

Unnamed: 0,text,class
0,How do you explain to your family that you wer...,0
1,I DONT UNDERSTAND THE US DEBT WHO DO THEY OWE ...,0
2,FireIt’s been a bit but I still think of her a...,1
3,AITA for telling my wife (34F) that reddit agr...,0
4,Join among us SGGFIF Jesjeuejjejejeeieieijdjdj...,0
...,...,...
174350,"Fellow teenagers, I have been influenced by th...",0
174351,I felt like talkingSo I was just outside at 01...,1
174352,i am trying to but i just cant i have everythi...,1
174353,I just want my suffering to endAll I have hear...,1


In [18]:
val_df = pd.concat([X_val, y_val], axis = 1).reset_index(drop = True)
val_df

Unnamed: 0,text,class
0,Really down........just need some words of enc...,1
1,I’m not gonna buy a carThe day gets closer. I’...,1
2,Help me kill myself. Please. Please. Please.I’...,1
3,The only thing keeping me alive is the fact th...,1
4,"I'm not.I'm not the sweet, determined girl eve...",1
...,...,...
19368,when she says Hi! This post seems to be relate...,0
19369,I gotta go to school tmmr for orientation at 9...,0
19370,Hey lads! Can I get some help from y'all? So.....,0
19371,My birthday is this coming month and it will b...,1


In [19]:
test_df = pd.concat([X_test, y_test], axis = 1).reset_index(drop = True)
test_df

Unnamed: 0,text,class
0,I just felt myself snapI have to pretend to be...,1
1,Are you envious of something about the opposit...,0
2,"We get it. Men have problems, too. We never sa...",0
3,Happy Birthday to everyone having Birthday on ...,0
4,i cant deal with life any longer but ive tried...,1
...,...,...
48427,I just need to go for everyone's sakeI can't e...,1
48428,Hope is now goneI'm 17m and I'm considering ta...,1
48429,18f needs someone to talk toI understand if th...,1
48430,"Help mePlease someone help me, just pm me.\nI'...",1


### Creation of Dataset

In [20]:
train_dataset = datasets.Dataset.from_pandas(train_df)
train_dataset

Dataset({
    features: ['text', 'class'],
    num_rows: 174355
})

In [21]:
val_dataset = datasets.Dataset.from_pandas(val_df)
val_dataset

Dataset({
    features: ['text', 'class'],
    num_rows: 19373
})

In [22]:
test_dataset = datasets.Dataset.from_pandas(test_df)
test_dataset

Dataset({
    features: ['text', 'class'],
    num_rows: 48432
})

In [23]:
dataset = datasets.DatasetDict({
    "train" : train_dataset, 
    "val" : val_dataset, 
    "test" : test_dataset
})

dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'class'],
        num_rows: 174355
    })
    val: Dataset({
        features: ['text', 'class'],
        num_rows: 19373
    })
    test: Dataset({
        features: ['text', 'class'],
        num_rows: 48432
    })
})

## Feature Engineering

### Defining of Functions

In [24]:
MAX_LENGTH = 512

In [25]:
def preprocess_function(examples, tokenizer):
    encoding = tokenizer(examples["text"], padding = "max_length", truncation = True, max_length = MAX_LENGTH)
    encoding["labels"] = torch.tensor(examples ['class'])
    return encoding

In [26]:
def create_encoded_dataset (tokenizer):
    encoded_dataset = dataset.map(preprocess_function, 
                                  batched=True, 
                                  remove_columns=dataset['train'].column_names, 
                                  fn_kwargs = {"tokenizer": tokenizer})
    
    encoded_dataset.set_format("torch")
    
    return encoded_dataset

### Tokenizing with BERT

In [27]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased', use_fast = False)

In [28]:
bert_encoded_dataset = create_encoded_dataset (bert_tokenizer)

  0%|          | 0/175 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/49 [00:00<?, ?ba/s]

### Tokenizing with RoBERTa

In [27]:
model_checkpoint_roberta = 'roberta-base'

In [28]:
roberta_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_roberta)

In [29]:
roberta_encoded_dataset = create_encoded_dataset (roberta_tokenizer)

  0%|          | 0/175 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/49 [00:00<?, ?ba/s]

## Modeling and Evaluation

### Defining of Functions

In [36]:
def compute_metrics(p: EvalPrediction):
    logits, labels = p
    predictions = np.argmax(logits, axis=-1)
    
    precision_metric = load_metric("precision")
    recall_metric = load_metric("recall")
    accuracy_metric = load_metric("accuracy")
    f1_metric = load_metric("f1")
    
    f1_macro_score = f1_metric.compute(predictions=predictions, references=labels, average="macro")
    accuracy_score = accuracy_metric.compute(predictions=predictions, references=labels)
    precision_score = precision_metric.compute(predictions=predictions, references=labels)
    recall_score = recall_metric.compute(predictions=predictions, references=labels)
    
    results = {
        'Accuracy' : accuracy_score ['accuracy'],
        'F1 Macro Score' : f1_macro_score ['f1'], 
        'Precision' : precision_score["precision"],
        'Recall' : recall_score["recall"]
    }
    
    return results

### Defining of Hyperparameter Space

In [37]:
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_categorical("learning_rate", [0.1, 0.01, 0.001]),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16]),
        "num_train_epochs": trial.suggest_categorical("num_train_epochs", [2, 3, 4])
    }

### BERT Model

#### Model Training 

In [34]:
model_checkpoint = 'bert-base-cased'

In [35]:
bert_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels = 2, 
    max_length = MAX_LENGTH
).to(device)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [36]:
training_args = TrainingArguments(output_dir = "bert_trainer", 
                                  save_steps = 20000,
                                  save_strategy = 'steps',
                                  fp16 = True,
                                  evaluation_strategy = "epoch", 
                                  resume_from_checkpoint = True)

In [37]:
trainer = Trainer(
    model = bert_model,
    args = training_args,
    train_dataset = bert_encoded_dataset ['train'],
    eval_dataset = bert_encoded_dataset ['val'],
    tokenizer = bert_tokenizer,
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

Using cuda_amp half precision backend


In [38]:
trainer.train()

***** Running training *****
  Num examples = 174355
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 65385
  Number of trainable parameters = 108311810
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mfrancheska_vicente[0m ([33mtonely[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.3674,0.413518,0.88293,0.882769,0.855347,0.921319
2,0.561,1.071023,0.500748,0.333666,0.0,0.0
3,0.1946,0.162171,0.952924,0.952922,0.957012,0.948304


Saving model checkpoint to bert_trainer\checkpoint-20000
Configuration saved in bert_trainer\checkpoint-20000\config.json
Model weights saved in bert_trainer\checkpoint-20000\pytorch_model.bin
tokenizer config file saved in bert_trainer\checkpoint-20000\tokenizer_config.json
Special tokens file saved in bert_trainer\checkpoint-20000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  precision_metric = load_metric("precision")
Saving model checkpoint to bert_trainer\checkpoint-40000
Configuration saved in bert_trainer\checkpoint-40000\config.json
Model weights saved in bert_trainer\checkpoint-40000\pytorch_model.bin
tokenizer config file saved in bert_trainer\checkpoint-40000\tokenizer_config.json
Special tokens file saved in bert_trainer\checkpoint-40000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to bert_trainer

TrainOutput(global_step=65385, training_loss=0.3582221678567332, metrics={'train_runtime': 20003.5531, 'train_samples_per_second': 26.149, 'train_steps_per_second': 3.269, 'total_flos': 1.376241841718784e+17, 'train_loss': 0.3582221678567332, 'epoch': 3.0})

#### Saving BERT base model

In [39]:
path_for_models ='./saved_models/BERTv4'

In [40]:
trainer.save_model(path_for_models)
bert_tokenizer.save_pretrained(path_for_models)

Saving model checkpoint to ./saved_models/BERTv4
Configuration saved in ./saved_models/BERTv4\config.json
Model weights saved in ./saved_models/BERTv4\pytorch_model.bin
tokenizer config file saved in ./saved_models/BERTv4\tokenizer_config.json
Special tokens file saved in ./saved_models/BERTv4\special_tokens_map.json
tokenizer config file saved in ./saved_models/BERTv4\tokenizer_config.json
Special tokens file saved in ./saved_models/BERTv4\special_tokens_map.json


('./saved_models/BERTv4\\tokenizer_config.json',
 './saved_models/BERTv4\\special_tokens_map.json',
 './saved_models/BERTv4\\vocab.txt',
 './saved_models/BERTv4\\added_tokens.json')

In [None]:
trainer.evaluate(eval_dataset=bert_encoded_dataset['test'])

#### Hyperparameter Tuning

In [36]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint,
                                                              num_labels = 2, 
                                                              max_length = MAX_LENGTH)

In [37]:
training_args_tuning = TrainingArguments(output_dir = "bert_trainer", 
                                         save_steps = 20000, 
                                         bf16 = True,
                                         save_strategy = 'steps',
                                         evaluation_strategy = "epoch", 
                                         resume_from_checkpoint = True)

In [38]:
trainer_tuning = Trainer(
    model_init = model_init,
    args = training_args_tuning,
    train_dataset = bert_encoded_dataset ['train'],
    eval_dataset = bert_encoded_dataset ['val'],
    tokenizer = bert_tokenizer,
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4bb33641f5c76a55d11f846e0\config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file pytorch_model.bin from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4bb33641f5c76a55d11f846

In [39]:
best_trial = trainer_tuning.hyperparameter_search(
    direction = "maximize",
    backend = "optuna",
    hp_space = optuna_hp_space,
    n_trials = 3
)

[32m[I 2023-04-07 06:36:09,875][0m A new study created in memory with name: no-name-8501078e-7df7-41fd-b022-6dc74c71cc6e[0m
Trial: {'learning_rate': 0.001, 'per_device_train_batch_size': 8, 'num_train_epochs': 3}
loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--bert-base-cased\snapshots\5532cc56f74641d4bb33641f5c76a55d11f846e0\config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_siz

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.7517,0.710669,0.500748,0.333666,0.0,0.0
2,0.7202,0.695365,0.499252,0.333001,0.499252,1.0
3,0.6972,0.691395,0.500748,0.333666,0.0,0.0


Saving model checkpoint to bert_trainer\run-0\checkpoint-20000
Configuration saved in bert_trainer\run-0\checkpoint-20000\config.json
Model weights saved in bert_trainer\run-0\checkpoint-20000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-0\checkpoint-20000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-0\checkpoint-20000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  precision_metric = load_metric("precision")
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to bert_trainer\run-0\checkpoint-40000
Configuration saved in bert_trainer\run-0\checkpoint-40000\config.json
Model weights saved in bert_trainer\run-0\checkpoint-40000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-0\checkpoint-40000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-0\checkpoint-40000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19

VBox(children=(Label(value='0.001 MB of 0.041 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.033370…

0,1
eval/Accuracy,█▁█
eval/F1 Macro Score,█▁█
eval/Precision,▁█▁
eval/Recall,▁█▁
eval/loss,█▂▁
eval/runtime,▁▅█
eval/samples_per_second,█▄▁
eval/steps_per_second,█▄▁
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███

0,1
eval/Accuracy,0.50075
eval/F1 Macro Score,0.33367
eval/Precision,0.0
eval/Recall,0.0
eval/loss,0.69139
eval/runtime,221.2597
eval/samples_per_second,87.558
eval/steps_per_second,10.946
train/epoch,3.0
train/global_step,65385.0


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016916666666414434, max=1.0…

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,7.0334,11.454621,0.499252,0.333001,0.499252,1.0
2,3.3215,4.41292,0.499252,0.333001,0.499252,1.0
3,0.8058,0.695219,0.500748,0.333666,0.0,0.0


Saving model checkpoint to bert_trainer\run-1\checkpoint-20000
Configuration saved in bert_trainer\run-1\checkpoint-20000\config.json
Model weights saved in bert_trainer\run-1\checkpoint-20000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-1\checkpoint-20000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-1\checkpoint-20000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
Saving model checkpoint to bert_trainer\run-1\checkpoint-40000
Configuration saved in bert_trainer\run-1\checkpoint-40000\config.json
Model weights saved in bert_trainer\run-1\checkpoint-40000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-1\checkpoint-40000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-1\checkpoint-40000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
Saving model checkpoint to bert_trainer\run-1\checkpoint-60000
Configuration sav

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/Accuracy,▁▁█
eval/F1 Macro Score,▁▁█
eval/Precision,██▁
eval/Recall,██▁
eval/loss,█▃▁
eval/runtime,▅▁█
eval/samples_per_second,▄█▁
eval/steps_per_second,▄█▁
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███

0,1
eval/Accuracy,0.50075
eval/F1 Macro Score,0.33367
eval/Precision,0.0
eval/Recall,0.0
eval/loss,0.69522
eval/runtime,219.0251
eval/samples_per_second,88.451
eval/steps_per_second,11.058
train/epoch,3.0
train/global_step,65385.0


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01693333333338766, max=1.0)…

Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,7.0334,11.454621,0.499252,0.333001,0.499252,1.0
2,3.3215,4.41292,0.499252,0.333001,0.499252,1.0
3,0.8058,0.695219,0.500748,0.333666,0.0,0.0


Saving model checkpoint to bert_trainer\run-2\checkpoint-20000
Configuration saved in bert_trainer\run-2\checkpoint-20000\config.json
Model weights saved in bert_trainer\run-2\checkpoint-20000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-2\checkpoint-20000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-2\checkpoint-20000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
Saving model checkpoint to bert_trainer\run-2\checkpoint-40000
Configuration saved in bert_trainer\run-2\checkpoint-40000\config.json
Model weights saved in bert_trainer\run-2\checkpoint-40000\pytorch_model.bin
tokenizer config file saved in bert_trainer\run-2\checkpoint-40000\tokenizer_config.json
Special tokens file saved in bert_trainer\run-2\checkpoint-40000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
Saving model checkpoint to bert_trainer\run-2\checkpoint-60000
Configuration sav

In [40]:
best_trial

BestRun(run_id='0', objective=0.8344142826144729, hyperparameters={'learning_rate': 0.001, 'per_device_train_batch_size': 8, 'num_train_epochs': 3})

##### Saving BERT tuned model

In [43]:
path_for_models ='./saved_models/BERTv2_tuned'
trainer_tuning.save_model(path_for_models)

Saving model checkpoint to ./saved_models/BERTv2_tuned
Configuration saved in ./saved_models/BERTv2_tuned\config.json
Model weights saved in ./saved_models/BERTv2_tuned\pytorch_model.bin
tokenizer config file saved in ./saved_models/BERTv2_tuned\tokenizer_config.json
Special tokens file saved in ./saved_models/BERTv2_tuned\special_tokens_map.json


#### Evaluation

In [None]:
trainer_tuning.evaluate(eval_dataset=bert_encoded_dataset['test'])

#### Feature Importance

### RoBERTa Model

#### Model Training 

In [34]:
model_checkpoint_roberta = 'roberta-base'

In [35]:
roberta_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint_roberta,
    num_labels = 2, 
    max_length = MAX_LENGTH
).to(device)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

In [36]:
training_args = TrainingArguments(output_dir = "roberta_trainer", 
                                  save_steps = 20000,
                                  save_strategy = 'steps',
                                  fp16 = True,
                                  evaluation_strategy = "epoch", 
                                  resume_from_checkpoint = True)

In [37]:
roberta_model.config.max_length

512

In [38]:
trainer = Trainer(
    model = roberta_model,
    args = training_args,
    train_dataset = roberta_encoded_dataset ['train'],
    eval_dataset = roberta_encoded_dataset ['val'],
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

Using cuda_amp half precision backend


In [39]:
trainer.train()

***** Running training *****
  Num examples = 174355
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 65385
  Number of trainable parameters = 124647170
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mfrancheska_vicente[0m ([33mtonely[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1 macro score,Precision,Recall
1,0.6226,1.136392,0.500748,0.333666,0.0,0.0
2,0.6935,0.934038,0.500748,0.333666,0.0,0.0
3,0.1714,0.176044,0.954731,0.954715,0.970673,0.937655


Saving model checkpoint to roberta_trainer\checkpoint-20000
Configuration saved in roberta_trainer\checkpoint-20000\config.json
Model weights saved in roberta_trainer\checkpoint-20000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  precision_metric = load_metric("precision")
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to roberta_trainer\checkpoint-40000
Configuration saved in roberta_trainer\checkpoint-40000\config.json
Model weights saved in roberta_trainer\checkpoint-40000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to roberta_trainer\checkpoint-60000
Configuration saved in roberta_trainer\checkpoint-60000\config.json
Model weights saved in roberta_trainer\checkpoint-60000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8


Training completed. Do 

TrainOutput(global_step=65385, training_loss=0.5497573370633435, metrics={'train_runtime': 20653.9299, 'train_samples_per_second': 25.325, 'train_steps_per_second': 3.166, 'total_flos': 1.376241841718784e+17, 'train_loss': 0.5497573370633435, 'epoch': 3.0})

#### Saving RoBERTa base model

In [40]:
path_for_models ='./saved_models/RoBERTav2'
trainer.save_model(path_for_models)
roberta_tokenizer.save_pretrained(path_for_models)

Saving model checkpoint to ./saved_models/RoBERTav2
Configuration saved in ./saved_models/RoBERTav2\config.json
Model weights saved in ./saved_models/RoBERTav2\pytorch_model.bin


In [41]:
trainer.evaluate(eval_dataset=roberta_encoded_dataset['test'])

***** Running Evaluation *****
  Num examples = 48432
  Batch size = 8


{'eval_loss': 0.16810336709022522,
 'eval_Accuracy': 0.95701189296333,
 'eval_F1 Macro Score': 0.9569997463552207,
 'eval_Precision': 0.9713724988267417,
 'eval_Recall': 0.9416435750031019,
 'eval_runtime': 524.6124,
 'eval_samples_per_second': 92.32,
 'eval_steps_per_second': 11.54,
 'epoch': 3.0}

#### Hyperparameter Tuning

In [33]:
def model_init_roberta ():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint_roberta,
                                                              num_labels = 2, 
                                                              max_length = MAX_LENGTH)

In [34]:
training_args_tuning = TrainingArguments(output_dir = "bert_trainer", 
                                         save_steps = 20000, 
                                         bf16 = True,
                                         save_strategy = 'steps',
                                         evaluation_strategy = "epoch", 
                                         resume_from_checkpoint = True)

In [38]:
trainer_tuning = Trainer(
    model_init = model_init_roberta,
    args = training_args_tuning,
    train_dataset = roberta_encoded_dataset ['train'],
    eval_dataset = roberta_encoded_dataset ['val'],
    tokenizer = roberta_tokenizer,
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af2e92b6eb5679868df33e224075ca68\config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_length": 512,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file pytorch_model.bin from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af

In [None]:
best_trial_roberta = trainer_tuning.hyperparameter_search(
    direction = "maximize",
    backend = "optuna",
    hp_space = optuna_hp_space,
    n_trials = 3
)

[32m[I 2023-04-09 05:35:48,793][0m A new study created in memory with name: no-name-0beb4556-dcee-4bc2-9c78-0e427a331125[0m
Trial: {'learning_rate': 0.01, 'per_device_train_batch_size': 16, 'num_train_epochs': 4}
loading configuration file config.json from cache at C:\Users\admin/.cache\huggingface\hub\models--roberta-base\snapshots\bc2764f8af2e92b6eb5679868df33e224075ca68\config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_length": 512,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_versio

In [None]:
best_trial_roberta

##### Saving RoBERTa tuned model

In [None]:
path_for_models ='./saved_models/RoBERTav1_tuned'
trainer_tuning.save_model(path_for_models)
roberta_tokenizer.save_pretrained(path_for_models)

#### Evaluation

In [None]:
trainer_tuning.evaluate(eval_dataset = roberta_encoded_dataset['test'])

#### Feature Importance