# ABOUT:
- In this notebook I:
    1. fine tuned **distilroberta-base** on the training set that has been **appended with categorical features**
    2. evaluate on validation set
- note:
    - it appears that **model performances improve when predictive features that appended as string to the text feature**

## main variables

In [2]:
model_checkpoint = "distilroberta-base"
num_labels = 2
batch_size = 30

## load_from_disk

In [1]:
from datasets import load_from_disk
tokenized_Dataset = load_from_disk(r"C:\Users\tanch\Documents\GitHub\Spam Detection (local)\data\tokenized_amended_Dataset")

In [3]:
tokenized_Dataset

DatasetDict({
    train: Dataset({
        features: ['amended_text', 'attention_mask', 'input_ids', 'label'],
        num_rows: 3900
    })
    test: Dataset({
        features: ['amended_text', 'attention_mask', 'input_ids', 'label'],
        num_rows: 1672
    })
})

## metrics

In [4]:
from sklearn.metrics import f1_score,confusion_matrix
def compute_f1(output):
    f1 = f1_score(output.label_ids, np.argmax(output.predictions,axis=1))
    return {"f1":f1}
def return_confusion_matrix(output):
    return confusion_matrix(output.label_ids, np.argmax(output.predictions,axis=1))

## AutoModelForSequenceClassification,TrainingArguments, Trainer

In [9]:
eval_steps = int(len(tokenized_Dataset['train'])/batch_size/5)

In [10]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model  = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
                                                            num_labels=num_labels)
model.to("cuda")

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.out_proj.bias

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerN

In [11]:
args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "steps",   
    logging_strategy ="steps",
    save_strategy ="no",                         
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,                               # best model may not be the model at the end of training, thus this param enables us to save any best model during training
    metric_for_best_model = "f1",
    eval_steps = eval_steps,
)

In [12]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_Dataset['train'],           
    eval_dataset=tokenized_Dataset['test'],
    compute_metrics = compute_f1
)

## fine tuning and evaluation

In [13]:
import mlflow
mlflow.end_run()
trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: amended_text.
***** Running training *****
  Num examples = 3900
  Num Epochs = 1
  Instantaneous batch size per device = 30
  Total train batch size (w. parallel, distributed & accumulation) = 30
  Gradient Accumulation steps = 1
  Total optimization steps = 130


Step,Training Loss,Validation Loss,F1
26,No log,0.136448,0.39576
52,No log,0.053453,0.95614
78,No log,0.044983,0.95614
104,No log,0.042142,0.970655
130,No log,0.037545,0.970787


The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: amended_text.
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-26
Configuration saved in test-glue\checkpoint-26\config.json
Model weights saved in test-glue\checkpoint-26\pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: amended_text.
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-52
Configuration saved in test-glue\checkpoint-52\config.json
Model weights saved in test-glue\checkpoint-52\pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: amended_text.
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-78
Configuration saved in test-glue\checkpoint-78\config.json
Model weights saved in test-glue\checkpoint-78\pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: amended_text.
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-104
Configuration saved in test-glue\checkpoint-104\config.json
Model weights saved in test-glue\checkpoint-104\pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: amended_text.
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-130
Configuration saved in test-glue\checkpoint-130\config.json
Model weights saved in test-glue\checkpoint-130\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from test-glue\checkpoint-130 (score: 0.9707865168539326).


TrainOutput(global_step=130, training_loss=0.12349849114051233, metrics={'train_runtime': 124.6472, 'train_samples_per_second': 31.288, 'train_steps_per_second': 1.043, 'total_flos': 192160654920000.0, 'train_loss': 0.12349849114051233, 'epoch': 1.0})

## confusion_matrix

In [14]:
output = trainer.predict(tokenized_Dataset['test'])
return_confusion_matrix(output)

The following columns in the test set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: amended_text.
***** Running Prediction *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

array([[1443,    2],
       [  11,  216]], dtype=int64)