# ABOUT:
- in this notebook i:
    -  Fine tuned **distilroberta** on the original training set **together with the paraphrased spam emails**
- insight:
    - it appears that fine tuning on **paraphrased spam emails hurt performance**


### main variables

In [7]:
model_checkpoint = "distilroberta-base"
num_labels = 2
batch_size = 30

### import data

In [3]:
from datasets import load_from_disk
tokenized_positive_examples = load_from_disk(r"C:\Users\tanch\Documents\GitHub\Spam Detection (local)\data\tokenized_paraphrased_positive_examples")
tokenized_Dataset = load_from_disk(r"C:\Users\tanch\Documents\GitHub\Spam Detection (local)\data\tokenized_spam_Dataset")

In [31]:
import pandas as pd
from datasets import Dataset
tokenized_Dataset['train'] = Dataset.from_pandas(pd.concat([pd.DataFrame(tokenized_positive_examples),pd.DataFrame(tokenized_Dataset['train'])]))

In [41]:
tokenized_Dataset['train'] = tokenized_Dataset['train'].shuffle()

In [42]:
tokenized_Dataset

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', '__index_level_0__'],
        num_rows: 4647
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label'],
        num_rows: 1672
    })
})

## metrics

In [33]:
from sklearn.metrics import f1_score,confusion_matrix
def compute_f1(output):
    f1 = f1_score(output.label_ids, np.argmax(output.predictions,axis=1))
    return {"f1":f1}
def return_confusion_matrix(output):
    return confusion_matrix(output.label_ids, np.argmax(output.predictions,axis=1))

## AutoModelForSequenceClassification,TrainingArguments, Trainer

In [39]:
eval_steps = int(len(tokenized_Dataset['train'])/batch_size/5)

In [40]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model  = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
                                                            num_labels=num_labels)
model.to("cuda")

loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at C:\Users\tanch/.cache\huggingface\transformers\42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.8.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file https://huggingface.co/distilroberta-base/resolve/main

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerN

In [44]:
args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "steps",   
    logging_strategy ="steps",
    save_strategy ="no",                         
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,                               # best model may not be the model at the end of training, thus this param enables us to save any best model during training
    metric_for_best_model = "f1",
    eval_steps = eval_steps,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [43]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_Dataset['train'],           
    eval_dataset=tokenized_Dataset['test'],
    compute_metrics = compute_f1
)

## fine tuning and evaluation

In [45]:
import mlflow
mlflow.end_run()
trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: __index_level_0__.
***** Running training *****
  Num examples = 4647
  Num Epochs = 1
  Instantaneous batch size per device = 30
  Total train batch size (w. parallel, distributed & accumulation) = 30
  Gradient Accumulation steps = 1
  Total optimization steps = 155


Step,Training Loss,Validation Loss,F1
6,No log,0.498119,0.0
12,No log,0.352188,0.0
18,No log,0.189204,0.593333
24,No log,0.114986,0.893424
30,No log,0.066955,0.92823
36,No log,0.072799,0.924485
42,No log,0.171097,0.823062
48,No log,0.066243,0.932715
54,No log,0.089128,0.890625
60,No log,0.054191,0.936893


***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-6
Configuration saved in test-glue\checkpoint-6\config.json
Model weights saved in test-glue\checkpoint-6\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-12
Configuration saved in test-glue\checkpoint-12\config.json
Model weights saved in test-glue\checkpoint-12\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-18
Configuration saved in test-glue\checkpoint-18\config.json
Model weights saved in test-glue\checkpoint-18\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-24
Configuration saved in test-glue\checkpoint-24\config.json
Model weights saved in test-glue\checkpoint-24\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-30
Configuration saved in test-glue\checkpoint-30\config.json
Model weights saved in test-glue\checkpoint-30\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-36
Configuration saved in test-glue\checkpoint-36\config.json
Model weights saved in test-glue\checkpoint-36\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-42
Configuration saved in test-glue\checkpoint-42\config.json
Model weights saved in test-glue\checkpoint-42\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-48
Configuration saved in test-glue\checkpoint-48\config.json
Model weights saved in test-glue\checkpoint-48\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-54
Configuration saved in test-glue\checkpoint-54\config.json
Model weights saved in test-glue\checkpoint-54\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-60
Configuration saved in test-glue\checkpoint-60\config.json
Model weights saved in test-glue\checkpoint-60\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-66
Configuration saved in test-glue\checkpoint-66\config.json
Model weights saved in test-glue\checkpoint-66\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-72
Configuration saved in test-glue\checkpoint-72\config.json
Model weights saved in test-glue\checkpoint-72\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-78
Configuration saved in test-glue\checkpoint-78\config.json
Model weights saved in test-glue\checkpoint-78\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-84
Configuration saved in test-glue\checkpoint-84\config.json
Model weights saved in test-glue\checkpoint-84\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-90
Configuration saved in test-glue\checkpoint-90\config.json
Model weights saved in test-glue\checkpoint-90\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-96
Configuration saved in test-glue\checkpoint-96\config.json
Model weights saved in test-glue\checkpoint-96\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-102
Configuration saved in test-glue\checkpoint-102\config.json
Model weights saved in test-glue\checkpoint-102\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-108
Configuration saved in test-glue\checkpoint-108\config.json
Model weights saved in test-glue\checkpoint-108\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-114
Configuration saved in test-glue\checkpoint-114\config.json
Model weights saved in test-glue\checkpoint-114\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-120
Configuration saved in test-glue\checkpoint-120\config.json
Model weights saved in test-glue\checkpoint-120\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-126
Configuration saved in test-glue\checkpoint-126\config.json
Model weights saved in test-glue\checkpoint-126\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-132
Configuration saved in test-glue\checkpoint-132\config.json
Model weights saved in test-glue\checkpoint-132\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-138
Configuration saved in test-glue\checkpoint-138\config.json
Model weights saved in test-glue\checkpoint-138\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-144
Configuration saved in test-glue\checkpoint-144\config.json
Model weights saved in test-glue\checkpoint-144\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

Saving model checkpoint to test-glue\checkpoint-150
Configuration saved in test-glue\checkpoint-150\config.json
Model weights saved in test-glue\checkpoint-150\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from test-glue\checkpoint-60 (score: 0.9368932038834952).


TrainOutput(global_step=155, training_loss=0.1792130131875315, metrics={'train_runtime': 298.7964, 'train_samples_per_second': 15.552, 'train_steps_per_second': 0.519, 'total_flos': 228966811131600.0, 'train_loss': 0.1792130131875315, 'epoch': 1.0})

## confusion_matrix

In [46]:
output = trainer.predict(tokenized_Dataset['test'])
return_confusion_matrix(output)

***** Running Prediction *****
  Num examples = 1672
  Batch size = 30


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

array([[1453,    8],
       [  18,  193]], dtype=int64)