# Primary sarcasm model

In [2]:
import torch
import numpy as np
import evaluate
from datasets import Dataset
from transformers import AutoTokenizer, BertForSequenceClassification
from transformers import TrainingArguments, Trainer

  from .autonotebook import tqdm as notebook_tqdm


Following architecture from *Context-Aware Sarcasm Detection Using BERT*:
> In our study, we used the uncased large version of BERT. This version has 24 layers and 16 attention heads. This model generates 1024 dimensional vector for each word. We used 1024 dimensional vector of the Extract layer as the representation of the text. Our classification layer consisted of a single Dense layer. This layer used the sigmoid activation layer. The classifier was trained using the Adam optimizer with a learning rate of 2e-5. The binary crossentropy loss function was used.

## Load model

We're using `BertForSequenceClassification`, which is a BERT transformer with a dense (linear) layer for classification. The transformer is not frozen, so training this model both finetunes BERT for this task and trains the classification layer. Running this cell will show the model architecture.

The loss from model is negative log-likelihood loss. The difference is that cross entropy expects raw probabilities, and NLL expects log probs, so not a huge difference, and it's probably fine to use. If needed, we can overwrite the default loss function.

The warning when you run this cell is just because the classification layer hasn't been trained, so the model shouldn't be used out of the box- it needs to be trained first.

In [3]:
pretrained_checkpoint = "google-bert/bert-base-uncased"     # switch to large later
id2label = {0: "not_sarcastic", 1: "sarcastic"} 

tokenizer = AutoTokenizer.from_pretrained(pretrained_checkpoint, use_fast=True)
model = BertForSequenceClassification.from_pretrained(pretrained_checkpoint, id2label=id2label)

model

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

## Load data

Data should be a HuggingFace dataset in order to play nicely with the trainer.

In [8]:
import json

data_dir = "../../data/sarc"
train_filename = f"{data_dir}/toy_comments-train-balanced.json"
eval_filename = f"{data_dir}/dev-comments-balanced.json"

In [5]:
def preprocess_func(data, text_key="text"):
    return tokenizer(data[text_key], truncation=True)

In [6]:
def preprocess_data(raw_data):
    data = [{"text": d["response"], "label": int(d["label"])} for d in raw_data]
    dataset = Dataset.from_list(data)

    encoded_dataset = dataset.map(preprocess_func)  

    encoded_dataset = encoded_dataset.remove_columns('text')    # training doesn't work if there are text columns
    return encoded_dataset.with_format("torch")

In [9]:
with open(train_filename) as f:
    train_data_raw = json.load(f)
    
train_dataset = preprocess_data(train_data_raw)
train_dataset[0]

Map: 100%|██████████| 20/20 [00:00<00:00, 2994.54 examples/s]


{'label': tensor(1),
 'input_ids': tensor([ 101, 4676, 2442, 2031, 1996, 3437,  102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1])}

In [10]:
with open(eval_filename) as f:
    eval_data_raw = json.load(f)
    
eval_dataset = preprocess_data(eval_data_raw[:10])   # use full data when ready to run
len(eval_dataset)

Map: 100%|██████████| 10/10 [00:00<00:00, 2441.25 examples/s]


10

## Train model

We're using the [HuggingFace Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer), which is optimized to work with pretrained models.

In [11]:
training_args = TrainingArguments(
    output_dir="sarc_bert",         # can do custom names
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    # push_to_hub=True,             # can push to hub instead of saving locally
    learning_rate=2e-5,             # defaults to Adam optimizer
    logging_steps=1,                # to log loss from the first epoch
    load_best_model_at_end=True,
    # metric_for_best_model="f1"    # default is loss
)   

In [12]:
f1_metric = evaluate.load("f1")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1_metric.compute(predictions=predictions, references=labels)

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [14]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,0.6189,0.622401,0.769231
2,0.609,0.581977,0.833333
3,0.7821,0.559033,0.909091


TrainOutput(global_step=9, training_loss=0.6646246843867831, metrics={'train_runtime': 43.4319, 'train_samples_per_second': 1.381, 'train_steps_per_second': 0.207, 'total_flos': 1079166438000.0, 'train_loss': 0.6646246843867831, 'epoch': 3.0})

## Evaluation

In [15]:
trainer.evaluate()  # using provided evaluation set (dev)

{'eval_loss': 0.5590333342552185,
 'eval_f1': 0.9090909090909091,
 'eval_runtime': 0.1514,
 'eval_samples_per_second': 66.034,
 'eval_steps_per_second': 13.207,
 'epoch': 3.0}

In [16]:
trainer.evaluate(train_dataset)     # this is how we will evaluate on the test set

{'eval_loss': 0.5771743655204773,
 'eval_f1': 0.8695652173913043,
 'eval_runtime': 0.2821,
 'eval_samples_per_second': 70.894,
 'eval_steps_per_second': 10.634,
 'epoch': 3.0}

to do:
* hyperparameter tuning
* config (using huggingface)
* batching

### misc code, just trying out model

In [31]:
inputs = tokenizer("I love ML \s", return_tensors="pt")
inputs

{'input_ids': tensor([[  101,  1045,  2293, 19875,  1032,  1055,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [60]:
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
print(predicted_class_id, model.config.id2label[predicted_class_id])

labels = torch.tensor([1])
output = model(**inputs, labels=labels)
print("raw output", output)
loss = output.loss
print(loss)
print(round(loss.item(), 2))

0 not_sarcastic
raw output SequenceClassifierOutput(loss=tensor(0.8168, grad_fn=<NllLossBackward0>), logits=tensor([[-0.0873, -0.3209]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
tensor(0.8168, grad_fn=<NllLossBackward0>)
0.82


### Resources
* https://huggingface.co/docs/transformers/en/model_doc/bert
* [Fine-tune a pretrained model (HuggingFace)](https://huggingface.co/docs/transformers/training)
* [Text classification on GLUE (Colab tutorial by HuggingFace)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)