# Main Notebook for A4

This notebook is adjusted from https://github.com/huggingface/notebooks/blob/main/examples/token_classification.ipynb

Modified contents are:
- Removed remote contents (logging in huggingface, etc.)
- tokenize_and_align_labels(): modified for the dataset and resolved weird bug
- model parameter: num_labels=2 (True or False, in or not inside a negation scope)
- metric: used self-made metric loader script ('../scripts/span_metric.py') for this task
- compute_metrics(): adjusted for this task and datasets

In [1]:
import transformers
import pandas as pd

In [42]:
task = "negation_scope"
model_checkpoint = "bert-base-uncased" # bert-base-uncased for better percision, distilbert-base-uncased for faster run
batch_size = 16

## Loading the dataset
The datasets are pre-generated huggingface dataset classes.

In [3]:
import datasets

In [4]:
trainds = datasets.load_from_disk('../data/hf_dataset/trainds')
devds = datasets.load_from_disk('../data/hf_dataset/devds')
testds = datasets.load_from_disk('../data/hf_dataset/testFds')

In [5]:
trainds[1]

{'id': 1,
 'negation_scope_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'tokens': ['Mr.',
  'Sherlock',
  'Holmes',
  ',',
  'who',
  'was',
  'usually',
  'very',
  'late',
  'in',
  'the',
  'mornings',
  ',',
  'save',
  'upon',
  'those',
  'not',
  '[NEG] infrequent',
  'occasions',
  'when',
  'he',
  'was',
  'up',
  'all',
  'night',
  ',',
  'was',
  'seated',
  'at',
  'the',
  'breakfast',
  'table',
  '.']}

## Preprocess
Using the pre-trained AutoTokenizer with the given model to tokenize. Added special marks (-100) to the beginning and ending of sentences.

In [6]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [7]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    all_word_ids = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)
        all_word_ids.append(word_ids)
    
    
    tokenized_inputs['word_ids'] = all_word_ids
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [10]:
label_all_tokens = True
tokenized_train = trainds.map(tokenize_and_align_labels, batched=True, batch_size=1)
tokenized_dev = devds.map(tokenize_and_align_labels, batched=True)
tokenized_test = testds.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/3779 [00:00<?, ? examples/s]

Map:   0%|          | 0/815 [00:00<?, ? examples/s]

Map:   0%|          | 0/1116 [00:00<?, ? examples/s]

In [11]:

tokenized_train = [{k : v for k,v in x.items() if k !=f'{task}_tags'} for x in tokenized_train ]
tokenized_dev = [{k : v for k,v in x.items() if k !=f'{task}_tags'} for x in tokenized_dev ]


In [12]:
print(tokenized_train[1]['word_ids'])

[None, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 17, 17, 17, 17, 17, 17, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, None]


## Load model and metric

In [13]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

2024-02-04 13:48:10.121638: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-04 13:48:10.144275: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-04 13:48:10.144296: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-04 13:48:10.144976: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-04 13:48:10.149430: I tensorflow/core/platform/cpu_feature_guar

In [14]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=2) # 2 labels are True/False for in negation scope. due to the conversion above they are 0/1.

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [43]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    push_to_hub=False,
)

In [16]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [17]:
from datasets import load_metric
# metric = load_metric("seqeval")
metric = load_metric('../scripts/span_metric.py',trust_remote_code=True) # A self-defined metric class calculating both token overlap and span agreement

  metric = load_metric('../scripts/span_metric.py',trust_remote_code=True) # A self-defined metric class calculating both token overlap and span agreement


In [18]:
label_list = [0,1] # IS IN NEGATION SCOPE OR NOT

## Train and evaluate

In [19]:
import numpy as np

In [20]:
def remove_ignored_index(predictions,labels):
    actual_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    actual_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return actual_predictions, actual_labels

In [21]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2) # Most possible label

    # Remove ignored index (special tokens)
    actual_predictions, actual_labels = remove_ignored_index(predictions,labels)
    
    results = metric.compute(predictions=actual_predictions, references=actual_labels)
    return {
        #"accuracy": results["overall_accuracy"],
        "token_precision":results["token_precision"], "token_recall":results["token_recall"], "token_f1":results["token_f1"],
        "span_precision":results["span_precision"], "span_recall":results["span_recall"], "span_f1":results["span_f1"]
    }

In [44]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [45]:
trainer.train()

Epoch,Training Loss,Validation Loss,Token Precision,Token Recall,Token F1,Span Precision,Span Recall,Span F1
1,No log,0.073146,0.928907,0.866166,0.89644,0.60119,0.60119,0.60119
2,No log,0.093075,0.932432,0.863039,0.896395,0.654762,0.654762,0.654762
3,0.008200,0.074949,0.935142,0.883677,0.908682,0.668639,0.668639,0.668639
4,0.008200,0.094302,0.961672,0.863039,0.90969,0.690476,0.690476,0.690476
5,0.003400,0.091456,0.946441,0.873046,0.908263,0.684524,0.684524,0.684524


Checkpoint destination directory bert-base-uncased-finetuned-negation_scope/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1185, training_loss=0.005261393413262025, metrics={'train_runtime': 25.691, 'train_samples_per_second': 735.471, 'train_steps_per_second': 46.125, 'total_flos': 245123843934060.0, 'train_loss': 0.005261393413262025, 'epoch': 5.0})

In [46]:
predictions, labels, _ = trainer.predict(tokenized_test)
predictions = np.argmax(predictions, axis=2)

In [47]:
# Remove ignored index (special tokens)

def detokenize(predictions, tokenized_test):
    actual_predictions, actual_labels = [], []
    for p, t in zip(predictions, tokenized_test):
        preds = []
        trues = []
        pred = []
        #print(len(p), len(t['word_ids']))
        word_idx = 0
        for i, (token_pred, id) in enumerate(zip(p, t['word_ids'])):
            if id is None:
                continue
            if id != word_idx:
                preds.append(int(any(pred)))
                pred = [token_pred]
                trues.append(t['negation_scope_tags'][word_idx])
                word_idx = id
            else:
                pred.append(token_pred)

        #print(len(trues), len(preds))
        actual_labels.append(trues)
        actual_predictions.append(preds)
    return actual_labels, actual_predictions

In [48]:
actual_labels, actual_predictions = detokenize(predictions, tokenized_test)
results = metric.compute(predictions=actual_predictions, references=actual_labels)
results

{'token_precision': 0.9580801944106926,
 'token_recall': 0.8636363636363636,
 'token_f1': 0.908410138248848,
 'span_precision': 0.704,
 'span_recall': 0.704,
 'span_f1': 0.704}

## Model comparison

In [37]:
model_checkpoint = "distilbert-base-uncased" # bert-base-uncased for better percision, distilbert-base-uncased for faster run

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

label_all_tokens = True
tokenized_train = trainds.map(tokenize_and_align_labels, batched=True)
tokenized_dev = devds.map(tokenize_and_align_labels, batched=True)
tokenized_test = testds.map(tokenize_and_align_labels, batched=True)

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=2) # 2 labels are True/False for in negation scope. due to the conversion above they are 0/1.

model_name = model_checkpoint.split("/")[-1]

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Map:   0%|          | 0/815 [00:00<?, ? examples/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Token Precision,Token Recall,Token F1,Span Precision,Span Recall,Span F1
1,No log,0.066309,0.917533,0.828018,0.87048,0.538462,0.538462,0.538462
2,No log,0.062737,0.932111,0.858662,0.89388,0.619048,0.619048,0.619048
3,0.082900,0.055281,0.924901,0.878049,0.900866,0.656805,0.656805,0.656805
4,0.082900,0.065736,0.943128,0.871169,0.905722,0.672619,0.672619,0.672619
5,0.015300,0.069275,0.944595,0.874296,0.908087,0.678571,0.678571,0.678571


Checkpoint destination directory distilbert-base-uncased-finetuned-negation_scope/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory distilbert-base-uncased-finetuned-negation_scope/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1185, training_loss=0.04255565289203628, metrics={'train_runtime': 25.9581, 'train_samples_per_second': 727.903, 'train_steps_per_second': 45.65, 'total_flos': 245123843934060.0, 'train_loss': 0.04255565289203628, 'epoch': 5.0})

In [38]:
trainer.evaluate()

{'eval_loss': 0.06927543878555298,
 'eval_token_precision': 0.9445945945945946,
 'eval_token_recall': 0.874296435272045,
 'eval_token_f1': 0.9080870412471582,
 'eval_span_precision': 0.6785714285714286,
 'eval_span_recall': 0.6785714285714286,
 'eval_span_f1': 0.6785714285714286,
 'eval_runtime': 0.2839,
 'eval_samples_per_second': 2870.77,
 'eval_steps_per_second': 179.643,
 'epoch': 5.0}

In [41]:
# trainer.save_model('../model/distilbert')
# It's too large (17 GB)

In [39]:
predictions, labels, _ = trainer.predict(tokenized_test)
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
actual_labels, actual_predictions = detokenize(predictions, tokenized_test)

results = metric.compute(predictions=actual_predictions, references=actual_labels)
results

{'token_precision': 0.9546279491833031,
 'token_recall': 0.864184008762322,
 'token_f1': 0.9071572290888186,
 'span_precision': 0.688,
 'span_recall': 0.688,
 'span_f1': 0.688}

In [30]:
result_df = pd.DataFrame(columns=['Sentence', 'Labels', 'Prediction'])
for p, t, ds in zip(actual_predictions, actual_labels, testds):
    if p != t:
        result_df.loc[len(result_df)] = [ds['tokens'], t, p]

result_df.to_csv('../results/errors.csv')

## Error Analysis

In [57]:
old_test = datasets.load_from_disk('../data/hf_dataset/old_test')

In [69]:
result_df = pd.DataFrame(columns=['Sentence_retokenized', 'Labels', 'Prediction', 'Sentence', 'Cue'])
for p, t, ds, ds2 in zip(actual_predictions, actual_labels, testds, old_test):
    if p != t:
        result_df.loc[len(result_df)] = [ds['tokens'], t, p, 
                                         ds2['tokens'], ds2['is_neg']] # For finding the negation cue

In [49]:
import sys
from termcolor import colored, cprint

In [50]:
def color_sentences(df):
    '''
    Color the mispredicted sentences.
    The negation cue cannot be directly obtained due to the re-tokenization and thus is provided by the context.
    '''
    print("Labels:")
    cprint("The Cue", "black", end=" ",attrs=["underline","reverse"])
    cprint("True Scope", "black", "on_green", end=" ")
    cprint("Correct Prediction", "black", "on_yellow", end=" ")
    cprint("False Positive", "black", "on_red", end=" ",attrs=["blink"])
    cprint("False Negative", "white", "on_light_red", end=" ",attrs=["blink"])
    print("\nNegation cue is the middle word of Cue context.")
    print("\n\n")
    
    for row in range(len(df)):
        print(row)
        
        # Coloring the negation cue and print the context
        c = df['Cue'][row]
        so = df['Sentence'][row]
        cue_phrase, cue_len, ending = '', 0, 0
        for i in range(len(c)): 
            if c[i] == True:
                cue_phrase = cue_phrase + so[i] + ' '
                cue_len +=1
                ending = i
        print("Cue context: ...", so[ending-(cue_len)], end=" ")
        cprint(cue_phrase[:-1], "black", end=" ",attrs=["underline","reverse"])
        print(so[ending+1], "...")
        
        # Coloring the scope
        s = df['Sentence_retokenized'][row]
        t = df['Labels'][row]
        p = df['Prediction'][row]
        for i in range(len(s)-2): 
            if (t[i] == False):
                cprint(s[i+1], "black", "on_green", end=" ")
            if (t[i] == True):
                cprint(s[i+1], "black", end=" ")
        print()
        for i in range(len(s)-2):
            if (p[i] == False) & (p[i]==t[i]):
                cprint(s[i+1], "black", "on_yellow", end=" ")
            if (p[i] == True) & (p[i]==t[i]):
                cprint(s[i+1], "black", end=" ")
            if (p[i] == False) & (p[i] != t[i]):
                cprint(s[i+1], "black", "on_red", end=" ",attrs=["blink"])
            if (p[i] == True) & (p[i] != t[i]):
                cprint(s[i+1], "white", "on_light_red", end=" ",attrs=["blink"])
        print("\n")

In [71]:
color_sentences(result_df)

Labels:
[7m[4m[30mThe Cue[0m [42m[30mTrue Scope[0m [43m[30mCorrect Prediction[0m [5m[41m[30mFalse Positive[0m [5m[101m[97mFalse Negative[0m 
Negation cue is the middle word of Cue context.



0
Cue context: ... can [7m[4m[30mnot[0m see ...
[30m`[0m [30m`[0m [30mwell[0m [30m,[0m [30mmrs[0m [30m.[0m [30mwarren[0m [30m,[0m [42m[30mi[0m [42m[30mcan[0m [30mnot[0m [42m[30msee[0m [42m[30mthat[0m [42m[30myou[0m [42m[30mhave[0m [42m[30many[0m [42m[30mparticular[0m [42m[30mcause[0m [42m[30mfor[0m [42m[30mune[0m [42m[30m##asi[0m [42m[30m##ness[0m [30m,[0m [30mnor[0m [30mdo[0m [30mi[0m [30munderstand[0m [30mwhy[0m [30mi[0m [30m,[0m [30mwhose[0m [30mtime[0m [30mis[0m [30mof[0m [30msome[0m [30mvalue[0m [30m,[0m [30mshould[0m [30minterfere[0m [30min[0m [30mthe[0m [30mmatter[0m [30m.[0m 
[30m`[0m [30m`[0m [30mwell[0m [30m,[0m [30mmrs[0m [30m.[0m [30mwarren[0m [30m,[0

### Manual error classification for first half:
- Me: Sentence have multiple negations and the prediction exceeded to other scopes.
- S: Subjunctive ("if not", "why not", etc.).
- I: Imperative ("can you not?").
- P: Phrase negation cue ("no more", "neither nor", ...).
- C: Clauses. "if", "and", "that" are usually FN while transitional conjunctions ("but", "even if") are usually mixed
- Punc: Punctuation divides the scope.
- Pron: Pronouns.
- RT: Problems caused by re-tokenization.
- W: Word with negation meaning. Can be prefix (un-) or suffix (-less).
- O: other.

0: O"should"
1: Me
2: Me
3: C_and
4: C_if
5: S
6: P(no more), C_if
7: C_that, Punc_,
8: Punc_, Punc_.
9: C_expect
10: ? "nothing more"
11: W_pre(unusual)
12: I(why not)
13: I, Pron
14: Punc__
15: Pron
16: W_pre(unusual), Me
17: C_if
18: O"reason"
19: C_T(but)
20: W_pre(absence)
21: W_pre(unusual)
22: C_and
23: Me, W_suf(without)
24: Punc_'
25: Me, W_pre(irrelevant)
26: RT
27: Punc_, C_T(even if)
28: C_if
29: P(no more)
30: Punc_, C_and
31: S(would have)
32: Me, C_and, Punc_,
33: C_what
34: Me
35: Me
36: Punc_,
37: P(neither nor nor)
38: RT
39: Punc_, *
40: O,*
41: W_suf(breathless)
42: O
43: O
44: Me
45: Me
46: C_that
47: W_pre(unoccupied)
48: Punc_, C_but
49: W_suf(carpetless)
50: O
51: S
52: Me
53: Me
54: RT
55: W_pre(unconventional)
56: P(neither nor)
57: RT
58: Me, RT
59: Me, C_which
60: W_pre(dislike)
61: W_pre(dislike)
62: O
63: C_once
64: Me, W_suf(senseless)
65: P(never more)
66: RT
67: C_but, C_and, Punc_,
68: RT

Me: 15
C: 19 (5and, 4if, 3but (even if, except, what, which, that, once
S: 3
I: 1
RT: 7
Punc: 12
Pron: 2
W: 13 (9pre+4suf)
O: 7
P: 5