# Main Notebook for A4

This notebook is adjusted from https://github.com/huggingface/notebooks/blob/main/examples/token_classification.ipynb

Modified contents are:
- Removed remote contents (logging in huggingface, etc.)
- tokenize_and_align_labels(): modified for the dataset and resolved weird bug
- model parameter: num_labels=2 (True or False, in or not inside a negation scope)
- metric: used self-made metric loader script ('../scripts/span_metric.py') for this task
- compute_metrics(): adjusted for this task and datasets

In [1]:
import transformers
import pandas as pd

In [2]:
task = "scope"
model_checkpoint = "bert-base-uncased" # bert-base-uncased for better percision, distilbert-base-uncased for faster run
batch_size = 16

## Loading the dataset
The datasets are pre-generated huggingface dataset classes.

In [3]:
import datasets

In [4]:
trainds = datasets.load_from_disk('../data/hf_dataset/trainds')
devds = datasets.load_from_disk('../data/hf_dataset/devds')
testds = datasets.load_from_disk('../data/hf_dataset/testFds')

### Datasets statistics

In [82]:
def calculate_dataset_statistics(ds):
    affix, word, phrase, none = 0,0,0,0
    sentence_num = len(ds['neg_type'])
    for i in range(sentence_num):
        if ds['neg_type'][i][0] == '':
            none += 1
        if ds['neg_type'][i][0] == 'NEG':
            word += 1
        if ds['neg_type'][i][0] == 'AFFIX':
            affix += 1
        if ds['neg_type'][i][0] == 'MULTI':
            phrase += 1
    neg_num = sentence_num-none
    print("Num_Sentence:", sentence_num, "; Num_Negation:", neg_num, "(", '{:04.2f}'.format((sentence_num-none)/sentence_num*100), "%);\n",
          "Num_Word:", word, "(", '{:04.2f}'.format(word/neg_num*100), "%)",
          "; Num_Affix:", affix, "(", '{:04.2f}'.format(affix/neg_num*100), "%)",
          "; Num_Phrase:", phrase, "(", '{:04.2f}'.format(phrase/neg_num*100), "%)",)

In [84]:
calculate_dataset_statistics(trainds[:])
calculate_dataset_statistics(devds[:])
calculate_dataset_statistics(testds[:])

Num_Sentence: 3779 ; Num_Negation: 983 ( 26.01 %);
 Num_Word: 813 ( 82.71 %) ; Num_Affix: 159 ( 16.17 %) ; Num_Phrase: 11 ( 1.12 %)
Num_Sentence: 815 ; Num_Negation: 173 ( 21.23 %);
 Num_Word: 135 ( 78.03 %) ; Num_Affix: 33 ( 19.08 %) ; Num_Phrase: 5 ( 2.89 %)
Num_Sentence: 1116 ; Num_Negation: 264 ( 23.66 %);
 Num_Word: 219 ( 82.95 %) ; Num_Affix: 36 ( 13.64 %) ; Num_Phrase: 9 ( 3.41 %)


## Preprocess
Using the pre-trained AutoTokenizer with the given model to tokenize. Added special marks (-100) to the beginning and ending of sentences.

In [6]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [7]:
def tokenize_and_align_labels(inds):
    tokenized_inputs = tokenizer(inds["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(inds['scope']):

        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                #label_ids.append(label[word_idx])
                x = label[word_idx]
                x = 0 if x=='False' else 1 # fix weird convert error
                label_ids.append(x)
            else:
                x = label[word_idx]
                x = 0 if x=='False' else 1
                label_ids.append(x if label_all_tokens else -100)

            previous_word_idx = word_idx
                
        labels.append(label_ids)
        
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [8]:
label_all_tokens = True
tokenized_train = trainds.map(tokenize_and_align_labels, batched=True)
tokenized_dev = devds.map(tokenize_and_align_labels, batched=True)
tokenized_test = testds.map(tokenize_and_align_labels, batched=True)

## Load model and metric

In [9]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

2024-01-30 16:13:31.341673: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-30 16:13:31.367840: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-30 16:13:31.367863: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-30 16:13:31.368542: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-30 16:13:31.372836: I tensorflow/core/platform/cpu_feature_guar

In [10]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=2) # 2 labels are True/False for in negation scope. due to the conversion above they are 0/1.

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

In [12]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [13]:
from datasets import load_metric
# metric = load_metric("seqeval")
metric = load_metric('../scripts/span_metric.py',trust_remote_code=True) # A self-defined metric class calculating both token overlap and span agreement

  metric = load_metric('../scripts/span_metric.py',trust_remote_code=True) # A self-defined metric class calculating both token overlap and span agreement


In [21]:
label_list = [True,False] # IS IN NEGATION SCOPE OR NOT

## Train and evaluate

In [15]:
import numpy as np

In [25]:
def remove_ignored_index(predictions,labels):
    actual_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    actual_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return actual_predictions, actual_labels

In [17]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2) # Most possible label

    # Remove ignored index (special tokens)
    actual_predictions, actual_labels = remove_ignored_index(predictions,labels)
    
    results = metric.compute(predictions=actual_predictions, references=actual_labels)
    return {
        #"accuracy": results["overall_accuracy"],
        "token_precision":results["token_precision"], "token_recall":results["token_recall"], "token_f1":results["token_f1"],
        "span_precision":results["span_precision"], "span_recall":results["span_recall"], "span_f1":results["span_f1"]
    }

In [23]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [26]:
trainer.train()

Epoch,Training Loss,Validation Loss,Token Precision,Token Recall,Token F1,Span Precision,Span Recall,Span F1
1,No log,0.152031,0.978075,0.978212,0.978144,0.878528,0.878528,0.878528
2,No log,0.144276,0.983014,0.973028,0.977995,0.880982,0.880982,0.880982
3,0.046000,0.162832,0.979092,0.977652,0.978371,0.879755,0.879755,0.879755


Checkpoint destination directory bert-base-uncased-finetuned-scope/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=711, training_loss=0.042505420545317786, metrics={'train_runtime': 26.3382, 'train_samples_per_second': 430.44, 'train_steps_per_second': 26.995, 'total_flos': 282571007910408.0, 'train_loss': 0.042505420545317786, 'epoch': 3.0})

In [27]:
trainer.evaluate()

{'eval_loss': 0.16283178329467773,
 'eval_token_precision': 0.9790921209569915,
 'eval_token_recall': 0.9776516743729858,
 'eval_token_f1': 0.9783713674764258,
 'eval_span_precision': 0.8797546012269939,
 'eval_span_recall': 0.8797546012269939,
 'eval_span_f1': 0.8797546012269939,
 'eval_runtime': 0.4437,
 'eval_samples_per_second': 1837.018,
 'eval_steps_per_second': 114.954,
 'epoch': 3.0}

In [28]:
predictions, labels, _ = trainer.predict(tokenized_test)
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
actual_predictions, actual_labels = remove_ignored_index(predictions,labels)

results = metric.compute(predictions=actual_predictions, references=actual_labels)
results

{'token_precision': 0.9841899217024694,
 'token_recall': 0.9840911372076684,
 'token_f1': 0.9841405269761606,
 'span_precision': 0.8835125448028673,
 'span_recall': 0.8835125448028673,
 'span_f1': 0.8835125448028673}

## Model comparison
re-run everything but with distilbert-base-uncased

In [29]:
model_checkpoint = "distilbert-base-uncased" # bert-base-uncased for better percision, distilbert-base-uncased for faster run

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

label_all_tokens = True
tokenized_train = trainds.map(tokenize_and_align_labels, batched=True)
tokenized_dev = devds.map(tokenize_and_align_labels, batched=True)
tokenized_test = testds.map(tokenize_and_align_labels, batched=True)

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=2) # 2 labels are True/False for in negation scope. due to the conversion above they are 0/1.

model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Token Precision,Token Recall,Token F1,Span Precision,Span Recall,Span F1
1,No log,0.133527,0.970554,0.981365,0.975929,0.853988,0.853988,0.853988
2,No log,0.130087,0.96998,0.982416,0.976158,0.86135,0.86135,0.86135
3,0.133900,0.123357,0.9778,0.97506,0.976428,0.867485,0.867485,0.867485


TrainOutput(global_step=711, training_loss=0.11660883295888136, metrics={'train_runtime': 15.531, 'train_samples_per_second': 729.961, 'train_steps_per_second': 45.779, 'total_flos': 141634595162424.0, 'train_loss': 0.11660883295888136, 'epoch': 3.0})

In [30]:
trainer.evaluate()

{'eval_loss': 0.12335667759180069,
 'eval_token_precision': 0.9777996346775326,
 'eval_token_recall': 0.9750595488300406,
 'eval_token_f1': 0.9764276694261259,
 'eval_span_precision': 0.8674846625766871,
 'eval_span_recall': 0.8674846625766871,
 'eval_span_f1': 0.8674846625766871,
 'eval_runtime': 0.2882,
 'eval_samples_per_second': 2828.048,
 'eval_steps_per_second': 176.97,
 'epoch': 3.0}

In [31]:
predictions, labels, _ = trainer.predict(tokenized_test)
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
actual_predictions, actual_labels = remove_ignored_index(predictions,labels)

results = metric.compute(predictions=actual_predictions, references=actual_labels)
results

{'token_precision': 0.9842298227110642,
 'token_recall': 0.9834889089631637,
 'token_f1': 0.9838592263473654,
 'span_precision': 0.8763440860215054,
 'span_recall': 0.8763440860215054,
 'span_f1': 0.8763440860215054}

## Error Analysis

In [34]:
result_df = pd.DataFrame(columns=['Sentence_retokenized', 'Labels', 'Prediction', 'Sentence', 'Cue'])
for p, t, ds in zip(actual_predictions, actual_labels, tokenized_test):
    if p != t:
        result_df.loc[len(result_df)] = [tokenizer.convert_ids_to_tokens(ds["input_ids"]), # Perform re-tokenization to match labels
                                         t, p, ds['tokens'], ds['is_neg']] # For finding the negation cue

result_df.to_csv('../results/errors.csv')

In [35]:
import sys
from termcolor import colored, cprint

In [36]:
def color_sentences(df):
    '''
    Color the mispredicted sentences.
    The negation cue cannot be directly obtained due to the re-tokenization and thus is provided by the context.
    '''
    print("Labels:")
    cprint("The Cue", "black", end=" ",attrs=["underline","reverse"])
    cprint("True Scope", "black", "on_green", end=" ")
    cprint("Correct Prediction", "black", "on_yellow", end=" ")
    cprint("False Positive", "black", "on_red", end=" ",attrs=["blink"])
    cprint("False Negative", "white", "on_light_red", end=" ",attrs=["blink"])
    print("\nNegation cue is the middle word of Cue context.")
    print("\n\n")
    
    for row in range(len(df)):
        print(row)
        
        # Coloring the negation cue and print the context
        c = df['Cue'][row]
        so = df['Sentence'][row]
        cue_phrase, cue_len, ending = '', 0, 0
        for i in range(len(c)): 
            if c[i] == True:
                cue_phrase = cue_phrase + so[i] + ' '
                cue_len +=1
                ending = i
        print("Cue context: ...", so[ending-(cue_len)], end=" ")
        cprint(cue_phrase[:-1], "black", end=" ",attrs=["underline","reverse"])
        print(so[ending+1], "...")
        
        # Coloring the scope
        s = df['Sentence_retokenized'][row]
        t = df['Labels'][row]
        p = df['Prediction'][row]
        for i in range(len(s)-2): 
            if (t[i] == False):
                cprint(s[i+1], "black", "on_green", end=" ")
            if (t[i] == True):
                cprint(s[i+1], "black", end=" ")
        print()
        for i in range(len(s)-2):
            if (p[i] == False) & (p[i]==t[i]):
                cprint(s[i+1], "black", "on_yellow", end=" ")
            if (p[i] == True) & (p[i]==t[i]):
                cprint(s[i+1], "black", end=" ")
            if (p[i] == False) & (p[i] != t[i]):
                cprint(s[i+1], "black", "on_red", end=" ",attrs=["blink"])
            if (p[i] == True) & (p[i] != t[i]):
                cprint(s[i+1], "white", "on_light_red", end=" ",attrs=["blink"])
        print("\n")

In [37]:
color_sentences(result_df)

Labels:
[7m[4m[30mThe Cue[0m [42m[30mTrue Scope[0m [43m[30mCorrect Prediction[0m [5m[41m[30mFalse Positive[0m [5m[101m[97mFalse Negative[0m 
Negation cue is the middle word of Cue context.



0
Cue context: ... can [7m[4m[30mnot[0m see ...
[30m`[0m [30m`[0m [30mwell[0m [30m,[0m [30mmrs[0m [30m.[0m [30mwarren[0m [30m,[0m [42m[30mi[0m [42m[30mcan[0m [30mnot[0m [42m[30msee[0m [42m[30mthat[0m [42m[30myou[0m [42m[30mhave[0m [42m[30many[0m [42m[30mparticular[0m [42m[30mcause[0m [42m[30mfor[0m [42m[30mune[0m [42m[30m##asi[0m [42m[30m##ness[0m [30m,[0m [30mnor[0m [30mdo[0m [30mi[0m [30munderstand[0m [30mwhy[0m [30mi[0m [30m,[0m [30mwhose[0m [30mtime[0m [30mis[0m [30mof[0m [30msome[0m [30mvalue[0m [30m,[0m [30mshould[0m [30minterfere[0m [30min[0m [30mthe[0m [30mmatter[0m [30m.[0m 
[30m`[0m [30m`[0m [30mwell[0m [30m,[0m [30mmrs[0m [30m.[0m [30mwarren[0m [30m,[0

### Manual error classification for first half:
- Me: Sentence have multiple negations and the prediction exceeded to other scopes.
- S: Subjunctive ("if not", "why not", etc.).
- I: Imperative ("can you not?").
- P: Phrase negation cue ("no more", "neither nor", ...).
- C: Clauses. "if", "and", "that" are usually FN while transitional conjunctions ("but", "even if") are usually mixed
- Punc: Punctuation divides the scope.
- Pron: Pronouns.
- RT: Problems caused by re-tokenization.
- W: Word with negation meaning. Can be prefix (un-) or suffix (-less).
- O: other.

0: O"should"
1: Me
2: Me
3: C_and
4: C_if
5: S
6: P(no more), C_if
7: C_that, Punc_,
8: Punc_, Punc_.
9: C_expect
10: ? "nothing more"
11: W_pre(unusual)
12: I(why not)
13: I, Pron
14: Punc__
15: Pron
16: W_pre(unusual), Me
17: C_if
18: O"reason"
19: C_T(but)
20: W_pre(absence)
21: W_pre(unusual)
22: C_and
23: Me, W_suf(without)
24: Punc_'
25: Me, W_pre(irrelevant)
26: RT
27: Punc_, C_T(even if)
28: C_if
29: P(no more)
30: Punc_, C_and
31: S(would have)
32: Me, C_and, Punc_,
33: C_what
34: Me
35: Me
36: Punc_,
37: P(neither nor nor)
38: RT
39: Punc_, *
40: O,*
41: W_suf(breathless)
42: O
43: O
44: Me
45: Me
46: C_that
47: W_pre(unoccupied)
48: Punc_, C_but
49: W_suf(carpetless)
50: O
51: S
52: Me
53: Me
54: RT
55: W_pre(unconventional)
56: P(neither nor)
57: RT
58: Me, RT
59: Me, C_which
60: W_pre(dislike)
61: W_pre(dislike)
62: O
63: C_once
64: Me, W_suf(senseless)
65: P(never more)
66: RT
67: C_but, C_and, Punc_,
68: RT

Me: 15
C: 19 (5and, 4if, 3but (even if, except, what, which, that, once
S: 3
I: 1
RT: 7
Punc: 12
Pron: 2
W: 13 (9pre+4suf)
O: 7
P: 5