### When your Language Model cannot even do Determiners right: Probing for Anti-Presuppositions and the Maximize Presupposition! Principle | @BlackboxNLP 2023

- In this notebook, we first preprocess the corpora from the SuperGlue datasets and then fine-tune the language models on these data with masked modeling

- cf. huggingface tutorial: https://huggingface.co/learn/nlp-course/chapter7/3?fw=pt

---- 

In [1]:
import collections
import numpy as np
import torch
import math
from torch import nn

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer

from datasets import load_dataset
from datasets import DatasetDict, concatenate_datasets

from transformers import DataCollatorForLanguageModeling
from transformers import default_data_collator
from transformers import TrainingArguments
from transformers import Trainer

from huggingface_hub import notebook_login

### Load model(s)

- uncomment the one that shall be used

In [3]:
## BERT base
model_checkpoint = "bert-base-cased" 

## BERT mutliingual multil_bert 
#model_checkpoint = "bert-base-multilingual-cased" 

## xlm RoBERTa 
#model_checkpoint = "xlm-roberta-base"

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Load and preprocess super glue data
 
- For question-answering datasets (BoolQ, COPA, MULTIRC), we merge the data in such a way that the context paragraph comes first, followed by the corresponding question, and then the correct answer. 
- As for natural language inference datasets (CB, RTE), we adopt the methodology described by Raffel et al. 2019, where we commence with the hypothesis, followed by a colon, and then the premise. We specifically include only entailment pairs, excluding instances involving contradictions.
 
 
- We excluded the following corpora from SuperGlue because they were not "beneficial" for our task and/ or could not get preprocessed in a way that MLM can be applied:
    - record (Reading Comprehension with Commonsense Reasoning Dataset, Zhang et al., 2018) (QA)
    - wic(Word-in-Context, Pilehvar and Camacho-Collados, 2019) (WSD / word sense disambiguation)
    - wsc (Winograd Schema Challenge, Levesque et al., 2012) (coref)    


In [4]:
def print_row(datasample):
    for row in datasample:
        for key, value in row.items():
            #print(f"Key: {key}, Value: {value}")
            print(key,"--", value)

####  QA datasets
- **BOOLQ** Boolean Questions, Clark et al., 2019
- **COPA** Choice of Plausible Alternatives, Roemmele et al., 2011
- **MULTIRC** Multi-Sentence Reading Comprehension, Khashabi et al., 2018


- generally, for these QA datasets, the input is processed in this way: paragraph, question, right answer 

In [5]:
##  ---- superglue_boolq ----
#(Boolean Questions, Clark et al., 2019) (QA)
#QA task where each example consists of a short passage and a yes/no question about the passage.  

superglue_boolq = load_dataset("super_glue", "boolq") 
sample_boolq = superglue_boolq["train"].shuffle().select(range(1))


## Passage, Question and then the right answer
def create_new_value(row):
    label_value = "Yes." if row['label'] == 1 else "No."
    question = row['question'].capitalize() + "?"
    new_value = f"{row['passage']} {question} {label_value}"
    return new_value

# try out with sample 
#sample_boolq = sample_boolq.map(lambda row: {'new_column': create_new_value(row)})
#print_row(sample_boolq)

# now with the whole corpus (and also exlude the other columns because I don't need them anymore)
superglue_boolq = superglue_boolq.map(lambda row: {'new_column': create_new_value(row)}, remove_columns=['question', 'passage', 'idx', 'label'])
#superglue_boolq["train"]["new_column"] #looks good

Found cached dataset super_glue (/Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-63f613b0c62f3545.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-647bdbc085b909b4.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-fa20af76f3656ea8.arrow


In [7]:
##  ---- superglue_copa ---- 
# (Choice of Plausible Alternatives, Roemmele et al., 2011) (QA)

superglue_copa = load_dataset("super_glue", "copa") 
sample_copa = superglue_copa["train"].shuffle().select(range(1))

## question:
# cause: What was the cause for this?
# effect: What happened as a result?
## label:
# 0 = premise 1 is the right answer
# 1 = premise 2 is the rigth answer

def create_new_value(row):
    if row['question'] == "cause":
        question = "What was the cause for this?"
        if row["label"] == 0:
            new_value = f"{row['premise']} {question} {row['choice1']}"
        else:
            new_value = f"{row['premise']} {question} {row['choice2']}"
    else:
        question = "What happened as a result?"
        if row["label"] == 0:
            new_value = f"{row['premise']} {question} {row['choice1']}"
        else:
            new_value = f"{row['premise']} {question} {row['choice2']}"
            
    return new_value

### try out with sample 
sample_copa = sample_copa.map(lambda row: {'new_column': create_new_value(row)})
#print_row(sample_copa)

# now with the whole corpus (and exclude the columns that we don't need anymore)
superglue_copa = superglue_copa.map(lambda row: {'new_column': create_new_value(row)}, remove_columns=['premise', 'choice1', 'choice2', 'question', 'idx', 'label'])
#superglue_copa["train"]["new_column"] #looks good

### because here we now have more test- than training-data, we swap the training and test datasets
new_train_dataset = superglue_copa["test"]
new_test_dataset = superglue_copa["train"]
validation_dataset = superglue_copa["validation"]

superglue_copa = DatasetDict({
    "train": new_train_dataset,
    "validation": validation_dataset,
    "test": new_test_dataset
})

Found cached dataset super_glue (/Users/judith/.cache/huggingface/datasets/super_glue/copa/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed)


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?ex/s]

Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/copa/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-b1ef6cdf9295be1e.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/copa/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-2a6c6f292ab84594.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/copa/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-896bdc4a44de60a9.arrow


In [8]:
##  ---- superglue_multirc ---- 
# (Multi-Sentence Reading Comprehension, Khashabi et al., 2018) (QA)
# QA task where each example consists of a context paragraph, a question about that paragraph, 
# and a list of possible answers.

superglue_multirc = load_dataset("super_glue", "multirc") 
sample_multirc = superglue_multirc["train"].shuffle().select(range(1))

# hier: paragraph, question, then answer
def create_new_value(row):
    new_value = f"{row['paragraph']} {row['question']} {row['answer']}"
    return new_value

### try out with sample 
#sample_multirc = sample_multirc.map(lambda row: {'new_column': create_new_value(row)})
#print_row(sample_multirc)

# now with the whole corpus (and exclude the columns that we don't need anymore)
superglue_multirc = superglue_multirc.map(lambda row: {'new_column': create_new_value(row)}, remove_columns=['paragraph', 'question', 'answer', 'idx', 'label'])
#superglue_multirc["train"]["new_column"] #looks good

Found cached dataset super_glue (/Users/judith/.cache/huggingface/datasets/super_glue/multirc/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/multirc/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-cab9f51582d86a0e.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/multirc/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-d3f44f12dc1f447c.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/multirc/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-6ab5547a39c36313.arrow


####  NLI datasets

- **CB** CommitmentBank, De Marneffe et al., 2019
- **RTE** Recognizing Textual Entailment 

- generally, here we handle the data preprocessing like the t5 authors: first the hypothesis; then ":", then the premise (but only for entailment-pairs, i.e. no contradictions etc. included)


In [9]:
##  ---- superglue_cb ---- 
#(CommitmentBank, De Marneffe et al., 2019) (NLI)
# Each example consists of a premise containing an embedded clause 
# and the corresponding hypothesis is the extraction of that clause.

superglue_cb = load_dataset("super_glue", "cb") 
sample_cb = superglue_cb["train"].shuffle().select(range(1))

## I do it like the t5 authors; first the hypothesis; then ":", then the premise
## but I only do it for the cases where label = 0, otherwise it would be a contradiction
def create_new_value(row):
    if row['label'] != 1:
        hypothesis = row['hypothesis'].capitalize() + ":"
        new_value = f"{hypothesis} {row['premise']}"
        return new_value
    else:
        return None

### try out with sample 
#sample_cb = sample_cb.map(lambda row: {'new_column': create_new_value(row)})
# Filter out None values to keep only rows where label is 1 (otherwise we include contradictions)
#sample_cb = sample_cb.filter(lambda row: row['new_column'] is not None)
# Print the updated dataset
#print_row(sample_cb)

### now with the whole corpus (and remove the columns that we don't need anymore)
superglue_cb = superglue_cb.map(lambda row: {'new_column': create_new_value(row)}, remove_columns=['premise', 'hypothesis', 'idx', 'label'])
# Filter out None values to keep only rows where label is 1 (otherwise we include contradictions)
superglue_cb = superglue_cb.filter(lambda row: row['new_column'] is not None)
#superglue_cb["train"]["new_column"] #looks good

### because here we now have more test- than training-data, we swap the training and test datasets
new_train_dataset = superglue_cb["test"]
new_test_dataset = superglue_cb["train"]
validation_dataset = superglue_cb["validation"]

superglue_cb = DatasetDict({
    "train": new_train_dataset,
    "validation": validation_dataset,
    "test": new_test_dataset
})

Found cached dataset super_glue (/Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-4e1fc93bcf0c1970.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-e410ca42beeee79c.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-620f7d8527718b4d.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-ced1b48e4fe06474.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-dcde1066bb7d5268.arrow
Loading cached processed dataset at /Use

In [10]:
##  ---- superglue_rte ---- 
# Recognizing Textual Entailment (NLI)
# two-class classification: entailment and not_entailment.

superglue_rte = load_dataset("super_glue", "rte") 
sample_rte = superglue_rte["train"].shuffle().select(range(3))

#label 0 = entailmant: label 1 = no entailment
## I do it like I did for cb (thus, like the t5 authors; first the hypothesis; then ":", then the premise)
## but I only do it for the cases where label = 0, otherwise it would be no entailment

def create_new_value(row):
    if row['label'] != 1:
        hypothesis = row['hypothesis'].replace(".",":")
        new_value = f"{hypothesis} {row['premise']}"
        return new_value
    else:
        return None

### try out with sample 
#sample_rte = sample_rte.map(lambda row: {'new_column': create_new_value(row)})
# Filter out None values to keep only rows where label is 1 (otherwise we include cases with no entailments)
#sample_rte = sample_rte.filter(lambda row: row['new_column'] is not None)
# Print the updated dataset
#print_row(sample_rte)

### now with the whole corpus (and remove the colums that I don't need anymore)
superglue_rte = superglue_rte.map(lambda row: {'new_column': create_new_value(row)}, remove_columns=['premise', 'hypothesis', 'idx', 'label'])
# Filter out None values to keep only rows where label is 1 (otherwise we include contradictions)
superglue_rte = superglue_rte.filter(lambda row: row['new_column'] is not None)
#superglue_rte["train"]["new_column"] #looks good

### because here we now have more test- than training-data, I swap the training and test datasets
new_train_dataset = superglue_rte["test"]
new_test_dataset = superglue_rte["train"]
validation_dataset = superglue_rte["validation"]

superglue_rte = DatasetDict({
    "train": new_train_dataset,
    "validation": validation_dataset,
    "test": new_test_dataset
})

Found cached dataset super_glue (/Users/judith/.cache/huggingface/datasets/super_glue/rte/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/rte/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-1574df8752a454be.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/rte/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-7a5309014a5b7476.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/rte/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-112d1572824f523c.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/rte/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-e31dfcc55dec0b4c.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/rte/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-2d9a23f03c842e9d.arrow
Loading cached processed dataset at

#### Merge these corpora to one

- --> but first only the three and the two NLI corpora together, because they need to be truncated differently:
    - --> QA from the left (so that the question (which is at the end) remains in the data) 
    - --> NLI from the rigtht (so that the hypothesis (which is at the beginning) remains in the data) 

In [11]:
### QA 
# Create an empty datasetDict
superglue_qa = DatasetDict()

# List of datasets I need to merge
dataset_dicts_to_merge_qa = [superglue_boolq, superglue_copa, superglue_multirc]

# Iterate through each split and merge datasets
for split in ['train', 'validation', 'test']:
    merged_datasets = [dataset_dict[split] for dataset_dict in dataset_dicts_to_merge_qa]
    merged_dataset = concatenate_datasets(merged_datasets)
    
    # Add the merged dataset to the superglue_data
    superglue_qa[split] = merged_dataset


### NLI
# Create an empty datasetDict
superglue_nli = DatasetDict()

# List of datasets I need to merge
dataset_dicts_to_merge_nli = [superglue_cb, superglue_rte]

# Iterate through each split and merge datasets
for split in ['train', 'validation', 'test']:
    merged_datasets = [dataset_dict[split] for dataset_dict in dataset_dicts_to_merge_nli]
    merged_dataset = concatenate_datasets(merged_datasets)
    
    # Add the merged dataset to the superglue_data
    superglue_nli[split] = merged_dataset

#### Now tokenize the corpus

In [17]:
max_length = 0
# Iterate through each split and find the maximum length
for split in ['train', 'validation', 'test']:
    dataset = superglue_qa[split]
    max_split_length = max(len(item) for item in dataset['new_column'])
    max_length_data = max(max_length, max_split_length)

print("Maximum length:", max_length_data) 

Maximum length: 3891


In [18]:
def tokenize_function(examples):
    result = tokenizer(examples["new_column"], max_length=128, padding=True, truncation=True)#, truncation_side="left")
    if tokenizer.is_fast:
        #grab the word IDs as we will need them later on to do whole word masking
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

#### Truncate the data
- In order to fit the data to the models' maximum context size, we truncate the QA datasets from the left (since the questions appear at the end of the items) and the NLI datasets from the right (as the premises appear at the beginning of the items).s. 

In [16]:
# QA
tokenizer.truncation_side='left'
tokenized_superglue_qa = superglue_qa.map(tokenize_function, batched=True) 

# NLI
tokenizer.truncation_side='right'
tokenized_superglue_nli = superglue_nli.map(tokenize_function, batched=True) 

Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-46ac5b06cfd21938.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-dcddbe8feef464b7.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-9611ea1df10822ed.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-a267384b08248fd6.arrow
Loading cached processed dataset at /Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-d499e7cd82fb40cd.arrow
Loading cached processed datase

In [22]:
tokenizer.decode(tokenized_superglue_nli["train"][32]["input_ids"])

"[CLS] Zosie's wishes had been consulted : But first Zosie had come. Rufus, driving back from London with the hashish his dealer swore was genuine Indian charas and a package of best Colombian, picked her off the street - ` ` a piece of property that is found ownerless''. And she had slept with Rufus in the Centaur Room it being taken for granted she would share his bed though Adam did not think her wishes had been consulted. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]"

In [23]:
tokenizer.decode(tokenized_superglue_qa["train"][32]["input_ids"])

"[CLS] Leaning Tower of Pisa - - The tower's tilt began during construction in the 12th century, caused by an inadequate foundation on ground too soft on one side to properly support the structure's weight. The tilt increased in the decades before the structure was completed in the 14th century. It gradually increased until the structure was stabilized ( and the tilt partially corrected ) by efforts in the late 20th and early 21st centuries. Was the leaning tower of pisa built leaning? No. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]"

In [141]:
### shuffle each of these 
tokenized_superglue_nli = tokenized_superglue_nli.shuffle(seed=42)
tokenized_superglue_qa = tokenized_superglue_qa.shuffle(seed=42)

Loading cached shuffled indices for dataset at /Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-ca10e9789443fb02.arrow
Loading cached shuffled indices for dataset at /Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-6a8aee13b1b97207.arrow
Loading cached shuffled indices for dataset at /Users/judith/.cache/huggingface/datasets/super_glue/cb/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-cbc86b15f24db202.arrow
Loading cached shuffled indices for dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-206f4127b9ac0ebd.arrow
Loading cached shuffled indices for dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-91

#### data balancing 
- We balance our corpus by downsampling the QA datasets, ensuring equal representation between the QA and NLI datasets, leading to a total of 9608 datapoints
 

In [24]:
### sample down the qa dataset because it is much bigger
# (for now) (maybe later I can try with a bigger dataset) we make it is as big as the nli dataset

# Sizes from the nli dataset
train_size_nli = tokenized_superglue_nli["train"].num_rows
test_size_nli = tokenized_superglue_nli["test"].num_rows
validation_size_nli = tokenized_superglue_nli["validation"].num_rows

# Split the QA dataset into train, validation, and test
train_qa = tokenized_superglue_qa["train"].shuffle(seed=42).select(range(train_size_nli))
validation_qa = tokenized_superglue_qa["validation"].shuffle(seed=42).select(range(validation_size_nli))
test_qa = tokenized_superglue_qa["test"].shuffle(seed=42).select(range(test_size_nli))

# Create a new DatasetDict with the adjusted splits
tokenized_superglue_qa = DatasetDict({
    "train": train_qa,
    "validation": validation_qa,
    "test": test_qa
})

Loading cached shuffled indices for dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-206f4127b9ac0ebd.arrow
Loading cached shuffled indices for dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-912b5a508e973784.arrow
Loading cached shuffled indices for dataset at /Users/judith/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-40b315ce4281c92f.arrow


In [28]:
### Merge these two now to one big datset:

# Create an empty super_glue datasetDict
tokenized_superglue = DatasetDict()

# List of datasets I need to merge
dataset_dicts_to_merge = [tokenized_superglue_nli, tokenized_superglue_qa]

# Iterate through each split and merge datasets
for split in ['train', 'validation', 'test']:
    merged_datasets = [dataset_dict[split] for dataset_dict in dataset_dicts_to_merge]
    merged_dataset = concatenate_datasets(merged_datasets)    
    # Add the merged dataset to the superglue_data
    tokenized_superglue[split] = merged_dataset

In [29]:
## add labels column (to have the ground truth for MLM)
for split in ['train', 'validation', 'test']:
    dataset = tokenized_superglue[split]
    dataset = dataset.map(lambda row: {'labels': row['input_ids'], **row})
    tokenized_superglue[split] = dataset

  0%|          | 0/6500 [00:00<?, ?ex/s]

  0%|          | 0/348 [00:00<?, ?ex/s]

  0%|          | 0/2760 [00:00<?, ?ex/s]

In [30]:
tokenizer.decode(tokenized_superglue["train"][3294]["input_ids"]) 

"[CLS] Laden family was forced to find a buyer for Usama's share of the family company in 1994. The Saudi government subsequently froze the proceeds of the sale. This action had the effect of divesting Bin Laden of what otherwise might indeed have been a large fortune. Nor were Bin Laden's assets in Sudan a source of money for al Qaeda. When Bin Laden lived in Sudan from 1991 to 1996, he owned a number of businesses and other assets. How did the tradecraft of each of the 9 / 11 plotters go to fund the terrorist activities of 9 / 11? Ordinary expenditures that defeated detection [SEP]"

In [31]:
tokenizer.decode(tokenized_superglue["train"][3294]["labels"])

"[CLS] Laden family was forced to find a buyer for Usama's share of the family company in 1994. The Saudi government subsequently froze the proceeds of the sale. This action had the effect of divesting Bin Laden of what otherwise might indeed have been a large fortune. Nor were Bin Laden's assets in Sudan a source of money for al Qaeda. When Bin Laden lived in Sudan from 1991 to 1996, he owned a number of businesses and other assets. How did the tradecraft of each of the 9 / 11 plotters go to fund the terrorist activities of 9 / 11? Ordinary expenditures that defeated detection [SEP]"

In [32]:
# remove "new_column"
for split in tokenized_superglue.keys():
    dataset = tokenized_superglue[split]
    dataset = dataset.remove_columns("new_column")
    tokenized_superglue[split] = dataset

In [33]:
tokenized_superglue

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 6500
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 348
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 2760
    })
})

#### insert  the masked tokens 

- To proceed with the fine-tuning of the models using these datasets for mask-filling, we employ masking on the specific minimal pairs of interest, that is, the words "the," "a," "all," and "both"
 
- data collator from hugging face is masking tokens not words...  --> because we want to mask whole words, we need to build our own  data collator

#### --> Random masking (not in the paper)

In [34]:
### Whole word masking

wwm_probability = 0.1 #=fraction of the tokens to be masked (15% = used fo BERT and common choice in literature)

def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")
        
        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)


In [35]:
### Testing
samples = [tokenized_superglue["train"][i] for i in range(4001,4002)]
batch1 = whole_word_masking_data_collator(samples)

for chunk in batch1["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] was purchased solely with federal legal aid dollars, should be used to provide legal services for poor [MASK] in South Carolina [MASK] " Kleiman said. LSC wants the title to go to the equal justice center in Charleston or [MASK] we want 100 [MASK] of the proceeds from [MASK] sale of [MASK] [MASK] to stay [MASK] Charleston. We are not contemplating taking that money out of South [MASK], " he said. Kleiman said if [MASK] neighborhood legal program in Charleston " had honored their [MASK], this would [MASK] be an issue [MASK] " A local bar in [MASK] County paid how much for the Charelston building? $ 50, 000 [SEP]'


#### --> Specific word masking: Masking the determiners "a" and "the" as well as  "all" and "both" (in the paper)

In [36]:
# Define the words we want to mask
words_to_mask = ["the", "a", "all", "both", "The", "A", "All", "Both"]

# Function to mask specific words.. .
def mask_specific_words(input_text, words_to_mask, tokenizer):
    words = input_text.split()
    masked_words = []
    
    for word in words:
        if word in words_to_mask:
            masked_words.append(tokenizer.mask_token)
        else:
            masked_words.append(word)
            
    masked_text = " ".join(masked_words)
    return masked_text

# Test the specific word masking function
#input_text = "Of these, Jan received the banana and a pear. All bananas and both pears..."
#words_to_mask = ["the", "a", "both", "all"]
#print(mask_specific_words(input_text, words_to_mask, tokenizer)) #läuft 

# create own data collator
def specific_word_masking_data_collator(features, words_to_mask=words_to_mask, tokenizer=tokenizer):
    new_features = []
    
    for feature in features:
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_input_ids = []
        new_labels = []
        
        for idx, token_id in enumerate(input_ids):
            token = tokenizer.decode([token_id])
            if token in words_to_mask:
                new_input_ids.append(tokenizer.mask_token_id)
                new_labels.append(labels[idx])
            else:
                new_input_ids.append(token_id)
                new_labels.append(-100)
        
        new_feature = {
            "input_ids": new_input_ids,
            "attention_mask": feature["attention_mask"],
            "token_type_ids": feature["token_type_ids"],
            "labels": new_labels
        }
        new_features.append(new_feature)

    return default_data_collator(new_features)


In [37]:
# for RoBERTA: no token_type_ids
def specific_word_masking_data_collator_roberta(features, words_to_mask=words_to_mask, tokenizer=tokenizer):
    new_features = []
    
    for feature in features:
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_input_ids = []
        new_labels = []
        
        for idx, token_id in enumerate(input_ids):
            token = tokenizer.decode([token_id])
            if token in words_to_mask:
                new_input_ids.append(tokenizer.mask_token_id)
                new_labels.append(labels[idx])
            else:
                new_input_ids.append(token_id)
                new_labels.append(-100)
        
        new_feature = {
            "input_ids": new_input_ids,
            "attention_mask": feature["attention_mask"],
            #"token_type_ids": feature["token_type_ids"],
            "labels": new_labels
        }
        new_features.append(new_feature)

    return default_data_collator(new_features)


In [38]:
### Testing

# Apply the specific_word_masking_data_collator function
samples = [tokenized_superglue["train"][i] for i in range(4001,4002)]

#batch2 = specific_word_masking_data_collator(samples, words_to_mask, tokenizer)
batch2 = specific_word_masking_data_collator_roberta(samples, words_to_mask, tokenizer)

for chunk in batch2["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] was purchased solely with federal legal aid dollars, should be used to provide legal services for poor people in South Carolina, " Kleiman said. LSC wants [MASK] title to go to [MASK] equal justice center in Charleston or " we want 100 percent of [MASK] proceeds from [MASK] sale of [MASK] building to stay in Charleston. We are not contemplating taking that money out of South Carolina, " he said. Kleiman said if [MASK] neighborhood legal program in Charleston " had honored their obligation, this would not be an issue. " [MASK] local bar in Charleston County paid how much for [MASK] Charelston building? $ 50, 000 [SEP]'


### Finetune the models

In [40]:
## Shuffle.. :)
tokenized_superglue = tokenized_superglue.shuffle(seed=42)

In [41]:
## log into the huggingcae hub
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (osxkeychain).
Your token has been saved to /Users/judith/.cache/huggingface/token
Login successful


In [44]:
model = model.to('mps')

In [45]:
## Specify the arguments for the Trainer:

batch_size = 64

# Show the training loss with every epoch
logging_steps = len(tokenized_superglue["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-SuperGlue",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    #push_to_hub=True,
    #fp16=True,
    logging_steps=logging_steps,
    remove_unused_columns=False # so we don't lose the word ids 
)

In [46]:
## for testing reasons, make a small subset:

small_train_subset = tokenized_superglue["train"].select(range(400)) 
small_validation_subset = tokenized_superglue["validation"].select(range(100))  
small_test_subset = tokenized_superglue["test"].select(range(200))  

small_dataset = DatasetDict({
    "train": small_train_subset,
    "validation": small_validation_subset,
    "test": small_test_subset
})

#small_dataset

In [37]:
# instantiate the trainer for the whole word masking /random masked tokens
trainer_random_masking = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_superglue["train"],
    eval_dataset=tokenized_superglue["test"],
    data_collator = whole_word_masking_data_collator,
    tokenizer=tokenizer)

In [53]:
words_to_mask = ["the", "a", "all", "both", "The", "A", "All", "Both"]

# instantiate the trainer for the specific word masking /masking our target words
trainer_specific_masking = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_superglue["train"],
    eval_dataset=tokenized_superglue["test"],
    #train_dataset=small_dataset["train"], #test... 
    #eval_dataset=small_dataset["test"], #test... 
    data_collator=specific_word_masking_data_collator, 
    #data_collator=specific_word_masking_data_collator_roberta, #when Roberta model is used! 
    tokenizer=tokenizer)

#### Training... 

In [56]:
trainer_specific_masking.train()



Epoch,Training Loss,Validation Loss
1,0.1863,0.234393
2,0.0978,0.253583
3,0.0675,0.284446


TrainOutput(global_step=306, training_loss=0.11653236104966769, metrics={'train_runtime': 24542.6477, 'train_samples_per_second': 0.795, 'train_steps_per_second': 0.012, 'total_flos': 1286410394880000.0, 'train_loss': 0.11653236104966769, 'epoch': 3.0})

In [59]:
# save model! 
trainer_specific_masking.save_model()