Author: Zhengyong Chen

### Baseline
This is the notebook that train the pharse/passage spolier data on base bert.
Most of the model and api are come from hugging face. The following tutorial give us some help.
https://huggingface.co/docs/transformers/tasks/question_answering

### Environment

Download the need libiary

In [None]:
!pip install transformers
!pip install bert-score
!pip install datasets
!pip install accelerate
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.1-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m82.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
[31mERROR: Operation cancelled by user[0m[31m
[0mLooking in indexes: https://pypi.org/simple,

Mount to the drive 

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Set the random seeds for reproducibility.

In [None]:
import torch
print(torch.cuda.get_device_name(0))
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Set the device

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

Set you model. you can use 

*   bert-base-cased
*   bert-base-uncased
*   bert-large-uncased
*   bert-large-cased


In [None]:
model_name="bert-base-cased"

Set which type of data you want to train
*   'phrase'
*   'passage'



In [None]:
type_of_data='phrase'

### Data Preprocessing
Get the training file and validation_file from the local directory. Please check path is correct

In [None]:
# Set the path to your traning and validation JSONL file
training_file = "/content/gdrive/MyDrive/cse_635/group_project/data/train.jsonl"
validation_file="/content/gdrive/MyDrive/cse_635/group_project/data/validation.jsonl"

A function that get the start index and end index of each answer from each content of each entry. Also, concatenate sentence array into one big string.

In [None]:
def getanswers(strs ,position):
    nums=position[0][0]
    start=0
    for i in range(len(strs)):
      if nums==i:
        start+=position[0][1]
        break
      start+=len(strs[i])
      
    return ''.join(strs),start

A dataset class that read the file through the path, extract data from the file. It just store the question, context, answer and index of answer for each entry. The 'type' parameter is to make the class know which type of data it only load. In this case, passage or phrase.

In [None]:
import json
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, filepath, type):
        self.data = []
        self.max_length=0
        self.id=0
        with open(filepath, 'r') as f:
            for line in f:
                json_obj = json.loads(line)
                if json_obj['tags'][0]==type:
                    question = json_obj['postText'][0]
                    context = json_obj['targetParagraphs']
                    answer = json_obj['spoiler']
                    strs,start=getanswers(context,json_obj['spoilerPositions'][0])
                    context=strs
                    extracted_data = {
                        'answers': {'answer_start': [start], 'text': answer},
                        'context': context,
                        'question': question,
                        'id':str(self.id),
                    }
                    self.data.append(extracted_data)
                    self.max_length=max(self.max_length,len(context)+len(question[0]))
                self.id+=1

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

Load the dataset using the MyDataset just defined

In [None]:
training = MyDataset(training_file,type_of_data)
validation = MyDataset(validation_file,type_of_data)

Check how the data stored 

In [None]:
print(training[0])
print(validation[0])

{'answers': {'answer_start': [0], 'text': ['2070']}, 'context': '2070 is shaping up to be a great year for Mother Earth.That\'s when NASA scientists are predicting the hole in the ozone layer might finally make a full recovery. Researchers announced their conclusion, in addition to other findings, in a presentation Wednesday during the annual American Geophysical Union meeting in San Francisco.The team of scientists specifically looked at the chemical composition of the ozone hole, which has shifted in both size and depth since the passing of the Montreal Protocol in 1987. The agreement banned its 197 signatory countries from using chemicals, like chlorofluorocarbons (CFCs), that break down into chlorine in the upper atmosphere and harm the ozone layer.They found that, while levels of chlorine in the atmosphere have indeed decreased as a result of the protocol, it\'s too soon to tie them to a healthier ozone layer."Ozone holes with smaller areas and a larger total amount of ozone are n

### Tokenization

Get the tokenizer from the hugging face. Can use bert-base-cased, bert-base-uncased, bert-large-uncased, bert-large-cased

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

Convet our dataset to huggingface dataset for preprocess_function() use.

In [None]:
from datasets import Dataset
dset_training = Dataset.from_list(training)
dset_training=dset_training.with_format("torch")

dset_validation = Dataset.from_list(validation)
dset_validation=dset_validation.with_format("torch")

Check how the tokenizer tokenize the data

In [None]:
context = dset_training[0]["context"]
question = dset_training[0]["question"]

inputs = tokenizer(question, context)
tokenizer.decode(inputs["input_ids"])

'[CLS] NASA sets date for full recovery of ozone hole [SEP] 2070 is shaping up to be a great year for Mother Earth. That\'s when NASA scientists are predicting the hole in the ozone layer might finally make a full recovery. Researchers announced their conclusion, in addition to other findings, in a presentation Wednesday during the annual American Geophysical Union meeting in San Francisco. The team of scientists specifically looked at the chemical composition of the ozone hole, which has shifted in both size and depth since the passing of the Montreal Protocol in 1987. The agreement banned its 197 signatory countries from using chemicals, like chlorofluorocarbons ( CFCs ), that break down into chlorine in the upper atmosphere and harm the ozone layer. They found that, while levels of chlorine in the atmosphere have indeed decreased as a result of the protocol, it\'s too soon to tie them to a healthier ozone layer. " Ozone holes with smaller areas and a larger total amount of ozone are

See how to use return_overflowing_tokens parameter to chunk the long input entry.

In [None]:
inputs = tokenizer(
        question,
        context,
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",      
    )
for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] NASA sets date for full recovery of ozone hole [SEP] 2070 is shaping up to be a great year for Mother Earth. That's when NASA scientists are predicting the hole in the ozone layer might finally make a full recovery. Researchers announced their conclusion, in addition to other findings, in a presentation Wednesday during the annual American Geophysical Union meeting in San Francisco. The team of scientists specifically looked at the chemical composition of the ozone hole, which has shifted in both size and depth since the passing of the Montreal Protocol in 1987. The agreement banned its 197 signatory countries from using chemicals, like chlorofluorocarbons ( CFCs ), that break down into chlorine in the upper atmosphere and harm the ozone layer. They found that, while levels of chlorine in the atmosphere have indeed decreased as a result of the protocol, it's too soon to tie them to a healthier ozone layer. " Ozone holes with smaller areas and a larger total amount of ozone are no

A preprocess function for training. Tokenize each data first. And then create the start and end index of the answer in the context for loss function. If no answer, the start and end index will be zero.



In [None]:
max_length = 384
stride = 128

def preprocess_function_training(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
        
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

Use map function and preprocess_function_training as parameter to tokenize our training data for bert used.

In [None]:
tokenized_training = dset_training.map(preprocess_function_training, batched=True,remove_columns=dset_training.column_names)

Map:   0%|          | 0/1367 [00:00<?, ? examples/s]

A preprocess function for validation. Similiar to preprocess_function_training. Add a example_id that used for evaluation

In [None]:
def preprocess_function_validation(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids

    offset_mapping = inputs["offset_mapping"]
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    return inputs

Use map function and preprocess_function_validation() as parameter to tokenize our validation data for bert used.

In [None]:
tokenized_validation = dset_validation.map(preprocess_function_validation, batched=True,remove_columns=dset_validation.column_names)

Map:   0%|          | 0/335 [00:00<?, ? examples/s]

###Training

Setting the gpu device, model and output directory. You can save your trained model to your own path by setting variable output_dir.

In [None]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-base-cased').to(device)
output_dir="/content/gdrive/MyDrive/cse_635/individual_project/code/baseline/bert_phrase"

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

Setting some hyper paramater for fine tuning.
Decrease the per_device_train_batch_size and per_device_eval_batch_size if you have the memory in training

In [None]:
from transformers import TrainingArguments,Trainer

args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    fp16=True,
)

Start training

As we can see, the validation loss increase at epoch 3. Therefore, we will use the model checkpoint at eopch 2 for our evaluation.

In [None]:
best_checkpoint = BertForQuestionAnswering.from_pretrained("/content/gdrive/MyDrive/cse_635/individual_project/code/baseline/bert_phrase/checkpoint-414")
checkpoint_tokenizer=AutoTokenizer.from_pretrained("/content/gdrive/MyDrive/cse_635/individual_project/code/baseline/bert_phrase/checkpoint-414")
trainer_best_checkpoint = Trainer(
    model=best_checkpoint,
    args=args,
    train_dataset=tokenized_training,
    eval_dataset=tokenized_validation,
    tokenizer=checkpoint_tokenizer,
)

###Evaluation

Create bleu, meteor, bertscore class by using huggingface interface

In [None]:
import evaluate

bleu = evaluate.load('bleu')
meteor = evaluate.load('meteor')
bertscore=evaluate.load('bertscore')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Set the model_dir to your trained model.

 Due to the memory size of the google drive, we lost all the models. 

In [None]:
from transformers import BertForQuestionAnswering,Trainer
model_dir=model_name #The path of the trained model.
best_checkpoint = BertForQuestionAnswering.from_pretrained(model_dir).to(device)
checkpoint_tokenizer=AutoTokenizer.from_pretrained(model_dir)
trainer_best_checkpoint = Trainer(
    model=best_checkpoint,
    train_dataset=tokenized_training,
    eval_dataset=tokenized_validation,
    tokenizer=checkpoint_tokenizer,
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

A function that for evaluation use. For each entry, get the 20 best answer according to the logit and then choose the best one that has the logit score.
Then use the evalutation function above to calcualte the scores.

In [None]:
from tqdm.auto import tqdm
import collections
import numpy as np
n_best = 20
if type_of_data=='phrase':
  max_answer_length=8
else:
  max_answer_length=50
print("There maximum lenght of predicted answer is "+str(max_answer_length))

def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)
    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []
        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue
                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)
        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(best_answer["text"])
        else:
            predicted_answers.append("")
    
    print(predicted_answers[0:20])
    theoretical_answers = [ex["answers"]["text"][0] for ex in examples]
    print(theoretical_answers[0:20])
    bleu_score=bleu.compute(predictions=predicted_answers, references=theoretical_answers)
    meteor_score=meteor.compute(predictions=predicted_answers, references=theoretical_answers)
    bertscore_score=bertscore.compute(predictions=predicted_answers, references=theoretical_answers,lang="en",model_type="distilbert-base-uncased")
    return bleu_score,meteor_score,bertscore_score

There maximum lenght of predicted answer is 8


In [None]:
predictions, _, _ = trainer_best_checkpoint.predict(tokenized_validation)
start_logits, end_logits = predictions
bl, me, be= compute_metrics(start_logits, end_logits, tokenized_validation, dset_validation)
print(bl)
print(me)
print("bert score: {precision: "+ str(be['precision'][0])+" recall: "+str(be['recall'][0])+" f1: "+str(be['f1'][0])+"}")

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/335 [00:00<?, ?it/s]

['and ask them', 'the', 'redients1 t', 'Shima and Mr.', 'Mosquito Control Section', 'Wei', 'who started earlier, supervisors', 'firms iOS 9.3', 'Round', 'ala', 'placed illustrations of cats throughout the', 'Steg', 'to mention Beyon', 'from', 'hosted a sale of Bob and', 'the', 'in the image you see here', 'was part and parcel', 'to endorse Clinton this year', 'ous heap.']
['20%', 'Sprite', 'Smoky Paprika-Baked Garbanzo Beans', 'Anthony Bourdain', 'Dibrom', 'They don’t fart', 'starts later', 'bricking iPad Pros', 'Stace Nelson', 'reduced fat sour cream', 'Edward Gorey', 'Rag & Bone', 'pixie cut', 'southern flying squirrel', "Hope's antique cabinet", 'August 6th', 'perfectly average', '"but"', 'The Arizona Republic', 'apocalyptic omen']
{'bleu': 0.0, 'precisions': [0.025919732441471572, 0.003484320557491289, 0.0016286644951140066, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 1.4004683840749415, 'translation_length': 1196, 'reference_length': 854}
{'meteor': 0.016875759719428323}
bert sc