<a href="https://colab.research.google.com/github/feiyu0214/ColossalAI/blob/main/squad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers datasets evaluate

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7

In [None]:
import collections
import datasets
import datetime
from evaluate import load
import numpy as np
import os
import transformers
import random
import torch
torch.backends.cudnn.benchmark = True

We load the dataset of SQuAD from huggingface hub.

In [None]:
raw_datasets = datasets.load_dataset("squad")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

`bert-base-cased` is selected as the finetuned model. Another available example in the tutorial is `distilbert-base-cased-distilled-squad`.

In [None]:
model_checkpoint = "bert-base-cased"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_checkpoint)
model = transformers.AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The preprocess function of the training dataset is defined here.

In [None]:
def span_masking(inputs, max_span_length=10):
    tokenized_inputs = np.array(inputs["input_ids"])

    for input_ids in tokenized_inputs:
        num_tokens = len(input_ids)
        if num_tokens < 2:
            continue
        start_idx = np.random.randint(0, num_tokens - 2)
        span_length = np.random.randint(2, min(max_span_length, num_tokens - start_idx))

        for idx in range(start_idx, start_idx + span_length):
            if input_ids[idx] not in [tokenizer.pad_token_id, tokenizer.cls_token_id, tokenizer.sep_token_id]:
                input_ids[idx] = tokenizer.mask_token_id

    inputs["input_ids"] = tokenized_inputs.tolist()
    return inputs


In [None]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Apply span masking
    inputs = span_masking(inputs)

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        context_start = next(idx for idx, sid in enumerate(sequence_ids) if sid == 1)
        context_end = next(idx for idx, sid in reversed(list(enumerate(sequence_ids))) if sid == 1)

        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)
            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs


In [None]:
# max_length = 384
# stride = 128


# def preprocess_training_examples(examples):
#     questions = [q.strip() for q in examples["question"]]
#     inputs = tokenizer(
#         questions,
#         examples["context"],
#         truncation="only_second",
#         max_length=max_length,
#         stride=stride,
#         return_overflowing_tokens=True,
#         return_offsets_mapping=True,
#         padding="max_length",
#     )
#     offset_mapping = inputs.pop("offset_mapping")
#     sample_map = inputs.pop("overflow_to_sample_mapping")
#     answers = examples["answers"]
#     start_positions = []
#     end_positions = []
#     for i, offset in enumerate(offset_mapping):
#         sample_idx = sample_map[i]
#         answer = answers[sample_idx]
#         start_char = answer["answer_start"][0]
#         end_char = answer["answer_start"][0] + len(answer["text"][0])
#         sequence_ids = inputs.sequence_ids(i)
#         idx = 0
#         while sequence_ids[idx] != 1:
#             idx += 1
#         context_start = idx
#         while sequence_ids[idx] == 1:
#             idx += 1
#         context_end = idx - 1
#         if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
#             start_positions.append(0)
#             end_positions.append(0)
#         else:
#             idx = context_start
#             while idx <= context_end and offset[idx][0] <= start_char:
#                 idx += 1
#             start_positions.append(idx - 1)
#             idx = context_end
#             while idx >= context_start and offset[idx][1] >= end_char:
#                 idx -= 1
#             end_positions.append(idx + 1)
#     inputs["start_positions"] = start_positions
#     inputs["end_positions"] = end_positions
#     return inputs

In [None]:
train_dataset = raw_datasets["train"].map(
    preprocess_training_examples,
    batched=True,
    num_proc=os.cpu_count(),
    remove_columns=raw_datasets["train"].column_names,
)

Map (num_proc=12):   0%|          | 0/87599 [00:00<?, ? examples/s]

The preprocess function of the validation dataset is defined here.

In [None]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []
    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])
        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]
    inputs["example_id"] = example_ids
    return inputs

In [None]:
validation_dataset = raw_datasets["validation"].map(
    preprocess_validation_examples,
    batched=True,
    num_proc=os.cpu_count(),
    remove_columns=raw_datasets["validation"].column_names,
)

Map (num_proc=12):   0%|          | 0/10570 [00:00<?, ? examples/s]

It defines the metrics function on the validation dataset. Note it cannot run directly with the trainer API, due to it's meaningless to calculate the validation loss on SQuAD task.

In [None]:
n_best = 20
max_answer_length = 30
metric = load("squad")


def compute_metrics(start_logits, end_logits, features, examples):
    examples_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        examples_to_features[feature["example_id"]].append(idx)
    predicted_answers = []
    for example in examples:
        example_id = example["id"]
        context = example["context"]
        answers = []
        for feature_idx in examples_to_features[example_id]:
            start_logit = start_logits[feature_idx]
            end_logit = end_logits[feature_idx]
            offsets = features[feature_idx]["offset_mapping"]
            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue
                    answer = {
                        "text": context[
                            offsets[start_index][0] : offsets[end_index][1]
                        ],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})
    theoretical_answers = [
        {"id": ex["id"], "answers": ex["answers"]} for ex in examples
    ]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

Define the hyparameters.

In [None]:
args = transformers.TrainingArguments(
    "bert-finetuned-squad",
    per_device_train_batch_size=16,
    eval_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    dataloader_num_workers=4,
)
trainer = transformers.Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)

  trainer = transformers.Trainer(


Begin to train. The training metrics is
```
{'train_runtime': 3494.2459, 'train_samples_per_second': 76.179, 'train_steps_per_second': 4.762, 'total_flos': 5.216534983896422e+16, 'train_loss': 0.8796716472938924, 'epoch': 3.0}
```

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
500,2.6522
1000,1.7288
1500,1.5383
2000,1.4371
2500,1.4082
3000,1.3378
3500,1.2806
4000,1.2572
4500,1.1959
5000,1.2305


TrainOutput(global_step=33276, training_loss=0.8807765796990159, metrics={'train_runtime': 3788.6216, 'train_samples_per_second': 70.26, 'train_steps_per_second': 8.783, 'total_flos': 5.216534983896422e+16, 'train_loss': 0.8807765796990159, 'epoch': 3.0})

Save the model on the local disk.

In [None]:
trainer.save_model(f"bert-finetuned-squad-{datetime.datetime.now().timestamp()}")

Evaluate the trained model on the validation dataset.

In [None]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
metrics = compute_metrics(
    start_logits, end_logits, validation_dataset, raw_datasets["validation"]
)
metrics

{'exact_match': 81.22043519394512, 'f1': 88.7018767233627}

The evaluation result is
```
BERT: {'exact_match': 81.0406811731315, 'f1': 88.46436588113946}
BERT+SpanMask: {'exact_match': 81.22043519394512, 'f1': 88.7018767233627}
```

It ranks 49 on EM, and 36 on F1 score compared to the models on the rank list.