<a href="https://colab.research.google.com/github/akbism/COVID-QA/blob/main/FineTuning/5_TPU_Second_Stage_Finetuning_roberta_base_squad2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This script is for TPU implementation of the second stage fine-tuning.

# Setting up the google drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd /content/gdrive/My\ Drive/Colab\ Notebooks/LJMU/covidqa/experiments/case_12_1_3/TPU

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/Colab Notebooks/LJMU/covidqa/experiments/case_12_1_3/TPU


# Setting up Weights & Biases for tracking the training process
(Only works for GPU implementation)

In [None]:
# !pip install wandb
# # Flexible integration for any Python script
# import wandb
# import os

# # 1. Start a W&B run
# wandb.init(project='bertqa', entity='akbism')

# # 2. Save model inputs and hyperparameters
# config = wandb.config
# config.learning_rate = 0.01
# os.environ["WANDB_DISABLED"] = "false"

In [None]:
!grep MemTotal /proc/meminfo

MemTotal:       13302920 kB


In [None]:
!pip install tqdm==4.41.1
!pip install transformers datasets
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.7-cp37-cp37m-linux_x86_64.whl
VERSION = "1.8.1"
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION

Collecting transformers
  Downloading transformers-4.9.1-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 5.3 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 38.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 47.0 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 42.6 MB/s 
[?25hCollecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 52.5 MB/s 
Collecting tqdm>=4.27
  Downloading tqdm-4.62.0-py2.py3-none-any.whl (76 kB)

# Setting random seed

In [None]:
import random
import numpy as np
import torch
import os
def set_seed(seed):
    """Set seed"""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    os.environ["PYTHONHASHSEED"] = str(seed)

set_seed(1)

# Setting the parameters of the experiment

In [None]:
from datasets import load_dataset, load_metric, concatenate_datasets
import os

class Args:
    model_type = 'bert'
    model_name = 'deepset/roberta-base-squad2' #'dmis-lab/biobert-base-cased-v1.1-squad'
    tokenizer_name='roberta-base'
    output_dir = './result_a'

    data_dir = '/content/gdrive/My Drive/Colab Notebooks/LJMU/covidqa/biobert-pytorch/datasets/QA/SQuAD'
    train_file = 'train-v2.0_modified1.json'
    predict_file ='dev-v2.0_modified1.json'
    
    data_dir1 = '/content/gdrive/My Drive/Colab Notebooks/LJMU/covidqa/biobert-pytorch/datasets/QA/BioASQ/BioASQ-678b/'
    train_file1 = 'BioASQ-train-factoid-6_8b-full-annotated_modified1.json'
    predict_file1 ='BioASQ-train-factoid-7b-full-annotated_modified1.json'

    data_dir2 = '/content/gdrive/MyDrive/Colab Notebooks/LJMU/covidqa/biobert-pytorch/datasets/QA'
    predict_file2 = 'COVID-QA-Modified-Modified1.json'

    train_path = "./datasets/train/"
    dev_path = "./datasets/dev/"
    val_path = "./datasets/val/"

    save_model_path = "./qa_model/"

    max_length = 324 # The maximum length of a feature (question and context)
    # batch_size = 16
    per_device_train_batch_size = 32
    per_device_eval_batch_size = 32
    doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.
    squad_v2 = True
    epochs= 3
    warmup_steps = 200
    warmup_proportion  = 0.1
    save_steps = 200
    evaluation_strategy="epoch"
    weight_decay=0.01
    logging_dir='logs'
    load_best_model_at_end=True
    metric_for_best_model="f1"
    do_eval = True
    learning_rate=3e-5
    n_best_size = 20
    max_answer_length = 50
    gradient_accumulation_steps = 4
    report_to = 'wandb'

args=Args()

metric = load_metric("squad_v2" if args.squad_v2 else "squad")

Downloading:   0%|          | 0.00/2.26k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.18k [00:00<?, ?B/s]

In [None]:
print(args.learning_rate)

3e-05


# Read Datasets

In [None]:
qa_data = load_dataset("squad_v2")

Downloading:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 (download: 44.34 MiB, generated: 122.41 MiB, post-processed: Unknown size, total: 166.75 MiB) to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


Downloading:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/801k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


In [None]:
### Following is required only once
train_file = os.path.join(args.data_dir1, args.train_file1)
predict_file = os.path.join(args.data_dir1, args.predict_file1)
qa_data_bioasq = load_dataset("json", data_files={"train": train_file, "validation": predict_file} )
# qa_data['train'] = qa_data['train'].select(range(2_000)) 
# qa_data['validation'] = qa_data['validation'].select(range(1_000))
# qa_data_squad['train'] = qa_data_squad['train'].filter(lambda example, indice: indice != 107709, with_indices=True)

Using custom data configuration default-1d600a62d23165f6


Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-1d600a62d23165f6/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264...


0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-1d600a62d23165f6/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264. Subsequent calls will reuse this data.


In [None]:
### Following is required only once
predict_file = os.path.join(args.data_dir2, args.predict_file2)
qa_data_covid= load_dataset("json", data_files={ "validation": predict_file} )
# qa_data['train'] = qa_data['train'].select(range(2_000)) 
# qa_data['validation'] = qa_data['validation'].select(range(1_000))

Using custom data configuration default-2c021e95388b6faf


Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-2c021e95388b6faf/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264...


0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-2c021e95388b6faf/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264. Subsequent calls will reuse this data.


In [None]:
qa_data_covid

DatasetDict({
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'is_impossible'],
        num_rows: 2019
    })
})

In [None]:
qa_data_bioasq

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'is_impossible'],
        num_rows: 14919
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'is_impossible'],
        num_rows: 5537
    })
})

# Split and  Merge the datasets

In [None]:
temp = qa_data_covid['validation'].shuffle(seed=1).train_test_split(train_size=0.4)

In [None]:
temp['validation']=temp['test']

In [None]:
qa_data['train']=concatenate_datasets([qa_data_bioasq['train'], temp['train']])#.shuffle(seed=1)
qa_data['validation']=concatenate_datasets([temp['validation']])

In [None]:
qa_data

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'is_impossible'],
        num_rows: 15726
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'is_impossible'],
        num_rows: 1212
    })
})

# Important Functions - training feature, validation feature and postprocessing

In [None]:
def prepare_train_features(examples, args=args):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=args.max_length,
        stride=args.doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

def prepare_validation_features(examples, args=args):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=args.max_length,
        stride=args.doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

from tqdm.auto import tqdm
import collections

def postprocess_qa_predictions(args, examples, features, raw_predictions):
    n_best_size = args.n_best_size
    max_answer_length = args.max_answer_length
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = 0 #None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not args.squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

# Read and save dataset to disk

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer 

tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name)
pad_on_right = tokenizer.padding_side == "right"

train_path = args.train_path
dev_path = args.dev_path
val_path = args.val_path

# if args.read_data:
tokenized_datasets = qa_data.map(prepare_train_features, batched=True, remove_columns=qa_data["train"].column_names)
train_dataset= tokenized_datasets['train']
val_dataset= tokenized_datasets['validation']

validation_features = qa_data["validation"].map(
  prepare_validation_features,
  batched=True,
  remove_columns=qa_data["validation"].column_names)

train_dataset.save_to_disk(train_path)
val_dataset.save_to_disk(dev_path)
validation_features.save_to_disk(val_path)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

# Read Pre-trained model

Now, load model. We use an `xla` wrapper 


Check https://pytorch.org/xla/release/1.8/index.html#torch_xla.distributed.xla_multiprocessing.MpModelWrapper



In [None]:
import torch_xla.distributed.xla_multiprocessing as xmp

model = AutoModelForQuestionAnswering.from_pretrained(args.model_name)
model.train()

WRAPPED_MODEL = xmp.MpModelWrapper(model)



Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/496M [00:00<?, ?B/s]

#Define functions for training, comput meric and formatting the predictions

In [None]:
from transformers import Trainer, TrainingArguments, EvalPrediction
from datasets import load_from_disk, load_metric
# from sklearn.metrics import precision_recall_fscore_support, accuracy_score
# from transformers import default_data_collator
import numpy as np


validation_features = load_from_disk(val_path)
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

def my_compute_metrics(p: EvalPrediction, validation_features=validation_features):
  predictions = p.predictions
  # validation_features = qa_data["validation"].map(
  #   prepare_validation_features,
  #   batched=True,
  #   remove_columns=qa_data["validation"].column_names)
  # import pdb
  # pdb.set_trace()
  final_predictions = postprocess_qa_predictions(args,qa_data["validation"], validation_features, predictions)
  formatted_predictions, references = format_predictions(final_predictions)
  return metric.compute(predictions=formatted_predictions, references=references)

def format_predictions(postprocess_predictions):
  if args.squad_v2:
      formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in postprocess_predictions.items()]
  else:
      formatted_predictions = [{"id": k, "prediction_text": v} for k, v in postprocess_predictions.items()]
  references = [{"id": ex["id"], "answers": ex["answers"]} for ex in qa_data["validation"]]
  return formatted_predictions, references

def train_qa(model, tokenizer= tokenizer, args=args):
    """
    This contains everything that must be done to train our models
    """
    print("Loading datasets... ", end="")

    # data_collator = default_data_collator
    train_dataset = load_from_disk(train_path)
    val_dataset = load_from_disk(dev_path)
    # import pdb
    # pdb.set_trace()
    training_args = TrainingArguments(
      # report_to = args.report_to, #'wandb',   
      output_dir=args.output_dir,
      num_train_epochs=args.epochs,
      warmup_steps=args.warmup_steps,
      save_steps = args.save_steps,
      evaluation_strategy=args.evaluation_strategy,
      weight_decay=args.weight_decay,
      logging_dir=args.logging_dir,
      # load_best_model_at_end=args.load_best_model_at_end,
      # metric_for_best_model=args.metric_for_best_model,
      do_eval = args.do_eval,
      learning_rate=args.learning_rate,
      # run_name = 'case-1.1.2: bert + squad + None' 
      per_device_train_batch_size = args.per_device_train_batch_size,
      per_device_eval_batch_size = args.per_device_eval_batch_size,
      gradient_accumulation_steps = args.gradient_accumulation_steps
    )

    results = []

    trainer = Trainer(
      model=model,
      args=training_args,
      compute_metrics=my_compute_metrics,
      train_dataset=train_dataset,
      eval_dataset=val_dataset,
      tokenizer=tokenizer,)

    trainer.place_model_on_device = False
    trainer.train()
    trainer.save_model(args.save_model_path)
    # tokenizer.save_pretrained("qa_model/")

# Final training and evaluation

We have to define a `_mp_fn` function which will be called with the index of the TPU core that it will run on. 

In order to diminish TPU memory usage, we use the `WRAPPED_MODEL` defined before. 

Then, we call `xmp.spawn` with `start_method='fork'`. 

In [None]:
# os.environ["WANDB_DISABLED"] = "true"

In [None]:
args.epochs

3

In [None]:
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
import torch_xla.distributed.xla_multiprocessing as xmp

def _mp_fn(index):
    device = xm.xla_device()
    # We wrap this 
    model = WRAPPED_MODEL.to(device)

    train_qa(model,tokenizer= tokenizer, args=args)

xmp.spawn(_mp_fn, start_method="fork")



***** Running training *****
  Num examples = 60652
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 4
  Total optimization steps = 177


Loading datasets... 

Epoch,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
0,No log,0.196466,13.118812,18.282824,1212,13.118812,18.282824,1212,13.118812,0.0,18.282824,0.0
1,No log,0.15314,17.491749,24.208134,1212,17.491749,24.208134,1212,17.491749,0.0,24.208134,0.0
2,No log,0.162815,27.970297,42.387201,1212,27.970297,42.387201,1212,27.970297,0.0,42.387201,0.0


Loading datasets... Loading datasets... 

Epoch,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
0,No log,0.196466,13.118812,18.282824,1212,13.118812,18.282824,1212,13.118812,0.0,18.282824,0.0
1,No log,0.15314,17.491749,24.208134,1212,17.491749,24.208134,1212,17.491749,0.0,24.208134,0.0
2,No log,0.162815,27.970297,42.387201,1212,27.970297,42.387201,1212,27.970297,0.0,42.387201,0.0


Epoch,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
0,No log,0.196466,13.118812,18.282824,1212,13.118812,18.282824,1212,13.118812,0.0,18.282824,0.0
1,No log,0.15314,17.491749,24.208134,1212,17.491749,24.208134,1212,17.491749,0.0,24.208134,0.0
2,No log,0.162815,27.970297,42.387201,1212,27.970297,42.387201,1212,27.970297,0.0,42.387201,0.0


Loading datasets... Loading datasets... 

Epoch,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
0,No log,0.196466,13.118812,18.282824,1212,13.118812,18.282824,1212,13.118812,0.0,18.282824,0.0
1,No log,0.15314,17.491749,24.208134,1212,17.491749,24.208134,1212,17.491749,0.0,24.208134,0.0
2,No log,0.162815,27.970297,42.387201,1212,27.970297,42.387201,1212,27.970297,0.0,42.387201,0.0


Epoch,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
0,No log,0.196466,13.118812,18.282824,1212,13.118812,18.282824,1212,13.118812,0.0,18.282824,0.0
1,No log,0.15314,17.491749,24.208134,1212,17.491749,24.208134,1212,17.491749,0.0,24.208134,0.0
2,No log,0.162815,27.970297,42.387201,1212,27.970297,42.387201,1212,27.970297,0.0,42.387201,0.0


Loading datasets... 

Epoch,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
0,No log,0.196466,13.118812,18.282824,1212,13.118812,18.282824,1212,13.118812,0.0,18.282824,0.0
1,No log,0.15314,17.491749,24.208134,1212,17.491749,24.208134,1212,17.491749,0.0,24.208134,0.0
2,No log,0.162815,27.970297,42.387201,1212,27.970297,42.387201,1212,27.970297,0.0,42.387201,0.0


Loading datasets... 

Epoch,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
0,No log,0.196466,13.118812,18.282824,1212,13.118812,18.282824,1212,13.118812,0.0,18.282824,0.0
1,No log,0.15314,17.491749,24.208134,1212,17.491749,24.208134,1212,17.491749,0.0,24.208134,0.0
2,No log,0.162815,27.970297,42.387201,1212,27.970297,42.387201,1212,27.970297,0.0,42.387201,0.0


Loading datasets... 

Epoch,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
0,No log,0.196466,13.118812,18.282824,1212,13.118812,18.282824,1212,13.118812,0.0,18.282824,0.0
1,No log,0.15314,17.491749,24.208134,1212,17.491749,24.208134,1212,17.491749,0.0,24.208134,0.0
2,No log,0.162815,27.970297,42.387201,1212,27.970297,42.387201,1212,27.970297,0.0,42.387201,0.0


***** Running Evaluation *****
  Num examples = 49698
  Batch size = 32


Post-processing 1212 example predictions split into 49698 features.
Post-processing 1212 example predictions split into 49698 features.
Post-processing 1212 example predictions split into 49698 features.
Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

  0%|          | 0/1212 [00:00<?, ?it/s]

  0%|          | 0/1212 [00:00<?, ?it/s]

  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

***** Running Evaluation *****
  Num examples = 49698
  Batch size = 32


Post-processing 1212 example predictions split into 49698 features.
Post-processing 1212 example predictions split into 49698 features.
Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

  0%|          | 0/1212 [00:00<?, ?it/s]

  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.
Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.
Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

  0%|          | 0/1212 [00:00<?, ?it/s]

***** Running Evaluation *****
  Num examples = 49698
  Batch size = 32


Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.
Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.
Post-processing 1212 example predictions split into 49698 features.
Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

  0%|          | 0/1212 [00:00<?, ?it/s]

  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]

Post-processing 1212 example predictions split into 49698 features.


  0%|          | 0/1212 [00:00<?, ?it/s]



Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to ./qa_model/
Configuration saved in ./qa_model/config.json
Model weights saved in ./qa_model/pytorch_model.bin
tokenizer config file saved in ./qa_model/tokenizer_config.json
Special tokens file saved in ./qa_model/special_tokens_map.json
