In this work, the “Simple Transformers” library is used: https://simpletransformers.ai/.

This library is built on top of huggingface/transformers, and offers models for specific NPL tasks. A model contains the pre-trainded model form Hugging Face and the other additional configurations to fine-tune the pre-trained model for a specific NPL model. We only have to train it.

---
Steps:
1. Initialize the model from the “Simple Transformers” library.

2. Train the model.

3. Test/evaluate the model.

4. Answer questions with the model.



In [None]:
# Save your model in Google Drive
%cd /content/drive/MyDrive/IIP2/Assignments/Final/TrainModel

/content/drive/MyDrive/IIP2/Assignments/Final/TrainModel


In [None]:
# Installing the "Simple Transformer" library
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.64.5-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.7/250.7 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from simpletransformers)
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting wandb>=0.10.32 (from simpletransformers)
 

In [None]:
# Importing the necessary libraries
import logging
import sklearn
import os
import requests
import json

from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs

In [None]:
# Getting training and testing datasets

os.mkdir('squad')
url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/'

for file in ['train-v2.0.json', 'dev-v2.0.json']:
  res = requests.get(f'{url}{file}') #make request
  with open(f'squad/{file}', 'wb') as f:
    for chunk in res.iter_content(chunk_size=4):
      f.write(chunk)

In [None]:
# Preparing a training dataset (130319 samples)

with open("squad/train-v2.0.json", "rb") as f:
  squad_dict = json.load(f)

train = []
for group in squad_dict['data']:
  train = train + group['paragraphs']

In [None]:
# Preparing an evaluation dataset (used during the training phase) (5448 samples)
with open("squad/dev-v2.0.json", "rb") as f:
  squad_dict = json.load(f)

eval = []
for i in range(17):
  eval = eval + squad_dict['data'][i]['paragraphs']

---

---
1. **Initialize the model from the “Simple Transformers” library.**

    At first, we have to determine the model type and the pre-trained model type. Then, we have to set the training arguments for our model.

In [None]:
# Here RoBERTa model is used for the "Question Answering" task.
# The pre-trained model is 'roberta-base' from Hugging Face: https://huggingface.co/roberta-base

model_type= 'roberta'
model_name= 'roberta-base'

In [None]:
# https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model

train_args = {
    #general arguments
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "output_dir": f"outputs/{model_type}",
    "best_model_dir": f"outputs/{model_type}/best_model",
    "save_model_every_epoch": False,
    "save_eval_checkpoints": False,
    # Maximum sequence length the model will support.
    "max_seq_length": 128,
    # The number of predictions given per question.
    "n_best_size":3,
    ### From Hyperparameter optimization ###
    "train_batch_size": 128,
    "learning_rate":3.9598e-4,
    ########################################
    "num_train_epochs": 50,
    "evaluate_during_training": True,
    "use_cached_eval_features": True,
    "eval_batch_size": 64,
    # Evaluate the model for each 800 optimization steps
    "evaluate_during_training_steps": 800,

    # Arguments for early stopping
    "use_early_stopping": True,
    "early_stopping_metric": "em_score",
    "early_stopping_metric_minimize": False, # To maximize "em_score"
    "early_stopping_delta": 0.01,  # Stop if "em_score" not improved with 0.01
    "early_stopping_patience": 2,  # for 3 consecutive evaluations (detect + 2).


     # Tracking the training process with Weights & Biases AI: https://wandb.ai/
    "wandb_project": "Question Answer Models",
    "wandb_kwargs": {"name": "RoBERTa Question Answering"},
}

In [None]:
# Initializing the model
model = QuestionAnsweringModel(
    model_type, model_name, args=train_args
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

---





---
2. **Train the model.**

    In case of not an early stop, the model will be trained for 50 epochs, with 1029 steps per epoch. (step = one pass forward and backward)



---------------------------


In [None]:
# Train the model
model.train_model(train, eval_data=eval, em_score=sklearn.metrics.accuracy_score)

convert squad examples to features: 100%|██████████| 130319/130319 [08:48<00:00, 246.50it/s]
add example index and unique id: 100%|██████████| 130319/130319 [00:00<00:00, 628101.17it/s]


Epoch:   0%|          | 0/50 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Running Epoch 0 of 50:   0%|          | 0/1027 [00:00<?, ?it/s]



convert squad examples to features:   0%|          | 0/5448 [00:00<?, ?it/s][A[A

convert squad examples to features:   0%|          | 1/5448 [00:10<15:15:46, 10.09s/it][A[A

convert squad examples to features: 100%|██████████| 5448/5448 [00:21<00:00, 254.29it/s]


add example index and unique id: 100%|██████████| 5448/5448 [00:00<00:00, 255619.21it/s]


Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

Running Epoch 1 of 50:   0%|          | 0/1027 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

Running Epoch 2 of 50:   0%|          | 0/1027 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

Running Epoch 3 of 50:   0%|          | 0/1027 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

(4000,
 {'global_step': [800, 1027, 1600, 2054, 2400, 3081, 3200, 4000],
  'correct': [2747, 2767, 2720, 2965, 2507, 2718, 2583, 2647],
  'similar': [2420, 2390, 2421, 2344, 2567, 2480, 2537, 2512],
  'incorrect': [281, 291, 307, 139, 374, 250, 328, 289],
  'train_loss': [0.923905611038208,
   1.081533670425415,
   1.0131828784942627,
   1.023398995399475,
   0.6702353954315186,
   0.7532515525817871,
   1.1036791801452637,
   1.1928532123565674],
  'eval_loss': [-6.947931463068182,
   -7.412553267045454,
   -5.795987215909091,
   -7.060857599431818,
   -6.904430042613637,
   -5.787242542613637,
   -6.466885653409091,
   -7.180264559659091],
  'em_score': [0.5042217327459618,
   0.5078928046989721,
   0.49926578560939794,
   0.5442364170337739,
   0.46016886930983847,
   0.4988986784140969,
   0.4741189427312775,
   0.4858663729809104]})

---

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
