# Custom LLM Training - Q&A   

# by ---  "AD ACADEMY" - AI for Aam Janta

Mentor - Dr Ayan Debnath, IIT Delhi + Harvard university Alumni

LinkedIn: [dr_ayan_debnath](https://www.linkedin.com/in/ayan-debnath/)

YouTube:[AD ACADEMY AI](https://www.youtube.com/@ad_academy)

Topic: Custom LLM Traiing - Q&A

class on 15th September 2024

### Question Answer Application
The goal of Question Answering is to find the answer to a question given a question and an accompanying context. The predicted answer will be either a span of text from the context or an empty string (indicating the question cannot be answered from the context).

In [1]:
!pip install simpletransformers
# This library is based on the Transformers library of Huggingface

Collecting simpletransformers
  Downloading simpletransformers-0.70.1-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from simpletransformers)
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting wandb>=0.10.32 (from simpletransformers)
  Downloading wandb-0.18.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collectin

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
main_dir = "/content/drive/MyDrive/Colab Notebooks/Cutom Question Answering"
os.chdir(main_dir)
os.listdir()

['predictions.json',
 'test.json',
 'train.json',
 'cache_dir',
 'outputs',
 'runs',
 'overview_spacy_word_vectors.ipynb',
 'Question_Answer_Application.ipynb']

In [4]:
import json
with open(r"train.json", "r") as read_file:
    train = json.load(read_file)

In [None]:
train

[{'context': 'Mistborn is a series of epic fantasy novels written by American author Brandon Sanderson.',
  'qas': [{'id': '00001',
    'is_impossible': False,
    'question': 'Who is the author of the Mistborn series?',
    'answers': [{'text': 'Brandon Sanderson', 'answer_start': 71}]}]},
 {'context': 'The first series, published between 2006 and 2008, consists of The Final Empire,The Well of Ascension, and The Hero of Ages.',
  'qas': [{'id': '00002',
    'is_impossible': False,
    'question': 'When was the series published?',
    'answers': [{'text': 'between 2006 and 2008', 'answer_start': 28}]},
   {'id': '00003',
    'is_impossible': False,
    'question': 'What are the three books in the series?',
    'answers': [{'text': 'The Final Empire, The Well of Ascension, and The Hero of Ages',
      'answer_start': 63}]},
   {'id': '00004',
    'is_impossible': True,
    'question': 'Who is the main character in the series?',
    'answers': []}]}]

In [5]:
with open(r"test.json", "r") as read_file:
    test = json.load(read_file)

In [6]:
test

[{'context': 'The series primarily takes place in a region called the Final Empire on a world called Scadrial, where the sun and sky are red, vegetation is brown, and the ground is constantly being covered under black volcanic ashfalls.',
  'qas': [{'id': '00001',
    'is_impossible': False,
    'question': 'Where does the series take place?',
    'answers': [{'text': 'region called the Final Empire', 'answer_start': 38},
     {'text': 'world called Scadrial', 'answer_start': 74}]}]},
 {'context': '"Mistings" have only one of the many Allomantic powers, while "Mistborns" have all the powers.',
  'qas': [{'id': '00002',
    'is_impossible': False,
    'question': 'How many powers does a Misting possess?',
    'answers': [{'text': 'one', 'answer_start': 21}]},
   {'id': '00003',
    'is_impossible': True,
    'question': 'What are Allomantic powers?',
    'answers': []}]}]

In [7]:
import logging

from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs

# simpletransformers.question_answering --> module
# QuestionAnsweringModel and QuestionAnsweringArgs classes in this module
# QuestionAnsweringModel class allows you to build a model for question-answering tasks using pre-trained transformers like BERT, RoBERTa, DistilBERT, etc.
# The model learns to predict the answer to a question given a context or passage.

In [8]:
model_type="bert"
model_name= "bert-base-cased"
if model_type == "bert":
    model_name = "bert-base-cased"

elif model_type == "roberta":
    model_name = "roberta-base"

elif model_type == "distilbert":
    model_name = "distilbert-base-cased"

elif model_type == "distilroberta":
    model_type = "roberta"
    model_name = "distilroberta-base"

elif model_type == "electra-base":
    model_type = "electra"
    model_name = "google/electra-base-discriminator"

elif model_type == "electra-small":
    model_type = "electra"
    model_name = "google/electra-small-discriminator"

elif model_type == "xlnet":
    model_name = "xlnet-base-cased"

In [9]:
### Advanced Methodology
train_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "use_cached_eval_features": True,
    "output_dir": f"outputs/{model_type}",
    "best_model_dir": f"outputs/{model_type}/best_model",
    "evaluate_during_training": True,
    "max_seq_length": 128,
    "num_train_epochs": 5,
    "evaluate_during_training_steps": 1000,
    "wandb_project": "Question Answer Application",
    "wandb_kwargs": {"name": model_name},
    "save_model_every_epoch": False,
    "save_eval_checkpoints": False,
    "n_best_size":3,
    # "use_early_stopping": True,
    # "early_stopping_metric": "mcc",
    # "n_gpu": 2,
    # "manual_seed": 4,
    # "use_multiprocessing": False,
    "train_batch_size": 128,
    "eval_batch_size": 64,
    # "config": {
    #     "output_hidden_states": True
    # }
}

In [10]:
# Downloading Pre-trained model
model = QuestionAnsweringModel(
    model_type,model_name, args=train_args
)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



In [11]:
# Configure the model
# args: Custom arguments for training, evaluation, etc. These are passed as an instance of the QuestionAnsweringArgs class.
model_args = QuestionAnsweringArgs()
model_args.train_batch_size = 16  # Size of each batch during training.
model_args.evaluate_during_training = True  # ensures the model is evaluated periodically during training.
    # It helps to monitor the performance of the model and avoid overfitting by regularly checking its accuracy during training.
model_args.n_best_size=3 # parameter controls how many of the top probable answers the model should consider when predicting an answer
model_args.num_train_epochs=5  # Number of training epochs (iterations over the dataset).


In [12]:
### Remove output folder
!rm -rf outputs

In [13]:
!pip install wandb
# Install Weights and Biases (wandb) for tracking and visualizing training in a web browser.



In [None]:
# Train the model
model.train_model(train, eval_data=test)

  self.pid = os.fork()
  self.pid = os.fork()
Could not find answer: 'The Final Empire,The Well of Ascension, and The Hero of Ages.' vs. 'The Final Empire, The Well of Ascension, and The Hero of Ages'
convert squad examples to features: 100%|██████████| 4/4 [00:00<00:00, 274.01it/s]
add example index and unique id: 100%|██████████| 4/4 [00:00<00:00, 56679.78it/s]


Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

In [None]:
# Evaluate the model
result, texts = model.eval_model(test)

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# Make predictions with the model
to_predict = [
    {
        "context": "Vin is a Mistborn of great power and skill.",
        "qas": [
            {
                "question": "What is Vin's speciality?",
                "id": "0",
            }
        ],
    }
]

In [None]:
answers, probabilities = model.predict(to_predict)

print(answers)


convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 4744.69it/s]

add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 658.45it/s]


Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

[{'id': '0', 'answer': ['is a Mistborn', 'is a Mi', 'born']}]
