**Objective: Generate Question Answer Dataset**

Task:


1. Pre-process the dataset crawled from web in assignment 1.
2. Pre processing should only remove special characters, urls etc if not already done. DO NOT do **stemming**, **lemmatization** **stop word remove**. Text should remain in natural language after pre processing.
5.  python, write code that reads one paragraph each from the crawled dataset.
6. Pass this dataset to a function which generates multiple (3 – 5) Question Answer pairs from this paragraph using LLM.
7. If your data already has question answers as well, using the answer text, generate more questions.
8. Save all the question answer pairs generated from all the paragraphs in a csv file with Questions in Column A and Answers in Column B.

**Notebook Author**: Faizan Khan

MS Robotics and AI at school of Mechanical & Manufacturing Engineering, NUST, Islamabad.

**Subject**: Generative AI & Applications


In [54]:
# mounting with G drive
'''
first let's mount the notebook to google drive to access data from drive
The purpose of using google drive is to save the progress and access file which may lost
when colab notebook crashes
'''
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


create directory if google drive if it does not exists:

```python
gdrive
└── MyDrive
    ├── Colab Notebooks
    └── genAI_assignment2
        ├── dawn_pakistan.json
        ├── processed_file.json
        └── ...
```

the cell below will navigate to MyDrive directory and
create a directory for assignment2, if the directory already exists, It will print the message.

installatinon

In [55]:
# navigate to MyDrive directory and create a directory "GenAI_assignment2"
import os

my_drive_path = "/gdrive/MyDrive"

if os.path.exists(my_drive_path):
  # navigate to my drive
  os.chdir(my_drive_path)
  destination_dir = os.path.join(my_drive_path, "genAI_assignment2")
  if not os.path.exists(destination_dir): # if it does not exists then create
    os.makedirs(destination_dir)
  else:
    print("The directory already exists!!!")

The directory already exists!!!


In [56]:
# uploading the dataset file scraped in the assignment1 to the created dir
# navigate to the genAI_assignment2 dir
os.chdir(destination_dir)

In [57]:
# checking the directory contents and printing the current working directory using linux commands
%pwd

'/gdrive/MyDrive/genAI_assignment2'

In [58]:
%ls

cleaned_file.csv    dawn_pakistan_processed.json      [0m[01;34mflagged[0m/                qa_pairs_new.csv
dawn_pakistan.json  dawn_pakistan_processed_t5_small  qag_end2end_logger.log


remove some extra files and folders if exists

In [59]:
# removing some extra stuff from the directory if exists
# file removal code
#---------------------------------------------------------


# %rm -rf t5_ans_gen_model/
# %rm -rf t5_que_gen_model/
# %rm -rf t5_que_gen.zip
%rm -rf flagged
# %rm -rf dawn_pakistan_processed_recent.json
# %rm -rf dawn_pakistan_processed.json


In [60]:
%ls

cleaned_file.csv    dawn_pakistan_processed.json      qag_end2end_logger.log
dawn_pakistan.json  dawn_pakistan_processed_t5_small  qa_pairs_new.csv


now upload your dataset files to the directory.

In [61]:
# reading the json file in python to load the dataset
import json
data_file_name = "dawn_pakistan.json"
with open(data_file_name) as f:
  dataset = json.load(f)

print(dataset)



dataset information

In [62]:
# inspect the content of the file
print(f'type of the file: \n{type(dataset)}\n')
print(f'Number of entries in the file: \n{len(dataset)}\n')
print(f'Keys in the first sample: \n{dataset[0].keys()}\n')
print(f'first sample: \n{dataset[0]}')


type of the file: 
<class 'list'>

Number of entries in the file: 
15

Keys in the first sample: 
dict_keys(['title', 'url', 'text'])

first sample: 
{'title': '2 officers among 7 soldiers martyred in terrorist attack on post in North Waziristan: ISPR', 'url': 'https://www.dawn.com/news/1821887/2-officers-among-7-soldiers-martyred-in-terrorist-attack-on-post-in-north-waziristan-ispr', 'text': 'Seven soldiers — including two officers — were martyred in a terrorist attack on a security forces’ post in the Mir Ali area of North Waziristan district, the Inter-Services Public Relations (ISPR) said on Saturday.Six terrorists were also neutralised in the subsequent clearance operation.According to the military’s media affairs wing, in the early hours of March 16, 2024, a group of six terrorists attacked a security forces’ post in the general area of Mir Ali of North Waziristan District.“As own troops foiled the initial attempt of intrusion, the terrorists rammed an explosives-laden vehicle in

In [63]:
# inspecting the text for a sample as we are interested in text section of the file
print(f"Text for the second sample: \n{dataset[0]['text']}")

Text for the second sample: 
Seven soldiers — including two officers — were martyred in a terrorist attack on a security forces’ post in the Mir Ali area of North Waziristan district, the Inter-Services Public Relations (ISPR) said on Saturday.Six terrorists were also neutralised in the subsequent clearance operation.According to the military’s media affairs wing, in the early hours of March 16, 2024, a group of six terrorists attacked a security forces’ post in the general area of Mir Ali of North Waziristan District.“As own troops foiled the initial attempt of intrusion, the terrorists rammed an explosives-laden vehicle into the post, followed by multiple suicide bombing attacks, which led to the collapse of a portion of a building, resulting into  (martyrdom) of five brave sons of the soil,” ISPR added.The martyrs of the initial attack include Havildar Sabir, a resident of District Khyber, Naik Khurshid  (resident of District Lakki Marwat), Sepoy Nasir (resident of District Peshawar

preprocess the text using python regular expression module
i.e. removing special characters from the text, url removal and remvoing some extra white spaces.

In [64]:
# write a python code to clean or preprocess the text of the article
# use regular expressions module
import re

def preprocess_text(text):
  """
  preporcess the text and remove special character and urls if there are any

  Parameters
  ----------
  text: str

  Return
  ------
  cleaned_text: str
  """
  # remove special chars and replace with none and don't remove punctuations (to keep natural language)
  cleaned_text =  re.sub(r'[^\w\s\.,!?]', '', text)

  # remove urls and replace it with none space
  cleaned_text = re.sub(r"http\S+", "", cleaned_text)
  # remove extra white spaces
  cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
  return cleaned_text.strip()


preprocessing code testing

In [65]:
# testing the preprocess_text function
example1 = "Hello!,  what's up? Visit us at https://example.com. #Excited!!"
example2 = "What's the subscription cost of Google colab pro? it's $50"
example3 = "$50!! @50 you can access to A100 NVIDIA GPUS, //newline, \n\n new paragraph"

print(preprocess_text(example1))
print(preprocess_text(example2))
print(preprocess_text(example3))
# it seems like everyting is working well!!! we don't want to remove puncations so it's useful in
# natural language that's why there are some puncatins chars

Hello!, whats up? Visit us at Excited!!
Whats the subscription cost of Google colab pro? its 50
50!! 50 you can access to A100 NVIDIA GPUS, newline, new paragraph


In [66]:
# now loop over all the text to process it and overwrite it in the original json file
def process_all_data(dataset):
  """
  process all the data in the file that we just read in a json file

  parameters
  -----------
  dataset: list (raw)

  return
  ------
  dataset: list (processed)
  """
  for data in dataset:
    text = data["text"]
    processed_text = preprocess_text(text)
    data["text"] = processed_text
  return dataset

In [67]:
# dataset before preprocessing

# print(dataset[0]['text'])
print(f"text length: {len(dataset[0]['text'])}\n")
dataset[0]['text']

text length: 5261



'Seven soldiers — including two officers — were martyred in a terrorist attack on a security forces’ post in the Mir Ali area of North Waziristan district, the Inter-Services Public Relations (ISPR) said on Saturday.Six terrorists were also neutralised in the subsequent clearance operation.According to the military’s media affairs wing, in the early hours of March 16, 2024, a group of six terrorists attacked a security forces’ post in the general area of Mir Ali of North Waziristan District.“As own troops foiled the initial attempt of intrusion, the terrorists rammed an explosives-laden vehicle into the post, followed by multiple suicide bombing attacks, which led to the collapse of a portion of a building, resulting into  (martyrdom) of five brave sons of the soil,” ISPR added.The martyrs of the initial attack include Havildar Sabir, a resident of District Khyber, Naik Khurshid  (resident of District Lakki Marwat), Sepoy Nasir (resident of District Peshawar), Sepoy Raja (resident of D

In [68]:
# preprocess all data
dataset = process_all_data(dataset)

In [69]:
# dataset after preprocessing

dataset[0]['text']

'Seven soldiers including two officers were martyred in a terrorist attack on a security forces post in the Mir Ali area of North Waziristan district, the InterServices Public Relations ISPR said on Saturday.Six terrorists were also neutralised in the subsequent clearance operation.According to the militarys media affairs wing, in the early hours of March 16, 2024, a group of six terrorists attacked a security forces post in the general area of Mir Ali of North Waziristan District.As own troops foiled the initial attempt of intrusion, the terrorists rammed an explosivesladen vehicle into the post, followed by multiple suicide bombing attacks, which led to the collapse of a portion of a building, resulting into martyrdom of five brave sons of the soil, ISPR added.The martyrs of the initial attack include Havildar Sabir, a resident of District Khyber, Naik Khurshid resident of District Lakki Marwat, Sepoy Nasir resident of District Peshawar, Sepoy Raja resident of District Kohat and Sepo

enable save_file when you are runing first time, so the preprocessed data should be stored in a seperate json file. And then disable the flag, so you don't run it again.

In [None]:
# saving the processed dataset in the json file so we can access it later for LLM

# Write the updated JSON back to the file
save_file = False # once saved please switch to false mode to disable saving again if you run this cell mistakenly
if save_file:
  with open('dawn_pakistan_processed.json', 'w') as file:
      json.dump(dataset, file, indent=4)  # indent for pretty printing

In [70]:
%ls

cleaned_file.csv    dawn_pakistan_processed.json      qag_end2end_logger.log
dawn_pakistan.json  dawn_pakistan_processed_t5_small  qa_pairs_new.csv


**Llama2**

The Llama 2 is a collection of pretrained and fine-tuned generative text models, ranging from 7 billion to 70 billion parameters, designed for dialogue use cases.

It outperforms open-source chat models on most benchmarks and is on par with popular closed-source models in human evaluations for helpfulness and safety.

**Quantized Models from the Hugging Face Community**

The Hugging Face community provides quantized models, which allow us to efficiently and effectively utilize the model on the T4 GPU. It is important to consult reliable sources before using any model.

There are several variations available, but the ones that interest us are based on the GGLM library.

In [None]:
# if below installation does not work, run this cell.
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
# install all the required packatges for llama2
# GPU llama-cpp-python
# the installation may take some time.
# ignore the warnings
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 numpy==1.23.4 --force-reinstall --upgrade --no-cache-dir --verbose
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4

In [104]:
# model details
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin"

In [105]:
# import all teh required packages for llama2 generator
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

In [106]:
# download the model
# this may take some time as it is fetching weights and model config from huggingface
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


llama-2-13b-chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]

In [None]:
# loading the model now
# may take 30-40 secs on tesla t4 colab gpu
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512,
    n_gpu_layers=32
    )

In [108]:
# See the number of layers in GPU
lcpp_llm.params.n_gpu_layers

32

In [109]:
# prompt creation, the prompt can be optimized for accurate reponse if needed.
####################################################################################################
text1 = """Pakistan, officially the Islamic Republic of Pakistan, is a country in South Asia.
It is the fifth-most populous country, with a population of over 241.5 million, having the second-largest Muslim population as of 2023.
Islamabad is the nation's capital, while Karachi is its largest city and financial centre."""

text2 = '''
Karachi is the capital city of the Pakistani province of Sindh.
It is the largest city in Pakistan and the 12th largest in the world, with a population of over 20 million.
It is situated at the southern tip of the country along the Arabian Sea coast and formerly served as the capital of Pakistan
'''
text3 = '''
LangChain is a framework designed to simplify the creation of applications using large language models.
As a language model integration framework,
LangChain's use-cases largely overlap with those of language models in general,
including document analysis and summarization, chatbots, and code analysis.
'''


###################################################################################################
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant for QA data acquisition.
    Generate question and answer pairs using the infromation from the USER text below.
    Generate your own questions from the context below and then generate answers from the text for each question
    you generated.


USER: {text}

ASSISTANT:
'''

In [110]:
# print prompt template
print(prompt_template)

SYSTEM: You are a helpful, respectful and honest assistant for QA data acquisition.
    Generate question and answer pairs using the infromation from the USER text below.
    Generate your own questions from the context below and then generate answers from the text for each question
    you generated.


USER: 
Pakistan, officially the Islamic Republic of Pakistan, is a country in South Asia.
It is the fifth-most populous country, with a population of over 241.5 million,
having the second-largest Muslim population as of 2023.
Islamabad is the nation's capital, while Karachi is its largest city and financial centre.


ASSISTANT:



In [111]:
# now generate the reponse for the given prompt
# changing some parameters may improve the performance
# first time the generation may take time and then it will be a little faster.
response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=False)

In [112]:
# show the model completion
print(response["choices"][0]["text"])

What is Pakistan's official name?
Answer: Islamic Republic of Pakistan.

How many people live in Pakistan?
Answer: Over 241.5 million as of 2023.

What is the capital of Pakistan?
Answer: Islamabad.

What is the largest city and financial center of Pakistan?
Answer: Karachi.


In [113]:
# llama2 llm
def llama2(prompt):
    # language model processing logic
    response = lcpp_llm(prompt=prompt, max_tokens=256, temperature=0.4, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=False)
    llm_completion = response["choices"][0]["text"]  # Retrieving generated text only

    return llm_completion


In [None]:
# intall langchain text splitter as we want to split long artilces into more managle parts
!pip install -qU langchain-text-splitters

In [115]:
import csv


def find_questions_indices(data):
  '''
  find question "?" identifiers index in the list

  parameters
  ----------
  data: list

  return:
  ------
  first_question_index: int
  last_question_index: int
  '''
  first_question_index = None
  last_question_index = None

  # Iterate through the list
  for i, item in enumerate(data):
      if item.endswith('?') or '?' in item:
          # If first_question_index is None, set it to the current index
          if first_question_index is None:
              first_question_index = i
          # Update last_question_index to the current index
          last_question_index = i
  return (first_question_index, last_question_index)


def has_consecutive_ones(lst):
  '''
  find if the completion contains all the questions when no answer generated.

  parameters
  ----------
  lst: list

  return
  ------
  var: bool
  '''
  for i in range(len(lst) - 1):
      if lst[i] == 1 and lst[i + 1] == 1:
          return True
  return False

def post_process_text(llm_completion):
  '''
  process the llm completion to generate question answer pairs and save them in a csv file

  parameters
  ----------
  llm_completion: str

  return
  ------
  None
  '''
  qa_list = llm_completion.split('\n') # separte each line of completion qa

  # remove first entry as it is just the model response not actual qa
  qa_list.pop(0)
  # remove empty items
  qa_list = [item for item in qa_list if item]
  # find first and last question indices
  first_question_ind, last_question_ind = find_questions_indices(qa_list)
  # get disired qa pairs
  qa_data = qa_list[first_question_ind: last_question_ind+2]
  # keep only quetions and answer and remove anything else
  data = [item[item.rfind(":") + 1:].strip() if ":" in  item
        else item[item.rfind(".") + 1:].strip() if "." in item
        else item[item.rfind(")") + 1:].strip() if ")" in item
        else item.strip() for item in qa_data]

  # checking if model is generating questions only then don't save in csv
  data_items = [1 if '?' in item else 0 for item in data]

  only_questions = has_consecutive_ones(data_items)
  if only_questions:
    # don't store anything in this case
    pass
  else:
    # create qa pairs
    print('writing output to csv..')
    # should be even number of pairs
    if len(data) % 2 != 0:
      data.pop(-1)
      qa_pairs = [(data[i], data[i+1]) for i in range(0, len(data), 2)]
    else:
      qa_pairs = [(data[i], data[i+1]) for i in range(0, len(data), 2)]

    # write to csv file
    file_path = 'qa_pairs.csv'

    if not os.path.exists(file_path):
        with open(file_path, 'w', newline='', encoding='utf-8') as file:
            writer = csv.writer(file)
            # Write header row
            writer.writerow(['Question', 'Answer'])

    # append data
    with open(file_path, 'a', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerows(qa_pairs)

In [116]:
# using logger to log output of llm in a file to see
#it's generation completion. Somtimes it return qa paris and sometimes only quetions
# this logged file will help understand the model behaviour

import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

file_handler = logging.FileHandler(filename= "qag_end2end_logger.log")
formatter = logging.Formatter(fmt= "%(asctime)s: %(message)s", datefmt= '%Y-%m-%d %H:%M:%S')
file_handler.setFormatter(formatter)

logger.addHandler(file_handler)

In [117]:
# split the long passage into managable chunks
# if the window context size is limited.
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=500, # do change this to see the effect
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False,
)

def qa_generator(text):
  '''
  takes the paragraph text and generate model completion and then post process the completion
  to generate QA data

  parameters
  ----------
  text: str

  return
  ------
  None
  '''

  # prompt
  prompt=f'''SYSTEM: You are a helpful, respectful and honest assistant for QA data acquisition.
    Generate question and answer pairs using the infromation from the USER text below.
    Generate your own questions from the context below and then generate answers from the text for each question
    you generated.
  '''
  # complte prompt: text ingested in the prompt
  prompt_template = prompt + f"\n\nUSER: {text} \n\nASSITANT:"
  # invoke llm (llama2)
  generated_text = llama2(prompt_template)
  # logg model completion for now
  logger.info(generated_text)
  # process the completion and write the qa pairs in a csv file
  post_process_text(generated_text)



def create_qa(dataset, text_splitter):
  '''
  run the llm over the json file, and store the respone in a csv file
  if the articles that are published should be marked so they don't be explored again

  parameters
  ----------
  dataset: list[dict]
  text_splitter: Module from langchain

  return:
  ------
  None
  '''
  # iterate over all the json entries
  for data in dataset:
    # don't process articles that have already been explored by llm
    if 'processed_article' in data.keys():
      if data['processed_article']:
        print('News article already processed!')
    else:
      # reterive text
      text = data['text']
      # chunk the text as the text is bit longer in each article
      # longer text takes time and llm context window can't process all at once becasue of limited context size for some llm
      # split into 500 chars with 0 overlap
      chunks_list = text_splitter.split_text(text)

      # iterate over each chunk
      for text_chunk in chunks_list:
        # genrate qa
        qa_generator(text_chunk)
      # this flag helps identify the articles already processed by an llm
      # so we don't repeat it again.
      data['processed_article'] = True
      # overwrite the update json file, so next time read this file when notebook crahes and doing
      # processing again
      with open('dawn_pakistan_processed', 'w') as file:
        json.dump(dataset, file, indent = 4)


In [118]:
# creating a prompt template to used later for generation stage
# feel free to improve it.
# prompt test
prompt=f'''SYSTEM: You are a helpful, respectful and honest assistant for QA data acquisition.
Generate question and answer pairs using the infromation from the USER text below.
Generate your own questions from the context below and then generate answers from the text for each question
you generated.
'''
prompt_template = prompt + f"\n\nUSER: {text1} \n\nASSITANT:"
print(prompt_template)

SYSTEM: You are a helpful, respectful and honest assistant for QA data acquisition.
Generate question and answer pairs using the infromation from the USER text below.
Generate your own questions from the context below and then generate answers from the text for each question
you generated.


USER: Pakistan, officially the Islamic Republic of Pakistan, is a country in South Asia.
It is the fifth-most populous country, with a population of over 241.5 million, having the second-largest Muslim population as of 2023.
Islamabad is the nation's capital, while Karachi is its largest city and financial centre. 

ASSITANT:


In [119]:
# reading process json file (that includes processed dataset i.e. no speical chars and urls)

# must run this cell to use dataset later
preprocessed_data = 'dawn_pakistan_processed.json'
with open(preprocessed_data) as f:
  dataset = json.load(f)
dataset[0]['text']

'Seven soldiers including two officers were martyred in a terrorist attack on a security forces post in the Mir Ali area of North Waziristan district, the InterServices Public Relations ISPR said on Saturday.Six terrorists were also neutralised in the subsequent clearance operation.According to the militarys media affairs wing, in the early hours of March 16, 2024, a group of six terrorists attacked a security forces post in the general area of Mir Ali of North Waziristan District.As own troops foiled the initial attempt of intrusion, the terrorists rammed an explosivesladen vehicle into the post, followed by multiple suicide bombing attacks, which led to the collapse of a portion of a building, resulting into martyrdom of five brave sons of the soil, ISPR added.The martyrs of the initial attack include Havildar Sabir, a resident of District Khyber, Naik Khurshid resident of District Lakki Marwat, Sepoy Nasir resident of District Peshawar, Sepoy Raja resident of District Kohat and Sepo

In [120]:
# run the piple line
# llam2 is too slow, it can not be used for qa generation
run_llama2 = False
if run_llama2:
  # use preporcessed data and text splitter to generate qa
  create_qa(dataset, text_splitter)

In [None]:
# some of the data generated by llam2
import pandas as pd
df = pd.read_csv('qa_pairs.csv')
df.head()

Unnamed: 0,Question,Answer
0,How many soldiers were martyred in the terrori...,Seven soldiers including two officers were mar...
1,How many terrorists were neutralized in the cl...,Six terrorists were neutralized in the subsequ...
2,Where did the terrorist attack take place?,The terrorist attack took place in the Mir Ali...
3,What was the date of the terrorist attack?,According to the military's media affairs wing...
4,How many terrorists attacked the security forc...,A group of six terrorists attacked a security ...


**Keys takeaways for using Llama2 for QA Generation:**
- Llama2 needs an optimal prompt for Good QA generation
- Llama2 takes a lot of time (1-2 mins per completion) on 500 characters text
- The completion generated by Llama2 is not uniform with format
- Alot of post processing is required for cleaning QA pairs from completion
- Not scalable to large datasets (maybe good for small datasets)
- Sometimes only questions generation and no answers
- rarely inaccurate response
- Not an ideal model for QA generation
- too slow, for one paragraph it takes on avearge 70-80 secs

**QAG with End2end**

- The end2end QAG models are fine-tuned to generate a list of QA pairs.
- With a single inference, so it is the fastest class among QAG models.
- No need to fine-tuned the prompt to the model
- Faster than Llama2
- takes 2-4 secs for each paragraph



| Model                                                                                                                                                        | Data                                                               | Type          | Language Model                                                                  | Language | QAAlignedF1Score (BERTScore) | QAAlignedF1Score (MoverScore) | QAAlignedPrecision (BERTScore) | QAAlignedPrecision (MoverScore) | QAAlignedRecall (BERTScore) | QAAlignedRecall (MoverScore) |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------|:--------------|:--------------------------------------------------------------------------------|:---------|-----------------------------:|------------------------------:|-------------------------------:|--------------------------------:|----------------------------:|-----------------------------:|
| [`lmqg/t5-small-squad-qag`](https://huggingface.co/lmqg/t5-small-squad-qag)                                                                                  | [`lmqg/qg_squad`](https://huggingface.co/datasets/lmqg/qg_squad)   | End2end QAG   | [`t5-small`](https://huggingface.co/t5-small)                                   | English  |                        92.76 |                         64.59 |                          92.87 |                           65.30 |                       92.68 |                        63.99 |



The model is computationally
light at both training and inference times, is
generally robust and outperforms other more
convoluted approaches. Finally, the paper shows that QA
models fine-tuned solely on generated question answer pairs can be competitive when compared to supervised QA models trained on human-labeled data.

paper: https://arxiv.org/pdf/2305.17002.pdf


In [79]:
# installation (it may take ~ 1.5 mins)
!pip install lmqg

In [80]:
# import TransformerQG
from lmqg import TransformersQG

In [None]:
# initialize model with english language and select t5-base-squad-qag fine tunned model
model = TransformersQG(language='en', model='lmqg/t5-base-squad-qag')

In [82]:
# paragraph to generate pairs of question and answer
# sample
context = '''Karachi is the capital city of the Pakistani province of Sindh.
It is the largest city in Pakistan and the 12th largest in the world, with a population of over 20 million.
It is situated at the southern tip of the country along the Arabian Sea coast and formerly served as the capital of Pakistan.
'''
# model prediction
question_answer = model.generate_qa(context)
# the output is a list of tuple (question, answer)
print(question_answer)

100%|██████████| 1/1 [00:00<00:00, 455.21it/s]


[('What is the capital city of Sindh?', 'Karachi'), ('What is the population of Karachi?', 'over 20 million'), ('Where is Karachi located?', 'southern tip of the country along the Arabian Sea coast')]


cool! this fine tuned model is working really well and the inference time is
small. So let's use it for QA generation here

In [None]:
import csv
def qag_generator(text):
  '''
  generate question answer and store them in a csv file

  parameters
  ----------
  text: str
  '''
  question_answer_pairs = model.generate_qa(text)
  logger.info(question_answer_pairs)

  # write to csv file
  file_path = 'qa_pairs_new_t5_small.csv'

  if not os.path.exists(file_path):
      with open(file_path, 'w', newline='', encoding='utf-8') as file:
          writer = csv.writer(file)
          # Write header row
          writer.writerow(['Question', 'Answer'])

  # append data
  with open(file_path, 'a', newline='', encoding='utf-8') as file:
      writer = csv.writer(file)
      writer.writerows(question_answer_pairs)


In [None]:
def QAG(dataset, text_splitter):
  """
  run over the json file and store them in a csv

  parameters
  ----------
  dataset: str
  text_splitter

  return
  ------
  None
  """
  # iterate over all the json entries
  for data in dataset:
    # don't process articles that have already been explored by llm
    if 'processed_article' in data.keys():
      if data['processed_article']:
        logger.info('News article already processed!')
    else:
      # reterive text
      text = data['text']
      # chunk the text as the text is bit longer in each article
      # longer text takes time and llm context window can't process all at once becasue of limited context size for some llm
      # split into 500 chars with 0 overlap
      chunks_list = text_splitter.split_text(text)

      # iterate over each chunk
      for text_chunk in chunks_list:
        # genrate qa
        qag_generator(text_chunk)
      # this flag helps identify the articles already processed by an llm
      # so we don't repeat it again.
      data['processed_article'] = True
      # overwrite the update json file, so next time read this file when notebook crahes and doing
      # processing again
      with open('dawn_pakistan_processed_t5_small', 'w') as file:
        json.dump(dataset, file, indent = 4)

In [None]:
# now run end2end QAG
run_qag_model = False
if run_qag_model:
  QAG(dataset, text_splitter)

100%|██████████| 1/1 [00:00<00:00, 364.06it/s]
INFO:__main__:[('How many soldiers were martyred in a terrorist attack on a security forces post?', 'Seven'), ('How many terrorists attacked a security forces post?', 'six'), ('What district is Mir Ali located in?', 'North Waziristan'), ('How many soldiers were martyred?', 'seven'), ('How many officers were martyred?', 'two'), ('How many terrorists attacked a security forces post?', 'six')]
100%|██████████| 1/1 [00:00<00:00, 365.10it/s]
INFO:__main__:[('What did the terrorists do after ramming an explosivesladen vehicle into the post?', 'multiple suicide bombing attacks'), ('How many sons of the soil were martyred?', 'five brave sons of the soil'), ('Who were the martyrs of the initial attack?', 'Havildar Sabir, a resident of District Khyber, Naik Khurshid resident of District Lakki Marwat, Sepoy Nasir resident of District Peshawar, Sepoy Raja resident of District Kohat'), ('Who were the martyrs of the initial attack?', 'Havildar Sabir, a 

In [None]:
# remove duplicate entries in the csv if exists!!
import pandas as pd

save_file = True
if save_file:
  df = pd.read_csv('qa_pairs_new.csv')

  # Remove duplicate rows
  df.drop_duplicates(inplace=True)

  # Write the cleaned data back to a CSV file
  df.to_csv('cleaned_file.csv', index=False)

In [None]:
df = pd.read_csv('cleaned_file.csv')
df.tail()

Unnamed: 0,Question,Answer
448,Who has sent Ibrahim K death threats?,internet trolls
449,What is Ibrahim K's maximum sentence?,four years
450,On what day was the hashtag trending on X in T...,Thursday
451,Who is the star striker at Istanbul giants?,Mauro Icardi
452,Who is the reigning Turkish champions of Galat...,Galatasaray


Streaming output of llm with Gradio interface

In [75]:
!pip install gradio

In [83]:
def end2end_qag(context):
    # return qa pairs for the context provided
    question_answer = model.generate_qa(context)
    formatted_text = ""
    for question, answer in question_answer:
        formatted_text += f"question: {question}\nanswer: {answer}\n\n"

    return formatted_text

**Demo**

In [84]:
# enable this cell to see gradio app interface (OPTIONAL)
# run the machine learning application in real time for question answering
import gradio as gr
enable = True
if enable:
  iface = gr.Interface(
      fn=end2end_qag,
      inputs=[
          gr.Textbox(label="Insert a paragraph here", lines=6),
      ],
      outputs=[gr.Textbox(label="QA Generation", lines=3)],
      title="LLM based Question answers pairs generator",
      examples=["generate question answer pairs from the dataset!"]
  )
  iface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://d470cec757b2208507.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


In [None]:
# to close the app if gradio interface is running
iface.close()

Closing server running on port: 7861


**OpenAI's GPT-3.5 Turbo**

GPT-3.5 Turbo models can understand and generate natural language or code and have been optimized for chat using the Chat Completions API but work well for non-chat tasks as well.

**gpt-3.5-turbo-0125**
The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls. Returns a maximum of 4,096 output tokens

In [85]:
#installation
!pip install openai

In [86]:
import pandas as pd
from openai import OpenAI

In [87]:
text = """
Pakistan, officially the Islamic Republic of Pakistan, is a country in South Asia.
It is the fifth-most populous country, with a population of over 241.5 million,
having the second-largest Muslim population as of 2023.
Islamabad is the nation's capital, while Karachi is its largest city and financial centre.
"""

In [149]:
# settings
api_key = ""
system_requirements = "You are a helpful and honest assistant for QA pairs generation."
# prompt
prompt = f"""
Please Generate Question Answer pairs from text below. The format of QA pairs should
be question first followed by an answer. An example of the QA format is shown below:

Question: your question generated from text?
Answer: your answer generated from text.

Please genearte 3 to 5 QA pairs from each text and keep your answer as concise as possible.

here is text: {text}"""

model = "gpt-3.5-turbo-0125"    #gpt-3.5-turbo-0125, gpt-4-0125-preview, etc

In [89]:

client = OpenAI(api_key=api_key)

def GPT35Turbo(prompt, model, system_requirements):
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_requirements},
                {"role": "user", "content": prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"An error occurred: {e}"

response = GPT35Turbo(prompt, model, system_requirements)
print(response)

Question: What is the official name of Pakistan?
Answer: The official name of Pakistan is the Islamic Republic of Pakistan.

Question: Where is Pakistan located?
Answer: Pakistan is located in South Asia.

Question: What is the population of Pakistan?
Answer: Pakistan has a population of over 241.5 million people.

Question: What is the largest city in Pakistan?
Answer: Karachi is the largest city and financial center of Pakistan.

Question: Which city serves as the capital of Pakistan?
Answer: Islamabad is the capital city of Pakistan.


In [90]:
def gpt_turbo(text):
  """
  use openai gpt3.5 turbo-0.125 model for qa geneartion

  parameters
  ----------
  text: context (str)

  return
  ------
  respone: qa pairs
  """
  model =  "gpt-3.5-turbo-0125"

  system_requirements = "You are a helpful and honest assistant for QA pairs generation."

  # prompt
  prompt = f"""
  Please Generate Question Answer pairs from text below. The format of QA pairs should
  be question first followed by an answer. An example of the QA format is shown below:

  Question: your question generated from text?
  Answer: your answer generated from text.

  Please genearte 3 to 5 QA pairs from each text and keep your answer as concise as possible.

  here is text: {text}"""

  try:
    response = client.chat.completions.create(
                                              model=model,
                                              messages=[
                                                  {"role": "system", "content": system_requirements},
                                                  {"role": "user", "content": prompt}
                                              ]
                                              )
    return response.choices[0].message.content
  except Exception as e:
    return f"An error occurred: {e}"

In [143]:
def process_save_qa_data(gpt_completion):
  # process respone and save it in a csv file

  # split question and answers
  # logger.info(f'completion: {gpt_completion}')
  qa_list = gpt_completion.split('\n')

  # remove empty items in list
  qa_list_empty_removed = [item for item in qa_list if item]

  # keep only quetions and answer and remove anything else
  data = [item.split(":", 1)[1].strip() for item in qa_list_empty_removed]

  # create qa pairs
  if len(data) % 2 != 0:
      data.pop(-1)
      qa_pairs = [(data[i], data[i+1]) for i in range(0, len(data), 2)]
  else:
        qa_pairs = [(data[i], data[i+1]) for i in range(0, len(data), 2)]


  # now save to a csv file
  file_path = 'qa_pairs_gpt35_turbo1.csv'

  if not os.path.exists(file_path):
      with open(file_path, 'w', newline='', encoding='utf-8') as file:
          writer = csv.writer(file)
          # Write header row
          writer.writerow(['Question', 'Answer'])

  # append data
  with open(file_path, 'a', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerows(qa_pairs)

In [144]:
# now writing a function to write the response in csv file.
def generate_qa_gpt(dataset, text_splitter):
  """
  run over the json file and store them in a csv, and use gpt3.5 turbo for qa generation

  parameters
  ----------
  dataset: str
  text_splitter

  return
  ------
  None
  """
  # iterate over all the json entries
  for data in dataset:
    # don't process articles that have already been explored by llm
    if 'processed_article' in data.keys():
      if data['processed_article']:
        logger.info('News article already processed!')
    else:
      # reterive text
      text = data['text']
      # chunk the text as the text is bit longer in each article
      # longer text takes time and llm context window can't process all at once becasue of limited context size for some llm
      # split into 500 chars with 0 overlap
      chunks_list = text_splitter.split_text(text)

      # iterate over each chunk
      for text_chunk in chunks_list:
        # genrate qa
        completion = gpt_turbo(text_chunk)
        if completion.startswith("An error occurred"):
          pass
        else:
          process_save_qa_data(completion)

        # this flag helps identify the articles already processed by an llm
        # so we don't repeat it again.
      data['processed_article'] = True
      logger.info('An article has been processed!!')
        # overwrite the update json file, so next time read this file when notebook crahes and doing
        # processing again, if noteobook crashes adn you have to restart, please use the file below
        # use 'dawn_pakistan_processed_recent_gpt35' as starting point.
      with open('dawn_pakistan_processed_recent_gpt35', 'w') as file:
        json.dump(dataset, file, indent = 4)


In [145]:
# for 15 news articles it took ~ 5mins
run_gpt = True
if run_gpt:
  generate_qa_gpt(dataset, text_splitter)

INFO:__main__:News article already processed!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!
INFO:__main__:An article has been processed!!


In [147]:
# qa generated by gpt3.5 turbo
df = pd.read_csv('qa_pairs_gpt35_turbo1.csv')
df.head()

Unnamed: 0,Question,Answer
0,Who was the jailed PTI founder whose pre-arres...,Imran Khan.
1,Which court extended Imran Khan's pre-arrest b...,ATCIII.
2,What cases were Imran Khan's bail extended for...,"Attacks on Jinnah House, Askari Tower, and Sha..."
3,When was the hearing for Imran Khan's bail ext...,March 22.
4,Why did the ATCIII judge adjourn the hearing f...,The presiding judge of ATCI had been transferred.


In [148]:
df.tail()

Unnamed: 0,Question,Answer
548,What action is one animal rights group plannin...,One animal rights group is planning to appeal ...
549,What was trending in Turkey related to Ibrahim...,The hashtag related to Ibrahim K was trending ...
550,Who has joined the Justice for Eros appeal?,"Several celebrities, including Argentinian foo..."
551,What team does Mauro Icardi play for?,Mauro Icardi is the star striker at Istanbul g...
552,What is happening with the comments section?,The comments section is undergoing an overhaul...


In [None]:
#the end

**Conclusion:**
- Llama2 is slow but generates concise responses
- T5 generates shorter, repeated QA responses sometimes, however, it has low latency and fine-tunned to end2end qa generation tasks
- No optimal prompt needed for t5 small as it is domain specific and generates qa pairs
- GPT3.5 Turbo is faster and accurate, however, it's costs o $0.0005 / 1K tokens (input)   and    $0.0015 / 1k tokens for output tokens.