<a href="https://colab.research.google.com/github/fabiomatricardi/TheRiseOfTheCuriourAI/blob/main/TheRiseOfTheCuriourAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question Generation using 🤗transformers

SOURCE:
https://github.com/patil-suraj/question_generation
<br>


---



---


This project is aimed as an open source study on question generation with pre-trained transformers (specifically seq-2-seq models) using straight-forward end-to-end methods without much complicated pipelines. The goal is to provide simplified data processing and training scripts and easy to use pipelines for inference.

### Multitask QA-QG
For answer aware question generation we usually need 3 models, first which will extract answer like spans, second model will generate question on that answer and third will be a QA model which will take the question and produce an answer, then we can compare the two answers to see if the generated question is correct or not.

Having 3 models for single task is lot of complexity, so goal is to create a multi-task model which can do all of these 3 tasks

- extract answer like spans
- generate question based on the answer
- QA

T5 model is fine-tuned in multi-task way using task prefixes as described in the paper.

<img src="https://camo.githubusercontent.com/c96395d16bf1363c4d6472c346ede0dbd7e9acdfec17250dcc0e1775b9c6d1e1/68747470733a2f2f692e6962622e636f2f544253336e73722f74352d73732d322e706e67" width=600>

End-to-End question generation (answer agnostic)
In end-to-end question generation the model is aksed to generate questions without providing the answers. This paper discusses these ideas in more detail. Here the T5 model is trained to generate multiple questions simultaneously by just providing the context. The questions are seperated by the <sep> token. Here's how the examples are processed

input text: Python is a programming language. Created by Guido van Rossum and first released in 1991.

target text: Who created Python ? <sep> When was python released ? <sep>

All the training details can be found in this wandb project

# Main Repo

## Install all the required libraries

In [1]:
%%capture
!pip install transformers
!pip install nltk
!pip install torch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0
!pip install langchain
!pip install sentencepiece

### Download question generation pipeline file

In [2]:
%%capture
!wget https://github.com/patil-suraj/question_generation/raw/master/pipelines.py

### Download some text files for examples

In [3]:
%%capture
!wget https://github.com/fabiomatricardi/Abstractive-Extractive/raw/main/BERTexplanation.txt
!wget https://github.com/fabiomatricardi/Abstractive-Extractive/raw/main/Text%20Summarization%20with%20NLP-%20TextRank%20vs%20Seq2Seq%20vs%20BART.txt
!wget https://github.com/fabiomatricardi/Abstractive-Extractive/raw/main/AutomaticTextSummarization.txt
!wget https://github.com/fabiomatricardi/Abstractive-Extractive/raw/main/GPT4VsLima.txt
!wget https://github.com/fabiomatricardi/Abstractive-Extractive/raw/main/nlp-basics-abstractive-and-extractive-text-summarization.txt

## Download the model weights (torch version)
### Download the model valhalla/t5-small-e2e-qg locally and move it to `t5-small-e2e-q`  directory

In [4]:
%%capture
!wget https://huggingface.co/valhalla/t5-small-e2e-qg/resolve/main/added_tokens.json
!wget https://huggingface.co/valhalla/t5-small-e2e-qg/resolve/main/config.json
!wget https://huggingface.co/valhalla/t5-small-e2e-qg/resolve/main/pytorch_model.bin
!wget https://huggingface.co/valhalla/t5-small-e2e-qg/resolve/main/special_tokens_map.json
!wget https://huggingface.co/valhalla/t5-small-e2e-qg/resolve/main/spiece.model
!wget https://huggingface.co/valhalla/t5-small-e2e-qg/resolve/main/tokenizer_config.json
!wget https://huggingface.co/valhalla/t5-small-e2e-qg/resolve/main/training_args.bin

!mkdir t5-small-e2e-qg
!mv /content/added_tokens.json /content/t5-small-e2e-qg/added_tokens.json
!mv /content/config.json  /content/t5-small-e2e-qg/config.json
!mv /content/pytorch_model.bin  /content/t5-small-e2e-qg/pytorch_model.bin
!mv /content/special_tokens_map.json  /content/t5-small-e2e-qg/special_tokens_map.json
!mv /content/spiece.model  /content/t5-small-e2e-qg/spiece.model
!mv /content/tokenizer_config.json  /content/t5-small-e2e-qg/tokenizer_config.json
!mv /content/training_args.bin  /content/t5-small-e2e-qg/training_args.bin

In [5]:
#@title Restart Runtime {display-mode: "form"}
import ipywidgets as widgets
def restart(b):
  exit()

button2 = widgets.Button(
    description='Restart Runtime',
    disabled=False,
    button_style='warning', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click me',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)
button2.on_click(restart)
button2




## Test a question generation inference

In [1]:
from pipelines import pipeline
import textwrap
import datetime

nlp = pipeline("e2e-qg", model="/content/t5-small-e2e-qg", tokenizer="/content/t5-small-e2e-qg")

ques = nlp("Python is a programming language. Created by Guido van Rossum and first released in 1991.")

print(ques)
print("---")
text2 =  "By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs."
ques2 = nlp(text2)
print(ques2)
print("---")

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


['What is a programming language?', 'Who created Python?', 'When was Python first released?']
---
['What do diffusion models achieve by decomposing the image formation process into a sequential application of denoising autoencoders?', 'What is a guiding mechanism to control the image generation process without retraining?', 'How long does optimization of powerful DMs typically consume?']
---


##  Inference for LONG text

In [2]:
fname = '/content/BERTexplanation.txt'
with open(fname) as f:
    doc = f.read()
f.close()

In [4]:
# Number of characters
len(doc)

11032

In [5]:
# Number of words
len(doc.split(' '))

1734

In [3]:
def mysplit(text,chunk,overlap):
  from langchain.text_splitter import RecursiveCharacterTextSplitter
  text_splitter = RecursiveCharacterTextSplitter(
        # Set a really small chunk size, just to show.
        chunk_size = chunk,
        chunk_overlap  = overlap,
        length_function = len,
        )
  texts = text_splitter.split_text(text)
  return texts

In [6]:
texts = mysplit(doc,6000,150)
for test in texts:
  print("---")
  questions = nlp(test)
  for i in questions:
    print('- '+i)

---
- What is the name of the book written by Pushpam Punjabi author?
- What is a key component of NLP?
- How does a machine understand human language?
- Who developed Understanding BERT BERT?
---
- What is a useful technique for a variety of NLP tasks?
- How can BERT better understand the overall meaning and context of a passage of text?
- What is important for applications such as chatbots or virtual assistants where the ability to understand and interpret human language is crucial for providing accurate and helpful responses?


In [None]:
texts = mysplit(doc,3700,50)
doma = []
for test in texts:
  print("---")
  questions = nlp(test)
  for i in questions:
    doma.append(i)
    print('- '+i)

---
- What is the name of the book written by Pushpam Punjabi author?
- What is a key component of NLP?
- How does a machine understand human language?
- Who developed Understanding BERT BERT?
---
- How many datasets does BERT use?
- What is the name of the dataset that BERT is trained on?
- How many words are masked in BERT?
---
- What is a powerful tool for a variety of NLP applications?
- What does fine-tuning a BERT model enable researchers and developers to achieve higher levels of accuracy and performance on specific tasks?
- BERT uses a unique “transformer” architecture that enables it to better understand the context and meaning of words and phrases in a sentence?
---
- Who is Pushpam Punjabi?
- What is the name of the machine learning engineer who develops solutions for the use cases emerging in the field of NLP/Natural Language Understanding?
