## Pipeline Quiz Generator (Separate Quiz and Distractor Approach)

Description: Quiz Generator with separate pipeline for quiz generation and then distractor generator

### Step 1 : SciQ Loading

In [2]:
from datasets import load_dataset

sciq_dataset = load_dataset("allenai/sciq")
sciq_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 11679
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 1000
    })
})

In [3]:
sciq_dataset["train"][27]

{'question': 'A small scale version of what type of map displays individual rock units?',
 'distractor3': 'polar map',
 'distractor1': 'seismic map',
 'distractor2': 'geographic map',
 'correct_answer': 'geologic map',
 'support': 'Geologic maps display rock units and geologic features. A small scale map displays individual rock units while a large scale map shows geologic provinces.'}

In [14]:
test = sciq_dataset.filter(lambda example: len(example["support"]) > 3000) 
test

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 0
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 0
    })
})

In [22]:
test['train'][2]

{'question': 'What type of molecules sit within a membrane and contain an aqueous channel that spans the membrane’s hydrophobic region?',
 'distractor3': 'mole',
 'distractor1': 'osmotic fluid',
 'distractor2': 'microorganisms',
 'correct_answer': 'channel',
 'support': 'you could prove that movements are occurring even in the absence of a gradient. In a similar manner, there are analogous carrier systems that move hydrophobic molecules through water. Channel molecules sit within a membrane and contain an aqueous channel that spans the membrane’s hydrophobic region. Hydrophilic molecules of particular sizes and shapes can pass through this aqueous channel and their movement involves a significantly lower activation energy than would be associated with moving through the lipid part of the membrane in the absence of the channel. Channels are generally highly selective in terms of which particles will pass through them. For example, there are channels in which 10,000 potassium ions will p

In [23]:
context = test['train'][1]['support']
answer = test['train'][1]['correct_answer']
text = "context: {} answer: {}".format(context, answer)
text

'context: organism’s life cycle is as subject to the effects of evolutionary pressures as any other (although it is easy to concentrate our attentions on adult forms and behaviors). The study of these processes, known as embryology, is beyond our scope here, but we can outline a few common themes. If fertilized eggs develop outside of the body of the mother and without parental protection, these new organisms are highly vulnerable to predation. In such organisms, early embryonic development generally proceeds rapidly. The eggs are large and contain all of the nutrients required for development to proceed up to the point where the new organism can feed on its own. To facilitate such rapid development, the egg is essentially pre-organized, that is, it is highly asymmetric, with specific factors that can influence gene expression, either directly or indirectly, positioned in various regions of the egg (→). Entry of the sperm (the male gamete), which itself is an inherently asymmetric proc

In [24]:
from summarizer import Summarizer

context = test['train'][1]['support']
model = Summarizer()
result = model(text, min_length=60, max_length = 500 , ratio = 0.4)
summarized_text = ''.join(result)
summarized_text

  super()._check_params_vs_input(X, default_n_init=10)


'context: organism’s life cycle is as subject to the effects of evolutionary pressures as any other (although it is easy to concentrate our attentions on adult forms and behaviors). The study of these processes, known as embryology, is beyond our scope here, but we can outline a few common themes. If fertilized eggs develop outside of the body of the mother and without parental protection, these new organisms are highly vulnerable to predation. In such organisms, early embryonic development generally proceeds rapidly. To facilitate such rapid development, the egg is essentially pre-organized, that is, it is highly asymmetric, with specific factors that can influence gene expression, either directly or indirectly, positioned in various regions of the egg (→). Cells within the interior form the inner cell mass that produces to the embryo proper. It is easy to tell a muscle cell from a neuron from a bone cell from a skin cell by the set of genes they express, the proteins they contain, th

### Step 2 Question Generation

#### Tokenize for Question Generation 

Input : Context and Answer \
Output : Question

In [18]:
import torch
import copy
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer.add_special_tokens({"sep_token": "<sep>"})

ex = tokenizer.encode_plus(text, max_length=512, padding='max_length', pad_to_max_length=False, truncation=True, return_tensors="pt")
ex

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'input_ids': tensor([[ 2625,    10,  9329,    22,     7,   280,  4005,    19,    38,  1426,
            12,     8,  1951,    13, 27168,  1666,     7,    38,   136,   119,
            41, 18252,    34,    19,   514,    12, 11345,    69,  1388,     7,
            30,  3165,  2807,    11, 15400,   137,    37,   810,    13,   175,
          2842,     6,   801,    38, 24157,  6427,     6,    19,  1909,    69,
          7401,   270,     6,    68,    62,    54, 11052,     3,     9,   360,
          1017,  8334,     5,   156, 20859,  1601,  5875,  1344,  1067,    13,
             8,   643,    13,     8,  2039,    11,   406, 21555,  1711,     6,
           175,   126,  9329,     7,    33,  1385,  9930,    12,   554,    26,
           257,     5,    86,   224,  9329,     7,     6,   778, 24157,  2532,
           606,  2389, 14942,  7313,     5,    37,  5875,    33,   508,    11,
          3480,    66,    13,     8, 12128,   831,    21,   606,    12,  8669,
            95,    12,     8,   500,  

In [19]:
for x in ex["input_ids"][:2]:
    print(tokenizer.decode(x))

context: organism’s life cycle is as subject to the effects of evolutionary pressures as any other (although it is easy to concentrate our attentions on adult forms and behaviors). The study of these processes, known as embryology, is beyond our scope here, but we can outline a few common themes. If fertilized eggs develop outside of the body of the mother and without parental protection, these new organisms are highly vulnerable to predation. In such organisms, early embryonic development generally proceeds rapidly. The eggs are large and contain all of the nutrients required for development to proceed up to the point where the new organism can feed on its own. To facilitate such rapid development, the egg is essentially pre-organized, that is, it is highly asymmetric, with specific factors that can influence gene expression, either directly or indirectly, positioned in various regions of the egg (<unk> ). Entry of the sperm (the male gamete), which itself is an inherently asymmetric 

In [32]:
def preprocess_dataset(example):
    text = "{}<sep>{}".format(example['correct_answer'], example['support'])
    question = example['question']

    max_length = 512
    doc_stride = 128
    
    tokenized_inputs = tokenizer.encode_plus(text, max_length=max_length, padding='max_length', pad_to_max_length=False, truncation=True, return_tensors="pt")
    tokenized_targets = tokenizer.encode_plus(question, max_length=max_length, padding='max_length', pad_to_max_length=False, truncation=True, return_tensors="pt")
    
    input_ids = tokenized_inputs['input_ids']
    input_attention = tokenized_inputs['attention_mask']

    target_ids = tokenized_targets['input_ids']
    target_attention = tokenized_targets['attention_mask']

    labels = copy.deepcopy(target_ids)
    labels[labels == 0] = -100
    
    outputs = {
        'input_ids':input_ids, 
        'attention_mask': input_attention, 
        'labels': labels
    }

    return outputs
    
tokenized_dataset = sciq_dataset.map(preprocess_dataset, batched=True, batch_size=8, remove_columns=sciq_dataset["train"].column_names)
tokenized_dataset

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/11679 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11679
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

In [33]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [34]:
from transformers import T5ForConditionalGeneration, TrainingArguments, Trainer, default_data_collator

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", per_device_train_batch_size=2)

model = T5ForConditionalGeneration.from_pretrained("t5-small")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

data_collator = default_data_collator

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [35]:
trainer.train()

ValueError: too many values to unpack (expected 2)

### Step 3 Distractor Generation

#### Tokenize for Distractor Generation 

Input : Answer, Question, Context \
Output : 4 Distractors

In [31]:
def preprocess_dataset_for_distractor(example):
    text = "{} {} {}".format(example['question'], example['correct_answer'], example['support'])
    distractor = "{} {} {}".format(example['distractor1'], example['distractor2'], example['distractor3'])

    max_length = 512
    doc_stride = 128
    
    tokenized_inputs = tokenizer.encode_plus(text, max_length=max_length, padding='max_length', pad_to_max_length=False, truncation=True, return_tensors="pt")
    tokenized_targets = tokenizer.encode_plus(distractor, max_length=max_length, padding='max_length', pad_to_max_length=False, truncation=True, return_tensors="pt")
    
    input_ids = tokenized_inputs['input_ids']
    input_attention = tokenized_inputs['attention_mask']

    target_ids = tokenized_targets['input_ids']
    target_attention = tokenized_targets['attention_mask']

    labels = copy.deepcopy(target_ids)
    labels[labels == 0] = -100
    
    outputs = {
        'input_ids':input_ids, 
        'attention_mask': input_attention, 
        'labels': labels
    }

    return outputs
    
tokenized_dataset_distractor = sciq_dataset.map(preprocess_dataset_for_distractor)
tokenized_dataset_distractor

Map:   0%|          | 0/11679 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11679
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

In [None]:
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", per_device_train_batch_size=2)

model = T5ForConditionalGeneration.from_pretrained("t5-small")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

data_collator = default_data_collator

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)