## Pipeline Quiz Generator (Separate Quiz and Distractor Approach)

Description: Quiz Generator with separate pipeline for quiz generation and then distractor generator

### Step 1 : SciQ Loading

Load dataset

In [4]:
from datasets import load_dataset

sciq_dataset = load_dataset("allenai/sciq")
sciq_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 11679
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 1000
    })
})

Sample Data:

In [5]:
sciq_dataset["train"][27]

{'question': 'A small scale version of what type of map displays individual rock units?',
 'distractor3': 'polar map',
 'distractor1': 'seismic map',
 'distractor2': 'geographic map',
 'correct_answer': 'geologic map',
 'support': 'Geologic maps display rock units and geologic features. A small scale map displays individual rock units while a large scale map shows geologic provinces.'}

Drop every data with empty support. 

In [6]:
filtered_sciq = sciq_dataset.filter(lambda example: example["support"] != '')
filtered_sciq

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 10481
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 887
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 884
    })
})

Check for support with longer than 512 tokens/words (Maximum token of T5).

In [7]:
test = filtered_sciq.filter(lambda example: len(example["support"]) > 2000) 
test

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 165
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 12
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 13
    })
})

In [8]:
test_data = test['train'][150]
test_data

{'question': 'The level of carbon dioxide in the atmosphere is greatly influenced by the reservoir of carbon where?',
 'distractor3': 'after the oceans',
 'distractor1': 'before the oceans',
 'distractor2': 'in the earth',
 'correct_answer': 'in the oceans',
 'support': 'As stated, the atmosphere is a major reservoir of carbon in the form of carbon dioxide that is essential to the process of photosynthesis. The level of carbon dioxide in the atmosphere is greatly influenced by the reservoir of carbon in the oceans. The exchange of carbon between the atmosphere and water reservoirs influences how much carbon is found in each, and each one affects the other reciprocally. Carbon dioxide (CO2) from the atmosphere dissolves in water and, unlike oxygen and nitrogen gas, reacts with water molecules to form ionic compounds. Some of these ions combine with calcium ions in the seawater to form calcium carbonate (CaCO3), a major component of the shells of marine organisms. These organisms eventua

We can see above that support is long but only a few sentences is relevant, we cannot do raw summarization, we have to extract text based on keywords which are answers (distractors and keywords from questions too!). If we left this, support and answer will be truncated. If we summarize it raw, we lose important info of what is asked.

Extractive Summarization based on answer and questions

In [9]:
import yake

def extract_question(question):
    kw_extractor = yake.KeywordExtractor(top=10, stopwords=None)
    keywords = kw_extractor.extract_keywords(question)
    return [keyword for keyword, score in keywords]

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")

def clean_text(text):
    lemmatizer = nlp.get_pipe("lemmatizer")
    doc = nlp(text)
    tokens = [tok for tok in doc]
    lemmas = [tok.lemma_ for tok in tokens]
    return ' '.join(lemmas)

In [11]:
def score_sentence(sentence, words):
    score = 0
    clean_sentences = clean_text(sentence.lower())
    for word in words:
        if clean_text(word.lower()) in clean_sentences:
            score += 1
    return score

In [12]:
def summarize_support(example, max_length=256):
    text = example["support"]
    words = extract_question(example["question"])
    words.extend([test_data["correct_answer"]])

    scored_sentences = (
        (i, sentence, score_sentence(sentence, words))
        for i, sentence in enumerate(text.split("."))
        if any(clean_text(w.lower()) in clean_text(sentence.lower()) for w in words)
    )
    ranked_sentences = sorted(scored_sentences, key=lambda x: x[2], reverse=True)

    sentence_in_summary = []
    sum_of_sentence = 0
    for order, sentence, _ in ranked_sentences:
        if sum_of_sentence <= max_length:
            sentence_in_summary.append((order, sentence))
            sum_of_sentence += len(sentence)

    summary = sorted(sentence_in_summary, key=lambda x: x[1])
    return ".".join(sent for _, sent in summary)


summarize_support(test_data)


' Carbon dioxide (CO2) from the atmosphere dissolves in water and, unlike oxygen and nitrogen gas, reacts with water molecules to form ionic compounds. The level of carbon dioxide in the atmosphere is greatly influenced by the reservoir of carbon in the oceans.As stated, the atmosphere is a major reservoir of carbon in the form of carbon dioxide that is essential to the process of photosynthesis'

In [13]:
def generate_context(example, sent_size=256):
    sentences = "{}<sep>{}".format(example['correct_answer'], example['support'])
    max_len = sent_size - len("{}<sep>".format(example['correct_answer']))
    context = example['support'] if len(sentences) < sent_size else summarize_support(example, max_length=max_len)
    
    return {
        "context": context
    }

preprocessed_sciq = filtered_sciq.map(generate_context, num_proc=4)

In [14]:
test_x = preprocessed_sciq.filter(lambda example: len(example["support"]) > 1000) 

In [15]:
test_x['train'][0]

{'question': 'Changes from a less-ordered state to a more-ordered state (such as a liquid to a solid) are always what?',
 'distractor3': 'endothermic',
 'distractor1': 'unbalanced',
 'distractor2': 'reactive',
 'correct_answer': 'exothermic',
 'support': 'Summary Changes of state are examples of phase changes, or phase transitions. All phase changes are accompanied by changes in the energy of a system. Changes from a more-ordered state to a less-ordered state (such as a liquid to a gas) areendothermic. Changes from a less-ordered state to a more-ordered state (such as a liquid to a solid) are always exothermic. The conversion of a solid to a liquid is called fusion (or melting). The energy required to melt 1 mol of a substance is its enthalpy of fusion (ΔHfus). The energy change required to vaporize 1 mol of a substance is the enthalpy of vaporization (ΔHvap). The direct conversion of a solid to a gas is sublimation. The amount of energy needed to sublime 1 mol of a substance is its en

In [16]:
preprocessed_sciq.save_to_disk("preprocessed_sciq-qg-256")

Saving the dataset (0/1 shards):   0%|          | 0/10481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/887 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/884 [00:00<?, ? examples/s]

Now preprocessed has shorter context for long ones, due to map process running slow, uncomment below after getting zip file from me

In [17]:
# from datasets import load_from_disk
# preprocessed_sciq = load_from_disk("preprocessed_sciq")

### Step 2 Question Generation

#### Tokenize for Question Generation 

Input : Context and Answer \
Output : Question

In [18]:
import torch
import copy
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer.add_special_tokens({"sep_token": "<sep>"})

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


1

In [19]:
# question = 'The generation of an isolated but open system, which we might call a protocell, was a critical step in the origin of life. Such an isolated system has important properties that are likely to have facilitated the further development of life. For example, because of the membrane boundary, changes that occur within one such structure will not be shared with neighboring systems. Rather, they accumulated in, and favor the survival of, one system over its neighbors. Such systems can also reproduce in a crude way by fragmentation. If changes within one such system improved its stability, its ability to accumulate resources, or its ability to survive and reproduce, that system, and its progeny, would be likely to become more common. As these changes accumulate and are passed from parent to offspring, the organisms will inevitably evolve, as we will see in detail in the next chapter. As in living systems today, the earliest steps in the formation of the first organisms required a source of energy to maintain the non-equilibrium living system. There are really two choices for the source of this energy, either light (electromagnetic radiation from the sun) or thermodynamically unstable chemicals present in the environment. There have been a number of plausible scenarios, based on various observations, for the steps leading to life. For example, a recent study based on the analysis of the genes (and the proteins that they encode) found in modern organisms, suggests that the last universal common ancestor (LUCA) arose in association with hydrothermal vents.60 But whether this reflects LUCA or an ancestor of LUCA that became adapted to living is association with hydrothermal vents is difficult (and perhaps impossible) to resolve unambiguously, particularly since LUCA lived ~3.4-3.8 billion years ago and cannot be studied directly. Mapping the history of life on earth Assuming, as seems likely, that life arose spontaneously, we can now look at what we know about the fossil record to better understand the diversification of life and life’s impact on the Earth. This is probably best done by starting with what we know about where the Universe and Earth came from. The current scientific model for the origin of the universe is known as the “Big Bang” (also known as the “primeval atom” or the “cosmic egg”), an idea originally proposed by the priest, physicist and astronomer Georges Lemaître (1894-1966).61 The Big Bang model arose from efforts to answer the question of whether the fuzzy nebulae identified by astronomers were located within or outside of our galaxy. This required some way to determine how far these nebulae were from Earth. Edwin Hubble (1889-1953) and his co-workers were the first to realize that nebulae were in fact galaxies in their own right, each very much like our own Milky Way and each is composed of many billions of stars. This was a surprising result. It made Earth, sitting on the edge of one (the Milky Way) among many, many galaxies seem less important – a change in cosmological perspective similar to that associated with the idea that the Sun, rather than Earth, was the center of the solar system (and the Universe). To measure the movement of galaxies with respect to Earth, Hubble and colleagues combined to types of observations. The first of these allowed them to estimate the distance from the Earth to.'
# tokenized_targets = tokenizer.encode_plus(question, max_length=512, padding='max_length', truncation=True, return_tensors="pt")
# tokenized_targets

In [20]:
# for x in tokenized_targets['input_ids'][:2]:
#     print(tokenizer.decode(x))

In [21]:
def preprocess_dataset(example):
    text = "{}<sep>{}".format(example['correct_answer'], example['context'])
    question = example['question']

    max_length = 256
    
    tokenized_inputs = tokenizer.encode_plus(text, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
    tokenized_targets = tokenizer.encode_plus(question, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
    
    input_ids = tokenized_inputs['input_ids'].squeeze()
    input_attention = tokenized_inputs['attention_mask'].squeeze()

    target_ids = tokenized_targets['input_ids'].squeeze()
    target_attention = tokenized_targets['attention_mask'].squeeze()

    labels = copy.deepcopy(target_ids)
    labels[labels == 0] = -100
    
    outputs = {
        'input_ids':input_ids, 
        'attention_mask': input_attention, 
        'labels': labels
    }

    return outputs
    
tokenized_dataset = preprocessed_sciq.map(preprocess_dataset, remove_columns= ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'context'])
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10481
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 887
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 884
    })
})

In [22]:
import numpy as np
import evaluate

def compute_metrics(eval_pred):
    metric = evaluate.load("bleu")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [26]:
from transformers import T5ForConditionalGeneration, TrainingArguments, Trainer, default_data_collator

training_args = TrainingArguments(
    output_dir="pretrained_question_generation", 
    evaluation_strategy="no", 
    auto_find_batch_size=True,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    fp16=True
    # gradient_checkpointing=True
)

model = T5ForConditionalGeneration.from_pretrained("t5-small")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

data_collator = default_data_collator

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [27]:
torch.cuda.empty_cache()

There is no validation yet (it's buggy in my comp for now)

In [28]:
trainer.train()

Step,Training Loss
500,1.5456
1000,1.2893
1500,1.2158


Checkpoint destination directory pretrained_question_generation/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory pretrained_question_generation/checkpoint-1000 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory pretrained_question_generation/checkpoint-1500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1965, training_loss=1.3112742339078403, metrics={'train_runtime': 6070.7844, 'train_samples_per_second': 5.179, 'train_steps_per_second': 0.324, 'total_flos': 2126625726529536.0, 'train_loss': 1.3112742339078403, 'epoch': 3.0})

In [29]:
trainer.save_model('model-qg')

In [35]:
tokenized_dataset['validation'][137]

{'input_ids': [11499,
  32100,
  37,
  2677,
  5013,
  2107,
  387,
  45,
  3,
  9,
  6957,
  42,
  13064,
  616,
  190,
  6079,
  9243,
  7293,
  7,
  6,
  114,
  273,
  16,
  7996,
  666,
  3,
  5,
  100,
  19,
  2953,
  57,
  579,
  2677,
  11,
  23295,
  24,
  169,
  8,
  387,
  12,
  1633,
  70,
  4096,
  5,
  634,
  52,
  1982,
  10441,
  19,
  10441,
  24,
  3033,
  7,
  8,
  2912,
  13,
  387,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,


In [114]:
model = T5ForConditionalGeneration.from_pretrained("model-qg")

In [115]:
text = "{}<sep>{}".format('minute amounts', "Within a nervous system, a neuron, neurone, or nerve cell is an electrically excitable cell that fires electric signals called action potentials across a neural network. Neurons communicate with other cells via synapses, which are specialized connections that commonly use minute amounts of chemical neurotransmitters to pass the electric signal from the presynaptic neuron to the target cell through the synaptic gap.")
tokenized_inputs = tokenizer.encode_plus(text, max_length=256, padding='max_length', truncation=True, return_tensors="pt")
decoder_input_ids = tokenized_inputs['input_ids']

In [116]:
output = model.generate(
    input_ids=tokenized_inputs['input_ids']
)
output

tensor([[   0,  363,   19,    8, 1657,   21,    3,    9, 6567,   29,    6, 6567,
           29,   15,    6,   42, 9077, 2358,   24, 1472]])

In [117]:
print(tokenizer.decode(output[0]))

<pad>What is the term for a neuron, neurone, or nerve cell that fire


In [36]:
preprocessed_sciq['validation'][137]

{'question': 'When the temperature of water is increased after being used in cooling, it is this form of pollution?',
 'distractor3': 'air',
 'distractor1': 'atmospheric',
 'distractor2': 'cosmic',
 'correct_answer': 'thermal',
 'support': "Thermal pollution is pollution that raises the temperature of water. This is caused by power plants and factories that use the water to cool their machines. The plants pump cold water from a lake or coastal area through giant cooling towers, like those in Figure below . As it flows through the towers, the cold water absorbs heat. This warmed water is returned to the lake or sea. Thermal pollution can kill fish and other water life. It's not just the warm temperature that kills them. Warm water can’t hold as much oxygen as cool water. If the water gets too warm, there may not be enough oxygen for living things.",
 'context': ' The plants pump cold water from a lake or coastal area through giant cooling towers, like those in Figure below . This is cau

In [53]:
# for x in tokenized_dataset['validation'][137]['input_ids']:
#     print(tokenizer.decode(x))

In [30]:
trainer.evaluate()

OutOfMemoryError: CUDA out of memory. Tried to allocate 4.32 GiB (GPU 0; 6.00 GiB total capacity; 5.98 GiB already allocated; 0 bytes free; 8.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

### Step 3 Distractor Generation

#### Tokenize for Distractor Generation 

Input : Answer, Question, Context \
Output : 3 Distractors

In [126]:
def preprocess_dataset_for_distractor(example):
    text = "{}<sep>{}<sep>{}".format(example['question'], example['correct_answer'], example['context'])
    distractor = "{}<sep>{}<sep>{}".format(example['distractor1'], example['distractor2'], example['distractor3'])

    max_length = 256
    
    tokenized_inputs = tokenizer.encode_plus(text, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
    tokenized_targets = tokenizer.encode_plus(distractor, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
    
    input_ids = tokenized_inputs['input_ids'].squeeze()
    input_attention = tokenized_inputs['attention_mask'].squeeze()

    target_ids = tokenized_targets['input_ids'].squeeze()
    target_attention = tokenized_targets['attention_mask'].squeeze()

    labels = copy.deepcopy(target_ids)
    labels[labels == 0] = -100
    
    outputs = {
        'input_ids':input_ids, 
        'attention_mask': input_attention, 
        'labels': labels
    }

    return outputs
    
tokenized_dataset_distractor = preprocessed_sciq.map(preprocess_dataset_for_distractor, remove_columns= ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'context'])
tokenized_dataset_distractor

Map:   0%|          | 0/10481 [00:00<?, ? examples/s]

Map:   0%|          | 0/887 [00:00<?, ? examples/s]

Map:   0%|          | 0/884 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10481
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 887
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 884
    })
})

In [127]:
training_args_dis = TrainingArguments(
    output_dir="pretrained_distractor_generation", 
    evaluation_strategy="no", 
    logging_strategy="epoch",
    auto_find_batch_size=True,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    fp16=True
    # gradient_checkpointing=True
)

model_dis = T5ForConditionalGeneration.from_pretrained("t5-small")

device = "cuda" if torch.cuda.is_available() else "cpu"
model_dis = model_dis.to(device)

trainer = Trainer(
    model=model_dis,
    args=training_args_dis,
    train_dataset=tokenized_dataset_distractor["train"],
    eval_dataset=tokenized_dataset_distractor["validation"],
    compute_metrics=compute_metrics
)

In [128]:
trainer.train()

Step,Training Loss
655,1.9005
1311,1.6972
1965,1.6619


TrainOutput(global_step=1965, training_loss=1.7532189910345102, metrics={'train_runtime': 6312.5134, 'train_samples_per_second': 4.981, 'train_steps_per_second': 0.311, 'total_flos': 2126625726529536.0, 'train_loss': 1.7532189910345102, 'epoch': 3.0})

In [129]:
trainer.save_model('model-dg')

In [133]:
model_dis = T5ForConditionalGeneration.from_pretrained("model-dg")

In [138]:
test_data = preprocessed_sciq['test'][137]
test_data

{'question': 'Compound forms when atoms of nonmetals form molecules that are held together by what?',
 'distractor3': 'dissonance bonds',
 'distractor1': 'phenotype bonds',
 'distractor2': 'magnetic bonds',
 'correct_answer': 'covalent bonds',
 'support': 'Compound forms when atoms of nonmetals form molecules that are held together by covalent bonds.',
 'context': 'Compound forms when atoms of nonmetals form molecules that are held together by covalent bonds.'}

In [140]:
text = "{}<sep>{}<sep>{}".format(test_data['question'], test_data['correct_answer'], test_data['context'])
tokenized_inputs = tokenizer.encode_plus(text, max_length=256, padding='max_length', truncation=True, return_tensors="pt")
decoder_input_ids = tokenized_inputs['input_ids']

In [141]:
output = model_dis.generate(
    input_ids=tokenized_inputs['input_ids']
)
output

tensor([[    0,   576, 15592, 13237,     2,     7,    15,   102,  3155,   509,
         15592, 13237,     2,     7,    15,   102,  3155,   509, 15592, 13237]])

In [142]:
print(tokenizer.decode(output[0]))

<pad> covalent bonds<unk> sep>covalent bonds<unk>sep>covalent bonds
