## Pipeline Quiz Generator (Separate Quiz and Distractor Approach)

Description: Quiz Generator with separate pipeline for quiz generation and then distractor generator

### Step 1 : SciQ Loading

Load dataset

In [2]:
from datasets import load_dataset

sciq_dataset = load_dataset("allenai/sciq")
sciq_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 11679
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 1000
    })
})

Sample Data:

In [3]:
sciq_dataset["train"][27]

{'question': 'A small scale version of what type of map displays individual rock units?',
 'distractor3': 'polar map',
 'distractor1': 'seismic map',
 'distractor2': 'geographic map',
 'correct_answer': 'geologic map',
 'support': 'Geologic maps display rock units and geologic features. A small scale map displays individual rock units while a large scale map shows geologic provinces.'}

Drop every data with empty support. 

In [4]:
filtered_sciq = sciq_dataset.filter(lambda example: example["support"] != '')
filtered_sciq

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 10481
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 887
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 884
    })
})

Check for support with longer than 512 tokens/words (Maximum token of T5).

In [5]:
test = filtered_sciq.filter(lambda example: len(example["support"]) > 512) 
test

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 3029
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 257
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 267
    })
})

In [6]:
test_data = test['train'][9]
test_data

{'question': 'Interstitial carbides are produced by the reaction of most transition metals at high temperatures with what element?',
 'distractor3': 'nitrogen',
 'distractor1': 'hydrogen',
 'distractor2': 'oxygen',
 'correct_answer': 'carbon',
 'support': 'temperatures with electropositive metals such as those of groups 1 and 2 and aluminum produces ionic carbides, which contain discrete metal cations and carbon anions. The identity of the anions depends on the size of the second element. For example, smaller elements such as beryllium and aluminum give methides such as Be2C and Al4C3, which formally contain the C4− ion derived from methane (CH4) by losing all four H atoms as protons. In contrast, larger metals such as sodium and calcium give carbides with stoichiometries of Na2C2 and CaC2. Because these carbides contain the C4− ion, which is derived from acetylene (HC≡CH) by losing both H atoms as protons, they are more properly called acetylides. As discussed in Chapter 21 "Periodic 

We can see above that support is long but only a few sentences is relevant, we cannot do raw summarization, we have to extract text based on keywords which are answers (distractors and keywords from questions too!). If we left this, support and answer will be truncated. If we summarize it raw, we lose important info of what is asked.

Extractive Summarization based on answer and questions

In [7]:
import yake

def extract_question(question):
    kw_extractor = yake.KeywordExtractor(top=10, stopwords=None)
    keywords = kw_extractor.extract_keywords(question)
    return [keyword for keyword, score in keywords]

In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")

def clean_text(text):
    lemmatizer = nlp.get_pipe("lemmatizer")
    doc = nlp(text)
    tokens = [tok for tok in doc]
    lemmas = [tok.lemma_ for tok in tokens]
    return ' '.join(lemmas)

In [9]:
def score_sentence(sentence, words):
    score = 0
    for word in words:
        if clean_text(word.lower()) in clean_text(sentence.lower()):
            score += 1
    return score

In [10]:
def summarize_support(example, max_length=500):
    text = example['support']
    words = extract_question(example['question'])
    words.extend([test_data['correct_answer']])
    
    scored_sentences = ((sentence, score_sentence(sentence, words)) for sentence in text.split(".") if any(clean_text(w.lower()) in clean_text(sentence.lower()) for w in words))
    ranked_sentences = sorted(scored_sentences, key=lambda x: x[1], reverse=True)
    
    sentence_in_summary = []
    sum_of_sentence = 0
    for sentences, _ in ranked_sentences:
        if sum_of_sentence <= max_length:
            sentence_in_summary.append(sentences)
            sum_of_sentence += len(sentences)
            
    return '.'.join(sentence_in_summary)

summarize_support(test_data)

' The reaction of carbon with most transition metals at high temperatures produces interstitial carbides. Due to the less electropositive nature of the transition metals, these carbides contain covalent metal– carbon interactions, which result in different properties: most interstitial carbides are good conductors of electricity, have high melting points, and are among the hardest substances known. Interstitial carbides exhibit a variety of nominal compositions, and they are often nonstoichiometric compounds whose carbon content can vary over a wide range'

In [11]:
def generate_context(example, token_size=512):
    sentences = "{}<sep>{}".format(example['correct_answer'], example['support'])
    max_len = token_size - len("{}<sep>".format(example['correct_answer']))
    context = example['support'] if len(sentences) < token_size else summarize_support(example, max_length=max_len)
    
    return {
        "context": context
    }

# preprocessed_sciq = filtered_sciq.map(generate_context)

In [12]:
# test_x = preprocessed_sciq.filter(lambda example: len(example["support"]) > 3000) 

In [13]:
# test_data = test_x['train'][9]
# test_data 

In [14]:
# preprocessed_sciq.save_to_disk("preprocessed_sciq")

Now preprocessed has shorter context for long ones, due to map process running slow, uncomment below after getting zip file from me

In [15]:
from datasets import load_from_disk
preprocessed_sciq = load_from_disk("preprocessed_sciq")

### Step 2 Question Generation

#### Tokenize for Question Generation 

Input : Context and Answer \
Output : Question

In [16]:
import torch
import copy
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer.add_special_tokens({"sep_token": "<sep>"})

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


1

In [17]:
def preprocess_dataset(example):
    text = "{}<sep>{}".format(example['correct_answer'], example['context'])
    question = example['question']

    max_length = 512
    
    tokenized_inputs = tokenizer.encode_plus(text, max_length=max_length, padding='max_length', return_tensors="pt")
    tokenized_targets = tokenizer.encode_plus(question, max_length=max_length, padding='max_length', return_tensors="pt")
    
    input_ids = tokenized_inputs['input_ids'].squeeze()
    input_attention = tokenized_inputs['attention_mask'].squeeze()

    target_ids = tokenized_targets['input_ids'].squeeze()
    target_attention = tokenized_targets['attention_mask'].squeeze()

    labels = copy.deepcopy(target_ids)
    labels[labels == 0] = -100
    
    outputs = {
        'input_ids':input_ids, 
        'attention_mask': input_attention, 
        'labels': labels
    }

    return outputs
    
tokenized_dataset = preprocessed_sciq.map(preprocess_dataset, remove_columns= ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'context'])
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10481
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 887
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 884
    })
})

In [18]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

2024-03-24 10:57:07.631163: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-24 10:57:07.631228: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-24 10:57:07.716932: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-24 10:57:07.896755: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [19]:
from transformers import T5ForConditionalGeneration, TrainingArguments, Trainer, default_data_collator

training_args = TrainingArguments(
    output_dir="pretrained_question_gen", 
    evaluation_strategy="epoch", 
    auto_find_batch_size=True,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    eval_accumulation_steps=1,
    num_train_epochs=3,
    fp16=True,
    gradient_checkpointing=True
)

model = T5ForConditionalGeneration.from_pretrained("t5-small")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

data_collator = default_data_collator

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [20]:
torch.cuda.empty_cache()

In [None]:
trainer.train(resume_from_checkpoint = 'pretrained_question_gen/checkpoint-5000')

There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Epoch,Training Loss,Validation Loss


### Step 3 Distractor Generation

#### Tokenize for Distractor Generation 

Input : Answer, Question, Context \
Output : 3 Distractors

In [31]:
def preprocess_dataset_for_distractor(example):
    text = "{} {} {}".format(example['question'], example['correct_answer'], example['support'])
    distractor = "{} {} {}".format(example['distractor1'], example['distractor2'], example['distractor3'])

    max_length = 512
    doc_stride = 128
    
    tokenized_inputs = tokenizer.encode_plus(text, max_length=max_length, padding='max_length', pad_to_max_length=False, truncation=True, return_tensors="pt")
    tokenized_targets = tokenizer.encode_plus(distractor, max_length=max_length, padding='max_length', pad_to_max_length=False, truncation=True, return_tensors="pt")
    
    input_ids = tokenized_inputs['input_ids']
    input_attention = tokenized_inputs['attention_mask']

    target_ids = tokenized_targets['input_ids']
    target_attention = tokenized_targets['attention_mask']

    labels = copy.deepcopy(target_ids)
    labels[labels == 0] = -100
    
    outputs = {
        'input_ids':input_ids, 
        'attention_mask': input_attention, 
        'labels': labels
    }

    return outputs
    
tokenized_dataset_distractor = sciq_dataset.map(preprocess_dataset_for_distractor)
tokenized_dataset_distractor

Map:   0%|          | 0/11679 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11679
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

In [None]:
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", per_device_train_batch_size=2)

model = T5ForConditionalGeneration.from_pretrained("t5-small")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

data_collator = default_data_collator

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)