## Pipeline Quiz Generator (Separate Quiz and Distractor Approach)

Description: Quiz Generator with separate pipeline for quiz generation and then distractor generator

### Step 1 : SciQ Loading

Load dataset

In [2]:
from datasets import load_dataset

sciq_dataset = load_dataset("allenai/sciq")
sciq_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 11679
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 1000
    })
})

Sample Data:

In [3]:
sciq_dataset["train"][27]

{'question': 'A small scale version of what type of map displays individual rock units?',
 'distractor3': 'polar map',
 'distractor1': 'seismic map',
 'distractor2': 'geographic map',
 'correct_answer': 'geologic map',
 'support': 'Geologic maps display rock units and geologic features. A small scale map displays individual rock units while a large scale map shows geologic provinces.'}

Drop every data with empty support. 

In [4]:
filtered_sciq = sciq_dataset.filter(lambda example: example["support"] != '')
filtered_sciq

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 10481
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 887
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 884
    })
})

Check for support with longer than 512 tokens/words (Maximum token of T5).

In [5]:
import pandas as pd
df_train = pd.DataFrame(sciq_dataset["train"])
df_train.head()

Unnamed: 0,question,distractor3,distractor1,distractor2,correct_answer,support
0,What type of organism is commonly used in prep...,viruses,protozoa,gymnosperms,mesophilic organisms,"Mesophiles grow best in moderate temperature, ..."
1,What phenomenon makes global winds blow northe...,tropical effect,muon effect,centrifugal effect,coriolis effect,Without Coriolis Effect the global winds would...
2,Changes from a less-ordered state to a more-or...,endothermic,unbalanced,reactive,exothermic,Summary Changes of state are examples of phase...
3,What is the least dangerous radioactive decay?,zeta decay,beta decay,gamma decay,alpha decay,All radioactive decay is dangerous to living t...
4,Kilauea in hawaii is the world’s most continuo...,magma,greenhouse gases,carbon and smog,smoke and ash,Example 3.5 Calculating Projectile Motion: Hot...


In [6]:
print(df_train['support'].str.len().max())

3559


In [7]:
test = filtered_sciq.filter(lambda example: len(example["support"]) > 3000) 
test

DatasetDict({
    train: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 0
    })
    test: Dataset({
        features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
        num_rows: 0
    })
})

In [8]:
test_data = test['train'][7]
test_data['support']

'membrane gradients was known, Mitchell proposed that energy captured through the absorption of light (by phototrophs) or the breakdown of molecules into more stable molecules (by various types of chemotrophs) relied on the same basic (homologous) mechanism, namely the generation of H+ gradients across membranes (the plasma membrane in prokaryotes or the internal membranes of mitochondria or chloroplasts (intracellular organelles, derived from bacteria – see below) in eukaryotes. What makes us think that these processes might have a similar evolutionary root, that they are homologous? Basically, it is the observation that in both light- and chemical-based processes captured energy is transferred through the movement of electrons through a membrane-embedded “electron transport chain”. An electron transport chain involves a series of membrane and associated proteins and a series of reduction-oxidation or redox reactions (see below) during which electrons move from a high energy donor to 

We can see above that support is long but only a few sentences is relevant, we cannot do raw summarization, we have to extract text based on keywords which are answers (distractors and keywords from questions too!). If we left this, support and answer will be truncated. If we summarize it raw, we lose important info of what is asked.

Extractive Summarization based on answer and questions

In [9]:
import yake

def extract_question(question):
    kw_extractor = yake.KeywordExtractor(top=10, stopwords=None)
    keywords = kw_extractor.extract_keywords(question)
    return [keyword for keyword, score in keywords]

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")

def clean_text(text):
    lemmatizer = nlp.get_pipe("lemmatizer")
    doc = nlp(text)
    tokens = [tok for tok in doc]
    lemmas = [tok.lemma_ for tok in tokens]
    return ' '.join(lemmas)

In [11]:
def score_sentence(sentence, words):
    score = 0
    clean_sentences = clean_text(sentence.lower())
    for word in words:
        if clean_text(word.lower()) in clean_sentences:
            score += 1
    return score

In [12]:
def summarize_support(example, max_words=256):
    text = example["support"]
    words = extract_question(example["question"])
    words.extend([test_data["correct_answer"]])

    scored_sentences = (
        (i, sentence, score_sentence(sentence, words))
        for i, sentence in enumerate(text.split("."))
        if any(clean_text(w.lower()) in clean_text(sentence.lower()) for w in words)
    )
    ranked_sentences = sorted(scored_sentences, key=lambda x: x[2], reverse=True)

    sentence_in_summary = []
    sum_of_words = 0
    for order, sentence, _ in ranked_sentences:
        num_of_words = len(sentence.split())
        if sum_of_words + num_of_words < max_words:
            sentence_in_summary.append((order, sentence))
            sum_of_words += num_of_words 

    summary = sorted(sentence_in_summary, key=lambda x: x[1])
    return ".".join(sent for _, sent in summary)


summarize_support(test_data)

' ) The major pigment in this system, chlorophyll, is based on a complex molecule, a porphyrin (see above) and it is primarily these pigments that give plants their green color. At this point, we consider only one aspect of this photosynthetic system, known as the oxygenic or non-cyclic system (look to more advanced classes for more details. Chlorophyll is synthesized by a conserved biosynthetic pathway that is also used to synthesize heme, which is found in the hemoglobin of animals and in the cytochromes, within the electron transport chain present in both plants and animals (which. For simplicity’s sake we will describe the photosynthetic system of cyanobacterium; the system in eukaryotic algae and plants, while more complex, follows the same basic logic. In all of these organisms, their photosynthetic systems appear to be homologous, that is derived from a common ancestor, a topic we will return to later in this chapter. Oxygenic photosynthesis \u2028 Compared to the salt loving ar

In [13]:
def generate_context(example, max_token_size=256):
    answer_size = len(example['correct_answer'].split())
    support_size = len(example['support'].split())
    words_len = answer_size + support_size + 1
    context = example['support']

    if words_len > max_token_size:
        max_new_token_size = max_token_size - answer_size - 1
        context = summarize_support(example, max_words=max_new_token_size)
    
    return {
        "context": context
    }

# preprocessed_sciq = filtered_sciq.map(generate_context, num_proc=4)

In [14]:
# test_x = preprocessed_sciq.filter(lambda example: len(example["question"].split()) > 64) 
# test_x['train'][0]

In [15]:
# preprocessed_sciq.save_to_disk("preprocessed_sciq-qg-256-new")

Now preprocessed has shorter context for long ones, due to map process running slow, uncomment below after getting zip file from me

In [16]:
from datasets import load_from_disk
preprocessed_sciq = load_from_disk("preprocessed_sciq-qg-256-new")

### Step 2 Question Generation

#### Tokenize for Question Generation 

Input : Context and Answer \
Output : Question

In [17]:
import torch
import copy
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")
tokenizer.add_special_tokens({"sep_token": "<sep>"})

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


1

In [18]:
def preprocess_dataset(example):
    text = "{}<sep>{}".format(example['correct_answer'], example['context'])
    question = example['question']

    max_length = 256
    max_length_target=256
    
    tokenized_inputs = tokenizer.encode_plus(text, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
    tokenized_targets = tokenizer.encode_plus(question, max_length=max_length_target, padding='max_length', truncation=True, return_tensors="pt")
    
    input_ids = tokenized_inputs['input_ids'].squeeze()
    input_attention = tokenized_inputs['attention_mask'].squeeze()

    target_ids = tokenized_targets['input_ids'].squeeze()
    target_attention = tokenized_targets['attention_mask'].squeeze()

    labels = copy.deepcopy(target_ids)
    labels[labels == 0] = -100
    
    outputs = {
        'input_ids':input_ids, 
        'attention_mask': input_attention, 
        'labels': labels
    }

    return outputs
    
tokenized_dataset = preprocessed_sciq.map(preprocess_dataset, remove_columns= ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support', 'context'])
tokenized_dataset.set_format("torch")

In [19]:
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler
from transformers import T5ForConditionalGeneration

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small").to(device)

train_dataloader = DataLoader(tokenized_dataset["train"], shuffle=True, batch_size=8)
eval_dataloader = DataLoader(tokenized_dataset["validation"], batch_size=2)
test_dataloader = DataLoader(tokenized_dataset["test"], batch_size=2)

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 5
num_training_steps = num_epochs * len(train_dataloader)

lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [20]:
class EarlyStopping:
    def __init__(self, tolerance=3, min_delta=0.5):

        self.tolerance = tolerance
        self.min_delta = min_delta
        self.counter = 0
        self.early_stop = False

    def __call__(self, train_loss, validation_loss):
        if (validation_loss - train_loss) > self.min_delta:
            self.counter +=1
            if self.counter >= self.tolerance:  
                self.early_stop = True

In [21]:
torch.cuda.empty_cache()

In [22]:
import evaluate
from tqdm.auto import tqdm

def validate(model, eval_dataloader):
    metric = evaluate.load("accuracy")
    model.eval()
    progress_bar_val = tqdm(range(len(eval_dataloader)), leave=False)
    
    total_loss = 0
    total_batches = 0
    for batch in eval_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()
        total_batches += 1
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        preds_flatten = torch.flatten(predictions)
        refs_flatten = torch.flatten(batch["labels"])
        metric.add_batch(predictions=preds_flatten, references=refs_flatten)
        progress_bar_val.update(1)
    average_loss = total_loss/total_batches
    val_accuracy = metric.compute()
    progress_bar_val.close()
    return average_loss, val_accuracy['accuracy']

2024-03-27 20:43:34.483299: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-27 20:43:34.484267: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-27 20:43:34.819911: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-27 20:43:35.336587: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [23]:
def train(model, train_dataloader, eval_dataloader, num_training_steps, num_epochs):
    progress_bar = tqdm(range(num_training_steps), unit="batch")
    early_stopping = EarlyStopping(tolerance=2, min_delta=5)
    model.train()
    for epoch in range(num_epochs):
        running_loss = 0
        total_batches = 0
        for batch in train_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            running_loss += loss.item()
            total_batches += 1
            progress_bar.set_description(f"Epoch {epoch + 1}")
            progress_bar.update(1)
        train_loss = running_loss/total_batches
        val_loss, val_acc = validate(model, eval_dataloader)
        
        torch.save({
                'epoch': epoch + 1,
                'model_state_dict': model.state_dict(),
                'train_loss': train_loss,
                'val_loss': val_loss,
            }, 'flan-T5-finetuned-qg-{}'.format(epoch + 1))
        
        print("Epoch {} : Training Loss {} Val Loss {} Val Acc {}%".format(epoch + 1, train_loss, val_loss, val_acc))
        
        early_stopping(train_loss, val_loss)
        if early_stopping.early_stop:
          print("We are at epoch:", epoch)
          break

# train(model, train_dataloader, eval_dataloader, num_training_steps, num_epochs)

Early stopped manually at epoch 6, best model is at epoch 3 

In [24]:
def evaluate_model(model, test_dataloader):
    metric_bleu = evaluate.load("bleu")
    model.eval()
    progress_bar_test = tqdm(range(len(test_dataloader)), leave=False)

    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)

        preds_flatten = torch.flatten(predictions)
        refs_flatten = torch.flatten(batch["labels"])

        refs_flatten[refs_flatten == -100] = 0

        preds_questions = tokenizer.batch_decode(preds_flatten, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        refs_questions = tokenizer.batch_decode(refs_flatten, skip_special_tokens=True, clean_up_tokenization_spaces=True)

        metric_bleu.add_batch(predictions=preds_questions, references=refs_questions)
        progress_bar_test.update(1)
    test_bleu = metric_bleu.compute()
    progress_bar_test.close()
    return test_bleu

model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small").to(device)
model.load_state_dict(torch.load('flan-T5-finetuned-qg-3')['model_state_dict'])

evaluate_model(model, test_dataloader)

  0%|          | 0/442 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [1]:
def inference(model, answer, context):
    text = "{}<sep>{}".format(answer, context)
    max_length = 256
    tokenized_inputs = tokenizer.encode_plus(text, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt").to(device) 
    decoder_input_ids = tokenized_inputs['input_ids']

    output = model.generate(
        input_ids=tokenized_inputs['input_ids']
    )
    
    question = tokenizer.batch_decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return question[0]

inference(model, "one-tenth", "Bacteria display a wide diversity of shapes and sizes. Bacterial cells are about one-tenth the size of eukaryotic cells and are typically 0.5–5.0 micrometres in length. However, a few species are visible to the unaided eye—for example, Thiomargarita namibiensis is up to half a millimetre long,[34] Epulopiscium fishelsoni reaches 0.7 mm,[35] and Thiomargarita magnifica can reach even 2 cm in length, which is 50 times larger than other known bacteria.[36][37] Among the smallest bacteria are members of the genus Mycoplasma, which measure only 0.3 micrometres, as small as the largest viruses.[38] Some bacteria may be even smaller, but these ultramicrobacteria are not well-studied.[39]")

NameError: name 'model' is not defined

In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [25]:
model.push_to_hub("question-generation")

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rizkiduwinanto/question-generation/commit/4efab07c7c0a8d4e0210dcbcd25e1f25824e51c6', commit_message='Upload T5ForConditionalGeneration', commit_description='', oid='4efab07c7c0a8d4e0210dcbcd25e1f25824e51c6', pr_url=None, pr_revision=None, pr_num=None)