In [None]:
!pip install pytorch_lightning transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch_lightning
  Downloading pytorch_lightning-2.0.1-py3-none-any.whl (716 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m716.4/716.4 KB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m93.1 MB/s[0m eta [36m0:00:00[0m
Collecting lightning-utilities>=0.7.0
  Downloading lightning_utilities-0.8.0-py3-none-any.whl (20 kB)
Collecting torchmetrics>=0.7.0
  Downloading torchmetrics-0.11.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.2/519.2 KB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

In [None]:
from typing import List, Dict
import tqdm.notebook as tq
from tqdm.notebook import tqdm
from random import shuffle

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

import pandas as pd
import numpy as np

import gensim
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

import torch
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split
from transformers import AdamW, T5ForConditionalGeneration, T5TokenizerFast as T5Tokenizer
import spacy
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package wordnet to /root/nltk_data...


# Algorithm description

Initially, we have some text (context) for which we want to compose a multiple choice question with the correct answer (MCQ) from this text. 

To do this, I finetuned T5, leaving in the dataset only the answer to the question, the question itself and the context for which the question is generated.

There are two options for generating a question:

1) a question is generated by context and keyword, then we look at close words among the embeddings and issue a question with answer options

2) only by context, we first use NERO to find keywords, then we create our own questions for all of them

Very often there is a problem that there are no close words, to solve it, you can try to increase the corpus of words for embedding or type it already according to certain algorithms or topics

P.S. for finetune 5 epoches, it took 2-3 hours

# Data download

In [None]:
squad_train_df = pd.read_csv('/content/drive/MyDrive/MCQ/data/squad-v1/train_df.csv')
squad_dev_df = pd.read_csv('/content/drive/MyDrive/MCQ/data/squad-v1/dev_df.csv')

print('train:', squad_train_df.shape)
print('dev:', squad_dev_df.shape)

train: (87599, 6)
dev: (10570, 6)


In [None]:
squad_train_df.head()

Unnamed: 0,question,context_para,context_sent,answer_text,answer_start,answer_end
0,To whom did the Virgin Mary allegedly appear i...,"Architecturally, the school has a Catholic cha...","It is a replica of the grotto at Lourdes, Fran...",Saint Bernadette Soubirous,515,541
1,What is in front of the Notre Dame Main Building?,"Architecturally, the school has a Catholic cha...",Immediately in front of the Main Building and ...,a copper statue of Christ,188,213
2,The Basilica of the Sacred heart at Notre Dame...,"Architecturally, the school has a Catholic cha...",Next to the Main Building is the Basilica of t...,the Main Building,279,296
3,What is the Grotto at Notre Dame?,"Architecturally, the school has a Catholic cha...","Immediately behind the basilica is the Grotto,...",a Marian place of prayer and reflection,381,420
4,What sits on top of the Main Building at Notre...,"Architecturally, the school has a Catholic cha...",Atop the Main Building's gold dome is a golden...,a golden statue of the Virgin Mary,92,126


# Data cleaning

In [None]:
context_name = 'context_para'
drop_context = 'context_sent' 
df = squad_train_df.copy()

df = df.dropna()
df.rename(columns = {context_name: 'context'}, inplace=True)
df.drop(columns=[drop_context, 'answer_start', 'answer_end'], inplace=True)

test_df = df[:11877]
train_df = df[11877:]

dev_df = squad_dev_df.copy()
dev_df.rename(columns = {context_name: 'context'}, inplace=True)
dev_df.drop(columns=[drop_context, 'answer_start', 'answer_end'], inplace=True)

print(train_df.shape, 'train_df')
print(dev_df.shape, 'dev_df')
print(test_df.shape, 'test_df')

train_df.head()

(75721, 3) train_df
(10570, 3) dev_df
(11877, 3) test_df


Unnamed: 0,question,context,answer_text
11877,What is heresy mainly at odds with?,Heresy is any provocative belief or theory tha...,established beliefs or customs
11878,What is a person called is practicing heresy?,Heresy is any provocative belief or theory tha...,A heretic
11879,What religions and idea of thought is heresy c...,The term is usually used to refer to violation...,"Christianity, Judaism, Islam and Marxism"
11880,What cultures are listed as examples of discip...,"In certain historical Christian, Islamic and J...","Christian, Islamic and Jewish"
11881,What language does the term heresy find its ro...,The term heresy is from Greek αἵρεσις original...,Greek


# Dataset

In [None]:
SEP_TOKEN = '<sep>'
MASKING_CHANCE = 0.3

In [None]:
class MCQDataset(Dataset):
    def __init__(self, data, tokenizer, source_max_token_len, target_max_token_len):
        self.tokenizer = tokenizer
        self.data = data
        self.source_max_token_len = source_max_token_len
        self.target_max_token_len = target_max_token_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index: int):
        data_row = self.data.iloc[index]

        if np.random.rand() > MASKING_CHANCE:
            answer = data_row['answer_text']
        else:
            answer = '[MASK]'

        source_encoding = tokenizer('{} {} {}'.format(answer, SEP_TOKEN, data_row['context']),
                                    max_length= self.source_max_token_len,
                                    padding='max_length',
                                    truncation= True,
                                    return_attention_mask=True,
                                    add_special_tokens=True,
                                    return_tensors='pt')
    
        target_encoding = tokenizer('{} {} {}'.format(data_row['answer_text'], SEP_TOKEN, data_row['question']),
                                    max_length=self.target_max_token_len,
                                    padding='max_length',
                                    truncation = True,
                                    return_attention_mask=True,
                                    add_special_tokens=True,
                                    return_tensors='pt')

        labels = target_encoding['input_ids']  
        labels[labels == 0] = -100

        return dict(answer_text = data_row['answer_text'], 
                    context = data_row['context'], 
                    question = data_row['question'], 
                    input_ids = source_encoding['input_ids'].flatten(), 
                    attention_mask = source_encoding['attention_mask'].flatten(),
                    labels=labels.flatten())

In [None]:
class MCQDataModule(pl.LightningDataModule):
    def __init__(self, train_df, val_df, test_df, tokenizer, batch_size, source_max_token_len, target_max_token_len): 
        super().__init__()
        self.batch_size = batch_size
        self.train_df = train_df
        self.val_df = val_df
        self.test_df = test_df
        self.tokenizer = tokenizer
        self.source_max_token_len = source_max_token_len
        self.target_max_token_len = target_max_token_len

    def setup(self, stage=None):
        self.train_dataset = MCQDataset(self.train_df, self.tokenizer, self.source_max_token_len, self.target_max_token_len)
        self.val_dataset = MCQDataset(self.val_df, self.tokenizer, self.source_max_token_len, self.target_max_token_len)
        self.test_dataset = MCQDataset(self.test_df, self.tokenizer, self.source_max_token_len, self.target_max_token_len)

    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size = self.batch_size, shuffle=True, num_workers = 2)

    def val_dataloader(self): 
        return DataLoader(self.val_dataset, batch_size=1, num_workers=2)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=1, num_workers=2)

# Hyperparameters & data

In [None]:
SOURCE_MAX_TOKEN_LEN = 300
TARGET_MAX_TOKEN_LEN = 80
N_EPOCHS = 5
BATCH_SIZE = 16
LEARNING_RATE = 0.0001

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')
tokenizer.add_tokens(SEP_TOKEN)
TOKENIZER_LEN = len(tokenizer)
data_module = MCQDataModule(train_df, dev_df, test_df, tokenizer, BATCH_SIZE, SOURCE_MAX_TOKEN_LEN, TARGET_MAX_TOKEN_LEN)

data_module.setup()

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

# Model

In [None]:
class QGModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
        self.model.resize_token_embeddings(TOKENIZER_LEN)

    def forward(self, input_ids, attention_mask, labels=None):
        output = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        return output.loss, output.logits

    def training_step(self, batch, batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        loss, output = self(input_ids, attention_mask, labels)
        self.log('train_loss', loss, prog_bar=True, logger=True)
        return loss

    def validation_step(self, batch, batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        loss, output = self(input_ids, attention_mask, labels)
        self.log('val_loss', loss, prog_bar=True, logger=True)
        return loss

    def test_step(self, batch, batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        loss, output = self(input_ids, attention_mask, labels)
        self.log('test_loss', loss, prog_bar=True, logger=True)
        return loss
  
    def configure_optimizers(self):
        return AdamW(self.parameters(), lr=LEARNING_RATE)

In [None]:
trainer = pl.Trainer(callbacks=ModelCheckpoint(dirpath='checkpoints', filename='best-checkpoint', save_top_k=-1, verbose=True, monitor='val_loss', mode='min'), max_epochs=N_EPOCHS)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [None]:
model = QGModel()

trainer.fit(model, data_module)

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 60.5 M
-----------------------------------------------------
60.5 M    Trainable params
0         Non-trainable params
60.5 M    Total params
241.971   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 4733: 'val_loss' reached 1.42180 (best 1.42180), saving model to '/content/checkpoints/best-checkpoint.ckpt' as top 1


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 1, global step 9466: 'val_loss' reached 1.35112 (best 1.35112), saving model to '/content/checkpoints/best-checkpoint-v1.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 2, global step 14199: 'val_loss' reached 1.33927 (best 1.33927), saving model to '/content/checkpoints/best-checkpoint-v2.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 3, global step 18932: 'val_loss' reached 1.32348 (best 1.32348), saving model to '/content/checkpoints/best-checkpoint-v3.ckpt' as top 4


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 4, global step 23665: 'val_loss' reached 1.31512 (best 1.31512), saving model to '/content/checkpoints/best-checkpoint-v4.ckpt' as top 5
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


In [None]:
checkpoint_path = '/content/drive/MyDrive/MCQ/checkpoints/best-checkpoint-v4.ckpt'

best_model = QGModel.load_from_checkpoint(checkpoint_path)
best_model.freeze()
best_model.eval()

print()

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]




In [None]:
def generate(qgmodel, answer, context):
    source_encoding = tokenizer('{} {} {}'.format(answer, SEP_TOKEN, context),
                                max_length=SOURCE_MAX_TOKEN_LEN,
                                padding='max_length',
                                truncation=True,
                                return_attention_mask=True,
                                add_special_tokens=True,
                                return_tensors='pt')

    generated_ids = qgmodel.model.generate(input_ids=source_encoding['input_ids'],
                                           attention_mask=source_encoding['attention_mask'],
                                           num_beams=1,
                                           max_length=TARGET_MAX_TOKEN_LEN,
                                           repetition_penalty=2.5,
                                           length_penalty=1.0,
                                           early_stopping=True,
                                           use_cache=True)

    preds = {tokenizer.decode(generated_id, skip_special_tokens=False, clean_up_tokenization_spaces=True) for generated_id in generated_ids}

    return ''.join(preds)

In [None]:
def show_result(generated, answer, context):
    print('Conext: ', context)
    print('Generated: ', generated)
    print('Answer: ', answer)

In [None]:
context = 'Which of the following is not a type of muscle tissue?'
answer = 'Adipose'

generated = generate(best_model, answer, context)

show_result(generated, answer, context)

Conext:  Which of the following is not a type of muscle tissue?
Generated:  <pad> Adipose<sep> What is the name of the muscle tissue that does not have a type?</s>
Answer:  Adipose


# "Synonym" generation

In [None]:
glove = '/content/drive/MyDrive/MCQ/data/embeddings/glove.6B.300d.txt'
temp = '/content/drive/MyDrive/MCQ/data/embeddings/word2vec-glove.6B.300d.txt'

In [None]:
glove2word2vec(glove, temp)
model = KeyedVectors.load_word2vec_format(temp)

In [None]:
def generate_distractors(answer, count):
    answer = str.lower(answer)
    closestWords = model.most_similar(positive=[answer], topn=count)
    distractors = list(map(lambda x: x[0], closestWords))[0:count]
    return distractors

In [None]:
generate_distractors('Adipose', 4)

['adipocytes', 'tissue', 'fatty', 'tissues']

# NER

In [None]:
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying a U.K. startup for $1 billion"
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


# Final

In [None]:
import re

def generation(context, right_keyword=None):

    if right_keyword != None:
        keyword_new = str.lower(right_keyword)
        generated = generate(best_model, right_keyword, context)
        generated = re.sub("<.*?>", "", generated)
        try:
            synonyms = generate_distractors(keyword_new, 3)
            synonyms.append(right_keyword)
            synonyms = [synonyms[i].capitalize() if keyword.text.isupper() else synonyms[i] for i in range(len(synonyms))]
            shuffle(synonyms)

            print(generated)
            print('A)', synonyms[0])
            print('B)', synonyms[1])
            print('C)', synonyms[2])
            print('D)', synonyms[3])
        except:
            print(generated)
            print('bad keyword (', right_keyword, ') not found in dictionary', sep='')
    
    else:
        keywords = []
        doc = nlp(context)

        for keyword in doc.ents:
            print(keyword.text, keyword.label_)
            keyword_new = str.lower(keyword.text)
            generated = generate(best_model, keyword_new, context)
            generated = re.sub("<.*?>", "", generated)
            try:
                synonyms = generate_distractors(keyword_new, 3)
                synonyms.append(keyword.text)
                synonyms = [synonyms[i].capitalize() if keyword.text.isupper() else synonyms[i] for i in range(len(synonyms))]
                shuffle(synonyms)

                print(generated)
                print('A)', synonyms[0])
                print('B)', synonyms[1])
                print('C)', synonyms[2])
                print('D)', synonyms[3])
            except:
                print(generated)
                print('bad keyword (', keyword, ') not found in dictionary', sep='')
        
            print()

In [None]:
context = 'Think of a business idea for a product or service you want to develop and to create your own \
        company with it. The product or service should be utilizing the power of Digital Innovation to \
        achieve Sustainability. Thus it should be within the Cleantech sector and should be based in \
        digital technologies, such as A.I., Blockchain, IoT, etc. The aim of the product or service should \
        be to turn domestic homes or businesses, energy efficient. It can be about special devices, \
        sophisticated platforms, marketplaces, or anything else.'

generation(context)
print('context:', context)

Digital Innovation to         achieve Sustainability ORG
 digital innovation to achieve sustainability What should the product or service be utilizing?
bad keyword (Digital Innovation to         achieve Sustainability) not found in dictionary

Cleantech ORG
 cleantech What sector should the product or service be within?
A) Cleantech
B) hi-tech
C) incubator
D) start-ups

A.I. GPE
 a.i. What is the name of Blockchain?
A) Bezzerides
B) Antz
C) Spielberg
D) A.i.

Blockchain GPE
 blockchain What type of technology should the product or service be based in?
bad keyword (Blockchain) not found in dictionary

context: Think of a business idea for a product or service you want to develop and to create your own         company with it. The product or service should be utilizing the power of Digital Innovation to         achieve Sustainability. Thus it should be within the Cleantech sector and should be based in         digital technologies, such as A.I., Blockchain, IoT, etc. The aim of the produ

In [None]:
generation(context, right_keyword='Sustainability')
print('context:', context)

 Sustainability What should the product or service achieve?
bad keyword (Sustainability) not found in dictionary
context: Think of a business idea for a product or service you want to develop and to create your own         company with it. The product or service should be utilizing the power of Digital Innovation to         achieve Sustainability. Thus it should be within the Cleantech sector and should be based in         digital technologies, such as A.I., Blockchain, IoT, etc. The aim of the product or service should         be to turn domestic homes or businesses, energy efficient. It can be about special devices,         sophisticated platforms, marketplaces, or anything else.


Examples of generated mcq:

----
1) Context: Think of a business idea for a product or service you want to develop and to create your own company with it. The product or service should be utilizing the power of Digital Innovation to achieve Sustainability. Thus it should be within the Cleantech sector and should be based in digital technologies, such as A.I., Blockchain, IoT, etc. The aim of the product or service should be to turn domestic homes or businesses, energy efficient. It can be about special devices, sophisticated platforms, marketplaces, or anything else.

(cleantech) What sector should the product or service be within?

A) Cleantech

B) hi-tech

C) incubator

D) start-ups

----
2) Context: Which of the following is not a type of muscle tissue?

(Adipose) What is the name of the muscle tissue that does not have a type?

A) adipocytes

B) tissue

C) fatty

D) tissues