# Simple text2mcq
## How it works
- First generate keywords using keybert
  - important words might theoretically make good answers for questions about the text
- Get senteces for keywords and use them to create "question-like sentences". We replace keywords with blanks (see Examples).
- Generate false answers using w2v and heuristic to remove false negative answers (ie. words that are another form of the correct answer and would also answer the question)
- Use facebook bart to pick questions that represent what was written in the text the best.

In [None]:
!pip install keybert transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting keybert
  Downloading keybert-0.7.0.tar.gz (21 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers>=0.3.8
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-

In [None]:
from keybert import KeyBERT
import nltk.data
import nltk
import gensim
import gensim.downloader
from transformers import BartForConditionalGeneration, BartTokenizer
nltk.download('punkt')
w2v = gensim.downloader.load('word2vec-google-news-300')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.




In [None]:
from dataclasses import dataclass
from typing import List, Tuple, Generator
from random import shuffle
import string
from copy import copy
from collections import defaultdict

@dataclass
class Question:
  """ A container to store question and answers. """
  question: str
  answers: List[str]
  original: str

  @property
  def correct_answer(self):
    return self.answers[0]

  @property
  def wrong_answers(self):
    return self.answers[1:]

  def __clear_fake_ans(self, fake):
    fake = [ans.replace('_', ' ') for ans in fake]
    cleared = []
    for ans in fake:
      skip = False
      for w in ans.split():
         if nltk.edit_distance(w, self.correct_answer) <= 2:
           skip = True
           break
      if not skip:
        cleared.append(ans)
    return cleared
  
  def __str__(self) -> str:
    out = [self.question]
    all_ans = copy(self.answers)
    shuffle(all_ans)
    #print(self.answers)
    #print(all_ans)
    correct = ""
    for letter, ans in zip(string.ascii_uppercase, all_ans):
      formatted = f"{letter}) {ans}"
      out.append(formatted)
      if self.correct_answer == ans:
        correct = formatted
    out.append(f"Answer: {correct}")
    return "\n".join(out)
      
  def add_fake_answers(self, vectors, cnt=3):
    """ Adds fake answers based on gensim w2v."""
    buffer = 20
    fake = [t[0] for t in vectors.most_similar(self.correct_answer, topn=cnt+buffer)]
    self.answers += self.__clear_fake_ans(fake)[:cnt]
    

In [None]:
nltk_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
kw_model = KeyBERT()
bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
bart = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/558M [00:00<?, ?B/s]

In [None]:
def get_keywords(words_cnt, doc):
  return [t[0] for t in 
          kw_model.extract_keywords(doc, top_n=words_cnt, stop_words='english')]


def to_sentences(doc):
  return nltk_tokenizer.tokenize(doc)


def remove_keyword(kw, words, substitute="________"):
  return " ".join([word if kw not in word.lower() else 
                   word.lower().replace(kw, substitute) for word in words])


def score_question(doc_tokens, qst: Question):
  labels = bart_tokenizer(qst.original, return_tensors="pt", truncation=True).input_ids
  loss = bart(input_ids=doc_tokens, labels=labels).loss
  return loss.item()

def get_best_questions(doc, qst_for_ranking, top_n):
  doc_tokens = bart_tokenizer(doc, return_tensors="pt", truncation=True).input_ids
  out = []
  for qst in qst_for_ranking:
    out.append((score_question(doc_tokens, qst), qst))
  out.sort()
  print(out)
  return [t[1] for t in out[:top_n]]


def remove_used_sentences(to_remove: List[Tuple[str, str]], 
                          sentences: List[str], sentences_lowered: List[str]):
  for tr in to_remove:
    sentences.remove(tr[0])
    sentences_lowered.remove(tr[1])
  return sentences, sentences_lowered

      
def get_questions(doc, keywords, vectors, ans_cnt=4, top_n=5):
  out: List[Question] = []
  sentences: List[str] = to_sentences(doc)
  sentences_lowered: List[str] = [st.lower() for st in sentences]
  to_remove = []
  for kw in keywords:
    if len(to_remove) > 0:
      sentences, sentences_lowered = remove_used_sentences(to_remove, sentences,
                                                           sentences_lowered)
      to_remove = []
    for st, st_l in zip(sentences, sentences_lowered):
      words_l = st_l.split()
      if kw in words_l:
        words = st.split()
        idx = words_l.index(kw)
        qst: Question = Question(remove_keyword(kw, words), [words[idx]], st)
        try:
          qst.add_fake_answers(vectors, cnt=ans_cnt - 1)
          out.append(qst)
          to_remove.append((st, st_l))  # Prevent one sentence to appear repeatedly
        except KeyError:
          print(f"{kw} not present in vectors vocabulary, skipping...")
  if top_n < len(out):
    out = get_best_questions(doc, out, top_n)
  return out

## Examples

### Short example

In [None]:
heart_doc = """heart, organ that serves as a pump to circulate the blood. It may be a straight tube, as in spiders and annelid worms, or a somewhat more elaborate structure with one or more receiving chambers (atria) and a main pumping chamber (ventricle), as in mollusks. In fishes the heart is a folded tube, with three or four enlarged areas that correspond to the chambers in the mammalian heart. In animals with lungs—amphibians, reptiles, birds, and mammals—the heart shows various stages of evolution from a single to a double pump that circulates blood (1) to the lungs and (2) to the body as a whole."""

In [None]:
kws = get_keywords(8, heart_doc)
for qst in get_questions(heart_doc, kws, w2v, top_n=5):
  print(qst)

heart, ________ that serves as a pump to circulate the blood.
A) organ
B) Aeolian Skinner
C) Wurlitzer pipe
D) Casavant pipe
Answer: A) organ
In animals with ________—amphibians, reptiles, birds, and mammals—the heart shows various stages of evolution from a single to a double pump that circulates blood (1) to the ________ and (2) to the body as a whole.
A) lungs
B) chest cavity
C) intestines
D) bronchial tubes
Answer: A) lungs
In fishes the ________ is a folded tube, with three or four enlarged areas that correspond to the chambers in the mammalian ________.
A) heartbeat
B) myocardial infraction
C) heart
D) cardiac
Answer: C) heart
It may be a straight tube, as in spiders and annelid worms, or a somewhat more elaborate structure with one or more receiving ________s (atria) and a main pumping ________ (ventricle), as in mollusks.
A) commerce Mohammad Nahavandian
B) chamber
C) Mohammad Qurban
D) commerces
Answer: B) chamber


### Long example

In [None]:
fusion_doc = """The release of energy with the fusion of light elements is due to the interplay of two opposing forces: the nuclear force, a manifestation of the strong interaction, which holds protons and neutrons tightly together in the atomic nucleus; and the Coulomb force, which causes positively charged protons in the nucleus to repel each other.[15] Lighter nuclei (nuclei smaller than iron and nickel) are sufficiently small and proton-poor to allow the nuclear force to overcome the Coulomb force. This is because the nucleus is sufficiently small that all nucleons feel the short-range attractive force at least as strongly as they feel the infinite-range Coulomb repulsion. Building up nuclei from lighter nuclei by fusion releases the extra energy from the net attraction of particles. For larger nuclei, however, no energy is released, because the nuclear force is short-range and cannot act across larger nuclei.

Fusion powers stars and produces virtually all elements in a process called nucleosynthesis. The Sun is a main-sequence star, and, as such, generates its energy by nuclear fusion of hydrogen nuclei into helium. In its core, the Sun fuses 620 million metric tons of hydrogen and makes 616 million metric tons of helium each second. The fusion of lighter elements in stars releases energy and the mass that always accompanies it. For example, in the fusion of two hydrogen nuclei to form helium, 0.645% of the mass is carried away in the form of kinetic energy of an alpha particle or other forms of energy, such as electromagnetic radiation."""

In [None]:
kws = get_keywords(15, fusion_doc)
for qst in get_questions(fusion_doc, kws, w2v, top_n=5):
  print(qst)

[(0.5604118704795837, Question(question='The release of energy with the ________ of light elements is due to the interplay of two opposing forces: the nuclear force, a manifestation of the strong interaction, which holds protons and neutrons tightly together in the atomic nucleus; and the Coulomb force, which causes positively charged protons in the nucleus to repel each other.', answers=['fusion', 'Asghar Sediqzadeh', 'TMPRSS2 ERG gene', 'Atrion specializes'], original='The release of energy with the fusion of light elements is due to the interplay of two opposing forces: the nuclear force, a manifestation of the strong interaction, which holds protons and neutrons tightly together in the atomic nucleus; and the Coulomb force, which causes positively charged protons in the nucleus to repel each other.')), (1.1614750623703003, Question(question='[15] Lighter ________ (________ smaller than iron and nickel) are sufficiently small and proton-poor to allow the nuclear force to overcome th

## Next steps
- Create seq2seq model to refolmulate sentences to questions. Right now questions are formulated as sentences.
- Try finetuning a transfomer for conditional generation to get the documment, answer (keyword) and produces a question.
- w2v sometimes produces very bad answers. Try generating false answers with Bert using [MASK]. Again finetuing is needed.

I already tried the second and third idea, however I don't possess computational and time capacity to make the finetuning work.

Howgh!