**Generating Distractors for the SQuAD 2.0 Dataset** [ Paper Repository](https://github.com/voidful/BDG)

In [1]:
%%capture
!pip install nlp2go transformers git+https://github.com/Maluuba/nlg-eval.git@master

In [2]:
from nlgeval import NLGEval

nlgeval = NLGEval(
    metrics_to_omit=['METEOR', 'EmbeddingAverageCosineSimilairty', 'SkipThoughtCS', 'VectorExtremaCosineSimilarity',
                     'GreedyMatchingScore', 'CIDEr'])

In [3]:
%%capture
!wget https://github.com/voidful/BDG/releases/download/v2.0/BDG.pt
!wget https://github.com/voidful/BDG/releases/download/v2.0/BDG_ANPM.pt
!wget https://github.com/voidful/BDG/releases/download/v2.0/BDG_PM.pt

In [6]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [7]:
from transformers import RobertaTokenizer
from transformers import RobertaForMultipleChoice
import torch
from torch.distributions import Categorical
import itertools as it
import nlp2go

tokenizer = RobertaTokenizer.from_pretrained("LIAMF-USP/roberta-large-finetuned-race")
model = RobertaForMultipleChoice.from_pretrained("LIAMF-USP/roberta-large-finetuned-race")
model.eval()
model.to(device)

dg_model = nlp2go.Model('./BDG.pt')
dg_model_pm = nlp2go.Model('./BDG_PM.pt')
dg_model_both = nlp2go.Model('./BDG_ANPM.pt')

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

===model info===
model_config facebook/bart-base
tags ['seq2seq_0']
type ['seq2seq']
maxlen 1024
epoch 8
loading saved model


Downloading:   0%|          | 0.00/1.68k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/532M [00:00<?, ?B/s]

===ADD TOKEN===
We have added 0 tokens
finish loading
loaded model predict_parameter {}
===model info===
model_config facebook/bart-base
tags ['seq2seq_0']
type ['seq2seq']
maxlen 1024
epoch 10
loading saved model
===ADD TOKEN===
We have added 0 tokens
finish loading
loaded model predict_parameter {}
===model info===
model_config facebook/bart-base
tags ['seq2seq_0']
type ['seq2seq']
maxlen 1024
epoch 8
loading saved model
===ADD TOKEN===
We have added 0 tokens
finish loading
loaded model predict_parameter {}


In [None]:
d_input = context + '</s>' + question + '</s>' + answer
choices = dg_model.predict(d_input, decodenum=3)['result']
choices_pm = dg_model_pm.predict(d_input, decodenum=3)['result']
choices_both = dg_model_both.predict(d_input, decodenum=3)['result']
all_options = choices + choices_pm + choices_both

In [None]:
def selection(context, question, answer, all_options):
    max_combin = [0, []]
    for combin in set(it.combinations(all_options, 3)):
        options = list(combin) + [answer]
        keep = True
        for i in set(it.combinations(options, 2)):
            a = "".join([char if char.isalpha() or char == " " else " " + char + " " for char in i[0]])
            b = "".join([char if char.isalpha() or char == " " else " " + char + " " for char in i[1]])
            metrics_dict = nlgeval.compute_individual_metrics([a], b)
            if metrics_dict['Bleu_1'] > 0.5:
                keep = False
                break
        if keep:
            prompt = context + tokenizer.sep_token + question
            encoding_input = []
            for choice in options:
                encoding_input.append([prompt, choice])
            encoding_input.append([prompt, answer])
            labels = torch.tensor(len(options) - 1).unsqueeze(0)
            encoding = tokenizer(encoding_input, return_tensors='pt', padding=True, truncation='only_first')
            outputs = model(**{k: v.unsqueeze(0).to('cuda') for k, v in encoding.items()},
                            labels=labels.to('cuda'))  # batch size is 1
            entropy = Categorical(probs=torch.softmax(outputs.logits, -1)).entropy().tolist()[0]
            if entropy >= max_combin[0]:
                max_combin = [entropy, options]
    return max_combin[1][:-1]

In [None]:
selection(context, question, answer, all_options)

['to make a choice when we shop',
 'to know how to use money wisely',
 'to know how to make a good choice']

Running the model on the SQuAD dataset for generating distractors

In [8]:
import pandas as pd


In [9]:
#Loading the SQuAD dataset into a dataframe
squad=pd.read_csv('SQuAD.csv')

In [16]:
#Selecting a subset of the SQuAD dataset
subset=squad[400:500]

In [17]:
subset

Unnamed: 0,context,question,id,answer_start,text,Answer_length
400,The Arctic tern holds the long-distance migrat...,What is an example of a shorter migration?,5705e9bf52bb891400689696,429,altitudinal migrations on mountains such as th...,10
401,Within a species not all populations may be mi...,What is leap frog migration?,5705fa3075f01819005e7810,604,birds that nest at higher latitudes spend the ...,12
402,"Nocturnal migrants minimize predation, avoid o...",How do nocturnal migrants compensate for loss ...,570688d052bb891400689a50,141,Migrants may be able to alter their quality of...,10
403,Species that have no long-distance migratory r...,What are the waxwings Bombycilla moving in res...,5706910552bb891400689a67,127,winter weather and the loss of their usual win...,10
404,Sometimes circumstances such as a good breedin...,For what reason would birds mor far beyond the...,5706a3b152bb891400689afe,32,a good breeding season followed by a food sour...,10
...,...,...,...,...,...,...
495,Endemic species can be threatened with extinct...,What causes genetic pollution?,570bd233ec8fbc190045bb2a,250,either a numerical and/or fitness advantage of...,10
496,The fossil record suggests that the last few m...,Why are some scientists uncertain about the fo...,570bd3b26b8089140040fa71,186,how strongly the fossil record is biased by th...,17
497,Biodiversity's relevance to human health is be...,What changes in biodiversity have an effect on...,570bd5e26b8089140040fa7a,344,changes in populations and distribution of dis...,19
498,"Since life began on Earth, five major mass ext...",What happened in the Carboniferous?,570bd6f56b8089140040fa82,454,rainforest collapse led to a great loss of pla...,12


In [19]:
#Generating the distractors
import time
distractors=[]
start=time.time()
for i in range(len(subset)):
  print(i)
  d_input = subset.iloc[i].context + '</s>' + subset.iloc[i].question + '</s>' + subset.iloc[i].text
  choices_both = dg_model_both.predict(d_input, decodenum=2)['result'] #Generating the distractors
  
  if (subset.iloc[i].text!=choices_both[0]):
    distractors.append(choices_both[0])
  else:
    distractors.append(choices_both[1])
print(f"Total Time taken for generating 20 distractors is :{time.time()-start}")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Total Time taken for generating 20 distractors is :853.0828821659088


In [14]:
subset["Distractor"]=distractors

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [15]:
for i in range(len(subset)):
  print("Question :"+subset.iloc[i].question +"\n"+" Answer : "+ subset.iloc[i].text +"\n"+"Distractor :" +subset.iloc[i].Distractor)

Question :What does a user need to do to transfer a Live account to the new system?
 Answer : users need to link a Windows Live ID to their gamertag on Xbox.com
Distractor :parents needn't change their email account
Question :What was the initial US strategy in the War of 1812?
 Answer : invading British Canada, hoping to use captured territory as a bargaining chip
Distractor : invasion of the New York, too
Question :Why did Coalition nations fear the removal of Hussein from power?
 Answer : it would create a power vacuum and destabilize the region
Distractor :it was unnecessary for Iraq to be removed from power
Question :What advantage did Washington have over the British generals?
 Answer : he had a better idea of how to win the war than they did
Distractor :he was more experienced than the British general
Question :How did the war effect both sides?
 Answer : the material and personnel of the South were used up, while the North prospered
Distractor :the South and the South were not 

In [None]:
subset.to_csv('subset.csv')