# [1] Distractor generation for multiple choice questions:

## [2] Importing important libraries:

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm

from collections import defaultdict
from collections import Counter 
from scipy.sparse import hstack
import operator

import re
import pickle

import warnings
warnings.filterwarnings("ignore")

import gc #Garbage collector.

<h4> Settings for Pandas, Seaborn </h4>

In [2]:
tqdm.pandas()
plt.style.use('ggplot')
#%config InlineBackend.figure_format = 'retina'
sns.set(rc={'figure.figsize':(10, 5)})

In [3]:
pd.options.display.max_columns=50
pd.options.display.max_rows=100

<h2> Problem Statement </h2>
<h3> Generating Distractors for questions </h3>

MCQs are a widely question format that is used for general assessment on domain knowledge of candidates. Most of the MCQs are created as paragraphs-based questions.

A paragraph or code snippet forms the base of such questions. The questions are created based on three or four options from which one option is the correct answer. The othe remaining options are called the distractors which means that these options ar eneares to the correct answer but are not correct.

You are provided with training dataset of questions, answers, and distractors to build and train an **NLP** model. The **test** dataset contains questions and answers. You are required to use your NLP model to create up to **three** distractors for each question-answer combination provided in the Test data.

<h3> Data Description: </h3>

- Train.csv 31499 * 3 (excluding headers)
- Test.csv 13500 * 2 (excluding headers)

|Column|Description|
|------|-----------|
|Question|Question of MCQs|
|Answer|Correct answer of the corresponding question|
|Distractor|Options that closely related to the correct answer but are not the correct answer|

<h3> Submission format: </h3>

- Your task is to create upto **three** distractors for each question-answer combination provided in the **Result.csv** file.
- Each distractor must be placed between quotes (' ' or " ") and must be separeated by a comma(,).
- This format can be observed in the traning dataset.

**Note:** To avoid evaluation errors, it is recommended that you must update the values under the **'distractor'** column that is available in the provided **Results.csv** file compliant to the expected submission format.

**Sample_submissions**

|question|answer_text|Distractors|
|--------|-----------|-----------|
|Which city's landmarks include: The Pantheon,The Spanish Steps and Trevi Fountain? 
- The pantheon 
- The Spanish Steps
- Trevi Fountain | Rome | 'Barcelona','Athens','Istanbul'
|What is the color of Donald Duck's bowtie?|Red|'Blue','Yellow',"green'
|What is a Web browser?|A software program that allows you to access sites on the World Wide Web| 'A kind of spider','A computer that stores WWW files','A person who likes to look at websites'


<h3> Evaluation criteria </h3>

For each question-answer combination, the distractors are converted to a **counter** vector of words and the **cosine similarity** is calculated between the predicted and the actual value:
score = **100*mean([cosine_similarity(text_to_vector(actual_value),text_to_vector(predicted_value))])**

======================================================================================================================

<h2> Approach: </h2>

Developing distractors has one of the most challenging and time-consuming tasks for a question-setter.

<h4> Different ways for Generating distractor: </h4>

1. Can be generated from and after learning the context/concept from the passage(if provided) by training a Generative model. This can be achieved by generating texts from the passage that are close to answers but not exactly the same. But the grammar and POS of the generated text might be all messed up which is undesirable.
2. Text can be generated just by considering the left context of right-most word. This can be achieved by training a GPT2 model. But if we don't have long sequences of texts it is difficult to train a model that generates grammatically correct sentences.
3. If we have a collection of potential distractors or "candidate distractors" we can choose the most similar distractors to the Question-Answer pair. The idea is to produce semantically meaningful sentence embeddings for Question-Answer pairs and the Distractors.

Here, I have choosen the 2nd approach since we need grammatically correct distractors. But if the passages were provided it is possible to generate distractors that are close to the context of the passage without compromising too much on their gramatical correctness. This can be achieved by leveraging from pre-trainded wide bidirectional encoder-decoder neural network architectures such as BERT.

In [4]:
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')

print("Number of data points in Train:",train.shape[0])
print("Number of data points in Test:",test.shape[0])

print(train.info())
print(test.info())

Number of data points in Train: 31499
Number of data points in Test: 13500
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31499 entries, 0 to 31498
Data columns (total 3 columns):
question       31499 non-null object
answer_text    31499 non-null object
distractor     31499 non-null object
dtypes: object(3)
memory usage: 738.3+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13500 entries, 0 to 13499
Data columns (total 5 columns):
question       13500 non-null object
answer_text    13500 non-null object
distractor     13500 non-null object
distractor2    13500 non-null object
distractors    13500 non-null object
dtypes: object(5)
memory usage: 527.4+ KB
None


## [4] Exploratory Data Analysis:

In [5]:
#Checking for duplicates.
print(train.duplicated(keep = False).any())
print(test.duplicated(keep = False).any())

False
False


In [10]:
train.tail(20)

Unnamed: 0,question,answer_text,distractor
31479,The trucking company 's lawyer insisted Farmer...,because he believed the answer should be in hi...,'because he knew Farmer Joe was not hurt in th...
31480,The participant of the research,may work in a private office or in a cubicle,'come from the university of Michigan - Flint'...
31481,Which of the following models should you choos...,Sony Ericsson 's W800i and Sumsung 's SPH - V5400,"""Sumsung 's SPH - V5400 and Nokia 's 6620"", ""S..."
31482,Which of the following is WRONG according to t...,There are no secrets in the book Duke of Zhou ...,'Duke of Zhou Interprets Dreams is a book abou...
31483,Daniel Goleman believes that,people not having high EQ may not be successfu...,'one can be just as successful without having ...
31484,"According to the passage , what is the main di...",Ritual has a religious purpose and drama does ...,"'Ritual uses music whereas drama does not .', ..."
31485,"According to Airbus , which of the following i...","Creating less pollution , having less weight","'Having more floor space ', ' creating less po..."
31486,"According to the passage , a dog will be most ...",familiar humans,"'familiar dogs', 'a human the dog had never me..."
31487,Which one is the best title ?,How to be a good reader .,"'How to read quickly .', 'Reading slowly is be..."
31488,The destruction of Agadir is an example of,faulty building construction,"""an earthquake 's strength"", 'widespread panic..."


- We can observe that a lot of question seems to be taken from a passage but the passage is not being provided.

In [11]:
test.head(20)

Unnamed: 0,question,answer_text
0,What 'S the main idea of the text ?,The lack of career -- based courses in US high...
1,"In the summer high season , Finland does nt se...",the sun is out at night
2,If you want to apply for Chinese Business Inte...,have to get confirmed at least twice
3,"That afternoon , the boy 's clothes were dry b...",nobody made room for him in the water .
4,Which of the following statements is NOT true ?,There are twelve countries in the World Wildli...
5,"The problem of "" lock - in "" can be dangerous ...",it may make it difficult for customers to reco...
6,The passage is mainly about,how Billy made blueberry juice with his uncle
7,Which of the following in not true ?,The snail 's teeth ca n't be worn out ..
8,What should you do at mealtime ?,Eat the food your host family gives you .
9,"The part of "" Only you can make a card like th...",how to make a meaningful DIY card


In [65]:
for row in range(100,200,1):
    print("Question-> ",train.iloc[row,0])
    print("Correct answer-> ",train.iloc[row,1])
    print("Distractors-> ",train.iloc[row,2])
    print("="*100)

Question->  Playing computer games
Correct answer->  makes us more clever
Distractors->  'makes our studies worse'
Question->  By visiting the website you can learn more information about
Correct answer->  Blue Ridge Folklike Festival
Distractors->  'Boones Mill Apple Festival', 'Mountain Magic Fall Festival', 'Vinton Fall Festival'
Question->  Women tend to go up the floor until they reach the top floor because
Correct answer->  they think even better husbands may be upstairs
Distractors->  'they think the husbands downstairs are not suitable', 'they are sure that the best husbands are on the top floor'
Question->  When the moon shines , it is during the
Correct answer->  night
Distractors->  'day'
Question->  The writer 's new level of sleep deprivation began since he
Correct answer->  began to teach in a school
Distractors->  'went 36 hours with no sleep'
Question->  What did the Economic Mobility Project find in its research ?
Correct answer->  The rags to riches story is more fict

Distractors->  'Why paving roads reduces our water', 'how much we depend on water to live'
Question->  From this passage , we can know that
Correct answer->  parents should try their best to keep away from spanking their children
Distractors->  'good children always do the right thing', 'parents have to spank the children when they do something wrong', 'children can have more than one choice not to be spanked'
Question->  Linda Waite 's studies support the idea that
Correct answer->  Marriage can help make up for ill health
Distractors->  'older men should quit smoking to stay healthy', 'Unmarried people are likely to suffer in later life'
Question->  David Trezequet lost the game to Italy in
Correct answer->  the 2006 World Cup in Germany
Distractors->  'the European Cup', 'the French World Cup in 1998'
Question->  The passage is mainly about   
Correct answer->  the changes of fashions
Distractors->  'the money in fashion'
Question->  Mr. Davis left the author with a very deep impres

Distractors->  'Sense ability and food tastes .', 'Olfactory genes and its system .'
Question->  What is the best title of the passage ?
Correct answer->  Love or money ?
Distractors->  'Material wealth can guarantee happiness', 'Studies on sudden wealth'
Question->  According to the text , which of the following has the longest history ?
Correct answer->  San Francisco 's Chinatown .
Distractors->  "London 's Chinatown .", "Bangkok 's Chinatown .", "Mauritius 's Chinatown ."
Question->  The author stopped his work immediately because he
Correct answer->  decided to drive home for fishing tools and take the old man fishing
Distractors->  'needed to go and buy a big bobber and some worms he needed', 'had got to head home to deal with some urgent situations'


In [66]:
for row in range(100,200,1):
    print("Question-> ",test.iloc[row,0])
    print("Correct answer-> ",test.iloc[row,1])
    print("="*100)

Question->  Why did most of Chinese people believe to eat eggplant and gram to be healthy ?
Correct answer->  Because the people think Zhang Wuben is a famous nutritionist and doctor .
Question->  Which of the following is NOT TRUE according to the passage ?
Correct answer->  Mrs. Anderson stole Charlie 's car at the request of her daughter .
Question->  What is the topic of the talk in November ?
Correct answer->  The work of the Thames Ironworks Heritage Trust .
Question->  What do we know about space food from the text ?
Correct answer->  Astronauts could eat apples in space in the past .
Question->  The reading is mainly about
Correct answer->  three friends at a school
Question->  We can learn from the text that business English
Correct answer->  pays more attention to the forms of expressions
Question->  What can we infer from the passage ?
Correct answer->  Friedman has got some supporters for his project .
Question->  Which   is TRUE according to the article   ?
Correct answer-

Correct answer->  The thief was caught two days later .
Question->  Which of the following may be a suitable title for the story ?
Correct answer->  Civilized Dog Raising
Question->  The writer
Correct answer->  is an American
Question->  Which is TRUE about Brazil ?
Correct answer->  Soccer and samba are two symbols of the country .
Question->  What worried the Craigs ?
Correct answer->  There was no cheetah mother to teach the cubs .
Question->  Why was life able to develop on the earth but not on other planets ?
Correct answer->  The water stayed on the earth but not on other planets .
Question->  What is the passage mainly about ?
Correct answer->  It instructs us how to water trees correctly
Question->  What are guests supposed to do from 9:00 a. m. to 10:30 a. m. ?
Correct answer->  To take a tour of the Thomas College campus .
Question->  What did Mr. and Mrs. Gordon do when Sandra was carried out to sea by the wave ?
Correct answer->  Perhaps they were reading magazines .
Quest

In [8]:
train['distractor'][1].split(', ')

["'If some tragedies occur again '",
 "' relevant departments of the State Council should take responsibility completely .'",
 "'Currently '",
 "' the central government has established sound management systems to guarantee school bus safety .'",
 "'Central and local governments will share the costs on building more modern schools in rural areas .'"]

In [6]:
train['num_distractors'] = train['distractor'].map(lambda x: len(x.split(',')))

In [7]:
train.head()

Unnamed: 0,question,answer_text,distractor,num_distractors
0,Meals can be served,in rooms at 9:00 p. m.,"'outside the room at 3:00 p. m.', 'in the dini...",3
1,It can be inferred from the passage that,The local government can deal with the problem...,"'If some tragedies occur again ', ' relevant d...",5
2,The author called Tommy 's parents in order to,help them realize their influence on Tommy,"'blame Tommy for his failing grades', 'blame T...",2
3,It can be inferred from the passage that,the writer is not very willing to use idioms,'idioms are the most important part in a langu...,3
4,How can we deal with snake wounds according to...,Stay calm and do n't move .,'Cut the wound and suck the poison out .',1


- Some questions have more than 3 distractor and some questions have only 1 distractor.

<h3> Different ways of producing sentence embeddings: </h3>

1. **Avg-W2V:** Words can be converted to vectors by producing their W2V embeddings. For sentences we can take average of the all the word embeddings to produce the sentence embedding. Since this is based of indivdual word embeddings this does not consider the context, POS of the words, Voice of the sentences.
2. **Avg-W2V (Glove):** Leverages a pretrained Glove model to produce sentence embeddings. This is said to also consider the context of the words.
3. **Sentence Transformers:** Sentence Embeddings using BERT. Tunes a model on Natural Language Inference (NLI) data. Given two sentences, the model should classify if these two sentence entail, contradict, or are neutral to each other. For this, the two sentences are passed to a transformer model to generate fixed-sized sentence embeddings. These sentence embeddings are then passed to a softmax classifier to derive the final label (entail, contradict, neutral). This generates sentence embeddings that are useful also for other tasks like clustering or semantic textual similarity.

<h4> Constructing sentence embeddings: </h4>

Choosing 'bert-base-nli-mean-tokens' and 'bert-base-nli-stsb-mean-tokens' pretrained sentence embeddings model for constructing sentence embedings/vectors.

In [8]:
from sentence_transformers import SentenceTransformer
import scipy.spatial

#Leveraging pretrained models, loading pre-trained models from the server:
#embedder = SentenceTransformer('bert-base-nli-mean-tokens') # STSbenchmark: 77.12
embedder = SentenceTransformer('bert-base-nli-stsb-mean-tokens')# STSbenchmark: 85.14

In [13]:
#
distractors = [train['distractor'][i].split(',') for i in range(train.shape[0])] #
candidate_distractors = [distractors[i][j] for i in range(len(distractors)) for j in range(len(distractors[i]))]

In [10]:
%%time
# Corpus with the candidate_distractors from train
corpus = candidate_distractors
corpus_embeddings = embedder.encode(candidate_distractors) #Pytorch makes use of GPU acceleration.

Wall time: 8min 17s


In [10]:
#Saving the embeddings for future use.
np.save("corpus_embeddings.npy",corpus_embeddings)
corpus_embeddings = np.load("corpus_embeddings.npy")
print(np.shape(corpus_embeddings))

(70796, 768)


In [9]:
# Considering combination of Questions and answers as Query sentences: 
q_a_combined = test['question'] + " " + test['answer_text']

In [12]:
%%time
query_embeddings = embedder.encode(q_a_combined)

Wall time: 1min 51s


In [11]:
#Saving the embeddings for future use.
np.save("query_embeddings.npy",query_embeddings)
query_embeddings = np.load("query_embeddings.npy")
print(np.shape(query_embeddings))

(13500, 768)


In [15]:
# Finding the closest 3 distractors from the corpus of distractors for each query sentence based on cosine similarity
closest_n = 3 
#Creating DataFrame for accomodating all the 3 distractors for each question in Test.
dis3 = pd.DataFrame(data=[[0 for i in range(closest_n)] for j in range(test.shape[0])], \
                    columns=['distractor_'+str(i) for i in range(closest_n)])
i=0
for query, query_embedding in tqdm(zip(q_a_combined[:], query_embeddings[:])):
    #Calculating cosine distance of every distractor from a query.
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])
    

    #print("\n\n======================\n\n")
    #print("Query:", query)
    #print("\nTop 5 most similar sentences in corpus:")
    j=0
    for idx, distance in results[0:closest_n]:
        #Assigning potential distract from Train to question-answer pairs in Test.
        #print(distance)
        #print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
        dis3.iloc[i,j] = corpus[idx].strip()
        j+=1
    i+=1

13500it [1:32:03,  2.49it/s]


In [16]:
dis3.head()

Unnamed: 0,distractor_0,distractor_1,distractor_2
0,"""thought the author was n't fit to be a teacher""",'Students are not interested in their curricul...,'middle school graduates should not choose the...
1,'a late nap may lead to sleep problems during ...,'Sleeping late on weekends can make up for los...,'the ways to get a better sleep on a full moon...
2,'Starting Business in China','Beijing No.4 Middle School has been encouragi...,'learning how to make Chinese dishes is very p...
3,'He thought it was such a big swimming pool th...,'there was no water to wash his hands','Nobody needs water'
4,'12 zoos are involved in the program in the US .','000 to A$ 12',' so it is hard to tell whether a child is a b...


In [17]:
#Combining all 3 distractors
dis3['distractors'] = dis3.iloc[:,0]+","+dis3.iloc[:,1]+","+dis3.iloc[:,2]

dis3.to_csv('dis3_v2.csv')

In [19]:
test['distractor'] = dis3['distractors']

test[['question', 'answer_text','distractor']].to_csv('results_2.csv',index=False)

In [22]:
test[['question', 'answer_text','distractor']].head()

Unnamed: 0,question,answer_text,distractor
0,What 'S the main idea of the text ?,The lack of career -- based courses in US high...,"""thought the author was n't fit to be a teache..."
1,"In the summer high season , Finland does nt se...",the sun is out at night,'a late nap may lead to sleep problems during ...
2,If you want to apply for Chinese Business Inte...,have to get confirmed at least twice,"'Starting Business in China','Beijing No.4 Mid..."
3,"That afternoon , the boy 's clothes were dry b...",nobody made room for him in the water .,'He thought it was such a big swimming pool th...
4,Which of the following statements is NOT true ?,There are twelve countries in the World Wildli...,'12 zoos are involved in the program in the US...
