# Dataset Preparation
CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. available here: https://stanfordnlp.github.io/coqa/

In [1]:
#Importing Libraries
import pandas as pd

#Reading Coqa Dataset
Total_dataset = pd.read_json('coqa-train-v1.0.json')

# Saving the Questions and Answers in anew DataFrame
Context_num=[]
Qs=[]
As= []

for i in range(len(Total_dataset)):
    
    questions = Total_dataset['data'][i]['questions']
    answers = Total_dataset['data'][i]['answers']

    for j in range(len(questions)):
        Context_num.append(i+1)
        Qs.append(questions[j]['input_text'])
        
    for g in range(len(answers)):
        As.append(answers[g]['input_text'])
        
data = pd.DataFrame({'Context Number':Context_num ,'Questions' :Qs,'Answers' : As})
print("Numer of questions is ", len(data))

# Saving the rearranged dataset in csv file
data.to_csv('Rearranged_data.csv',index = False)

Numer of questions is  108647


# Building the chatbot 
I used the retrieval chatbot (which uses TFidf for vectorization) which is described in the blog: https://omarito.me/building-a-basic-fatwa-chat-bot/ but with some modifications.
 
## Modifications:
1. It uses CoQa dataset instead of the dataset of askfm
2. It prints only the top one answer if the simularity of its question is larger than 0.5
3. if the simlrity of top one questions < .5 it prints the top five questions as suggestions

 

### First conversation 

In [6]:
"""Importing Libraries"""
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

"""Reading Questions and Answers"""
full_Dataset = pd.read_csv('Rearranged_data.csv')
data_QA= full_Dataset[['Questions','Answers']]

"""Tf-idf Vectorizer"""
vectorizer = TfidfVectorizer()
vectorizer.fit(data_QA.values.ravel())

print("Welcome to chatbot")
question = [input('Please enter a question: ')]

while True:
    """Taking input Question and Vectorizing it """
    question = vectorizer.transform(question)
    """Cosine Simularity"""
    rank = cosine_similarity(question, vectorizer.transform(data_QA['Questions'].values))
    
    """Getting maximum cosine simularity rank"""
    print("\nSearching for the best answer..........\n")   
    array=np.asarray(rank)
    array=array.reshape(len(data_QA),1)
    maximum = max(array)
    top_one =np.argsort(-1*rank, axis=-1).T[:1].tolist()
    
    """Checking if maximum cosine simularity rank<0.5 """
    for item in top_one:
        
      if (maximum < 0.5):
        print("Sorry, I can't find the answer\n")
        
        """ printting the answers of the top five quesions in cosine
         simularity"""
        print('Answers of the most similar five Questions\n')
        top_five = np.argsort(-1*rank, axis=-1).T[:5].tolist()
        for item in top_five:
          print(data_QA['Questions'].iloc[item].values[0],' : ' , data_QA['Answers'].iloc[item].values[0]) 

      else:
        print('The answer is "', 
              data_QA['Answers'].iloc[item].values[0],'" with similarity: ', maximum[0])  
    print("\n ---------------------------------------------------------\n")
    
    flag = True
    flag_N= False
    while flag:
          do = [input('Do you have another question (Y/N)? ')]
          if do[0] == 'Y':
             question = [input('Please enter another question: ')]
             flag = False
          elif do[0] == 'N':
                print('\nGood bye')
                flag = False
                flag_N = True
          else:
             print("\n I can't understand\n")

    if flag_N == True:
        break
    

Welcome to chatbot
Please enter a question: When was the Vat formally opened?

Searching for the best answer..........

The answer is " It was formally established in 1475 " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question:  for what subjects?

Searching for the best answer..........

The answer is " history, and law " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: what must be requested in person or by mail?

Searching for the best answer..........

The answer is " Photocopies " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: when were the Secret Archives moved from the rest of the library?

Searching for the best answer..........

The answer is " at the beginning of the 17th cen

### Second conversation

In [7]:
"""Importing Libraries"""
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

"""Reading Questions and Answers"""
full_Dataset = pd.read_csv('Rearranged_data.csv')
data_QA= full_Dataset[['Questions','Answers']]

"""Tf-idf Vectorizer"""
vectorizer = TfidfVectorizer()
vectorizer.fit(data_QA.values.ravel())

print("Welcome to chatbot")
question = [input('Please enter a question: ')]

while True:
    """Taking input Question and Vectorizing it """
    question = vectorizer.transform(question)
    """Cosine Simularity"""
    rank = cosine_similarity(question, vectorizer.transform(data_QA['Questions'].values))
    
    """Getting maximum cosine simularity rank"""
    print("\nSearching for the best answer..........\n")   
    array=np.asarray(rank)
    array=array.reshape(len(data_QA),1)
    maximum = max(array)
    top_one =np.argsort(-1*rank, axis=-1).T[:1].tolist()
    
    """Checking if maximum cosine simularity rank<0.5 """
    for item in top_one:
        
      if (maximum < 0.5):
        print("Sorry, I can't find the answer\n")
        
        """ printting the answers of the top five quesions in cosine
         simularity"""
        print('Answers of the most similar five Questions\n')
        top_five = np.argsort(-1*rank, axis=-1).T[:5].tolist()
        for item in top_five:
          print(data_QA['Questions'].iloc[item].values[0],' : ' , data_QA['Answers'].iloc[item].values[0]) 

      else:
        print('The answer is "', 
              data_QA['Answers'].iloc[item].values[0],'" with similarity: ', maximum[0])  
    print("\n ---------------------------------------------------------\n")
    
    flag = True
    flag_N= False
    while flag:
          do = [input('Do you have another question (Y/N)? ')]
          if do[0] == 'Y':
             question = [input('Please enter another question: ')]
             flag = False
          elif do[0] == 'N':
                print('\nGood bye')
                flag = False
                flag_N = True
          else:
             print("\n I can't understand\n")

    if flag_N == True:
        break
    

Welcome to chatbot
Please enter a question: Is the JPEG format supported by Adobe Flash Player 11.0?

Searching for the best answer..........

The answer is " yes " with similarity:  1.0000000000000002

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: what is JPEG XR short for?

Searching for the best answer..........

The answer is " JPEG extended range " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: when did microsoft put HD Photo up for consideration to be named JPEG XR?

Searching for the best answer..........

The answer is " July 2007 " with similarity:  1.0000000000000002

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: what did they rename it to?

Searching for the best answer..........

The answer is " HD Photo " w

### Third conversation

In [8]:
"""Importing Libraries"""
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

"""Reading Questions and Answers"""
full_Dataset = pd.read_csv('Rearranged_data.csv')
data_QA= full_Dataset[['Questions','Answers']]

"""Tf-idf Vectorizer"""
vectorizer = TfidfVectorizer()
vectorizer.fit(data_QA.values.ravel())

print("Welcome to chatbot")
question = [input('Please enter a question: ')]

while True:
    """Taking input Question and Vectorizing it """
    question = vectorizer.transform(question)
    """Cosine Simularity"""
    rank = cosine_similarity(question, vectorizer.transform(data_QA['Questions'].values))
    
    """Getting maximum cosine simularity rank"""
    print("\nSearching for the best answer..........\n")   
    array=np.asarray(rank)
    array=array.reshape(len(data_QA),1)
    maximum = max(array)
    top_one =np.argsort(-1*rank, axis=-1).T[:1].tolist()
    
    """Checking if maximum cosine simularity rank<0.5 """
    for item in top_one:
        
      if (maximum < 0.5):
        print("Sorry, I can't find the answer\n")
        
        """ printting the answers of the top five quesions in cosine
         simularity"""
        print('Answers of the most similar five Questions\n')
        top_five = np.argsort(-1*rank, axis=-1).T[:5].tolist()
        for item in top_five:
          print(data_QA['Questions'].iloc[item].values[0],' : ' , data_QA['Answers'].iloc[item].values[0]) 

      else:
        print('The answer is "', 
              data_QA['Answers'].iloc[item].values[0],'" with similarity: ', maximum[0])  
    print("\n ---------------------------------------------------------\n")
    
    flag = True
    flag_N= False
    while flag:
          do = [input('Do you have another question (Y/N)? ')]
          if do[0] == 'Y':
             question = [input('Please enter another question: ')]
             flag = False
          elif do[0] == 'N':
                print('\nGood bye')
                flag = False
                flag_N = True
          else:
             print("\n I can't understand\n")

    if flag_N == True:
        break
    

Welcome to chatbot
Please enter a question: what is the GoI?

Searching for the best answer..........

The answer is " Government of India " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: where did India come from?

Searching for the best answer..........

The answer is " the Indus river " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: how many states are in India?

Searching for the best answer..........

The answer is " 29 " with similarity:  1.0000000000000002

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: how was it created?

Searching for the best answer..........

The answer is " by the constitution of India " with similarity:  1.0

 ---------------------------------------------------------



### A conversation illustrates some mistakes due to the repetition of the questions in different contexts

In [9]:
"""Importing Libraries"""
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

"""Reading Questions and Answers"""
full_Dataset = pd.read_csv('Rearranged_data.csv')
data_QA= full_Dataset[['Questions','Answers']]

"""Tf-idf Vectorizer"""
vectorizer = TfidfVectorizer()
vectorizer.fit(data_QA.values.ravel())

print("Welcome to chatbot")
question = [input('Please enter a question: ')]

while True:
    """Taking input Question and Vectorizing it """
    question = vectorizer.transform(question)
    """Cosine Simularity"""
    rank = cosine_similarity(question, vectorizer.transform(data_QA['Questions'].values))
    
    """Getting maximum cosine simularity rank"""
    print("\nSearching for the best answer..........\n")   
    array=np.asarray(rank)
    array=array.reshape(len(data_QA),1)
    maximum = max(array)
    top_one =np.argsort(-1*rank, axis=-1).T[:1].tolist()
    
    """Checking if maximum cosine simularity rank<0.5 """
    for item in top_one:
        
      if (maximum < 0.5):
        print("Sorry, I can't find the answer\n")
        
        """ printting the answers of the top five quesions in cosine
         simularity"""
        print('Answers of the most similar five Questions\n')
        top_five = np.argsort(-1*rank, axis=-1).T[:5].tolist()
        for item in top_five:
          print(data_QA['Questions'].iloc[item].values[0],' : ' , data_QA['Answers'].iloc[item].values[0]) 

      else:
        print('The answer is "', 
              data_QA['Answers'].iloc[item].values[0],'" with similarity: ', maximum[0])  
    print("\n ---------------------------------------------------------\n")
    
    flag = True
    flag_N= False
    while flag:
          do = [input('Do you have another question (Y/N)? ')]
          if do[0] == 'Y':
             question = [input('Please enter another question: ')]
             flag = False
          elif do[0] == 'N':
                print('\nGood bye')
                flag = False
                flag_N = True
          else:
             print("\n I can't understand\n")

    if flag_N == True:
        break

Welcome to chatbot
Please enter a question: What is the largest island?

Searching for the best answer..........

The answer is " Socotra " with similarity:  1.0000000000000002

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: Where is it?

Searching for the best answer..........

The answer is " to the west " with similarity:  1.0

 ---------------------------------------------------------

Do you have another question (Y/N)? Y
Please enter another question: How many people live there?

Searching for the best answer..........

The answer is " 1.8 million. " with similarity:  1.0000000000000002

 ---------------------------------------------------------

Do you have another question (Y/N)? N

Good bye


#### In the previous conversation the right answers were:
1. ' Australia ' and the answer that printed was for another question in another context about 'yemen' with also the question 'What is the largest island?' 
2. ' south of earth ' and the answer that printed was for another question in another context about 'alska' with also the question 'Where is it?'
3. ' The population is nearly as large as Shanghai's ' and the answer that printed was for another question in another context  about 'vienna' with also the question 'How many people live there?'

I think also if we changed a little in the question to be diffrernt from the dataset it will produce another mistakes. 