## Dataset Creation

As a inital step we will load the sample data which we have. Here is a sample data which [I found](
https://raw.githubusercontent.com/bvshyam/make_yourself_a_bot/master/workspace-watson.json). This will be me base to create a dataset.

In [2]:
# Loading necessary libraries

import json
import pandas as pd

Input dataset is a JSON file. In which the data is segregated in terms of intent, question and answer. We are not going to mention which is the intent and response. We are going to train the bot for answering these questions.

So we will remove all the intent and this format. We will gather only the questions and answers.

In [3]:
# Loading Json
data = json.load(open('./data/resume_conversation.json'))

Below is the sample format of the file. It has the conditions( Question type), and output(Answer's)

In [4]:
data['dialog_nodes'][0]

{'conditions': '#complaint',
 'context': None,
 'created': '2016-09-10T13:23:23.985Z',
 'description': None,
 'dialog_node': 'node_10_1473513867252',
 'go_to': None,
 'metadata': None,
 'output': {'text': 'All right. I filed a complaint to the management.... done. Sending notification to lawyers...done :)'},
 'parent': None,
 'previous_sibling': 'node_1_1473866820552'}

In [5]:
# Looping through all the conditions and output(Answers)

intent_questions =[]
answers = []

for question in  data['dialog_nodes'][1:1000]:
    temp = dict(question)
    intent_questions.append(str(temp['conditions']).replace('#',""))
    answers.append(temp['output']['text'])

In [6]:
#Sample text
intent_questions[:10]

['what_are_your_strengths',
 'what_are_your_weaknesses',
 'yes',
 'greetings',
 'conversation_start',
 'conversation_start_x',
 'how_you_describe_yourself',
 'why_do_you_want_to_work_for_us',
 'anything_else',
 'can_you_work']

In [7]:
# Converting the intent and answers in a pandas dataframe

df_qa = pd.DataFrame(list(zip(intent_questions,answers)))
df_qa.columns = ['intent','answers']

In [8]:
df_qa.head()

Unnamed: 0,intent,answers
0,what_are_your_strengths,ClichÃ© question detecting. Waiting for next q...
1,what_are_your_weaknesses,ClichÃ© question detecting. Waiting for next q...
2,yes,Good for you
3,greetings,Greetings!
4,conversation_start,"Hello, I'm Andrei's AI. Think of me as Andrei...."


As a next part, we need to scrape the intents and different questions type. Different questions type means the different ways a user can ask questions.

In [9]:
intents =[]
different_questions = []

#Looping intents and in the tree get the different questions as list.
for question in data['intents']:
    temp = dict(question)
    intents.append(question['intent'])
    different_questions.append([answers['text'] for answers in question['examples']]  )

In [10]:
#Createa a pandas dataframe
df_quest = pd.DataFrame(columns=['intent','question'])

In [11]:
#Add intent and questions in each row

for intent,question in zip(intents,different_questions):
    for quest in list(question):
        df_quest = df_quest.append({'intent':intent,'question':quest}, ignore_index=True)

In [12]:
df_quest.head()

Unnamed: 0,intent,question
0,age_related,Are you an adult ?
1,age_related,are you clever ?
2,age_related,are you there ?
3,age_related,How many years you have?
4,age_related,How old are you ?


Finally we need to join between the left and right dataframe to get our final result

In [13]:
pd_final = pd.merge(df_qa, df_quest, on='intent', how='outer')

pd_final.head()

Unnamed: 0,intent,answers,question
0,what_are_your_strengths,ClichÃ© question detecting. Waiting for next q...,Tell me 5 positive things about you
1,what_are_your_strengths,ClichÃ© question detecting. Waiting for next q...,Tell me your strengths
2,what_are_your_strengths,ClichÃ© question detecting. Waiting for next q...,Tell us Unique Selling Points
3,what_are_your_strengths,ClichÃ© question detecting. Waiting for next q...,What are you good at ?
4,what_are_your_strengths,ClichÃ© question detecting. Waiting for next q...,What are your professional strengths ?


In [14]:
# Above dataframe will be exported to the csv for further usage.

#pd_final.to_csv('./data/final_qa_data.csv')

There is also a need to cleanup the dataset and update the answers manually. Because some of the answers is not relevent and updated to my info. So those information need to be updated manually.

In [15]:
pd_final.to_csv('./data/final_qa_data_tabbed.csv',sep="\t")