## Scope

This notebook goes through how to create a training set from a set of questions and answers

## Optional: further research
* 1) BLUE SCORE FOR EVALUATION: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
* 2) Stanford SQuad (Question/Answer Scoring Datasets): https://github.com/obryanlouis/qa
* 3) https://github.com/facebookresearch/ParlAI

## Initialize

In [11]:
# Data
import numpy as np

# Utility
import time
import math
import sys
import os
import glob
import pickle

# NLP
import tensorflow as tf
import re
import spacy


# Custom libraries
# from seq2seq_model import Seq2SeqModel
# from corpora_tools import *

## Import Data

In [42]:
zip()

NameError: name 'unzip' is not defined

In [62]:
# Option 1: All responses are equal to the original query except for the comment by the author
# Zipped Q/A
data = pickle.load(open('../data/all_responses_equal.p','rb'))
q_and_a = []
for i in list(data):
    q_and_a.append((i[0]['utterance'],i[1]['utterance']))

# Split pairs into question and answer
question, answer = zip(*q_and_a)

In [63]:
idx = 2
print("Q:", question[idx])
print("A:", answer[idx])

Q: Husband deteriorating before my eyes, doctors at a loss, no one will help; Reddit docs, I need you.
A: A few weeks ago I was suffering from the same symptoms. Severe groin pain that radiated through my testicles and legs, urgent urination even though I didn't need to pee, muscle twitching/eye twitching. Everything came back normal. When I went to the hospital the sixth time, the doctors found that I had a lot of poop stuck in my colon around the pelvis area so they gave me an enema and some miralax and I was instantly cured. all this pain was also giving me anxiety and the anxiety made me feel like my face was going numb and that my legs and arms were going numb. After a few days I was still having trouble sleeping but I'm getting back to normal now. Poop is no joke and I hope that this is the problem your husband is having! If not.... I'm honestly so sorry because having to go through this pain for sooooooo long is insanely depressing. Maybe check out this subreddit /r/chronicpain


In [66]:
def tokenize(txt):
    return str(txt).split(' ')

def clean_sentences(txt):
    # tokenize
    tokenized_txt = tokenize(txt)
    # Filter comments too long
    return tokenized_txt[:20]

## Create Trainable dataset

In [67]:
%%time
clean_question = [clean_sentences(s) for s in question]
clean_answer = [clean_sentences(s) for s in answer]

print(len(clean_question))
assert len(clean_question) == len(clean_answer)

108825
CPU times: user 1.03 s, sys: 70.2 ms, total: 1.1 s
Wall time: 1.1 s


In [91]:
filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_question, 
          clean_answer)
print("# Filtered Corpora length (i.e. number of sentences)")
print(len(filt_clean_sen_l1))
assert len(filt_clean_sen_l1) == len(filt_clean_sen_l2)

# Filtered Corpora length (i.e. number of sentences)
108825


Then, let's create the dictionaries for the two sets of sentences. Practically, they should look the same (since the same sentence appears once on the left side, and once in the right side) except there might be some changes introduced by the first and last sentences of a conversation (they appear only once). To make the best out of our corpora, let's build two dictionaries of words and then encode all the words in the corpora with their dictionary indexes

In [90]:
from collections import Counter
def create_indexed_dictionary(sentences, dict_size=10000, storage_path=None):
    count_words = Counter()
    dict_words = {}

    for sen in sentences:
        for word in sen:
            count_words[word] += 1

    for idx, item in enumerate(count_words.most_common(dict_size)):
        dict_words[item[0]] = idx + dict_size

    return dict_words

def sentences_to_indexes(sentences, indexed_dictionary):
    indexed_sentences = []
    not_found_counter = 0
    for sent in sentences:
        idx_sent = []
        for word in sent:
            try:
                idx_sent.append(indexed_dictionary[word])
            except KeyError:
                idx_sent.append(data_utils.UNK_ID)
                not_found_counter += 1
        indexed_sentences.append(idx_sent)
    
    print('[sentences_to_indexes] Did not find {} words'.format(not_found_counter))
    return indexed_sentences

def filter_sentence_length(sentences_l1, sentences_l2, min_len=0, max_len=20):
    filtered_sentences_l1 = []
    filtered_sentences_l2 = []
    for i in range(len(sentences_l1)):
        if min_len <= len(sentences_l1[i]) <= max_len and \
                 min_len <= len(sentences_l2[i]) <= max_len:
            filtered_sentences_l1.append(sentences_l1[i])
            filtered_sentences_l2.append(sentences_l2[i])
    return filtered_sentences_l1, filtered_sentences_l2

In [92]:
dict_l1 = create_indexed_dictionary(clean_question, dict_size=15000)
dict_l2 = create_indexed_dictionary(clean_answer, dict_size=15000)
idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1)
idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2)
print("# Same sentences as before, with their dictionary ID")
print("Q:", list(zip(clean_question[0], idx_sentences_l1[0])))
print("A:", list(zip(filt_clean_sen_l2[0], idx_sentences_l2[0])))

NameError: name 'data_utils' is not defined