# Building a Chatbot

In this project, we will build a chatbot using conversations from Cornell University's [Movie Dialogue Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). The main features of our model are LSTM cells, a bidirectional dynamic RNN, and decoders with attention. 

The conversations will be cleaned rather extensively to help the model to produce better responses. As part of the cleaning process, punctuation will be removed, rare words will be replaced with "UNK" (our "unknown" token), longer sentences will not be used, and all letters will be in the lowercase. 

With a larger amount of data, it would be more practical to keep features, such as punctuation. However, I am using FloydHub's GPU services and I don't want to get carried away with too training for too long.

In [1]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [1]:
import numpy as np
import tensorflow as tf
import time

#Local libraries
from corpus import Corpus
import metrics
import loss_functions

tf.__version__

np.random.seed(1)
tf.set_random_seed(1)

  from ._conv import register_converters as _register_converters


Most of the code to load the data is courtesy of https://github.com/suriyadeepan/practical_seq2seq/blob/master/datasets/cornell_corpus/data.py.

## Load and Preprocess Data

In [95]:
cornell_corpus = Corpus("movie_lines.txt", "movie_conversations.txt", max_vocab=8100, max_line_length=30)

questions_text = cornell_corpus.prompts
answers_text = cornell_corpus.answers
questions_int = cornell_corpus.prompts_int
answers_int = cornell_corpus.answers_int

UNK = cornell_corpus.unk
vocab2int = cornell_corpus.vocab2int
int2vocab = cornell_corpus.int2vocab

METATOKEN_INDEX = len(vocab2int)
META = "<META>"
EOS = "<EOS>"
PAD = "<PAD>"
GO = "<GO>"
codes = [META, EOS, PAD, GO]
    

source_vocab_size = len(vocab2int)
dest_vocab_size = len(vocab2int)

vocab_dicts = (vocab2int, int2vocab)

(questions_vocab_to_int, questions_int_to_vocab) = vocab_dicts
(answers_vocab_to_int, answers_int_to_vocab) = vocab_dicts

In [96]:
vocab_dicts

({'you': 0,
  'i': 1,
  'the': 2,
  'not': 3,
  'to': 4,
  'it': 5,
  'a': 6,
  'do': 7,
  'that': 8,
  'what': 9,
  'are': 10,
  'and': 11,
  'of': 12,
  'have': 13,
  'me': 14,
  'is': 15,
  'in': 16,
  'we': 17,
  'he': 18,
  'this': 19,
  'know': 20,
  'no': 21,
  'for': 22,
  'your': 23,
  'will': 24,
  'was': 25,
  'my': 26,
  'be': 27,
  'on': 28,
  'just': 29,
  'did': 30,
  'with': 31,
  'but': 32,
  'would': 33,
  'they': 34,
  'like': 35,
  'about': 36,
  'there': 37,
  'all': 38,
  'get': 39,
  'here': 40,
  'got': 41,
  'so': 42,
  'how': 43,
  'she': 44,
  'out': 45,
  'if': 46,
  'him': 47,
  'want': 48,
  'can': 49,
  'think': 50,
  'up': 51,
  'well': 52,
  'right': 53,
  'why': 54,
  'go': 55,
  'at': 56,
  'one': 57,
  'yes': 58,
  'now': 59,
  'oh': 60,
  'yeah': 61,
  'going': 62,
  'her': 63,
  'who': 64,
  'see': 65,
  'where': 66,
  'good': 67,
  'tell': 68,
  'could': 69,
  'come': 70,
  'ca': 71,
  'were': 72,
  'as': 73,
  'an': 74,
  'u': 75,
  'when': 76,
 

In [4]:
for i in range(50):
    print(questions_text[i])
    print(answers_text[i])
    print()

['can', 'we', 'make', 'this', 'quick', '<UNK>', '<UNK>', 'and', 'andrew', 'barrett', 'are', 'having', 'an', 'incredibly', '<UNK>', 'public', 'break', 'up', 'on', 'the', '<UNK>', 'again']
['well', 'i', 'thought', 'we', 'would', 'start', 'with', '<UNK>', 'if', 'that', 'okay', 'with', 'you']

['well', 'i', 'thought', 'we', 'would', 'start', 'with', '<UNK>', 'if', 'that', 'okay', 'with', 'you']
['not', 'the', 'hacking', 'and', '<UNK>', 'and', '<UNK>', 'part', 'please']

['not', 'the', 'hacking', 'and', '<UNK>', 'and', '<UNK>', 'part', 'please']
['okay', 'then', 'how', 'about', 'we', 'try', 'out', 'some', 'french', '<UNK>', 'saturday', 'night']

['you', 'are', 'asking', 'me', 'out', 'that', 'so', 'cute', 'what', 'your', 'name', 'again']
['forget', 'it']

['no', 'no', 'it', 'my', 'fault', 'we', 'did', 'not', 'have', 'a', 'proper', 'introduction']
['cameron']

['cameron']
['the', 'thing', 'is', 'cameron', 'i', 'at', 'the', 'mercy', 'of', 'a', 'particularly', '<UNK>', 'breed', 'of', 'loser', '

In [5]:
def int_to_text(sequence, int2vocab):
    return [int2vocab[index] for index in sequence if index != METATOKEN_INDEX]

def text_to_int(sequence, vocab2int):
    return [vocab2int.get(token, vocab2int[UNK]) for token in sequence if token not in codes]

## Word2Vec Embeddings

In [6]:
combined_corpus=[]
combined_corpus.extend(questions_text)
combined_corpus.extend(answers_text)

In [7]:
len(combined_corpus)

393982

In [8]:
combined_corpus[:5]

[['can',
  'we',
  'make',
  'this',
  'quick',
  '<UNK>',
  '<UNK>',
  'and',
  'andrew',
  'barrett',
  'are',
  'having',
  'an',
  'incredibly',
  '<UNK>',
  'public',
  'break',
  'up',
  'on',
  'the',
  '<UNK>',
  'again'],
 ['well',
  'i',
  'thought',
  'we',
  'would',
  'start',
  'with',
  '<UNK>',
  'if',
  'that',
  'okay',
  'with',
  'you'],
 ['not', 'the', 'hacking', 'and', '<UNK>', 'and', '<UNK>', 'part', 'please'],
 ['you',
  'are',
  'asking',
  'me',
  'out',
  'that',
  'so',
  'cute',
  'what',
  'your',
  'name',
  'again'],
 ['no',
  'no',
  'it',
  'my',
  'fault',
  'we',
  'did',
  'not',
  'have',
  'a',
  'proper',
  'introduction']]

In [13]:
from gensim.models import Word2Vec
embedding_size = 1024
model = Word2Vec(sentences=combined_corpus, size=embedding_size, window=5, min_count=1, workers=4, sg=0)

In [84]:
model.wv['well'].shape

(1024,)

In [15]:
wordVecs = model.wv

In [18]:
model.wv.index2word

['you',
 'i',
 '<UNK>',
 'the',
 'not',
 'to',
 'it',
 'a',
 'do',
 'that',
 'what',
 'are',
 'and',
 'of',
 'have',
 'me',
 'is',
 'in',
 'we',
 'he',
 'this',
 'know',
 'no',
 'for',
 'your',
 'will',
 'was',
 'my',
 'be',
 'on',
 'just',
 'did',
 'with',
 'but',
 'would',
 'they',
 'like',
 'about',
 'there',
 'all',
 'get',
 'here',
 'got',
 'so',
 'how',
 'she',
 'out',
 'if',
 'him',
 'want',
 'can',
 'think',
 'up',
 'well',
 'right',
 'why',
 'go',
 'at',
 'one',
 'yes',
 'now',
 'oh',
 'yeah',
 'going',
 'her',
 'who',
 'see',
 'where',
 'good',
 'tell',
 'could',
 'come',
 'ca',
 'were',
 'as',
 'an',
 'u',
 'when',
 'from',
 'say',
 'been',
 'time',
 'his',
 'some',
 'let',
 'back',
 'or',
 'look',
 'then',
 'them',
 'something',
 'mean',
 'take',
 'us',
 'man',
 'never',
 'had',
 'okay',
 'does',
 'too',
 'sure',
 'na',
 'really',
 'way',
 'make',
 'should',
 'said',
 'any',
 'down',
 'more',
 'little',
 'need',
 'maybe',
 'gon',
 'very',
 'over',
 'off',
 'thing',
 'only',

In [101]:
word_vecs = np.zeros((len(model.wv.vocab),1024))
for i,word in enumerate(model.wv.index2word):
        word_vecs[vocab2int[word]] = model[word]
      

  This is separate from the ipykernel package so we can avoid doing imports until


In [102]:
print("Vocabulary lengths")
print(len(word_vecs))
print(len(questions_vocab_to_int))
print(len(answers_vocab_to_int))
print(len(questions_int_to_vocab))
print(len(answers_int_to_vocab))

Vocabulary lengths
8101
8101
8101
8101
8101


In [112]:
len(word_vecs)

8101

In [103]:
np.save('word_Vecs.npy',word_vecs)

<H1> Word2Affect Vector - VAD </H1>

In [104]:
import pandas as pd
df_vad=pd.read_excel('Warriner, Kuperman, Brysbaert - 2013 BRM-ANEW expanded.xlsx')

In [105]:
df_vad.head(5)

Unnamed: 0,Word,V.Mean.Sum,V.SD.Sum,V.Rat.Sum,A.Mean.Sum,A.SD.Sum,A.Rat.Sum,D.Mean.Sum,D.SD.Sum,D.Rat.Sum,...,A.Rat.L,A.Mean.H,A.SD.H,A.Rat.H,D.Mean.L,D.SD.L,D.Rat.L,D.Mean.H,D.SD.H,D.Rat.H
1,aardvark,6.26,2.21,19,2.41,1.4,22,4.27,1.75,15,...,11,2.55,1.29,11,4.12,1.64,8,4.43,1.99,7
2,abalone,5.3,1.59,20,2.65,1.9,20,4.95,1.79,22,...,12,2.38,1.92,8,5.55,2.21,11,4.36,1.03,11
3,abandon,2.84,1.54,19,3.73,2.43,22,3.32,2.5,22,...,11,3.82,2.14,11,2.77,2.09,13,4.11,2.93,9
4,abandonment,2.63,1.74,19,4.95,2.64,21,2.64,1.81,28,...,14,5.29,2.63,7,2.31,1.45,16,3.08,2.19,12
5,abbey,5.85,1.69,20,2.2,1.7,20,5.0,2.02,25,...,9,2.55,1.92,11,4.83,2.18,18,5.43,1.62,7


In [106]:
import nltk
from nltk.corpus import wordnet

In [107]:
list_wordvecs=[]
for i,word in enumerate(model.wv.index2word):
    list_wordvecs.append(word)

In [108]:
list3 = set(list_wordvecs) & set(df_vad["Word"])
list3

{'principal',
 'shitty',
 'society',
 'championship',
 'whip',
 'contractor',
 'rumor',
 'cancel',
 'loaf',
 'affect',
 'soda',
 'discover',
 'prominent',
 'quarantine',
 'recommend',
 'acid',
 'teeth',
 'adopt',
 'wonder',
 'rich',
 'cargo',
 'hustle',
 'speech',
 'recovery',
 'fuss',
 'bully',
 'corporal',
 'nickel',
 'flare',
 'delay',
 'deaf',
 'reckless',
 'nix',
 'counsel',
 'shower',
 'sphere',
 'outrageous',
 'giant',
 'delicious',
 'fairy',
 'joy',
 'soviet',
 'orange',
 'drum',
 'tooth',
 'perfect',
 'entertain',
 'pub',
 'drain',
 'imagine',
 'helmet',
 'science',
 'crooked',
 'lung',
 'counter',
 'virus',
 'step',
 'apology',
 'agency',
 'pervert',
 'confuse',
 'vow',
 'nun',
 'sleeper',
 'anchor',
 'payment',
 'fresh',
 'explosive',
 'tender',
 'poison',
 'attraction',
 'attract',
 'identify',
 'fuzzy',
 'scared',
 'martini',
 'liberty',
 'statement',
 'refill',
 'prove',
 'scumbag',
 'junior',
 'grid',
 'physical',
 'withdraw',
 'scotch',
 'robbery',
 'dinner',
 'sentimen

In [109]:
len(list3)

4032

In [110]:
len(model.wv.vocab)

8101

In [111]:
list_vad = set(df_vad['Word'])

In [115]:
word_vecs_vad = np.zeros((len(model.wv.vocab),1027))
count_vad=0
count_neutral=0
for i,word in enumerate(model.wv.index2word):   
    if word in set(list_vad):
        #print(word)
        count_vad=count_vad+1
        word_vecs_vad[vocab2int[word]][0:1024] = model[word]
        word_vecs_vad[vocab2int[word]][1024]=df_vad.loc[df_vad["Word"] == word, 'V.Mean.Sum'].iloc[0]
        word_vecs_vad[vocab2int[word]][1025]=df_vad.loc[df_vad["Word"] == word, 'A.Mean.Sum'].iloc[0]
        word_vecs_vad[vocab2int[word]][1026]=df_vad.loc[df_vad["Word"] == word, 'D.Mean.Sum'].iloc[0]
        #print(word_vecs_vad[i])
    else:
        #print("out")
        count_neutral=count_neutral+1
        word_vecs_vad[vocab2int[word]][0:1024] = model[word]
        word_vecs_vad[vocab2int[word]][1024]=5
        word_vecs_vad[vocab2int[word]][1025]=1
        word_vecs_vad[vocab2int[word]][1026]=5

  app.launch_new_instance()
  


In [116]:
print(count_vad)
print(count_neutral)

4032
4069


In [117]:
val =df_vad[df_vad['Word'] == "mailbox"]["V.Mean.Sum"]
val

7352    6.05
Name: V.Mean.Sum, dtype: float64

In [118]:
np.save('word_Vecs_VAD.npy',word_vecs_vad)

<H1>Word2Vec - counterfitting + affect </H1>

In [80]:
import gensim

# Load Google's pre-trained Word2Vec model.
model_counterfit_affect = gensim.models.KeyedVectors.load_word2vec_format('./w2v_counterfit_append_affect.bin', binary=True)

In [89]:
model_counterfit_affect["well"]

array([-0.068298,  0.01397 , -0.02441 ,  0.054046,  0.033629, -0.01007 ,
       -0.031107,  0.040267, -0.024809, -0.023129,  0.035075, -0.061981,
        0.094889,  0.038437, -0.018818,  0.039997,  0.077868,  0.027647,
       -0.0072  , -0.000698, -0.02708 ,  0.056055, -0.053605,  0.001178,
        0.081652,  0.042793, -0.091373,  0.02682 , -0.010707, -0.034362,
       -0.027781, -0.004621,  0.085808, -0.004652,  0.075245,  0.084456,
       -0.050868, -0.044668, -0.008077,  0.009224,  0.096167,  0.001957,
        0.03171 ,  0.026655, -0.043085,  0.038686,  0.014165,  0.06862 ,
        0.050164, -0.005307,  0.145519,  0.081177, -0.056621,  0.014355,
       -0.044101, -0.001805, -0.031732,  0.003166,  0.096613, -0.089282,
       -0.090452,  0.046647, -0.198193, -0.04675 , -0.009436,  0.075172,
       -0.052017, -0.028137, -0.078767,  0.027518,  0.006175, -0.017348,
        0.179998, -0.017671, -0.091829, -0.055601,  0.006504,  0.079339,
       -0.006983, -0.022379, -0.025881,  0.029811, 

In [90]:
len(model.wv.vocab)

8101

In [121]:
word_vecs_counterfit_affect = np.zeros((len(model.wv.vocab),303))
list_word_not_found =[]
for i,word in enumerate(model.wv.index2word):
    if word in model_counterfit_affect.wv.vocab:
        word_vecs_counterfit_affect[vocab2int[word]] = model_counterfit_affect[word]
    else:
        list_word_not_found.append(vocab2int[word])

  after removing the cwd from sys.path.


In [123]:
len(list_word_not_found)

222

In [125]:
word_unknown = np.mean(word_vecs_counterfit_affect, axis=0)

In [126]:
word_unknown.shape

(303,)

In [127]:
for i in list_word_not_found:
    word_vecs_counterfit_affect[i] = word_unknown

In [128]:
word_vecs_counterfit_affect.shape

(8101, 303)

In [129]:
np.save('word_Vecs_counterfit_affect.npy',word_vecs_counterfit_affect)

<H1>Word2Vec - retrofitting + affect </H1>

In [130]:
import gensim

# Load Google's pre-trained Word2Vec model.
model_retrofit_affect = gensim.models.KeyedVectors.load_word2vec_format('./w2v_retrofit_append_affect.bin', binary=True)

In [132]:
model_retrofit_affect.wv["well"]

  """Entry point for launching an IPython kernel.


array([-4.23350e-02, -3.88500e-03, -6.49750e-02,  3.56760e-02,
       -3.79590e-02, -2.26900e-03,  8.23700e-03,  4.43840e-02,
        2.83620e-02, -4.37600e-02,  5.67000e-03, -9.95770e-02,
        8.63900e-02,  4.60080e-02, -2.64400e-02,  4.40500e-03,
        1.02670e-02,  5.38950e-02, -4.03780e-02,  9.87700e-03,
       -1.51570e-02,  4.20510e-02, -3.27200e-02, -4.11550e-02,
        1.38780e-01,  2.17640e-02, -8.32230e-02, -5.89670e-02,
       -3.76710e-02, -2.49510e-02, -1.35740e-02,  4.56560e-02,
        6.87190e-02, -4.15760e-02,  1.90190e-02,  7.76660e-02,
       -3.91100e-02,  1.02170e-02, -2.43200e-03, -3.44060e-02,
        1.62309e-01,  9.66700e-03,  4.85220e-02,  1.48780e-02,
       -1.43820e-02,  6.88200e-03, -6.29780e-02,  5.76300e-03,
        5.62570e-02, -1.36200e-02,  1.05286e-01,  4.24430e-02,
       -2.99170e-02,  3.37360e-02,  1.19460e-02,  5.95990e-02,
       -5.28090e-02,  3.39970e-02,  1.04224e-01, -3.99220e-02,
       -8.26270e-02,  8.65720e-02, -1.31603e-01, -9.609

In [133]:
word_vecs_retrofit_affect = np.zeros((len(model.wv.vocab),303))
list_word_not_found_retro =[]
for i,word in enumerate(model.wv.index2word):
    if word in model_retrofit_affect.wv.vocab:
        word_vecs_retrofit_affect[vocab2int[word]] = model_retrofit_affect[word]
    else:
        list_word_not_found_retro.append(vocab2int[word])

  after removing the cwd from sys.path.


In [134]:
len(list_word_not_found_retro)

222

In [135]:
word_unknown_retro = np.mean(word_vecs_retrofit_affect, axis=0)

In [136]:
for i in list_word_not_found_retro:
    word_vecs_retrofit_affect[i] = word_unknown_retro

In [137]:
np.save('word_Vecs_retrofit_affect.npy',word_vecs_retrofit_affect)

## Additional Preprocessing

In [138]:
#Add EOS tokens to target data now that the embeddings have been trained
for i in range(len(answers_int)):
    answers_text[i] += " " + EOS
    answers_int[i].append(METATOKEN_INDEX)
    
    #answers_int[i].append(answers_vocab_to_int[EOS])

In [139]:
# Sort questions and answers by the length of questions.
# This will reduce the amount of padding during training
# Which should speed up training and help to reduce the loss

max_source_line_length = max( [len(sentence) for sentence in questions_int])
max_targ_line_length = max([len(sentence) for sentence in answers_int])
max_line_length = max(max_source_line_length, max_targ_line_length)

sorted_questions = []
sorted_answers = []

for length in range(1, max_line_length+1):
    for index, sequence in enumerate(questions_int):
        if len(sequence) == length:
            sorted_questions.append(questions_int[index])
            sorted_answers.append(answers_int[index])

print(len(sorted_questions))
print(len(sorted_answers))
print()
indices = [0, 1, 2, len(sorted_questions) - 1]
for i in indices:
    print(int_to_text(sorted_questions[i], questions_int_to_vocab))
    print(int_to_text(sorted_answers[i], answers_int_to_vocab))
    print()

196991
196991

['cameron']
['the', 'thing', 'is', 'cameron', 'i', 'at', 'the', 'mercy', 'of', 'a', 'particularly', '<UNK>', 'breed', 'of', 'loser', 'my', 'sister', 'i', 'ca', 'not', 'date', 'until', 'she', 'does']

['why']
['<UNK>', 'mystery', 'she', 'used', 'to', 'be', 'really', 'popular', 'when', 'she', 'started', 'high', 'school', 'then', 'it', 'was', 'just', 'like', 'she', 'got', 'sick', 'of', 'it', 'or', 'something']

['there']
['where']

['yes', 'i', 'see', 'you', 'have', 'issued', 'each', 'of', 'them', 'with', 'a', 'martini', 'henry', '<UNK>', 'our', '<UNK>', 'for', 'native', '<UNK>', 'one', 'rifle', 'to', 'ten', 'men', 'and', 'only', 'five', 'rounds', 'per', 'rifle']
['but', 'will', 'they', 'make', 'good', 'use', 'of', 'them']



In [140]:
#FIXME: This really should be something like "preprocess_targets"
def process_decoding_input(target_data, batch_size):
    '''Remove the last word id from each batch and concat the <GO> to the begining of each batch'''
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    dec_input = tf.concat( [tf.fill([batch_size, 1], METATOKEN_INDEX), ending], 1)
    return dec_input


In [141]:
def dropout_cell(rnn_size, keep_prob):
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    return tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob=keep_prob)

def multi_dropout_cell(rnn_size, keep_prob, num_layers):    
    return tf.contrib.rnn.MultiRNNCell( [dropout_cell(rnn_size, keep_prob) for _ in range(num_layers)] )

In [142]:
def encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob, sequence_lengths):
    """
    Create the encoding layer
    
    Returns a tuple `(outputs, output_states)` where
      outputs is a 2-tuple of vectors of dimensions [sequence_length, rnn_size] for the forward and backward passes
      output_states is a 2-tupe of the final hidden states of the forward and backward passes
    
    """
    forward_cell = multi_dropout_cell(rnn_size, keep_prob, num_layers)
    backward_cell = multi_dropout_cell(rnn_size, keep_prob, num_layers)
    outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw = forward_cell,
                                                   cell_bw = backward_cell,
                                                   sequence_length = sequence_lengths,
                                                   inputs = rnn_inputs,
                                                    dtype=tf.float32)
    return outputs, states

## Decoding

In [143]:
def decoding_layer(enc_state, enc_outputs, dec_embed_input, dec_embeddings, #Inputs
                        attn_size, rnn_size, num_layers, output_layer, #Architecture
                        keep_prob, beam_width, #Hypeparameters
                        source_lengths, target_lengths, batch_size): 
   
    with tf.variable_scope("decoding", reuse=tf.AUTO_REUSE) as decoding_scope:
        dec_cell = multi_dropout_cell(rnn_size, keep_prob, num_layers)
        init_dec_state_size = batch_size
        #TRAINING
        train_attn = tf.contrib.seq2seq.BahdanauAttention(num_units=attn_size, memory=enc_outputs,
                                                         memory_sequence_length=source_lengths)
        
        train_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell, train_attn,
                                                    attention_layer_size=dec_cell.output_size)
        
        
        helper = tf.contrib.seq2seq.TrainingHelper(dec_embed_input, target_lengths, time_major=False)
        train_decoder = tf.contrib.seq2seq.BasicDecoder(train_cell, helper,
                            train_cell.zero_state(init_dec_state_size, tf.float32)
                                                        .clone(cell_state=enc_state),
                            output_layer = output_layer)
        outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(train_decoder, impute_finished=True, scope=decoding_scope)
        logits = outputs.rnn_output

        #INFERENCE
        #Tile inputs
        enc_state = tf.contrib.seq2seq.tile_batch(enc_state, beam_width)
        enc_outputs = tf.contrib.seq2seq.tile_batch(enc_outputs, beam_width)
        tiled_source_lengths = tf.contrib.seq2seq.tile_batch(source_lengths, beam_width)
        init_dec_state_size *= beam_width
        
        infer_attn = tf.contrib.seq2seq.BahdanauAttention(num_units=attn_size,memory=enc_outputs,
                                                          memory_sequence_length=tiled_source_lengths)
        infer_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell, infer_attn,
                                                    attention_layer_size=dec_cell.output_size)
        
        
        start_tokens = tf.tile( [METATOKEN_INDEX], [batch_size]) #Not by batch_size*beam_width, strangely
        end_token = METATOKEN_INDEX
        
        decoder = tf.contrib.seq2seq.BeamSearchDecoder(cell = infer_cell,
            embedding = dec_embeddings,
            start_tokens = start_tokens, 
            end_token = end_token,
            beam_width = beam_width,
            initial_state = infer_cell.zero_state(init_dec_state_size, tf.float32).clone(cell_state=enc_state),
            output_layer = output_layer
        )  
        final_decoder_output, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, scope=decoding_scope,
                                                                      maximum_iterations=50)
        
        ids = final_decoder_output.predicted_ids
        beams = ids
                
    return logits, beams

In [144]:
def seq2seq_model(wordVecs,input_data, target_data, keep_prob, batch_size,
                  source_lengths, target_lengths,
                  vocab_size, enc_embedding_size, dec_embedding_size,
                  attn_size, rnn_size, num_layers, beam_width):
    

    W = tf.Variable(wordVecs,trainable=False,name="W")
    enc_embed_input = tf.nn.embedding_lookup(W, input_data)
    enc_outputs, enc_states = encoding_layer(enc_embed_input, rnn_size, num_layers, keep_prob, source_lengths)    
    concatenated_enc_output = tf.concat(enc_outputs, -1)
    init_dec_state = enc_states[0]    
    
    
    dec_input = process_decoding_input(target_data, batch_size)
    dec_embeddings = W 
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
    print(dec_embed_input.shape)
    output_layer = tf.layers.Dense(vocab_size,bias_initializer=tf.zeros_initializer(),activation=tf.nn.relu)
    logits, beams = decoding_layer(init_dec_state,
                            concatenated_enc_output,
                            dec_embed_input,
                            dec_embeddings,
                            attn_size,
                            rnn_size, 
                            num_layers,
                            output_layer,
                            keep_prob,
                            beam_width,
                            source_lengths,
                            target_lengths, 
                            batch_size
                            )
    
    
    return logits, beams

In [145]:
#Settings used by Asghar et al.
rnn_size = 1024
num_layers = 1
embedding_size = 1027
encoding_embedding_size = 1027 ## if word2vec.  1027 if word2Affect vect - VAD.  .... 303 if word2vec,retrofit or counterfit
decoding_embedding_size = 1027 ## if word2vec.  1027 if word2Affect vect - VAD.  .... 303 if word2vec,retrofit or counterfit

flag_affect_functions = True # change this flag to false if affect functions are not used
attention_size = 256

#Training
epochs = 50
train_batch_size = 128
learning_rate = 0.001

learning_rate_decay = 0.9
min_learning_rate = 0.00001

keep_probability = 0.75
vocab_size = len(answers_vocab_to_int)
#Decoding
beam_width = 10

#Validation
valid_batch_size = 128

wordVecs = np.load('word_Vecs_VAD.npy').astype(np.float32)
metatoken_embedding = np.zeros((1, embedding_size), dtype=wordVecs.dtype)
wordVecsWithMeta = np.concatenate( (wordVecs, metatoken_embedding), axis=0 )
vocab_size_with_meta = wordVecsWithMeta.shape[0]

print("vocab_size_with_meta =", vocab_size_with_meta)
print("METATOKEN_INDEX =", METATOKEN_INDEX)
print("wordVecsWithMeta.shape =", wordVecsWithMeta.shape)
print("wordVecsWithMeta[METATOKEN_INDEX] =", wordVecsWithMeta[METATOKEN_INDEX])

#print(wordVecsWithMeta.dtype)

vocab_size_with_meta = 8102
METATOKEN_INDEX = 8101
wordVecsWithMeta.shape = (8102, 1027)
wordVecsWithMeta[METATOKEN_INDEX] = [0. 0. 0. ... 0. 0. 0.]


In [146]:
# Reset the graph to ensure that it is ready for training
tf.reset_default_graph()


#                                      batch_size, time
input_data = tf.placeholder(tf.int32, [None,       None], name='input')
targets = tf.placeholder(tf.int32,    [None,       None], name='targets')
lr = tf.placeholder(tf.float32, name='learning_rate')
keep_prob = tf.placeholder(tf.float32, name='keep_prob')



#                                          batch_size
source_lengths = tf.placeholder(tf.int32, [None], name="source_lengths")
target_lengths = tf.placeholder(tf.int32, [None], name="target_lengths")
batch_size = tf.shape(input_data)[0]



# Create the training and inference logits
train_logits, beams = \
seq2seq_model(wordVecsWithMeta,input_data, targets, keep_prob, batch_size,
        source_lengths, target_lengths, 
        vocab_size_with_meta, encoding_embedding_size, decoding_embedding_size, attention_size, rnn_size, num_layers,
        beam_width)

# Find the shape of the input data for sequence_loss
with tf.name_scope("optimization"): 
    
    if(flag_affect_functions):
        #Embeddings
        Weight = tf.Variable(wordVecsWithMeta,trainable=False,name="Weight")
        embed_vec  = tf.nn.embedding_lookup(Weight, input_data)
        input_embed =embed_vec[:,:,1024:1027]



        dec_res = process_decoding_input(targets, batch_size)
        dec_embed = tf.nn.embedding_lookup(Weight, dec_res)
        target_embed = dec_embed[:,:,1024:1027]
        print(target_embed.shape)
    xent = loss_functions.cross_entropy(train_logits, targets, target_lengths)
    '''--param values are 0.5, 0.4 and 0.5 as per the hyper parameter tuning mentioned in paper for LDMIN,LDMAX,LDAC functions--
    '''
    cost = loss_functions.min_affective_dissonance(0.5,train_logits,targets,target_lengths,input_embed,target_embed)
    #cost = loss_functions.max_affective_dissonance(0.4,train_logits,targets,target_lengths,input_embed,target_embed)
    #cost = loss_functions.max_affective_content(0.5,train_logits,targets,target_lengths,target_embed)
    
    eval_mask = tf.sequence_mask(target_lengths, dtype=tf.float32)
    perplexity = tf.contrib.seq2seq.sequence_loss(train_logits, targets, eval_mask,
                                                softmax_loss_function=metrics.perplexity)
    
    optimizer = tf.train.AdamOptimizer(lr)
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)


(?, ?, 1027)
(?, ?, 3)


In [147]:
def pad_sentence_batch(sentence_batch, vocab_to_int):
    """Pad sentences with <PAD> so that each sentence of a batch has the same length"""
    pad_int = METATOKEN_INDEX
    max_sentence_length = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [pad_int] * (max_sentence_length - len(sentence)) for sentence in sentence_batch]

In [148]:
def batch_data(questions, answers, batch_size):
    """Batch questions and answers together"""
    for batch_i in range(0, len(questions)//batch_size):
        start_i = batch_i * batch_size
        questions_batch = questions[start_i:start_i + batch_size]
        answers_batch = answers[start_i:start_i + batch_size]
        
        source_lengths = np.array( [len(sentence) for sentence in questions_batch] )
        target_lengths = np.array( [len(sentence) for sentence in answers_batch])
        
        pad_questions_batch = np.array(pad_sentence_batch(questions_batch, questions_vocab_to_int))
        pad_answers_batch = np.array(pad_sentence_batch(answers_batch, answers_vocab_to_int))
        yield source_lengths, target_lengths, pad_questions_batch, pad_answers_batch

In [149]:
def parallel_shuffle(source_sequences, target_sequences):
    if len(source_sequences) != len(target_sequences):
        raise ValueError("Cannot shuffle parallel sets with different numbers of sequences")
    indices = np.random.permutation(len(source_sequences))
    shuffled_source = [source_sequences[indices[i]] for i in range(len(indices))]
    shuffled_target = [target_sequences[indices[i]] for i in range(len(indices))]
    
    return (shuffled_source, shuffled_target)

### Subroutines for Sampling Output

In [150]:
def  show_response(question_int, beams, answer_int = None, best_only=True):
    pad_q = METATOKEN_INDEX
    print("Prompt")
    print("  Word Ids: {}".format([i for i in question_int if i != pad_q]))
    print("      Text: {}".format(int_to_text(question_int, questions_int_to_vocab)))
    
    pad_a = METATOKEN_INDEX
    if answer_int is not None:
        print("Actual Answer")
        print("  Word Ids: {}".format([i for i in answer_int if i != pad_a]))
        print("      Text: {}".format(int_to_text(answer_int, answers_int_to_vocab)))

    limit = 1 if best_only else beam_width
    for i in range(limit):
        beam = beams[:, i]
        print("\nBeam Answer", i)
        print('  Word Ids: {}'.format([i for i in beam if i != pad_a]))
        print('      Text: {}'.format(int_to_text(beam, answers_int_to_vocab)))
        
def check_response(session, question_int, answer_int=None, best_only=True):
    """
    session - the TensorFlow session
    question_int - a list of integers
    answer - the actual, correct response (if available)
    """
    
    two_d_question_int = [question_int]
    q_lengths = [len(question_int)]
    
    [beam_output] = session.run([beams], feed_dict = {input_data: np.array(two_d_question_int, dtype=np.float32),
                                                      source_lengths: q_lengths,
                                                      keep_prob: 1})
    
    show_response(question_int, beam_output[0], answer_int, best_only=best_only)

In [151]:
def show_metrics_batch(xent, perp, bleu, wer, num_batches=None):
    """
    xent - cross-entropy error summed across batches (unless num_batches is None)
    perp - perplexity          summed across batches (unless num_batches is None)
    bleu - BLEU                summed across batches (unless num_batches is None)
    wer -  Word-Error Rate     summed across batches (unless num_batches is None)
    num_batches - the number of batches (used to average each of the metrics)
    """
    if num_batches:
        xent /= num_batches
        perp /= num_batches
        bleu /= num_batches
        wer /= num_batches

    print("Metrics averaged by sentence")
    print("\t  Cross-Entropy: {:>9.6f}".format(xent))
    print("\t     Perplexity: {}".format(perp))
    print("\t           BLEU: {}".format(bleu))
    print("\tWord-Error Rate: {}".format(wer))
    
def calc_metrics_beams(beams, prompt_int, answer_int ):
    print("Sample output")
    show_response(prompt_int, beams, answer_int, best_only=False)
    targ_text = [int_to_text(answer_int, answers_int_to_vocab)]
    pred_text = [int_to_text(beams[:, 0], answers_int_to_vocab)]
    sing_bleu = metrics.bleu(targ_text, pred_text)
    sing_wer = metrics.batch_word_error_rate(targ_text, pred_text)
    print("Metrics for best beam")
    print("\tBLEU: {}".format(sing_bleu))
    print("\t WER: {}".format(sing_wer))

In [152]:
# Validate the training with 10% of the data
train_valid_split = int(len(sorted_questions)*0.15)


# Split the questions and answers into training and validating data
(shuffled_questions, shuffled_answers) = parallel_shuffle(sorted_questions, sorted_answers)

train_questions = shuffled_questions[train_valid_split:]
train_answers = shuffled_answers[train_valid_split:]

valid_questions = shuffled_questions[:train_valid_split]
valid_answers = sorted_answers[:train_valid_split]

print(len(train_questions))
print(len(valid_questions))

167443
29548


In [153]:
#TRAINING
display_step = 100 # Check training loss after every 100 batches
total_train_loss = 0 # Record the training loss for each display step

#VALIDATION
stop_early = 0 
stop = 5 # If the validation loss does decrease in 5 consecutive checks, stop training
validation_check = ((len(train_questions))//train_batch_size//2)-1 #Check validation loss every half-epoch
summary_valid_loss = [] # Record the validation loss for saving improvements in the model

#Minimum number of epochs before we start checking sample output with beam search
min_epochs_before_validation = 0

checkpoint = "./checkpoints/best_model.ckpt" 

early_stopping_metric = "perplexity"

In [None]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print("Initialized model parameters")
    print("Using {} for early stopping after {} stalled steps".format(early_stopping_metric, stop))
    for epoch_i in range(1, epochs+1):        
        print("Shuffling training data . . .")
        (train_questions, train_answers) = parallel_shuffle(train_questions, train_answers)
                
        for batch_i, (q_lengths, a_lengths, questions_batch, answers_batch) in enumerate(
                batch_data(train_questions, train_answers, train_batch_size)):
            start_time = time.time()
            _, loss = sess.run([train_op, cost],
                {input_data: questions_batch, targets: answers_batch,
                 source_lengths: q_lengths, target_lengths: a_lengths,
                 lr: learning_rate, keep_prob: keep_probability})
            total_train_loss += loss
            batch_time = time.time() - start_time

            if batch_i % display_step == 0:
                print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>9.6f}, Seconds: {:>4.2f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(train_questions) // train_batch_size, 
                              total_train_loss / display_step, 
                              batch_time*display_step),
                         flush=True)
                total_train_loss = 0

            if batch_i % validation_check == 0 and epoch_i >= min_epochs_before_validation:
                print("Shuffling validation data . . .")
                (valid_questions, valid_answers) = parallel_shuffle(valid_questions, valid_answers)
                total_xent = 0
                total_perp = 0
                total_bleu = 0
                total_wer = 0
                num_batches = 0
                
                start_time = time.time()        
                for batch_ii, (q_lengths, a_lengths, questions_batch, answers_batch) in \
                        enumerate(batch_data(valid_questions, valid_answers, valid_batch_size)):
                        
                    [valid_xent, valid_perp, beam_output] = sess.run( [xent, perplexity, beams],
                        {input_data: questions_batch, targets: answers_batch,
                        source_lengths: q_lengths, target_lengths: a_lengths, keep_prob: 1})
                    total_xent += valid_xent
                    total_perp += valid_perp
                    #Text-based metrics
                    best_beams = beam_output[:, :, 0]
                    beam_text = [int_to_text(best_beams[i], answers_int_to_vocab)
                                     for i in range(len(best_beams))]
                    target_text = [int_to_text(answers_batch[i], answers_int_to_vocab)
                                       for i in range(len(answers_batch))]
                    total_bleu += metrics.bleu(target_text, beam_text)
                    total_wer  += metrics.batch_word_error_rate(target_text, beam_text)
                    num_batches += 1
                batch_time = time.time() - start_time
                
                print("Processed validation set in {:>4.2f} seconds".format(batch_time))
                show_metrics_batch(total_xent, total_perp, total_bleu, total_wer, num_batches)
                calc_metrics_beams(beam_output[-1, :, :], questions_batch[-1], answers_batch[-1])

                # Reduce learning rate, but not below its minimum value
                learning_rate *= learning_rate_decay
                if learning_rate < min_learning_rate:
                    learning_rate = min_learning_rate

                avg_valid_loss = total_xent / num_batches
                print(summary_valid_loss)
                if (len(summary_valid_loss) > 0) and (avg_valid_loss >= min(summary_valid_loss)):
                    print("No improvement for {}.".format(early_stopping_metric))
                    stop_early += 1

                else:
                    print("New record for {}!".format(early_stopping_metric)) 
                    stop_early = 0
                    saver = tf.train.Saver() 
                    saver.save(sess, checkpoint)
                summary_valid_loss.append(avg_valid_loss)

                if stop_early == stop:
                        break
                        


        if stop_early == stop:
            print("Stopping training after {} stalled steps".format(stop))
            break

Initialized model parameters
Using perplexity for early stopping after 5 stalled steps
Shuffling training data . . .
Epoch   1/50 Batch    0/1308 - Loss:  0.005484, Seconds: 1372.15
Shuffling validation data . . .


In [39]:
def question_to_seq(question, vocab_to_int, int_to_vocab):
    '''Prepare the question for the model'''
    cleaned_question = Corpus.clean_sequence(question)
    return [vocab_to_int.get(word, vocab_to_int[UNK]) for word in cleaned_question]


In [40]:
# Use a question from the data as your input
random = np.random.choice(len(sorted_questions))
question_int = sorted_questions[random]
answer_int = sorted_answers[random]

saver = tf.train.Saver()
with tf.Session() as sess:
    # Run the model with the input question
    saver.restore(sess, checkpoint)
    check_response(sess, question_int, answer_int, best_only=False)
    


INFO:tensorflow:Restoring parameters from ./checkpoints/best_model.ckpt
Prompt
  Word Ids: [278]
      Text: ['hello']
Actual Answer
  Word Ids: [5, 3004]
      Text: ['it', 'elvis']

Beam Answer 0
  Word Ids: []
      Text: []

Beam Answer 1
  Word Ids: [1]
      Text: ['i']

Beam Answer 2
  Word Ids: [1, 1]
      Text: ['i', 'i']

Beam Answer 3
  Word Ids: [1, 1, 1]
      Text: ['i', 'i', 'i']

Beam Answer 4
  Word Ids: [1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 1, 6, 6, 6, 6]
      Text: ['i', 'i', 'i', 'i', 'i', 'i', 'i', 'a', 'a', 'a', 'i', 'a', 'a', 'a', 'a']

Beam Answer 5
  Word Ids: [1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 6, 6, 6]
      Text: ['i', 'i', 'i', 'i', 'i', 'i', 'i', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

Beam Answer 6
  Word Ids: [1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 1, 6, 6]
      Text: ['i', 'i', 'i', 'i', 'i', 'i', 'i', 'a', 'a', 'a', 'a', 'a', 'i', 'a', 'a']

Beam Answer 7
  Word Ids: [1, 1, 1, 1, 1, 1, 6, 6, 6, 1, 6, 6, 6, 6, 6]
      Text: ['i', 'i', 'i', 'i', 'i', 'i', 'a