# Chatbot using RNN
### Northwestern University - Fall 2017
### Student: Danilo Neves Ribeiro
### E-mail: daniloribeiro2021@u.northwestern.edu

# Introduction

The idea of the project was to train a simple chatbot using Recurrent Neural Networks. 

## Chatbots
There are many ways one can go about creating a chat-bot. For example, many chatbots rely on pre-defined rules to answer questions. Those can work well but requires intese human work to create as many rules as possible. 

Machine learning greately simplify this task by enableing to learn from pre-existing conversation corpus. The two main types of ML chatbots are:

- Retrieval-based: answer questions by choosing from one of the answers available in the data-set.
- Generative: generates the conversation dialog word by word based on the query. The generated sentense is normally not included in the original data-set.

For this project, I decided to create a chatbot using the generative approch, which normally makes more mistakes, such as grammar mistakes, but can respond a broader set of questions and contexts.

## Dataset
The model was trained using the [Cornell Movie Dialog Corpus](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html), that contains a collection of fictional conversations extracted from raw movie scripts. 

## Implementation Architecture
Here I use a RNN to train on the data set. More specifically I use a seq2seq model with bucketing and attention mechanism, which is described in more details below:

### Seq2Seq:
Sequence to Sequence RNN models are composed of two main components: encoder and decoder. The encoder is responsible for reading the input, word by word, and generating a hidden state that "represents" the input. The decoder outputs words according to the hidden states generated by the encoder. The following image gives a general idea of this architecture:
<img src="seq2seq.png" alt="seq2seq" style="width: 700px;"/>

### Padding and Bucketing: 
One of the limitations of the simple Seq2Seq arquitectures is that it has fixed size input and output. Therefore we need to use padding and special symbols to deal with the fact that both input and output sentences can have different length (the ones used here are: EOS = "End of sentence", PAD = "Filler", GO = "Start decoding", plus a special symbol for unknown words: UNK).

To efficiently handle sentenses with different lengths we use the bucketing method. Our model uses buckets = [(5, 10), (10, 15), (20, 25), (40, 50)], this means that if the input is a sentence with 3 tokens, and the corresponding output is a sentence with 6 tokens, then they will be put in the first bucket and padded to length 5 for encoder inputs, and length 10 for decoder inputs.

### Attention mechanism:
The attention mechanism tries to address the following limitations:
- The decoder is not aware of which parts of the encoding are relevant at each step of the generation.
- The encoder has limited memory and can't "remember" more than a single fixed size vector.

The attention model comes between the encoder and the decoder and helps the decoder to pick only the encoded inputs that ar important for each step of the decoding process:

<img src="attention.jpg" alt="Attention mechanism" style="width: 400px;"/>

# Code

The code will be split between: 
- Preprocessing data (tokenizing, creating vobabulary, transforming input from words to word ids)
- Training
- Evaluation

##### Software requirements
- Python 3.6.2
- Numpy
- TensorFlow 

**_Note_**: this code is based on the code from the chatbot tutorial by [Suriyadeepan Ram](http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/) and also uses the more general seq2seq model provided by the [Google Tensorflow tutorial on NMT](https://github.com/tensorflow/nmt), which is imported from a separate code file.

### Preprocessing

In [None]:
# IMPORTS

import os
import numpy as np
import re
import tensorflow as tf
from seq2seq_model import Seq2SeqModel

In [2]:
# GLOBAL VARIABLES AND PARAMS

# encoding and decoding paths
TRAIN_END_PATH = os.path.join('data', 'train.enc')
TRAIN_DEC_PATH = os.path.join('data', 'train.dec')
TEST_END_PATH = os.path.join('data', 'test.enc')
TEST_DEC_PATH = os.path.join('data', 'test.dec')

TRAIN_END_ID_PATH = os.path.join('data', 'train.enc.id')
TRAIN_DEC_ID_PATH = os.path.join('data', 'train.dec.id')
TEST_END_ID_PATH = os.path.join('data', 'test.enc.id')
TEST_DEC_ID_PATH = os.path.join('data', 'test.dec.id')

# vocabulary paths
VOCAB_ENC_PATH = os.path.join('data', 'vocab.enc')
VOCAB_DEC_PATH = os.path.join('data', 'vocab.dec')
MAX_VOCAB_SIZE = 20000

# data utils
SPLIT_REGEX = re.compile("([.,!?\"':;)(])")
PAD_TOKEN = "_PAD"
START_TOKEN = "_GO"
END_TOKEN = "_EOS"
UNKNOWEN_TOKEN = "_UNK"
INIT_VOCAB = [PAD_TOKEN, START_TOKEN, END_TOKEN, UNKNOWEN_TOKEN]

# args
BUCKETS = [(5, 10), (10, 15), (20, 25), (40, 50)]
LSTM_LAYES = 3
LAYER_SIZE = 256
BATCH_SIZE = 64
LEARNING_RATE = 0.5
LEARNING_RATE_DECAY_FACTOR = 0.99
MAX_GRADIENT_NORM = 5.0

# pre training
TRAINED_MODEL_PATH = 'pre_trained'
TRAINED_VOCAB_ENC = os.path.join('pre_trained', 'vocab.enc')
TRAINED_VOCAB_DEC = os.path.join('pre_trained', 'vocab.dec')

In [None]:
# SIMPLE TOKENIZER

def tokenize(sentense):
    tokens = []
    for token in sentense.strip().split():
        tokens.extend([x for x in re.split(SPLIT_REGEX, token) if x])
    return tokens

In [None]:
# CREATING VOCABULARY

def create_vocab(data_path, vocab_path):
    vocab = {}    
    # only creates new file if file doesn't exist
    if os.path.exists(vocab_path):
        print("file ", vocab_path, " already exists") 
    else:
        with open(data_path, 'r') as data_file:
            for line in data_file:
                tokens = tokenize(line)
                for token in tokens:
                    if token not in vocab:
                        vocab[token] = 1
                    else:
                        vocab[token] += 1
        # use the default tokens as initial vocabulity words
        vocab_list = INIT_VOCAB + sorted(vocab, key=vocab.get, reverse=True)
        # trim vocabulary
        vocab_list = vocab_list[:MAX_VOCAB_SIZE]
        print("final vacabulary size for ", data_path, " = ", len(vocab_list))
        # save to file
        with open(vocab_path, 'w') as vocab_file:
            for word in vocab_list:
                vocab_file.write(word + "\n")   
        # update vocab with new order
        vocab = dict([(y, x) for (x, y) in enumerate(vocab_list)])
        return vocab

In [None]:
# TRANSFORM WORDS IN DATA TO IDS

def from_text_data_to_id_list(data_path, ouput_path, vocab):
    # only creates new file is file doesn't exist
    if os.path.exists(ouput_path):
        print("file ", ouput_path, " already exists") 
    else:
        with open(data_path, 'r') as data_file:
            with open(ouput_path, 'w') as ouput_file:
                for line in data_file:
                    tokens = tokenize(line)
                    id_list = [str(vocab.get(word, vocab.get(UNKNOWEN_TOKEN))) for word in tokens]
                    ouput_file.write(" ".join(id_list) + "\n")

In [None]:
# DATA PREPROCESSING

def preprocess_data():
    encoding_vocab = create_vocab(TRAIN_END_PATH, VOCAB_ENC_PATH)
    decoding_vocab = create_vocab(TRAIN_DEC_PATH, VOCAB_DEC_PATH)
    from_text_data_to_id_list(TRAIN_END_PATH, TRAIN_END_ID_PATH, encoding_vocab)
    from_text_data_to_id_list(TRAIN_DEC_PATH, TRAIN_DEC_ID_PATH, decoding_vocab)
    from_text_data_to_id_list(TEST_END_PATH, TEST_END_ID_PATH, encoding_vocab)
    from_text_data_to_id_list(TEST_DEC_PATH, TEST_DEC_ID_PATH, decoding_vocab)
    print("Data preprocessing complete.")

preprocess_data()

### Training

In [None]:
def read_data(source_path, target_path):
    data_set = [[] for _ in BUCKETS]
    with tf.gfile.GFile(source_path, mode="r") as source_file:
        with tf.gfile.GFile(target_path, mode="r") as target_file:
            source, target = source_file.readline(), target_file.readline()
            while source and target:
                source_ids = [int(x) for x in source.split()]
                target_ids = [int(x) for x in target.split()]
                target_ids.append(INIT_VOCAB.index(END_TOKEN))
                for bucket_id, (source_size, target_size) in enumerate(BUCKETS):
                    if len(source_ids) < source_size and len(target_ids) < target_size:
                        data_set[bucket_id].append([source_ids, target_ids])
                        break
                source, target = source_file.readline(), target_file.readline()
                return data_set

In [3]:
# CREATE MODEL

def create_model(forward_only):
    # TODO: remove
    print(MAX_VOCAB_SIZE, "\n", MAX_VOCAB_SIZE, "\n", BUCKETS, "\n", LAYER_SIZE, "\n", LSTM_LAYES, "\n", MAX_GRADIENT_NORM, "\n", 
        BATCH_SIZE, "\n", LEARNING_RATE, "\n", LEARNING_RATE_DECAY_FACTOR, "\n", forward_only, "\n")
    return Seq2SeqModel(
        MAX_VOCAB_SIZE, MAX_VOCAB_SIZE, BUCKETS, LAYER_SIZE, LSTM_LAYES, MAX_GRADIENT_NORM, 
        BATCH_SIZE, LEARNING_RATE, LEARNING_RATE_DECAY_FACTOR, forward_only)

In [None]:
# TRAIN MODEL

def train():
    # prepare dataset
    enc_train, dec_train, enc_dev, dec_dev

    # setup config to use BFC allocator
    config = tf.ConfigProto()    
    with tf.Session(config=config) as sess:
        # Create model.
        model = create_model(forward_only = False)
        sess.run(tf.global_variables_initializer())

        # Read data into buckets and compute their sizes.
        dev_set = read_data(enc_dev, dec_dev)
        train_set = read_data(enc_train, dec_train)
        train_bucket_sizes = [len(train_set[b]) for b in xrange(len(_buckets))]
        train_total_size = float(sum(train_bucket_sizes))

        # A bucket scale is a list of increasing numbers from 0 to 1 that we'll use
        # to select a bucket. Length of [scale[i], scale[i+1]] is proportional to
        # the size if i-th training bucket, as used later.
        train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size
                               for i in xrange(len(train_bucket_sizes))]

        # This is the training loop.
        step_time, loss = 0.0, 0.0
        current_step = 0
        while True:
            # Choose a bucket according to data distribution. We pick a random number
            # in [0, 1] and use the corresponding interval in train_buckets_scale.
            random_number_01 = np.random.random_sample()
            bucket_id = min([i for i in xrange(len(train_buckets_scale))
                           if train_buckets_scale[i] > random_number_01])

            # Get a batch and make a step.
            encoder_inputs, decoder_inputs, target_weights = model.get_batch(
              train_set, bucket_id)
            _, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs,
                                       target_weights, bucket_id, False)            
            loss += step_loss / gConfig['steps_per_checkpoint']
            current_step += 1

            # Once in a while, we save checkpoint, print statistics, and run evals.
            if current_step % gConfig['steps_per_checkpoint'] == 0:
                # Print statistics for the previous epoch.
                perplexity = math.exp(loss) if loss < 300 else float('inf')
                print ("global step %d learning rate %.4f step-time %.2f perplexity "
                       "%.2f" % (model.global_step.eval(), model.learning_rate.eval(),
                                 step_time, perplexity))
                sys.stdout.flush()

### Evaluation



In [4]:
# LOAD PRE-TRAINED MODEL

def load_vocabulary_list(vocabulary_path):
    with open(vocabulary_path, mode="r") as vocab_file:
        return [line.strip() for line in vocab_file.readlines()]

def load_pre_trained_model(session):
    print("Loading vocab...")
    enc_vocab_list = load_vocabulary_list(TRAINED_VOCAB_ENC)
    dec_vocab_list = load_vocabulary_list(TRAINED_VOCAB_DEC)
    enc_vocab = dict([(x, y) for (y, x) in enumerate(dec_vocab_list)])
    rev_dec_vocab = dict(enumerate(dec_vocab_list))
    
    print("Creting model...")
    model = create_model(forward_only = True)

    print("Loading saved model...")
    ckpt = tf.train.get_checkpoint_state(TRAINED_MODEL_PATH)
    model.saver.restore(session, ckpt.model_checkpoint_path)
    return (model, enc_vocab, rev_dec_vocab)

In [5]:
# DECODING

def decode():
    print("Start decoding...")
    with tf.Session() as sess:
        model, enc_vocab, rev_dec_vocab = load_pre_trained_model(sess)
        model.batch_size = 1  # We decode one sentence at a time.
        
        # Decode from standard input.
        sys.stdout.write("> ")
        sys.stdout.flush()
        sentence = sys.stdin.readline()
        while sentence:
            # Get token-ids for the input sentence.
            token_ids = [enc_vocab.get(w, INIT_VOCAB.index(UNKNOWEN_TOKEN)) for w in words]
            # Which bucket does it belong to?
            bucket_id = min([b for b in xrange(len(_buckets))
                           if _buckets[b][0] > len(token_ids)])
            # Get a 1-element batch to feed the sentence to the model.
            encoder_inputs, decoder_inputs, target_weights = model.get_batch(
              {bucket_id: [(token_ids, [])]}, bucket_id)
            # Get output logits for the sentence.
            _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs,
                                           target_weights, bucket_id, True)
            # This is a greedy decoder - outputs are just argmaxes of output_logits.
            outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]
            # If there is an EOS symbol in outputs, cut them at that point.
            if data_utils.EOS_ID in outputs:
                outputs = outputs[:outputs.index(data_utils.EOS_ID)]
                # Print out French sentence corresponding to outputs.
                print(" ".join([tf.compat.as_str(rev_dec_vocab[output]) for output in outputs]))
                print("> ", end="")
                sys.stdout.flush()
                sentence = sys.stdin.readline()

# Demo

In [6]:
decode()

Start decoding...
Loading vocab...
Creting model...
20000 
 20000 
 [(5, 10), (10, 15), (20, 25), (40, 50)] 
 256 
 3 
 5.0 
 64 
 0.5 
 0.99 
 True 

Loading saved model...
INFO:tensorflow:Restoring parameters from pre_trained\seq2seq.ckpt-21300


InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [1536,256] rhs shape= [768,256]
	 [[Node: save/Assign_7 = Assign[T=DT_FLOAT, _class=["loc:@embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/kernel, save/RestoreV2_7)]]

Caused by op 'save/Assign_7', defined at:
  File "C:\Users\dnr2\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\dnr2\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tornado\ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2698, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2808, in run_ast_nodes
    if self.run_code(code, result):
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-e2fbaaba3ddd>", line 1, in <module>
    decode()
  File "<ipython-input-5-61b61ffcc40a>", line 6, in decode
    model, enc_vocab, rev_dec_vocab = load_pre_trained_model(sess)
  File "<ipython-input-4-b1c59e9d1fc6>", line 15, in load_pre_trained_model
    model = create_model(forward_only = True)
  File "<ipython-input-3-524b219ee88c>", line 9, in create_model
    BATCH_SIZE, LEARNING_RATE, LEARNING_RATE_DECAY_FACTOR, forward_only)
  File "C:\Users\dnr2\Documents\Northwestern\Fall 2017\EECS 495 - Deep learning foundations from scratch\Project\chatbot\seq2seq_model.py", line 175, in __init__
    self.saver = tf.train.Saver(tf.global_variables())
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1218, in __init__
    self.build()
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1227, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1263, in _build
    build_save=build_save, build_restore=build_restore)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 751, in _build_internal
    restore_sequentially, reshape)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 439, in _AddRestoreOps
    assign_ops.append(saveable.restore(tensors, shapes))
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 160, in restore
    self.op.get_shape().is_fully_defined())
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\ops\state_ops.py", line 276, in assign
    validate_shape=validate_shape)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 56, in assign
    use_locking=use_locking, name=name)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2956, in create_op
    op_def=op_def)
  File "C:\Users\dnr2\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [1536,256] rhs shape= [768,256]
	 [[Node: save/Assign_7 = Assign[T=DT_FLOAT, _class=["loc:@embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/kernel, save/RestoreV2_7)]]


# Conclusion


## Project Challenges

Here I present some of the challenges I faced when tring to train the model for this project:

- Initially I tried to use the [ubuntu-dialog corpus](http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/). The dataset proved to be very large (a few Gb) and it took several hours just to preprocess the data. I decided that this corpus would be to complex to train on and decided to use the The Cornell Movie Dialog Corpus.
- The Cornell Movie Dialog Corpus is a smaller and more manageble dataset, but training the model still took several hours (almost 2 entire days).