# RETRIEVAL CHATBOT

# 1. INTRODUCTION

In this exercise we will see step by step the process of building a **generative chatbot**, preparing the dataset we want to work with, create a model and train it to get a dialog system with the ability to answer to the users questions. 

As you know from the slides, a generative chatbot generates an answer from scratch. It receives a question (the user input) and uses it to generate an answer based on the training performed.

## 1.1. Dataset
We are going to work again with the Ubuntu Corpus (https://arxiv.org/pdf/1506.08909) to create a generative chatbot capable of answering technical support questions about the well known OS Ubuntu. Other options are included in the project, such us Cornell Movie Corpus (movie dialogs) and a Custom Mode that allows you to use any dialog estructured corpus. Feel free to visit the project's Github (https://github.com/Conchylicultor/DeepQA) to know more about these datasets and experiment with them. 



## 1.2. Model
The architecture of the neural network is called Sequence to Sequence or Encoder-Decoder. It receives a sentence, encode it to get an embedded representation of it (same way as was done with the retrieval chatbot). The difference is that now this representation will be fed to a **decoder**, that will reverse the process: it transforms the embedded representation into a set of words (the answer to the question). The original model can be found here: https://arxiv.org/pdf/1506.05869.pdf

There's a maximum length of the sentence, used to discard all of the sentences longer than it and pad the shorter ones. The maximum length shouldn't be too big, as LSTM Recurrent Neural Networks are not proficient in remembering the first steps of a long sequence of steps. In this project we are going to work with 10 maximum words per sentence.

<img src='basic_seq2seq.png'>

*From the Seq2Seq Tensorflow tutorial (https://www.tensorflow.org/images/basic_seq2seq.png)*

In [None]:
import tensorflow as tf
print(tf.__version__)

# 2. Requirements

First of all, we need to install the libraries required to complete this project. The most important are:

* Python >= 3
* Tensorflow >= 1.0

Once installed, import them into the project and we are ready to start.

In [None]:
import argparse  # Command line parsing
import configparser  # Saving the models parameters
import datetime  # Chronometer
import os  # Files management
import tensorflow as tf
import numpy as np
import math

from tqdm import tqdm  # Progress bar
from tensorflow.python import debug as tf_debug

from chatbot.textdata import TextData

# 3. Setting everything up

For the Tensorflow graph, we have to define a series of variables that will be needed once we create the model. 

In [None]:
args = None

# Task specific object
textData = None  # Dataset
model = None  # Sequence to sequence model

# Tensorflow utilities for convenience saving/logging
writer = None
saver = None 
modelDir = ''  # Where the model is saved
globStep = 0  # Represent the number of iteration for the current model

# TensorFlow main session (we keep track for the daemon)
sess = None

# Filename and directories constants
MODEL_DIR_BASE = 'save/model'
MODEL_NAME_BASE = 'model'
MODEL_EXT = '.ckpt'
CONFIG_FILENAME = 'params.ini'
CONFIG_VERSION = '0.5'
TEST_IN_NAME = 'data/test/samples.txt'
TEST_OUT_SUFFIX = '_predictions.txt'
SENTENCES_PREFIX = ['Q: ', 'A: ']

There are different types of testing, but in this demo we are going to use the interactive one.

In [None]:
class TestMode:
    """ Simple structure representing the different testing modes
    """
    ALL = 'all'
    INTERACTIVE = 'interactive'  # The user can write his own questions
    DAEMON = 'daemon'  # The chatbot runs on background and can regularly be called to predict something

These are all the parameters that are going to be used to define our data preprocessing, model, training and testing steps. 

In [None]:
args = {
    'test': None, # Options: None, TestMode.ALL, TestMode.INTERACTIVE, TestMode.DAEMON
    'rootDir': '/notebooks', # Path where the project is loaded
    'createDataset': False, # Just preprocess the dataset (no training or testing)
    'reset': False, 
    'verbose': False,
    'debug': False,
    'keepAll': False,
    'modelTag': 'generative', # Identificator for the model we are going to train/test
    'watsonMode': False,
    'autoEncode': False,
    'playDataset': False,
    'device': None,
    'seed': 21111993,
    'corpus': TextData.corpusChoices()[3], # Cornell, Ubuntu, Custom...
    'datasetTag': 'course', # Identificator for the dataset we are going to use (None if not wanted)
    'ratioDataset': 1.0, # Ratio of the dataset we want to use
    'maxLength': 12, # Remove all sentences of the dataset with more than maxLength words
    'filterVocab': 50, # Remove all words in the dataset that appear less times than filterVocab
    'skipLines': False,
    'vocabularySize': 40000, # Number of words in the vocabulary
    'hiddenSize': 256, # Size of the hidden state of the LSTM cell
    'numLayers': 2, # Number of layers of LSTM
    'softmaxSamples': 2048,
    'initEmbeddings': False, 
    'embeddingSize': 64, # Dimensions of the word embeddings
    'embeddingSource': "GoogleNews-vectors-negative300.bin", # Pretrained word embeddings
    'numEpochs': 30, # Number of epochs
    'saveEvery': 500, # Save every N steps
    'batchSize': 128, # Number of sentences to encode in each batch
    'learningRate': 0.002, 
    'dropout': 0.9
}

The following methods are going to be used to help with the saving and reloading of parameters. This is going to come useful to reload the model after training.

In [None]:
def loadModelParams():
    """ Load the some values associated with the current model, like the current globStep value
    For now, this function does not need to be called before loading the model (no parameters restored). However,
    the modelDir name will be initialized here so it is required to call this function before managePreviousModel(),
    _getModelName() or _getSummaryName()
    Warning: if you modify this function, make sure the changes mirror saveModelParams, also check if the parameters
    should be reset in managePreviousModel
    """
    global args
    global globStep
    global modelDir
    global MODEL_DIR_BASE
    global CONFIG_FILENAME

    # Compute the current model path
    modelDir = os.path.join(args['rootDir'], MODEL_DIR_BASE)

    if args['modelTag']:
        modelDir += '-' + args['modelTag']

    # If there is a previous model, restore some parameters
    configName = os.path.join(modelDir, CONFIG_FILENAME)
    if not args['reset'] and not args['createDataset'] and os.path.exists(configName):
        # Loading
        config = configparser.ConfigParser()
        config.read(configName)

        # Check the version
        currentVersion = config['General'].get('version')
        if currentVersion != CONFIG_VERSION:
            raise UserWarning('Present configuration version {0} does not match {1}. You can try manual changes on \'{2}\''.format(currentVersion, CONFIG_VERSION, configName))

        # Restoring the the parameters

        globStep = config['General'].getint('globStep')
        args['watsonMode'] = config['General'].getboolean('watsonMode')
        args['autoEncode'] = config['General'].getboolean('autoEncode')
        args['corpus'] = config['General'].get('corpus')

        #args['datasetTag'] = config['Dataset'].get('datasetTag')
        args['maxLength'] = config['Dataset'].getint('maxLength')  # We need to restore the model length because of the textData associated and the vocabulary size (TODO: Compatibility mode between different maxLength)
        args['filterVocab'] = config['Dataset'].getint('filterVocab')

        args['hiddenSize'] = config['Network'].getint('hiddenSize')
        args['numLayers'] = config['Network'].getint('numLayers')
        args['softmaxSamples'] = config['Network'].getint('softmaxSamples')
        args['initEmbeddings'] = config['Network'].getboolean('initEmbeddings')
        args['embeddingSize'] = config['Network'].getint('embeddingSize')
        args['embeddingSource'] = config['Network'].get('embeddingSource')


        # No restoring for training params, batch size or other non model dependent parameters

        # Show the restored params
        print()
        print('Warning: Restoring parameters:')
        print('globStep: {}'.format(globStep))
        print('watsonMode: {}'.format(args['watsonMode']))
        print('autoEncode: {}'.format(args['autoEncode']))
        print('corpus: {}'.format(args['corpus']))
        print('datasetTag: {}'.format(args['datasetTag']))
        print('maxLength: {}'.format(args['maxLength']))
        print('filterVocab: {}'.format(args['filterVocab']))
        print('hiddenSize: {}'.format(args['hiddenSize']))
        print('numLayers: {}'.format(args['numLayers']))
        print('softmaxSamples: {}'.format(args['softmaxSamples']))
        print('initEmbeddings: {}'.format(args['initEmbeddings']))
        print('embeddingSize: {}'.format(args['embeddingSize']))
        print('embeddingSource: {}'.format(args['embeddingSource']))
        print()

    # For now, not arbitrary  independent maxLength between encoder and decoder
    args['maxLengthEnco'] = args['maxLength']
    args['maxLengthDeco'] = args['maxLength'] + 2
    
def saveModelParams():
    """ Save the params of the model, like the current globStep value
    Warning: if you modify this function, make sure the changes mirror loadModelParams
    """
    global args
    global globStep
    global modelDir
    global CONFIG_FILENAME
    global CONFIG_VERSION
    config = configparser.ConfigParser()
    config['General'] = {}
    config['General']['version']  = CONFIG_VERSION
    config['General']['globStep']  = str(globStep)
    config['General']['watsonMode'] = str(args['watsonMode'])
    config['General']['autoEncode'] = str(args['autoEncode'])
    config['General']['corpus'] = str(args['corpus'])

    config['Dataset'] = {}
    #config['Dataset']['datasetTag'] = str(args['datasetTag'])
    config['Dataset']['maxLength'] = str(args['maxLength'])
    config['Dataset']['filterVocab'] = str(args['filterVocab'])
    config['Dataset']['skipLines'] = str(args['skipLines'])
    config['Dataset']['vocabularySize'] = str(args['vocabularySize'])

    config['Network'] = {}
    config['Network']['hiddenSize'] = str(args['hiddenSize'])
    config['Network']['numLayers'] = str(args['numLayers'])
    config['Network']['softmaxSamples'] = str(args['softmaxSamples'])
    config['Network']['initEmbeddings'] = str(args['initEmbeddings'])
    config['Network']['embeddingSize'] = str(args['embeddingSize'])
    config['Network']['embeddingSource'] = str(args['embeddingSource'])

    # Keep track of the learning params (but without restoring them)
    config['Training (won\'t be restored)'] = {}
    config['Training (won\'t be restored)']['learningRate'] = str(args['learningRate'])
    config['Training (won\'t be restored)']['batchSize'] = str(args['batchSize'])
    config['Training (won\'t be restored)']['dropout'] = str(args['dropout'])

    with open(os.path.join(modelDir, CONFIG_FILENAME), 'w') as configFile:
        config.write(configFile)

def managePreviousModel(sess):
    """ Restore or reset the model, depending of the parameters
    If the destination directory already contains some file, it will handle the conflict as following:
     * If --reset is set, all present files will be removed (warning: no confirmation is asked) and the training
     restart from scratch (globStep & cie reinitialized)
     * Otherwise, it will depend of the directory content. If the directory contains:
       * No model files (only summary logs): works as a reset (restart from scratch)
       * Other model files, but modelName not found (surely keepAll option changed): raise error, the user should
       decide by himself what to do
       * The right model file (eventually some other): no problem, simply resume the training
    In any case, the directory will exist as it has been created by the summary writer
    Args:
        sess: The current running session
    """
    global args
    global saver
    global modelDir
    print('WARNING: ', end='')

    modelName = _getModelName()

    if os.listdir(modelDir):
        if args['reset']:
            print('Reset: Destroying previous model at {}'.format(modelDir))
        # Analysing directory content
        elif os.path.exists(modelName):  # Restore the model
            print('Restoring previous model from {}'.format(modelName))
            saver.restore(sess, modelName)  # Will crash when --reset is not activated and the model has not been saved yet
        elif _getModelList():
            print('Conflict with previous models.')
            raise RuntimeError('Some models are already present in \'{}\'. You should check them first (or re-try with the keepAll flag)'.format(modelDir))
        else:  # No other model to conflict with (probably summary files)
            print('No previous model found, but some files found at {}. Cleaning...'.format(modelDir))  # Warning: No confirmation asked
            args['reset'] = True

        if args['reset']:
            fileList = [os.path.join(modelDir, f) for f in os.listdir(modelDir)]
            for f in fileList:
                print('Removing {}'.format(f))
                os.remove(f)

    else:
        print('No previous model found, starting from clean directory: {}'.format(modelDir))
        
def managePreviousModel( sess):
    """ Restore or reset the model, depending of the parameters
    If the destination directory already contains some file, it will handle the conflict as following:
     * If --reset is set, all present files will be removed (warning: no confirmation is asked) and the training
     restart from scratch (globStep & cie reinitialized)
     * Otherwise, it will depend of the directory content. If the directory contains:
       * No model files (only summary logs): works as a reset (restart from scratch)
       * Other model files, but modelName not found (surely keepAll option changed): raise error, the user should
       decide by himself what to do
       * The right model file (eventually some other): no problem, simply resume the training
    In any case, the directory will exist as it has been created by the summary writer
    Args:
        sess: The current running session
    """
    global args
    global saver
    global modelDir
    print('WARNING: ', end='')

    modelName = _getModelName()

    if os.listdir(modelDir):
        if args['reset']:
            print('Reset: Destroying previous model at {}'.format(modelDir))
        # Analysing directory content
        elif os.path.exists(modelName):  # Restore the model
            print('Restoring previous model from {}'.format(modelName))
            saver.restore(sess, modelName)  # Will crash when --reset is not activated and the model has not been saved yet
        elif _getModelList():
            print('Conflict with previous models.')
            raise RuntimeError('Some models are already present in \'{}\'. You should check them first (or re-try with the keepAll flag)'.format(modelDir))
        else:  # No other model to conflict with (probably summary files)
            print('No previous model found, but some files found at {}. Cleaning...'.format(modelDir))  # Warning: No confirmation asked
            args['reset'] = True

        if args['reset']:
            fileList = [os.path.join(modelDir, f) for f in os.listdir(modelDir)]
            for f in fileList:
                print('Removing {}'.format(f))
                os.remove(f)

    else:
        print('No previous model found, starting from clean directory: {}'.format(modelDir))

After that, a few more helping functions. 

In [None]:
def _getModelName():
    """ Parse the argument to decide were to save/load the model
    This function is called at each checkpoint and the first time the model is load. If keepAll option is set, the
    globStep value will be included in the name.
    Return:
        str: The path and name were the model need to be saved
    """
    global modelDir
    global MODEL_NAME_BASE
    global globStep
    global args
    global MODEL_EXT
    
    modelName = os.path.join(modelDir, MODEL_NAME_BASE)
    if args['keepAll']:  # We do not erase the previously saved model by including the current step on the name
        modelName += '-' + str(globStep)
    return modelName + MODEL_EXT
def _getModelList():
    """ Return the list of the model files inside the model directory
    """
    return [os.path.join(modelDir, f) for f in os.listdir(modelDir) if f.endswith(MODEL_EXT)]
def _getSummaryName():
    """ Parse the argument to decide were to save the summary, at the same place that the model
    The folder could already contain logs if we restore the training, those will be merged
    Return:
        str: The path and name of the summary
    """
    global modelDir
    print("Model Dir:", modelDir)
    return modelDir
def getDevice():
    """ Parse the argument to decide on which device run the model
    Return:
        str: The name of the device on which run the program
    """
    global args
    if args['device'] == 'cpu':
        return '/cpu:0'
    elif args['device'] == 'gpu':
        return '/gpu:0'
    elif args['device'] is None:  # No specified device (default)
        return None
    else:
        print('Warning: Error in the device name: {}, use the default device'.format(args['device']))
        return None
def _saveSession(sess):
    """ Save the model parameters and the variables
    Args:
        sess: the current session
    """
    tqdm.write('Checkpoint reached: saving model (don\'t stop the run)...')
    saveModelParams()
    model_name = _getModelName()
    with open(model_name, 'w') as f:  # HACK: Simulate the old model existance to avoid rewriting the file parser
        f.write('This file is used internally by DeepQA to check the model existance. Please do not remove.\n')
    saver.save(sess, model_name)  # TODO: Put a limit size (ex: 3GB for the modelDir)
    tqdm.write('Model saved.')

# 4. DATA PREPROCESSING

All the data processing methods are included in the file **textdata.py** and **corpus/ubuntudata.py**. Unfortunately, there is too much code to be included in this notebook. We encourage you to go to the files and have a look to the code (it's commented). Overall, the most important points are the following ones:
- **Tokens used**:
    - *padToken*: token of the dictionary that will be used to pad the sentences to the max length allowed
    - *goToken*: token of the dictionary that will be used to stop encoding and start the decoding of the input representation
    - *eosToken*: token of the dictionary that will be used to indicate the end of the decoded sentence
    - *unknownToken*: token of the dictionary that will be used to replace those words of the input sentence that are not in the dictionary


- **Methods**:
    - *_createBatch()*: create a batch from a group of samples
    - *loadCorpus()*: first preprocess of the dataset, storing each dialog as a list of utterances
    - *filterFromFull()*: filter the preprocessed dataset to match the args defined (maxLength, filterVocab...)
    - *sentence2enco(sentence)*: translate the words in a sentence to the dictionary IDs 
    - *deco2sentence(decoderOutputs)*: translate a list of words IDs to a word sentence.

# 4. DEFINING THE MODEL

We are ready to start with the actual neural network! First we we are going to define a class skeleton with the methods that are going to be used.

In [None]:

class ProjectionOp:
    """ Single layer perceptron
    Project input tensor on the output dimension
    """
    def __init__(self, shape, scope=None, dtype=None):
        """
        Args:
            shape: a tuple (input dim, output dim)
            scope (str): encapsulate variables
            dtype: the weights type
        """
        assert len(shape) == 2

        self.scope = scope

        # Projection on the keyboard
        with tf.variable_scope('weights_' + self.scope):
            self.W_t = tf.get_variable(
                'weights',
                shape,
                # initializer=tf.truncated_normal_initializer()  # TODO: Tune value (fct of input size: 1/sqrt(input_dim))
                dtype=dtype
            )
            self.b = tf.get_variable(
                'bias',
                shape[0],
                initializer=tf.constant_initializer(),
                dtype=dtype
            )
            self.W = tf.transpose(self.W_t)

    def getWeights(self):
        """ Convenience method for some tf arguments
        """
        return self.W, self.b

    def __call__(self, X):
        """ Project the output of the decoder into the vocabulary space
        Args:
            X (tf.Tensor): input value
        """
        with tf.name_scope(self.scope):
            return tf.matmul(X, self.W) + self.b


In [None]:
class Model:
    """
    Implementation of a seq2seq model.
    Architecture:
        Encoder/decoder
    """

    def __init__(self, args, textData):
        """
        Args:
            args: parameters of the model
            textData: the dataset object
        """
        print("Model creation...")

        self.textData = textData  # Keep a reference on the dataset
        self.args = args  # Keep track of the parameters of the model
        self.dtype = tf.float32

        # Placeholders
        self.encoderInputs  = None
        self.decoderInputs  = None  # Same that decoderTarget plus the <go>
        self.decoderTargets = None
        self.decoderWeights = None  # Adjust the learning to the target sentence size

        # Main operators
        self.lossFct = None
        self.optOp = None
        self.outputs = None  # Outputs of the network, list of probability for each words

        # Construct the graphs
        self.buildNetwork()
    # Function to define the neural network
    def buildNetwork(self):
        return 
    # Function to define the operation and batches used for training/testing
    def step(self, batch):
        return 

Next we are going to define the method that builds the neural network. It deserves special attention the function *create_rnn_cell()*. In this model, the selected **cell** is **LSTM**. The network is defined by the arguments *hiddenSize* and *numLayers*. The former defines the number of hidden states the LSTM layer has, while the later defines the number of LSTM layers of the network.

In [None]:
def buildNetwork(self):

    """ Create the computational graph
    """

    # TODO: Create name_scopes (for better graph visualisation)
    # TODO: Use buckets (better perfs)

    # Parameters of sampled softmax (needed for attention mechanism and a large vocabulary size)
    outputProjection = None
    # Sampled softmax only makes sense if we sample less than vocabulary size.
    if 0 < self.args['softmaxSamples'] < self.textData.getVocabularySize():
        outputProjection = ProjectionOp(
            (self.textData.getVocabularySize(), self.args['hiddenSize']),
            scope='softmax_projection',
            dtype=self.dtype
        )

        def sampledSoftmax(labels, inputs):
            labels = tf.reshape(labels, [-1, 1])  # Add one dimension (nb of true classes, here 1)

            # We need to compute the sampled_softmax_loss using 32bit floats to
            # avoid numerical instabilities.
            localWt     = tf.cast(outputProjection.W_t,             tf.float32)
            localB      = tf.cast(outputProjection.b,               tf.float32)
            localInputs = tf.cast(inputs,                           tf.float32)

            return tf.cast(
                tf.nn.sampled_softmax_loss(
                    localWt,  # Should have shape [num_classes, dim]
                    localB,
                    labels,
                    localInputs,
                    self.args['softmaxSamples'],  # The number of classes to randomly sample per batch
                    self.textData.getVocabularySize()),  # The number of classes
                self.dtype)

    # Creation of the rnn cell
    def create_rnn_cell():
        encoDecoCell = tf.contrib.rnn.BasicLSTMCell(  # Or GRUCell, LSTMCell(args['hiddenSize)
            self.args['hiddenSize'],
        )
        if not self.args['test']:  # TODO: Should use a placeholder instead
            encoDecoCell = tf.contrib.rnn.DropoutWrapper(
                encoDecoCell,
                input_keep_prob=1.0,
                output_keep_prob=self.args['dropout']
            )
        return encoDecoCell
    encoDecoCell = tf.contrib.rnn.MultiRNNCell(
        [create_rnn_cell() for _ in range(self.args['numLayers'])],
    )

    # Network input (placeholders)

    with tf.name_scope('placeholder_encoder'):
        self.encoderInputs  = [tf.placeholder(tf.int32,   [None, ]) for _ in range(self.args['maxLengthEnco'])]  # Batch size * sequence length * input dim

    with tf.name_scope('placeholder_decoder'):
        self.decoderInputs  = [tf.placeholder(tf.int32,   [None, ], name='inputs') for _ in range(self.args['maxLengthDeco'])]  # Same sentence length for input and output (Right ?)
        self.decoderTargets = [tf.placeholder(tf.int32,   [None, ], name='targets') for _ in range(self.args['maxLengthDeco'])]
        self.decoderWeights = [tf.placeholder(tf.float32, [None, ], name='weights') for _ in range(self.args['maxLengthDeco'])]

    # Define the network
    # Here we use an embedding model, it takes integer as input and convert them into word vector for
    # better word representation
    decoderOutputs, states = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(
        self.encoderInputs,  # List<[batch=?, inputDim=1]>, list of size args['maxLength
        self.decoderInputs,  # For training, we force the correct output (feed_previous=False)
        encoDecoCell,
        self.textData.getVocabularySize(),
        self.textData.getVocabularySize(),  # Both encoder and decoder have the same number of class
        embedding_size=self.args['embeddingSize'],  # Dimension of each word
        output_projection=outputProjection.getWeights() if outputProjection else None,
        feed_previous=bool(self.args['test'])  # When we test (self.args['test']), we use previous output as next input (feed_previous)
    )

    # TODO: When the LSTM hidden size is too big, we should project the LSTM output into a smaller space (4086 => 2046): Should speed up
    # training and reduce memory usage. Other solution, use sampling softmax

    # For testing only
    if self.args['test']:
        if not outputProjection:
            self.outputs = decoderOutputs
        else:
            self.outputs = [outputProjection(output) for output in decoderOutputs]

        # TODO: Attach a summary to visualize the output

    # For training only
    else:
        # Finally, we define the loss function
        self.lossFct = tf.contrib.legacy_seq2seq.sequence_loss(
            decoderOutputs,
            self.decoderTargets,
            self.decoderWeights,
            self.textData.getVocabularySize(),
            softmax_loss_function= sampledSoftmax if outputProjection else None  # If None, use default SoftMax
        )
        tf.summary.scalar('loss', self.lossFct)  # Keep track of the cost

        # Initialize the optimizer
        opt = tf.train.AdamOptimizer(
            learning_rate=self.args['learningRate'],
            beta1=0.9,
            beta2=0.999,
            epsilon=1e-08
        )
        self.optOp = opt.minimize(self.lossFct)

When running the model, we need to act differently in training than testing. 

- For **training**, we should feed the neural network with the inputs of both the encoder and the decoder, as we want to "force" the decoder with the actual answer. We are interested on retrieving the losses and the optimizer (Adam by default) operation to minimize the loss.

- In **testing**, we only need to feed with the inputs of the encoder (the user question) and the operation to retrieve is the output of the decoder (answer of the chatbot)

In [None]:
def step(self, batch):

    """ Forward/training step operation.
    Does not perform run on itself but just return the operators to do so. Those have then to be run
    Args:
        batch (Batch): Input data on testing mode, input and target on output mode
    Return:
        (ops), dict: A tuple of the (training, loss) operators or (outputs,) in testing mode with the associated feed dictionary
    """

    # Feed the dictionary
    feedDict = {}
    ops = None

    if not self.args['test']:  # Training
        for i in range(self.args['maxLengthEnco']):
            feedDict[self.encoderInputs[i]]  = batch.encoderSeqs[i]
        for i in range(self.args['maxLengthDeco']):
            feedDict[self.decoderInputs[i]]  = batch.decoderSeqs[i]
            feedDict[self.decoderTargets[i]] = batch.targetSeqs[i]
            feedDict[self.decoderWeights[i]] = batch.weights[i]

        ops = (self.optOp, self.lossFct)
    else:  # Testing (batchSize == 1)
        for i in range(self.args['maxLengthEnco']):
            feedDict[self.encoderInputs[i]]  = batch.encoderSeqs[i]
        feedDict[self.decoderInputs[0]]  = [self.textData.goToken]
        ops = (self.outputs,)

    # Return one pass operator
    return ops, feedDict

Finally, we update the class methods with them, and our model is ready to be used.

In [None]:
Model.buildNetwork = buildNetwork
Model.step = step

# 5. TRAINING

Next, we define our training function. We want to make sure if we have to restore an already existing model if we interrupted the training. After that, we get the batches for the training and, for each one of them, perform the operation with the feeding dictionary (as was defined in the previous part of the tutorial).

A checkpoint of the model is saved every N steps, being N defined in the initial args (by default 500)

In [None]:
def mainTrain(sess):
    """ Training loop
    Args:
        sess: The current running session
    """
    global textData
    global args
    global writer
    global model
    global globStep
    # Specific training dependent loading

    textData.makeLighter(args['ratioDataset'])  # Limit the number of training samples

    mergedSummaries = tf.summary.merge_all()  # Define the summary operator (Warning: Won't appear on the tensorboard graph)
    if globStep == 0:  # Not restoring from previous run
        writer.add_graph(sess.graph)  # First time only

    # If restoring a model, restore the progression bar ? and current batch ?

    print('Start training (press Ctrl+C to save and exit)...')

    try:  # If the user exit while training, we still try to save the model
        for e in range(args['numEpochs']):

            print()
            print("----- Epoch {}/{} ; (lr={}) -----".format(e+1, args['numEpochs'], args['learningRate']))

            batches = textData.getBatches()

            # TODO: Also update learning parameters eventually

            tic = datetime.datetime.now()
            for nextBatch in tqdm(batches, desc="Training"):
                # Training pass
                ops, feedDict = model.step(nextBatch)
                assert len(ops) == 2  # training, loss
                _, loss, summary = sess.run(ops + (mergedSummaries,), feedDict)
                writer.add_summary(summary, globStep)
                globStep += 1

                # Output training status
                if globStep % 100 == 0:
                    perplexity = math.exp(float(loss)) if loss < 300 else float("inf")
                    tqdm.write("----- Step %d -- Loss %.2f -- Perplexity %.2f" % (globStep, loss, perplexity))

                # Checkpoint
                if globStep % args['saveEvery'] == 0:
                    _saveSession(sess)

            toc = datetime.datetime.now()

            print("Epoch finished in {}".format(toc-tic))  # Warning: Will overflow if an epoch takes more than 24 hours, and the output isn't really nicer
    except (KeyboardInterrupt, SystemExit):  # If the user press Ctrl+C while testing progress
        print('Interruption detected, exiting the program...')

    _saveSession(sess)  # Ultimate saving before complete exit


In [None]:
print('Welcome to DeepQA v0.1 !')
print('TensorFlow detected: v{}'.format(tf.__version__))

# General initialisation
if not args['rootDir']:
    args['rootDir'] = os.getcwd()  # Use the current working directory

#tf.logging.set_verbosity(tf.logging.INFO) # DEBUG, INFO, WARN (default), ERROR, or FATAL

loadModelParams()  # Update the modelDir and globStep, for now, not used when loading Model (but need to be called before _getSummaryName)

# Data preprocessing function for the given arguments
textData = TextData(args)

In [None]:
# Prepare the model
with tf.device(getDevice()):
    model = Model(args, textData)

# Saver/summaries
writer = tf.summary.FileWriter(_getSummaryName())
saver = tf.train.Saver(max_to_keep=200)

# Running session
sess = tf.Session(config=tf.ConfigProto(
    allow_soft_placement=True,  # Allows backup device for non GPU-available operations (when forcing GPU)
    log_device_placement=False)  # Too verbose ?
)  # TODO: Replace all sess by sess (not necessary a good idea) ?

if args['debug']:
    sess = tf_debug.LocalCLIDebugWrapperSession(sess)
    sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)

print('Initialize variables...')
sess.run(tf.global_variables_initializer())

# Reload the model eventually (if it exist.), on testing mode, the models are not loaded here (but in predictTestset)
if args['test'] != TestMode.ALL:
    managePreviousModel(sess)

# Initialize embeddings with pre-trained word2vec vectors
if args['initEmbeddings']:
    loadEmbedding(sess)

**IMPORTANT**: if Kernel dies during testing of a checkpoint and you need to execute the notebook again, you don't have to train it again. Please just skip the following cell.

In [None]:
# Train!
mainTrain(sess)

# 7. TESTING

At this point we should have a trained model for the corpus. Now we want to make empirical tests to understand how good its answeres are, what are the weak and strong points and, finally, how to improve it.

The same way we did with the training function, now we can define a testing function. Note that, although there are different types of teting in this project, in the tutorial we are going to focus on the **interactive** one. This mode allows the user to give a question as an input and receive an answer in an infinite loop (until you want the program to stop). 

In [None]:
def mainTestInteractive( sess):
    """ Try predicting the sentences that the user will enter in the console
    Args:
        sess: The current running session
    """
    # TODO: If verbose mode, also show similar sentences from the training set with the same words (include in mainTest also)
    # TODO: Also show the top 10 most likely predictions for each predicted output (when verbose mode)
    # TODO: Log the questions asked for latter re-use (merge with test/samples.txt)
    global SENTENCES_PREFIX
    global textData
    print('Testing: Launch interactive mode:')
    print('')
    print('Welcome to the interactive mode, here you can ask to Deep Q&A the sentence you want. Don\'t have high '
          'expectation. Type \'exit\' or just press ENTER to quit the program. Have fun.')

    while True:
        question = input(SENTENCES_PREFIX[0])
        if question == '' or question == 'exit':
            break

        questionSeq = []  # Will be contain the question as seen by the encoder
        answer = singlePredict(question, questionSeq)
        if not answer:
            print('Warning: sentence too long, sorry. Maybe try a simpler sentence.')
            continue  # Back to the beginning, try again

        print('{}{}'.format(SENTENCES_PREFIX[1], textData.sequence2str(answer, clean=True)))

        if args['verbose']:
            print(textData.batchSeq2str(questionSeq, clean=True, reverse=True))
            print(textData.sequence2str(answer))

        print()

For the prediction function, we listen for a question from the user. Once the input is submitted, we translate the words into IDs of our dictionary. These IDs are then transformed into embeddings, and these embeddings are fed to the network. The result after running the model is the output of the decoder. We need to translate the IDs into words and we have an answer to the user question.

In [None]:
def singlePredict(question, questionSeq=None):
    """ Predict the sentence
    Args:
        question (str): the raw input sentence
        questionSeq (List<int>): output argument. If given will contain the input batch sequence
    Return:
        list <int>: the word ids corresponding to the answer
    """
    # Create the input batch
    batch = textData.sentence2enco(question)
    if not batch:
        return None
    if questionSeq is not None:  # If the caller want to have the real input
        questionSeq.extend(batch.encoderSeqs)

    # Run the model
    ops, feedDict = model.step(batch)

    output = sess.run(ops[0], feedDict)  # TODO: Summarize the output too (histogram, ...)
    answer = textData.deco2sentence(output)

    return answer

Once again, we initialize our model to be tested. This should only be run once, if it's already loaded you might receive an error.

In [None]:
# General initialisation
args['test'] = TestMode.INTERACTIVE
args['batchSize'] = 1

if not args['rootDir']:
    args['rootDir'] = os.getcwd()  # Use the current working directory

#tf.logging.set_verbosity(tf.logging.INFO) # DEBUG, INFO, WARN (default), ERROR, or FATAL

loadModelParams()  # Update the modelDir and globStep, for now, not used when loading Model (but need to be called before _getSummaryName)

''' # Uncomment this block if you haven't loaded the model previously (not trained/testing performed since you opened this notebook)
# Prepare the model
textData = TextData(args)

with tf.device(getDevice()):
    model = Model(args, textData)
'''
# Saver/summaries
writer = tf.summary.FileWriter(_getSummaryName())
saver = tf.train.Saver(max_to_keep=200)

# Running session
sess = tf.Session(config=tf.ConfigProto(
    allow_soft_placement=True,  # Allows backup device for non GPU-available operations (when forcing GPU)
    log_device_placement=False)  # Too verbose ?
)  # TODO: Replace all sess by sess (not necessary a good idea) ?

if args['debug']:
    sess = tf_debug.LocalCLIDebugWrapperSession(sess)
    sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)

print('Initialize variables...')
sess.run(tf.global_variables_initializer())
print("Model dir value:", modelDir)
# Reload the model eventually (if it exist.), on testing mode, the models are not loaded here (but in predictTestset)
if args['test'] != TestMode.ALL:
    managePreviousModel(sess)

# Initialize embeddings with pre-trained word2vec vectors
if args['initEmbeddings']:
    loadEmbedding(sess)   

And that's it! We have our model running. Now we just need to call the testing function to activate the interactive mode.

In [None]:
mainTestInteractive(sess)