## Chapter 5: Sequential NLP and memory

* Goal: How can we answer a question on the basis of a number of facts?

## 5.1 Memory and language
* Language is a sequential and contextual phenomenom!
    * Part-of-speech tagging
    * Sentiment and topic analysis
### 5.1.1 The problem: Question Answering
* Question Answering involves matching answers to questions
* This task is dependent on memory as it must store all of the facts since it doesn't know the upcoming question.
* Goal: Given a sequence of sentences, and a question that can be answered by 1 (and only 1) of the sentences, how can we retrieve the necessary information for answering the question?

### 5.2: Data and Data Processing
* bAbI dataset
    * Sequences of facts, linked to questions
    * Created by Meta
* Example Story:
    * Mary moved to the bathroom
    * John went to the hallway
    * Where is Mary? bathroom
* Storing information like this can be burdensome so check the model under two conditions:
    * Using a question and the supporting fact
    * Using all facts in a story including non-relevant ones
* Data Preparation:
    * Prepare the following vectors:
        * List for all facts
            * Lumped into one long vector
        * List for all vectorized questions
        * List of labels
    * The vectors will be created via a tokenizer
        * Tokenizer fitted on the words in the training/test data
    * Create a switch to remove/keep irrelevant facts

## 5.3: Question Answering w/ Sequential Models
### 5.3.1: RNNs for Question Answering
* RNN -> Recurrent Neural Network
    * Blindly passes all historical information in
* Implement a branching model w/ 2 RNNs
    * 1 RNN handles the facts/stories
    * 1 RNN handles the question
* It's important that we use One-hot encoding on here such that our question/answer vectors can be related
* Flow:
    * Create 2 RNN input layers
    * Merge the input layers via concatenation
    * Send it through a Dense layer
    * Get an output layer w/ dimensionality vocab_size
    * Compile the model + test
* Model accuracy is super high if we throw out the irrelevant facts
* But when we start needing more context, we get a significant drop in performance as it can't store as much information with a binary switch
* Incremental Context:
    * Rather than using a binary switch, specify the amount of irrelevant facts we can tolerate
    * Slowly increate the numbers of irrelevant facts + amount of words the facts cover
    * Even still, RNNs suck for holding lots of information so the performance isn't that great
### 5.3.2: LSTMs for Question Answering
* LSTM -> Long Short Term Memory
    * Works 1 feature at a time through an input vector
    * Updates its cell state at each step
* LSTMs come in either stateless or stateful mode
    * Stateless:
        * After a sequence has been processed, the weights of the surrounding layers are updated through backpropagation
        * Cell state of the LSTM is reset
    * Stateful:
        * Vectors across batches are synchronized
        * Each bector proceeds with the cell state for a corresponding vector
        * Batches contain temporally linked vectors
    * In both modes, a batch of labeled training vectors are used
* LSTMs require that we have the following data triplets:
    * Number of samples
    * Time Steps
    * Features / Timestep
* Stateful batches are NOT useful here as we're not trying to predict an outcome per fact.
    * Question: Would it eb useful then if we needed information from multiple facts to get the answer? Thinking 'Where is A and B?'
* Flow:
    * Almost exactly the same as RNNs but with LSTM layers instead
    * Add an input layer prior to the LSTM to deduce the triplet data structure
* Again, model accuracy is high if we throw out irrelevant information, but it still has poor performance when running on all contexts
* LSTMs outperform RNNs for moderately long contexts, but fail on extensive sequences
### 5.3.3 End-to-end memory networks for Question Answering
* Moving from rote mapping to responsive memory
* Don't just teach a network to predict an answer from the combined story/question vector
    * Instead, produce a memory response and use that to weight the facts vector
    * The question vector is then recombined with the weighted factors vector and used to predict a word index
* Approach:
    * Facts are embedded with an embedding A
    * Question is embedded with an embedding B
    * Facts are also embedded with an embedding C
    * Create the memory bank response by deriving an inner product of the facts with A embedding
    * Combine the probabilities of A w/ C using a weighted sum operation
    * The question in B is now combined through weighted sum concatenation to the above
        * Now the facts are weighted for relevance at answering the question!
    * The resulting concatenation is sent to a dense output layer to get the word index of the answer
* The resulting model gives a greater performance for super long sequences


## 5.4: Summary
* Three approaches to Question Answering with differing memory capacity involve using RNNS, LSTMs, and end-to-end memory networks
* For Question-Answering, RNNs perform worse than LSTMs in remembering long sequences of data.
* End-to-end memory networks work by memorizing sample questions, supporting facts and answers to the questions in memory banks.
* End-to-end memory networks outperform LSTMs for Question-Answering