# Chapter 6: Episodic memory for NLP

* Strongly supervised memory networks easily produce above-baseline results
* Semi-supervised memory networks can produce better accuracies though!

## 6.1 Memory Networks for Sequential NLP
* Procedural memory consists of a semantic and an episodic component
    * Semantic: More generic/abstract conceptual info
    * Episodic: Specific memories + personal experiences
* Most of NLP is about picking out contextual info to make a step towards interpretation
* Ambiguities in a sentence/phrase can make it difficult to determine the contribution that a noun/verb can become
* Question: Which part of speech should a noun/verb ambiguity get in this context?
    * We must make sure that insertion order is preserved here!
    * How can we transform random NLP data such that a memory network can be ingested?

## 6.2 Data and Data Processing
* Data:
    * Data for preprositional phrase - attachment
    * Dutch diminuitive formation data
    * Spanish part-of-speech tagging data
* Let's use a single retrievel pass over the stored facts
* Also going to provide explicitly the relevant facts for answering a question
### 6.2.1: PP Attachment Data:
* Prepositional phrases restrict the meaning of other words in the sentence
* Data Format:
    * eats,pizza,with,anchovies,N
    * dumped,sacks,into,bin,V
        * N vs. V is what the pp is modifying
* Also throw in data that tags the part of speech as well
### 6.2.2 Dutch Diminutive Data
* Features here indicate whether a syllable is stressed or not provided that the syllable doesn't have an onset or is less than 3 syllables
    * Onset: String of consonants before the vocalic part
    * Coda: String of consonants after the vocalic part
* Converting Dutch diminutive data to bAbI format
    * Define an array for holding substitutions
    * Print the facts/questions
### 6.2.3 Spanish Part-of-Speech Data
* Data Format:
    * Abogado NC B-PER
    * General AQ I-PER
        * 2nd column = Part of speech label for the first column
        * Last column = Phrasal information
            * Starting a named entity (B-PER, B for 'beginning)
            * Being part of a named entity (I-PER, I for 'inside')
* Process:
    * Define a dictionary, holding the assignment of parts of speech
    * Store the combination of the words/part of speech
    * Keep track of ambiguities
    * Define an ngram size
    * Define a focus position (word position addressed in the question)
    * Convert the focus position w/ ngram size into a story
    * Check for any ambiguities
    * Print the fact

## 6.3 Strongly Supervised Memory Networks: Experiments and Results
### 6.3.1 PP-Attachment
* Base Line Results: (Test Loss / Test Accuracy = .4298 / .8162)
* Crank up the results by utilizing external word embeddings
### 6.3.2 Dutch Diminutives
* Base Line Results: (Test Loss / Test Accuracy = .18 / .9137)
* Crank up the results by utilizing external word embeddings
### 6.3.3 Spanish Part-of-Speech Tagging
* Base Line Results: (Test Loss / Test Accuracy = .3104 / .9006)
* Crank up the results by incorporating linguistically richer facts

* Overall: The Memory Network yields average performance w/ ~ no processing cost

## 6.4 Semi-Supervised Memory Networks
* Previously we explicitly gave the facts for answering the question
* Can the system go and figure out which facts are important for predicting an outcome?
    * Only needs facts, questions, and answers
* Memory networks estimate a layer of probabilities that express the importance of a certain fact for answering a question.
* Can we make these probabilities better by doing multiple exposures of facts to questions?
    * Multi-hop Approach:
        * Process the result 3 times in a row rather than once
    * Flow:
        * Question is matched w/ the facts vector
        * Result -> vector p + result from embedding C
        * Compute the output 3 times
        * The p probabilities become re-estimated during every hop w/ every hop implementing the same matching steps

## 6.5 Semi-Supervised Memory Networks: Experiments and Results
* More bAbI datasets!
* Indefinite knowledge: Reasoning task that includes disjunctions (or) and with answers including maybe
* Data:
    * Fred is either in the school or the park.
    * Mary went back to the office.
    * Is Fred in the school? Maybe
* Training w/ the 3-hop network can lead to a significantly improved accuracy vs. the 1-hop network as seen w/ the above data
* Sometimes though the multi-hop network can lead to accuracy degradation as seen when trying to test for the Spanish part of speech data