<img src="assets/microsoft_logo.png" width="30%"/>

# Deep Learning for Language in Azure

**Original work by Adam Atkinson, SDE @ [Microsoft Research Montréal](https://www.microsoft.com/en-us/research/lab/microsoft-research-montreal/)** (formerly Maluuba)

**Edited by @Vicky Fu**

## Goal: 

**To familiarize you with the tools, techniques, and lingo when it comes to applying deep learning to language tasks.**

## Prerequisites:

- Machine learning 101: probabilities, vector & matrix calculus, training & inference, discriminative & generative models (supervised vs. unsupervised learning), evaluation setup.

- Neural networks & deep learning 101: activation functions, loss functions, gradients, backpropagation, (stochastic) gradient descent, multi-layer perceptrons (MLPs), convolutional neural networks (CNNs), recurrent neural networks (RNNs).

- Knowledge of Python and TensorFlow for these code samples.

**Don't be intimidated! You don't need to be a neural engineer or have a PhD. If these topics are new or unfamiliar there's lots of great resources in print and online!**

# Data Science Virtual Machine

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available for [Windows Server 2016](http://aka.ms/dsvm/win2016) and [Ubuntu 16.04 LTS](http://aka.ms/dsvm/ubuntu). We also offer [Windows Server 2012](http://aka.ms/dsvm) and [CentOS](http://aka.ms/dsvm/centos) versions, although Windows 2016 and Ubuntu are the recommended options. 

You can try the Data Science VM for free for 30 days (with $200 credits) with a free [Azure Trial](http://azure.com/free). The Ubuntu DSVM also provides a free trial through the [Azure Test Driver](aka.ms/dsvm/testdrive). The Test Drive will provide full access to you own instance of the VM with just a free Microsoft account - No Azure subscription or credit card needed.

### Deploy a DSVM for Linux (Ubuntu) with password authentication using the portal
<a href="https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2FDataScienceVM%2Fmaster%2FScripts%2FCreateDSVM%2FUbuntu%2Fazuredeploy.json" target="_blank">
<img src="http://azuredeploy.net/deploybutton.png"/>
</a>

### Turn on the Jupyter Notebook

Using the URL http://vm_ip:8000 on any browser

## Social Good Applications

Humanity is generating an increasingly enormous amount of unstructured text in the form of language. What does this text tell us about ourselves and how does this understanding help us address real problems?

### Examples

- Identifying health concerns and responding to crises.
    - There's work done assessing mental health and for [suicide prevention](https://onlinelibrary.wiley.com/doi/abs/10.1111/sltb.12312).
    
    
- Filtering toxic content.
    - [Classifying hatespeech](http://www.aclweb.org/anthology/W17-1101).


- Identifying fake news.
    - See the ["Liar, Liar, Pants on Fire"](https://arxiv.org/abs/1705.00648) dataset and the paper's citations.
    

- Validating facts.
    - See the [First Workshop on Information Extraction and Verification](http://fever.ai/).


- Studying social biases with text analysis.
    - See the excellent paper ["Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings"](https://arxiv.org/abs/1607.06520).

<img src="assets/biased_word_embeddings.png" width="100%"/>

- x-axis is the word embedding projected onto the vector difference between "she" and "he". This shows which gender the word is more often associated with, i.e. bias/skewness.
- y-axis is the projection onto the gender neutralness component of the word embedding. Above the line means gender neutral.
- **We need to move the gender neutral words (above the horizontal line) to be equidistant from either gender (to the vertical line).**  

[Source: shameless Microsoft Research plug.](https://arxiv.org/abs/1607.06520)

## Tasks

**Natural Language Processing (NLP)** encompasses tasks related to:
1. **speech recognition** (mainly refining predictions from **automatic speech recognition (ASR)** models).
1. the **syntax** of language.
1. the **semantics** of language.
1. language **generation**.
1. analyzing **discourse**.

NLP/NLU focuses more on language itself rather than extracting information - **information retrieval (IR)** - but the two fields intersect.

Generally we want to _understand_ language by labelling it or generating new text.

### Text classification

1. Label a section of text **(N -> 1)**.
1. Label each word or token, i.e. sequence labelling **(N -> N)**.

E.g.
- Sentiment analysis.
- Classifying the intent of a sentence (slot filling).
- Extracting entities.
- Part-of-speech tagging.
- Coreference resolution.
- Logical entailment.
- Extractive summarization.

### Text generation:

1. Consume some text and produce more, lengths don't need to match **(N -> M)**.

E.g.
- Machine translation.
- Generating natural language answers to questions.
- Image captioning.
- Language modelling.
- Producing responses in dialogue.
- Abstractive summarization.

\* **Caveat**: natural language generation is challenging, especially for neural networks or without the use of templates.

## Why Deep Learning?

**Natural language:**

- has a high dimensional, sparse feature space because of large vocabularies, and this space becomes combinatorially large as you create features composed of multiple tokens.

- has many nuanced rules, ambiguities, and exceptions that can't easily be captured by rule-based systems.

- depends on order and context.

- is really noisy!

**Deep neural networks:**

- learn dense representations in high dimensional feature spaces.

- learn representations that capture complex relationships in the data.

- learn compositional and contextual relationships.

- perform well on large amounts of data, generalize well, and handle noise.

Data for neural networks boils down to small normalized/standardized floating point values. This means you can combine text representations with representations for other types of data (e.g. images, audio, video) to create **multi-modal** models.

There's lots of language data out there so robust representations can be learned, borrowed, and easily be applied to other tasks using **transfer learning** or **fine tuning**. This makes it easy to bootstrap your own models.

## Features and Representations

### Preprocessing

Raw text is messy.

- Fold case and accents.
- Strip or compress whitespace.
- **Tokenize** the text.
- **Stem** or **lemmatize** tokens to remove affixes.
- Heurisitic or regular expression substitution or deletion of tokens/characters. E.g. unicode empty spaces and emojis 😰.
- Collect the **vocabulary** of tokens.
- Add special tokens like sentence markers (**beginning/end of sentence** BOS/EOS) and **out-of-vocabulary** (OOV or UNK).

### Feature Engineering

- Calculate frequency based features likening text to a **bag-of-words**.
    - e.g **term frequency inverse document frequency**, tf-idf
    
    $tfidf(t,d,D) = tf(t,d)*idf(t,D)$
    
    where $t$ = term, $d$ = a specific document, $D$ = the corpus of all documents.
    
    
- Boolean or **one-hot vector** encoded flags for tokens or features.
    - a one-hot encoding vector $V$ has $V[index] = 1$ where $index$ is the index of the word in the vocabulary, and $V[...] = 0$ otherwise
    
    
- Compute features for n-length windows of tokens, called **n-grams**.
    - E.g. "artificial intelligence for social good" has trigrams:
        - ("artificial","intelligence","for")
        - ("intelligence","for","social")
        - ("for","social","good")


- Fill a term by document frequency matrix where $M_{term,document}=f_{term,document}$ and apply dimensionality reduction or matrix factorization techniques to extract features. E.g. **singular value decomposition**.

### Vector Representations

Neural networks take **tensors** (N dimensional matrices) as input so we need to represent discrete tokens in this format. 

We can use a one-hot vector, but with a million or billion-word vocabulary we're going to run out of memory. Also the signals are sparse for the network. Besides, why do we need to hand-engineer special features?

Instead, we want a denser vector with a lower fixed dimension, independent of the vocabulary size. How? **Learn it!**

The idea here is to train a neural network to map focused keywords to their surround context words in a section of text, then use the neural net's weights as the word representations (**word vectors**).

Think of the learned word vectors as columns in a matrix $W = [w_1, ... , w_n]$. To encode a word with a one-hot vector $x$ we multiply it by $W$, to get its embedding $y = Wx$. The neural network learns this weight matrix.

<img src="assets/word_embeddings.png" width="60%"/>

[Source](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/)

#### Methods

**[Good] Continuous Bag Of Words (CBOW)**: Given the _context_, predict the _keyword_.

**[Better] Skipgram**: Given the _keyword_ predict the _context_. E.g. **word2vec**.

<img src="assets/cbow_skipgram.png" width="67%"/>

[Source](http://rohanvarma.me/Word2Vec/)

**[Best] Global Vectors for Word Embeddings [(GloVe)](https://nlp.stanford.edu/projects/glove/)**: Learn word vectors for terms such that their dot product is equal to their probability of co-occurrence. There's a good blog post [here](https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/).

These learned representations capture semantic relationships, so words can be manipulated semantically using mathematical operators. These representations are learned in an **unsupervised** manner.

## Models

Language is sequential and has contextual dependencies. Additionally linguistic features appear at different  offsets in the sequence so we need models that can be agnostic to the precise starting position of a feature. **Convolutional** and **recurrent** neural networks can do both of these.

### Input

These are the word vectors or standardized hand-crafted numerical features. 

Word vectors can also be loaded as rows in a weight matrix so sparse or one-hot token representations can be fed directly into the model. This is called an **embedding layer** and the embeddings can be tuned as weights of the network by setting them to be **trainable**.

### CNNs

Apply a mathematical transformation (i.e. a **convolution**) over all inputs and weights in a patch of a volume. These patches are computed over windows of the input, producing local features. These convolutional layers are followed by **max pooling** to be less sensitive to the precise location of features.

See [Stanford's CS231n](http://cs231n.github.io/convolutional-networks/) for a good introduction.

#### For Text

We can treat a sequence of tokens like a one dimensional image where the channels are the components of the token representations (e.g. word vectors). The **filter size** and **stride** of the convolution determine the n-grams for which features are learned.

[Denny Britz's blog](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/) explains this well.

Convolutions are also highly parallelizable which makes CNNs faster to train on GPUs.

<img src="assets/cnn_lang.png" width="90%"/>

[Source](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/)

### RNNs

Learn a representation at each timestep of a sequence using the input at time $t$ and the output and hidden states from the previous timestep $t-1$ (hence recurrence).

Think of them as long or very deep neural nets. Their sequential nature makes them well-suited for text. **GRU** and **LSTM** are the most popular flavours of RNN units.

See [Chris Olah's](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) and [Andrej Karpathy's](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) blogs.

#### For Text

Language tasks often use **bidirectional** RNNs to capture forward and backward dependencies in the data. A bidirectional RNN (or BidiRNN, BiRNN) has an RNN operating forward on the sequence $t_1$ -> $t_n$ and another independent RNN operating backward on the sequence $t_n$ -> $t_1$. The hidden states and outputs at each time step are the concatenation of those produced by the forward and backward RNNs.

RNN layers are stacked by treating each RNN layer as an encoder, i.e. all outputs/hidden states for layer $L$ are computed and then passed as input to the next RNN layer $L+1$.

<img src="assets/birnn.png" width="70%"/>

[Source](http://www.cl.cam.ac.uk/~pv273/slides/LSTMslides.pdf)

The final output and hidden state of an RNN is an encoding that captures information for the whole sequence.

### Encoder-Decoder

Neural network models for language generally follow an encoder-decoder recipe. 

- An **encoder network** reduces the source sequence to a tensor representation.
- A **decoder network** expands the encoded representation to match the target sequence.

The intermediate encoded representation can be fixed or variable length and can be used as a feature to decoders in other tasks.

<img src="assets/encdec.jpg" width="70%"/>

[Source](https://talbaumel.github.io/blog/attention/)

### Output

The encoded representation of the whole sequence is often fed into a **dense** or **fully connected** layer followed by a softmax. This produces an output probability distribution, the argmax of which is the correct class or token. This techique can also be applied independently at each time step to produce an output for each element of the sequence.

In the case of text generation, the probability distribution has as many buckets as the vocabulary has tokens, so we often have to limit it to the top $K$ most significant components to make the softmax tractable.

## Advanced Neural Tools for Language

### Attention

We learn the weighted contribution of the surrounding tokens (the context) thereby learning to pay **attention** to specific words. Attention is usually applied in the decoder.

Chris Olah has a great [blog](https://distill.pub/2016/augmented-rnns/) explaining this too.

<img src="assets/attention.png" width="80%"/>

[Source](https://distill.pub/2016/augmented-rnns/)

### Beam Search

In order to compute a likely output sequence tractably, we compute $K$ paths through the output distributions. 
For each tail node at time $t$ in the $K$ paths, we select the most likely node at time $t+1$ to add to the path. Here $K$ is our **beam width**.

<img src="assets/beam_search.jpg" width="90%"/>

[Source](https://talbaumel.github.io/blog/attention/)

### Character-Level Embeddings

We can model text as a sequence of characters instead of word tokens. We can learn character embeddings and use these as features instead. This eliminates out-of-vocabulary tokens and shrinks the output space, but requires lots of data to train and bigger networks to handle longer sequences.

### Hierarchical Softmax

We can avoid the big-softmax problem by modelling a component of the output probability distribution as the product of node probabilities in a path of a binary tree. This reduces the complexity to at most the depth of the tree $log_{2}(|vocabulary|)$. There's a great blog post [here](http://ruder.io/word-embeddings-softmax/index.html#hierarchicalsoftmax).

## A Note on Evaluation

How do we make sure our model is decent?

- For classification:
    - Compute accuracy, **precision**, **recall**, **ROC curves**, **false positive** and **false negative** rates, and **F1** score.
    - Plot a **confusion matrix** counting the number of times a ground truth label matches a predicted label for all possible labels.

<img src="assets/confusion_matrix.png" width="50%"/>

[Source](https://stackoverflow.com/questions/31324218/scikit-learn-how-to-obtain-true-positive-true-negative-false-positive-and-fal)
    
- For sequences:
    - **Mean** and **absolute** error over the sequence.
    - **Levenstein** or **edit distance**: how many edits are needed to transform one sequence to another.


- For generated language:
    - Examine the lengths of the generated phrases compared to those of the training targets.
    - **Word Error Rate (WER)**: edit distance for language.
    - Examine fluency by comparing the output to that of a language model. i.e. is the next word close to what a language model predicts. Fluency can be computed as the **perplexity** between the model output and the output of a language model. (See [MSR Montreal's question-generation paper](https://arxiv.org/abs/1705.02012)).
    - **BLEU score**: correlates with human judgement score on the quality of generated text. Similarly **ROUGE** and **METEOR**.


**Note**: We can't optimize these measures directly in a neural network since they aren't **differentiable**. (You can use Reinforcement Learning / REINFORCE if you're ambitious).

## Finally an Example!

Here we'll classify a document as having a positive or negative sentiment. Our approach is as follows:

1. **Features**: tokenize a dataset of text files and look up the GloVe vector for each token.

1. **Model**: feed the vector representation of each token through a bidirectional recurrent neural network and use the final output of the RNN as input to a densely connected layer.

1. **Training**: minimize the cross entropy between the sigmoid of the model output and the sentiment label, for each document.

Our sentiment data will come from a subset of IMDB reviews, hosted [here](https://www.cs.cornell.edu/people/pabo/movie-review-data/). We include the 50-dimensional [GloVe](https://nlp.stanford.edu/projects/glove/) word vectors from the Wikipedia and Gigaword corpora [here](http://nlp.stanford.edu/data/glove.6B.zip).

See `sentiment_rnn.py` for a standalone implementation.

Note a pretrained model is provided at https://msrmtl-public-store.azureedge.net/ai4good/sentiment_90pct_639ep.tar.gz.

This tutorial works with **TensorFlow 1.8**.

-----

First we read the GloVe vectors and build two lookup tables:
1. One that maps word (string) -> token index (int), exposed by `lookup_word`.
1. `glove`: token index (int) -> GloVe embedding (array of float).

These use an `UNK` token for out-of-vocabulary words internally, and provide a padding token for shorter snippets of text. See `embedding.py` for more details.

In [1]:
import logging
import numpy as np
import os
import shutil
import tensorflow as tf

from embedding import glove, look_up_word, PAD_TOKEN

# Set to True to start from scratch, False to continue training an existing model
FROM_SCRATCH = False

TRAIN_BATCH_SIZE = 16
EPOCHS = 2

MAX_LENGTH = 200
VALIDATION_SPLIT = 0.2
EMBEDDING_DIMS = glove.shape[1]
RNN_UNITS = 64

logging.basicConfig(level=logging.DEBUG)

  from ._conv import register_converters as _register_converters


Next we get the raw dataset, tokenize it, and generate labels.

For each batch we generate the token indices for embedding lookup, sequence lengths, and labels. We don't do any fancy tokenization here since the data is so clean.

We define a generator we can use to create TensorFlow `Dataset` objects that feed our data through the computational graph. In order to evaluate our model on the whole training and validation sets after each epoch we create separate dataset initalizers and a dynamic batch size.

In [2]:
cache_dir = '.cache'

if FROM_SCRATCH:
    checkpoint_dir = 'experiment1'
    if os.path.exists(checkpoint_dir):
        shutil.rmtree(checkpoint_dir)
    start_epoch = 0
else:
    # Put downloaded pretrained model here or your own trained one #
    checkpoint_dir = 'sentiment_90pct_639ep'
    start_epoch = 639

for d in [cache_dir, checkpoint_dir]:
    if not os.path.exists(d):
        os.mkdir(d)

RELATIVE_POLARITY_DATASET_SUBDIR = os.path.join('datasets', 'review_polarity')
DATA_DIR = os.path.join(
    cache_dir, RELATIVE_POLARITY_DATASET_SUBDIR, 'txt_sentoken')
do_extract = not os.path.exists(DATA_DIR)

dataset = tf.keras.utils.get_file(
    fname='review_polarity.tar.gz',
    cache_dir=cache_dir,
    cache_subdir=RELATIVE_POLARITY_DATASET_SUBDIR,
    origin='https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz',
    extract=do_extract)

train_file_sents, val_file_sents = [], []

for d, sent in [(os.path.join(DATA_DIR, sd), score) for sd, score in [('pos', 1.), ('neg', 0.)]]:
    files = os.listdir(d)
    split_index = int((1-VALIDATION_SPLIT)*len(files))
    train_file_sents += [(os.path.join(d, f), sent)
                         for f in files[:split_index]]
    val_file_sents += [(os.path.join(d, f), sent)
                       for f in files[split_index:]]


def make_token_generator_for_files(src_file_sents):
    def generator():
        for f, sent in src_file_sents:
            # Put your custom tokenizing code here
            # Use nltk.word_tokenize, but in this case the dataset is processed so we don't need to
            #   import nltk
            #   nltk.download('punkt')
            line_token_ids = [look_up_word(t.lower()) for ts in [line.split() for line in tf.gfile.GFile(
                f, 'r').readlines()] for t in ts][:MAX_LENGTH]
            token_ids_length = len(line_token_ids)
            # Could also do `padded_batch` here
            line_token_ids += [PAD_TOKEN] * (MAX_LENGTH - token_ids_length)
            yield (line_token_ids, token_ids_length, sent)
    return generator

batch_size = tf.placeholder(tf.int64)

train_set_size, val_set_size = len(train_file_sents), len(val_file_sents)

train_dataset = tf.data.Dataset.from_generator(
    make_token_generator_for_files(train_file_sents), (tf.int32, tf.int32, tf.float32), (tf.TensorShape([None]), tf.TensorShape(None), tf.TensorShape(None)))\
    .shuffle(train_set_size)\
    .batch(batch_size)

val_dataset = tf.data.Dataset.from_generator(
    make_token_generator_for_files(val_file_sents), (tf.int32, tf.int32, tf.float32), (tf.TensorShape([None]), tf.TensorShape(None), tf.TensorShape(None)))\
    .shuffle(val_set_size)\
    .batch(batch_size)

Downloading data from https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz


We create an iterator that will choose a dataset based on the initializer used and will load batch data into the graph. Token embeddings are looked up through a table parameterized by the `glove` map.

In [3]:
iterator = tf.data.Iterator.from_structure(
    train_dataset.output_types, train_dataset.output_shapes)

train_init_op = iterator.make_initializer(train_dataset)
val_init_op = iterator.make_initializer(val_dataset)

batch_token_ids, batch_seq_lens, batch_labels = iterator.get_next()

embedding_table = tf.get_variable("embedding_table", initializer=glove)

batch_embedding = tf.nn.embedding_lookup(embedding_table, batch_token_ids)

Next we define our model. We feed all the batch sequences through a bidirectional recurrent neural network having forward and backward **gated recurrent unit (GRU)** RNN layers. We chose GRU here because it has fewer parameters making it faster to train, and its hidden state is equal to its output at each time step.

We take the last hidden state (i.e. last output) of each of the forward and backward units concatenated since this is a representation of all the information learned over the whole sequence. These features are then fed as input to a linear layer to create **logits**. We don't apply a non-linearity or squashing function since TensorFlow can incorporate these directly into loss functions for numerical stability purposes and space efficency.

In [4]:
fwd = tf.contrib.rnn.GRUCell(num_units=RNN_UNITS)
bwd = tf.contrib.rnn.GRUCell(num_units=RNN_UNITS)

_, final_rnn_state = tf.nn.bidirectional_dynamic_rnn(
    fwd,
    bwd,
    batch_embedding,
    sequence_length=batch_seq_lens,
    dtype=tf.float32
)

fwd_state, bwd_state = final_rnn_state

last_rnn_state = tf.concat([fwd_state, bwd_state], axis=1)

sentiment_logits = tf.layers.dense(
    last_rnn_state,
    1,
    use_bias=True
)

Instructions for updating:
seq_dim is deprecated, use seq_axis instead


Instructions for updating:
seq_dim is deprecated, use seq_axis instead


Instructions for updating:
batch_dim is deprecated, use batch_axis instead


Instructions for updating:
batch_dim is deprecated, use batch_axis instead


Our loss is the binary cross entropy between the squashed logits (predictions) and labels. One-hot encoding with softmax is mathematically equivalent here. We also define an `accuracy` operator used to evaluate our model's performance, independent of the loss. We use the Adam optimizer with default parameters because this generally works well. Finally we add a `Saver` to save and restore our model.

In [5]:
loss_op = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
    logits=sentiment_logits, labels=batch_labels))

accuracy = tf.metrics.accuracy(
    batch_labels,
    tf.greater(sentiment_logits, tf.zeros(tf.shape(sentiment_logits)))
)

optimizer = tf.train.AdamOptimizer()
trainer = optimizer.minimize(loss_op)
saver = tf.train.Saver()

Before our training loop we need to initalize global variables (`batch_size`), local variables (used by `tf.metrics.accuracy`), and lookup tables. We reinitialize our dataset iterators on each epoch, and when our training iterator is out of data we evaluate our model on the entire training and validation sets.

In [6]:
session = tf.Session()

if FROM_SCRATCH:
    session.run(tf.global_variables_initializer())
else:
    saver.restore(session, os.path.join(checkpoint_dir, 'model-{0}'.format(start_epoch)))

session.run(tf.local_variables_initializer())
session.run(tf.tables_initializer())

last_epoch = start_epoch + EPOCHS

for i in range(start_epoch + 1, last_epoch + 1, 1):
    session.run(train_init_op, feed_dict={batch_size: TRAIN_BATCH_SIZE})
    logging.info('='*50)
    logging.info('EPOCH %d ' % i + '-'*40)
    # Iterate over batches
    batchn = 0
    while True:
        try:
            loss, acc, bs, _ = session.run(
                [loss_op, accuracy, batch_size, trainer], feed_dict={batch_size: TRAIN_BATCH_SIZE})

            # Print stats at the start and end of the batch for debugging
            if batchn == 0 or batchn == ((train_set_size // bs) - 1):
                logging.info('ep={}, batch={}, loss={:.5f}, acc={:.4f}'.format(
                    i, batchn, loss, acc[0]))

            batchn += 1

        except tf.errors.OutOfRangeError:
            break

    session.run(train_init_op, feed_dict={batch_size: train_set_size})
    loss, acc = session.run([loss_op, accuracy])
    logging.info('-'*50)
    logging.info(
        'TRAIN RESULTS: loss={:.5f}, acc={:.4f}'.format(loss, acc[0]))

    session.run(val_init_op, feed_dict={batch_size: val_set_size})
    loss, acc = session.run([loss_op, accuracy])
    logging.info(
        'VALIDATION RESULTS: loss={:.5f}, acc={:.4f}'.format(loss, acc[0]))

    saver.save(session, os.path.join(checkpoint_dir, 'model'), i)

logging.info('Done training.')
session.close()

INFO:tensorflow:Restoring parameters from sentiment_90pct_639ep/model-639


INFO:tensorflow:Restoring parameters from sentiment_90pct_639ep/model-639


NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for sentiment_90pct_639ep/model-639
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
  File "/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/anaconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 486, in start
    self.io_loop.start()
  File "/anaconda3/lib/python3.6/site-packages/tornado/ioloop.py", line 832, in start
    self._run_callback(self._callbacks.popleft())
  File "/anaconda3/lib/python3.6/site-packages/tornado/ioloop.py", line 605, in _run_callback
    ret = callback()
  File "/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 536, in <lambda>
    self.io_loop.add_callback(lambda : self._handle_events(self.socket, 0))
  File "/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/anaconda3/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/anaconda3/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2850, in run_ast_nodes
    if self.run_code(code, result):
  File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-af7f08ed936d>", line 11, in <module>
    saver = tf.train.Saver()
  File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1284, in __init__
    self.build()
  File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1296, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1333, in _build
    build_save=build_save, build_restore=build_restore)
  File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 781, in _build_internal
    restore_sequentially, reshape)
  File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 400, in _AddRestoreOps
    restore_sequentially)
  File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 832, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1463, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
    op_def=op_def)
  File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for sentiment_90pct_639ep/model-639
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]


Now let's restore our model and try it out. We define a runtime method to feed data forward into the runtime graph without calling the `trainer` which backpropagates gradients. Note we don't have a sigmoid op in the graph so we look at the sign of the sentiment logit, where 0 is the decision boundary of a sigmoid.

In [None]:
def interact(sess):
    logging.info('='*40)
    logging.info('Interactive runtime')
    logging.info('='*40)
    inp = input('Enter a phrase or `q` to quit: ')
    while inp and inp != 'q':
        logging.info('Query: %s' % inp)
        line_token_ids = [look_up_word(t.lower())
                          for t in inp.split()][:MAX_LENGTH]
        token_ids_length = len(line_token_ids)
        line_token_ids += [PAD_TOKEN] * (MAX_LENGTH - token_ids_length)

        pred = sess.run([sentiment_logits], feed_dict={
            batch_token_ids: [line_token_ids], batch_seq_lens: [token_ids_length], batch_size: 1})

        if pred[0] >= 0:
            logging.info('Result: POSTIVE (+)')
        else:
            logging.info('Result: NEGATIVE (-)')

        inp = input('Enter a phrase or `q` to quit: ')
        

# Create a new session, load the model in, and try it out.
new_session = tf.Session()
print(checkpoint_dir)
saver.restore(new_session, os.path.join(checkpoint_dir, 'model-{0}'.format(last_epoch)))
new_session.run(tf.local_variables_initializer())
new_session.run(tf.tables_initializer())

interact(new_session)

new_session.close()
logging.info('All done.')

## Next Steps

#### Evaluate the model.

- Test the model on different sentiment datasets, e.g. Rotten Tomatoes.
- Look at different evaluation metrics defined above ^.


#### Augment the data.

- Train on varied sequence lengths. Augment the dataset by randomly taking snippets of different lengths from the documents.
- Train on more data.


#### Adjust the training setup.

- Use actual sequence lengths so the whole RNN isn't unrolled or use [`padded_batch`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch).
- Dynamic learning rate or [batch size](https://arxiv.org/abs/1711.00489).
- Adjust the batch size depending on the size of the training set.
- Change the weight and bias initialization scheme.
    

#### Augment the model.

- Add another dense layer or two to the output.
- Stack more RNN layers.
- Make the word embeddings trainable.
- Use higher dimensional word embeddings.
- Add dropout layers to prevent overfitting.
- Apply **batch normalization**.
- Add an **attention** mechanism over the RNN outputs / hidden states for each time step.
- Add **convolution** over the input sequence where the filters are inputs to an RNN.
- Add **character level embeddings**
- Incorporate features learned by other embedding models (e.g. universal sentence encoder, NNLM, character level embeddings, ELMo) and **fine tune** them. Check out TensorFlow Hub.

## Other Great Resources

- [A great collection of NLP tutorials and resources](https://alex-fabbri.github.io/TutorialBank/)
- [Stanfords Deep Learning for NLP course](http://cs224d.stanford.edu/)
- [Sebastian Ruder's blog](http://ruder.io/#open)
- [Maluuba's QGen Workshop](https://github.com/Maluuba/qgen-workshop)
- [Debugging your neural network](http://theorangeduck.com/page/neural-network-not-working)
- [More advanced tutorial on TensorFlow datasets](https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428)

### Datasets

- Lots of sources aggregated on GitHub:
    - https://github.com/niderhoff/nlp-datasets
    - https://github.com/karthikncode/nlp-datasets
- [Maluuba Datasets](https://datasets.maluuba.com/)

### Software

- [Fast.ai](https://github.com/fastai/fastai) library and course for deeplearning & NLP using PyTorch.
- [AllenNLP](https://github.com/allenai/allennlp) NLP and deep learning library.
- [SpaCy](https://spacy.io/) library for classic NLP and text processing.
- [Gensim](https://radimrehurek.com/gensim/), good for word vectors and topic modelling.
- [NLTK](https://www.nltk.org/) library for clasic NLP and text processing.
- [Pretrained models and embeddings in TensorFlow Hub](https://www.tensorflow.org/hub/modules/text)
- [Maluuba's nlg-eval tools](https://github.com/Maluuba/nlg-eval)
- [TensorFlow datasets](https://www.tensorflow.org/programmers_guide/datasets)

**Training data for this example was originally curated for this work**:

Pang, B., & Lee, L. (2004, July). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics (p. 271). Association for Computational Linguistics.