# LSTM-based Variational Autoencoder for Text Encoding
In this notebook I will demo the LSTM-based variational autoencoder (https://arxiv.org/pdf/1312.6114.pdf) I wrote in Keras for encoding text to a latent vector representation. This representation can be used for computing similarity metrics between documents (sentences, in this case) or as a feature vector for other learning tasks.

The VAE is a generative model that maximizes the marginal probability of the input by conditioning it on a latent variable whose distribution is learned by a parameterized function estimator, such as a neural network. The neural network samples z from a normal distribution and transforms it to a distribution Q(z|X) to give us a distribution of z values given X which are likely to produce X. This is where the "Variational" part of VAE's comes in: we use KL-divergence in our loss function to drive Q(z|X) as close as we can to P(z), the prior distribution of z. The loss function also includes a reconstruction error term. In summary, the VAE learns an encoding distribution Q which produces latent representations z which are likely to produce the input data X, and a decoding function f(z) which is optimized to output data as close to X as it can from the latent representation.

In [1]:
from utils import *
from vae_lstm import *
import numpy as np

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Get Data
We will be building our dataset by converting sentences from various NLTK corpora (Brown, Reuters, Gutenberg) into a word embedding representation which will yield a 3D tensor of shape (N, S, E), where N is the number of sentences, S is the length of the sentence (zero padded at the beginning), and E is the length of the word embedding. Here we're using S=20 and E=300. We're using the wiki-news-300d-1M.vec from https://fasttext.cc/docs/en/english-vectors.html for our word embeddings.

In [None]:
data, all_text = get_data()

len_train = 50000
len_test = 10000
train = data[:len_train]
train_text = all_text[:len_train]
test = data[len_train:len_train + len_test]

batch_size = 50
epochs = 30
input_dim = train.shape[-1]
timesteps = train.shape[1]

## Train the Model
We will train the model for 30 epochs with a batch size of 50. Our displayed loss is mean squared error between the generated word vectors in the output sequence and the word vectors in the input sequence. 

In [2]:
model = VAE_LSTM(input_dim=input_dim, latent_dim=100, hidden_dims=[32], timesteps=timesteps, batch_size=batch_size)
vae, encoder, generator = model.autoencoder, model.encoder, model.generator

vae.fit(train, train, shuffle=True, epochs=epochs, batch_size=batch_size, validation_data=(test, test))

Train on 37751 samples, validate on 0 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x24cacfa8c50>

## Check Similar Sentences
We'll do a spot check on our model by printing the text of the most similar sentences in the encoding space for a few training examples.

In [5]:
encoded_sentences = encoder.predict(np.array(train), batch_size = batch_size)

In [16]:
def print_nearest_sentences(sent_idx):
    print("[First sentence is target sentence, following are closest neighbors]")
    for s in get_nearest_sentences(sent_idx, encoded_sentences, train):
        print(s)
    print()
        
print_nearest_sentences(12352)
print_nearest_sentences(5226)
print_nearest_sentences(35233)

[First sentence is target sentence, following are closest neighbors]
it said after nine months following closing may require royal to register the 200 000 shares for sale . 
under terms of the letter of intent would contribute substantially to a three year exploration budget of 4 . 
to fully compensate for devaluation the quota would have to be around 28 dlrs per bag against 7 . 
economists polled by reuters said that m 1 should be anywhere from down four billion dlrs to up 2 . 
like its cousin the refrigerator a conditioner can be expected to last 20 to 25 years or more . 
economists polled by reuters said that m 1 would be anywhere from down two billion dlrs to up 1 . 

[First sentence is target sentence, following are closest neighbors]
1 mln dlr defense logistics agency contract for jet fuel the defense department said . 
12 mln tonnes in 1985 the commodity board for margarine fats and oils said . 
7 mln dlrs manufactures a line of computer output to microfilm hardware and . 
13 ml

## Summary
It looks like our model is learning some notion of sentence structure. The first example has sentences which all discuss a subject with some relationship to a number at the end of the sentence. Similarly, the second example has sentences which all begin with some number of some unit and end with a '\[subject\] said'. The final example has sentences which all begin with a simple '\[pronoun\] \[verb\]' structure. While sentence structure is important, the final example shows that the model may not be representing topics very well in addition to structure. Perhaps using a model like the Dirichlet Variational Autoencoder (https://arxiv.org/pdf/1811.00135.pdf), which explicitly models topics in its latent representation using a dirichlet distribution, would improve the representations in that regard.