Variational Autoencoder for Texts

Very loose implementation of Generating Sentences from a Continuous Space.

Dataset and Preprocessing

The dataset used in this repo is IMDB Review Dataset. The dataset is preprocessed as follows:

Each string is split by sentence.
Characters are all lower-cased.
No words are removed, common punctuations (comma, period, question mark) are preserved and treated as a word.
Dataset is split into training and testing set.
Only the top 20000 most frequent words in the training set are stored in vocabulary, other words are replaced with an unknown token.
The training set, testing set and Keras word Tokenizer are saved as pickle files.

Example of preprocessed sentence:
Original: Do you think Batman lives under Bruce Wayne's manor?
Preprocessed: do you think batman lives under bruce wayne s manor ?

Check dataset.py for more details.

Training

The model is described in seq_vae.py. The Kullback-Leibler divergence term is multiplied by a weight beta as described in the paper. Beta is annealed from 0 to 1 following a sigmoid schedule as desribed in the paper. The model is trained for 20 epochs.

Check seq_vae.py and train.py for more details and tuning hyperparameters.

Results

usage: test.py [-h] [-m MODE] [-s SENTENCE] [-s1 SENTENCE_1] [-s2 SENTENCE_2]
               [-b BEAM_SIZE] [-r RESTORE_PATH]

optional arguments:
  -h, --help            show this help message and exit
  -m MODE, --mode MODE
  -s SENTENCE, --sentence SENTENCE
  -s1 SENTENCE_1, --sentence_1 SENTENCE_1
  -s2 SENTENCE_2, --sentence_2 SENTENCE_2
  -b BEAM_SIZE, --beam_size BEAM_SIZE
  -r RESTORE_PATH, --restore_path RESTORE_PATH

Reconstruction

python test.py --mode reconstruct --sentence "the actors are too young for their characters" --beam_size 8 --restore_path vae/vae-20  

1. the actors are too young for them .
2. the actors are too young for their .
3. the songs are too young for them .
4. the actors are too young for her .
5. the actors are too young for some .
6. the songs are not too young .
7. the actors are not too young .
8. the characters are too young for them .

Generating random sentences

python test.py --mode generate --beam_size 8 --restore_path vae/vae-20  

1. that would be the best thing about the previous films today , but this film is packed , and it was a great movie , not a good , dark comedy .
2. that would be the best thing about the previous films today , but this film is packed , and it was a great movie , especially for a good , dark comedy .
3. that would be the best thing about the previous films today , but this film is packed , and it was a great movie , it was a good , classy thriller .
4. that would be the best thing about the previous films today , but this film is packed , and it was a great movie , it was a good , classy movie .
5. that would be the best thing about the previous films today , but this film is packed , and it was a great movie , especially for a good , classy thriller .
6. that would be the best thing about the previous films today , but this film is packed , and it was a great movie , especially for a good , classy comedy .
7. that would be the best thing about the previous films today , but this film is packed , and it was a great movie , it was a good , classy fashion .
8. that would be the best thing about the previous films today , but this film is packed , and it was a great movie , it was a good , classy , inspirational movie .

Interpolating between sentences

First and last sentences are the reconstruction of the originals

python test.py --mode interpolate --sentence_1 "the movie was not as good as the prequel" --sentence_2 "the actors of the film are too young for their characters" --restore_path vae/vae-20    
1. the movie was not as good as a .
2. the movie was not as good as a .
3. the movie was not as good as <unk> .
4. the acting is great as good as <unk> .
5. the actors in the movie are quite good .
6. the actors of the film are too young for their .
7. the actors of the film are too young for their .

Check out test.py to vary the number of interpolations, 5 interpolations is a good number to see significant changes between each interpolation.

Requirements

Tensorflow/Tensorflow-gpu 1.13.1
NumPy 1.16.3

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
dataloader.py		dataloader.py
dataset.py		dataset.py
seq_vae.py		seq_vae.py
tensorboard.PNG		tensorboard.PNG
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitattributes

.gitattributes

LICENSE

LICENSE

README.md

README.md

dataloader.py

dataloader.py

dataset.py

dataset.py

seq_vae.py

seq_vae.py

tensorboard.PNG

tensorboard.PNG

test.py

test.py

train.py

train.py

Repository files navigation

Variational Autoencoder for Texts

Dataset and Preprocessing

Training

Results

Reconstruction

Generating random sentences

Interpolating between sentences

Requirements

About

Releases

Packages

Languages

License

daQuincy/Variational-Autoencoder-for-Texts

Folders and files

Latest commit

History

Repository files navigation

Variational Autoencoder for Texts

Dataset and Preprocessing

Training

Results

Reconstruction

Generating random sentences

Interpolating between sentences

Requirements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages