# Word Prediction using Recurrent Neural Networks (RNNs)
## Experiment 2016-12-30

Experiment setup

### Table of Contents

1. Initialize
2. Prepare Data
3. Explore Data
4. Train Models
5. Test Models
6. Generate Text

## 1. Initialize
### Import

In [190]:
# import python modules
from __future__ import print_function, division
import os.path
import random
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from nltk import tokenize

In [191]:
# import wp modules (can be slow)
import sys; sys.path.append('../../src')
print('importing wp (and nltk)...')
import wp
print('done')

importing wp (and nltk)...
done


In [192]:
# reload wp modules in case changed (for development purposes)
reload(wp)
reload(wp.data)
reload(wp.util)
reload(wp.model)
reload(wp.ngram)
reload(wp.rnn)
reload(wp.experiment);

## 2. Prepare Data

Clean and merge raw text files, split into train, validate, and test sets.

In [193]:
# get wrapper around all data and tokenization
#data = wp.data.Data('gutenbergs')
data = wp.data.Data('animals')

Clean the raw data files - remove Gutenberg headers and footers, and non-ascii characters (nltk complains otherwise).

In [194]:
data.clean()

The raw files have been cleaned.


Merge the cleaned data files into one.

In [195]:
data.merge()

The cleaned files have already been merged.


Split the merged file by sentences into train, validate, and test sets.

In [196]:
data.split()

The merged file has already been split.


## 3. Explore Data

### Show some statistics

In [197]:
# too slow
#stats = data.analyze()
#stats

### Show some samples of the text

In [198]:
s_merged = data.text('merged')
nsamples = 4
nchars = len(s_merged)
nskip = int(nchars / nsamples)
for i in range(nsamples):
    s = s_merged[i*nskip:i*nskip+200]
    s = s.replace('\n', ' ').strip()
    print(s)
    print()

Dog barked. Cat slept. Dog slept. Cat meowed.

Cat slept. Dog slept. Cat meowed.

og slept. Cat meowed.

t meowed.



### Show some text split into sentences

This shows how the text was split up into the train, validate, and test sets.

In [199]:
# we'll just look at the first 50k characters, because parsing sentences is slow
sentences = data.sentences('merged', 50000)
random.seed(2)
samples = random.sample(sentences, 4)
print('\n\n'.join(samples))

Cat meowed.    

Dog slept.

Dog barked.

Cat slept.


### Show the text split into tokens

Note that punctuation marks are treated as separate tokens.

In [200]:
tokens = data.tokens('merged', 50000)
print('ntokens',len(tokens))
print(tokens[-50:])

ntokens 16
['Dog', 'barked', '.', 'END', 'Cat', 'slept', '.', 'END', 'Dog', 'slept', '.', 'END', 'Cat', 'meowed', '.', 'END']


## 4. Train Models

Train models on the training tokens.

In [201]:
# define models to train and test
model_specs = [
    [wp.ngram.Ngram, {'n':1}],
    [wp.ngram.Ngram, {'n':2}],
    [wp.ngram.Ngram, {'n':3}],
    [wp.ngram.Ngram, {'n':4}],
    [wp.rnn.Rnn, {'nvocabmax':1000,'nhidden':100}],
]

In [202]:
# train models on different amounts of training data

train_amounts = [0.0001, 0.001, 0.01, 0.1, 1.0] # fraction of total training data

#nchars_list = [1000]#,10000,100000]#,1000000,6000000]
model_table = wp.analyze.init_model_table(model_specs, data, nchars_list)
print('done')

ntraining_chars 1000
get complete stream of training tokens, nchars=1000
train model n-gram-(nchars-1000-n-1)
get ngrams, n=1
add ngrams to model
save model n-gram-(nchars-1000-n-1)
train model n-gram-(nchars-1000-n-2)
get ngrams, n=2
add ngrams to model
save model n-gram-(nchars-1000-n-2)
train model n-gram-(nchars-1000-n-3)
get ngrams, n=3
add ngrams to model
save model n-gram-(nchars-1000-n-3)
train model n-gram-(nchars-1000-n-4)
get ngrams, n=4
add ngrams to model
save model n-gram-(nchars-1000-n-4)
train model rnn-(nchars-1000-nvocabmax-1000-nhidden-100)
2016-12-30 16:19:21: Loss after nexamples_seen=0 epoch=0: 5.069344
2016-12-30 16:19:21: Loss after nexamples_seen=2 epoch=1: 1.583008
2016-12-30 16:19:21: Loss after nexamples_seen=4 epoch=2: 2.017114
Setting learning rate to 0.002500
2016-12-30 16:19:21: Loss after nexamples_seen=6 epoch=3: 1.519566
2016-12-30 16:19:21: Loss after nexamples_seen=8 epoch=4: 1.710806
Setting learning rate to 0.001250
2016-12-30 16:19:21: Loss after

## 5. Test Models

Test all models on held-out test data.

In [203]:
# test all models and save results to a pandas dataframe

ntest_chars = 10000
npredictions_max = 1000
k = 3 # predict top k tokens

df = wp.analyze.test_model_table(model_table, data, ntest_chars, npredictions_max, k)

get complete stream of test tokens, nchars=10000
n-gram-(nchars-1000-n-1): accuracy = nright/total = 2/2 = 1.000000
n-gram-(nchars-1000-n-2): accuracy = nright/total = 0/1 = 0.000000


ZeroDivisionError: division by zero

In [None]:
df

In [None]:
for i in range(len(df.index)):
    ix_i = df.ix[i]
    plt.plot(df.columns, ix_i)
plt.legend(loc=(1.1,0.5))
plt.xscale('log')
plt.xlabel('Training set size (chars)')
plt.ylabel('Accuracy')
plt.show()

## 6. Generate Text

In [1]:
nsentences = 5
models = model_table[-1] # use models with most training data
for model in models[1:]:
    print(model.name)
    print('-'*80)
    for seed in range(nsentences):
        random.seed(seed)
        s = model.generate()
        print(s)
        print()
    print()

NameError: name 'model_table' is not defined