# Word Prediction using Recurrent Neural Networks (RNNs)
## Experiment 2016-12-22

Refining test architecture.

### Table of Contents

1. Load Training Data
2. Explore Training Data
3. Load / Train Models
4. Test Models

## Imports

In [10]:
# import python modules
from __future__ import print_function, division
import os.path
import random
from nltk import tokenize

In [58]:
# import wp modules (can be slow)
import sys; sys.path.append('../../src')
print('importing wp (and nltk)...')
import wp
print('done')

importing wp (and nltk)...
done


In [59]:
# reload wp modules in case changed (for development purposes)
reload(wp)
reload(wp.data)
reload(wp.ngram)
reload(wp.rnn)

<module 'wp.rnn' from '../../src\wp\rnn.pyc'>

## Initialize

In [60]:
random.seed(0)

## 1. Prepare Data

Merge raw text files, convert to plain strings, split into train, validate, and test sets.

In [61]:
# get wrapper around all data and tokenization
data = wp.data.Data()

Merge the raw data files into one and remove non-ascii characters (nltk complains otherwise).

In [15]:
data.merge()

Merged file created.


Split the merged file by sentences into train, validate, and test sets.

In [16]:
data.split()

The merged file has been split into train, validate, and test files.


## 2. Explore Data

### Show some samples of the text

In [44]:
s_merged = data.text('merged')
nsamples = 5
nchars = len(s_merged)
nskip = int(nchars / nsamples)
for i in range(nsamples):
    s = s_merged[i*nskip:i*nskip+200]
    s = s.replace('\n', ' ').strip()
    print(s)
    print()

The Project Gutenberg EBook of Phantastes, by George MacDonald  This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  You may copy it, give it away or re

to be vanquished, retreated; but Wellington shouted, "Up, Guards, and aim straight!" The red regiment of English guards, lying flat behind the hedges, sprang up, a cloud of grape-shot riddled the tric

xcept that geometrical point, the _I_; bringing everything back to the soul-atom; expanding everything in God, entangling all activity, from summit to base, in the obscurity of a dizzy mechanism, atta

y, or to speak more accurately, that same evening, as Marius left the table, and was on the point of withdrawing to his study, having a case to look over, Basque handed him a letter saying: "The perso

ay evening in January, the lonely valley had been a desirable place to him; he had watched the green battlements in summer and winter weather, had seen the heaped mounds rising dimly amidst th

### Show some text split into sentences

This shows how the text was split up into the train, validate, and test sets.

In [45]:
# we'll just look at the first 50k characters, because parsing sentences is slow
sentences = data.sentences('merged', 50000)
random.seed(0)
samples = random.sample(sentences, 5)
print('\n\n'.join(samples))

And off they set, after some new mischief.

Primrose is gone.

But how then do you come to live here?

He looked up, and lo!

It contained many wondrous tales of Fairy Land, and olden times, and the Knights of King Arthurs table.


### Show the text split into tokens

Note that punctuation marks are treated as separate tokens.

In [46]:
tokens = data.tokens('merged', 50000) # look at first 50k characters
print('ntokens',len(tokens))
print(tokens[8000:8100])

ntokens 10369
[';', 'as', 'if', 'we', 'were', 'not', 'good', 'enough', 'to', 'look', 'at', 'her', ',', 'and', 'she', 'was', ',', 'the', 'proud', 'thing', '!', '--', 'served', 'her', 'right', '!', 'Oh', ',', 'Pocket', ',', 'Pocket', ',', 'said', 'I', ';', 'but', 'by', 'this', 'time', 'the', 'party', 'which', 'had', 'gone', 'towards', 'the', 'house', ',', 'rushed', 'out', 'again', ',', 'shouting', 'and', 'screaming', 'with', 'laughter', '.', 'Half', 'of', 'them', 'were', 'on', 'the', 'cats', 'back', ',', 'and', 'half', 'held', 'on', 'by', 'her', 'fur', 'and', 'tail', ',', 'or', 'ran', 'beside', 'her', ';', 'till', ',', 'more', 'coming', 'to', 'their', 'help', ',', 'the', 'furious', 'cat', 'was', 'held', 'fast', ';', 'and', 'they', 'proceeded']


## 3. Train / Load Models

Load models if they have been saved in pickle files, otherwise train them on the training text.

In [50]:
# get sequence of training tokens (slow)
train_tokens = data.tokens('train')

In [54]:
# define a function first
def encode_params(params):
    """
    Encode a dictionary of parameters as a string to be stored in a filename.
    e.g. {'n':3,'b':1.2} => 'n-3,b-1.2'
    """
    s = str(params)
    s = s.replace(':','-')
    s = s.replace("'",'')
    s = s.replace('{','')
    s = s.replace('}','')
    s = s.replace(' ','')
    s = '(' + s + ')'
    return s    

In [64]:
#. put in a pandas table? use to store results also? 

#model_list = [
#    ['n-gram (n=2)', 'ngram-model-basic-n-2', wp.ngram.NgramModel, {'n':2}],
#    ['n-gram (n=3)', 'ngram-model-basic-n-3', wp.ngram.NgramModel, {'n':3}],
#    ['rnn', 'rnn', wp.rnn.RnnModel, {}],
#]
model_list = [
    [wp.ngram.NgramModel, {'n':2}],
    [wp.ngram.NgramModel, {'n':3}],
#    [wp.ngram.NgramModel, {'n':4}],
    [wp.rnn.RnnModel, {}],
]

models = []
for modelclass, modelparams in model_list:

    # load existing model, or create, train, and save one
    sparams = encode_params(modelparams)
    modelfile = 'models/' + modelclass.__name__ + '-' + sparams + '.pickle'
    if os.path.isfile(modelfile):
        print("load model: " + modelfile)
        model = modelclass.load(modelfile) # static method
    else:
        print("create model object")
        model = modelclass(**modelparams)

        print("train model")
        model.train(train_tokens)

        print("save model: " + modelfile)
        model.save(modelfile)

    models.append(model)
    
print("done")

load model: models/NgramModel-(n-2).pickle
create model object
train model
get ngrams
add ngrams to model
save model: models/NgramModel-(n-3).pickle
load model: models/RnnModel-().pickle
done


## 4. Test Models

Now that we have some trained models, let's test them against some held-out data.

### Split text into tuples of tokens

In [62]:
#. will want to iterate over ntokens_per_tuple?

ntokens_per_tuple = 2
test_tuples = data.tuples('test', ntokens_per_tuple)
print(test_tuples[100:120])

[('the', 'chamber'), ('chamber', 'where'), ('where', 'the'), ('the', 'secretary'), ('secretary', 'stood'), ('stood', ','), (',', 'the'), ('the', 'first'), ('first', 'lights'), ('lights', 'that'), ('that', 'had'), ('had', 'been'), ('been', 'there'), ('there', 'for'), ('for', 'many'), ('many', 'a'), ('a', 'year'), ('year', ';'), (';', 'for'), ('for', ',')]


### Test models

In [65]:
imax = 100
for model in models:
    print(model.name)
    i = 0
    nright = 0
    for tuple in test_tuples:
        prompt = tuple[:-1]
        actual = tuple[-1]
        prediction = model.predict(prompt)
        if actual==prediction:
            nright += 1
        i += 1
        if i>imax: break
    print("nright/total=%d/%d" % (nright, imax))
    print()

n-gram (n=2)
nright/total=8/100

n-gram (n=3)
nright/total=7/100

rnn
nright/total=0/100

