# Word Prediction using Recurrent Neural Networks (RNNs)
## Experiment 2016-12-23

Loop over training size

### Table of Contents

1. Prepare Data
2. Explore Data
3. Train Models
4. Test Models

## Imports

In [5]:
# import python modules
from __future__ import print_function, division
import os.path
import random
from nltk import tokenize

In [6]:
# import wp modules (can be slow)
import sys; sys.path.append('../../src')
print('importing wp (and nltk)...')
import wp
print('done')

importing wp (and nltk)...
done


In [7]:
# reload wp modules in case changed (for development purposes)
reload(wp)
reload(wp.data)
reload(wp.ngram)
reload(wp.rnn)

<module 'wp.rnn' from '../../src\wp\rnn.pyc'>

## Initialize

In [8]:
random.seed(0)

## 1. Prepare Data

Merge raw text files, convert to plain strings, split into train, validate, and test sets.

In [9]:
# get wrapper around all data and tokenization
data = wp.data.Data()

Merge the raw data files into one and remove non-ascii characters (nltk complains otherwise).

In [10]:
data.merge()

The merged file already exists.


Split the merged file by sentences into train, validate, and test sets.

In [11]:
data.split()

The merged file has already been split.


## 2. Explore Data

### Show some samples of the text

In [12]:
s_merged = data.text('merged')
nsamples = 5
nchars = len(s_merged)
nskip = int(nchars / nsamples)
for i in range(nsamples):
    s = s_merged[i*nskip:i*nskip+200]
    s = s.replace('\n', ' ').strip()
    print(s)
    print()

The Project Gutenberg EBook of Phantastes, by George MacDonald  This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  You may copy it, give it away or re

to be vanquished, retreated; but Wellington shouted, "Up, Guards, and aim straight!" The red regiment of English guards, lying flat behind the hedges, sprang up, a cloud of grape-shot riddled the tric

xcept that geometrical point, the _I_; bringing everything back to the soul-atom; expanding everything in God, entangling all activity, from summit to base, in the obscurity of a dizzy mechanism, atta

y, or to speak more accurately, that same evening, as Marius left the table, and was on the point of withdrawing to his study, having a case to look over, Basque handed him a letter saying: "The perso

ay evening in January, the lonely valley had been a desirable place to him; he had watched the green battlements in summer and winter weather, had seen the heaped mounds rising dimly amidst th

### Show some text split into sentences

This shows how the text was split up into the train, validate, and test sets.

In [13]:
# we'll just look at the first 50k characters, because parsing sentences is slow
sentences = data.sentences('merged', 50000)
random.seed(0)
samples = random.sample(sentences, 5)
print('\n\n'.join(samples))

And off they set, after some new mischief.

Primrose is gone.

But how then do you come to live here?

He looked up, and lo!

It contained many wondrous tales of Fairy Land, and olden times, and the Knights of King Arthurs table.


### Show the text split into tokens

Note that punctuation marks are treated as separate tokens.

In [14]:
tokens = data.tokens('merged', 50000) # look at first 50k characters
print('ntokens',len(tokens))
print(tokens[8000:8100])

ntokens 10369
[';', 'as', 'if', 'we', 'were', 'not', 'good', 'enough', 'to', 'look', 'at', 'her', ',', 'and', 'she', 'was', ',', 'the', 'proud', 'thing', '!', '--', 'served', 'her', 'right', '!', 'Oh', ',', 'Pocket', ',', 'Pocket', ',', 'said', 'I', ';', 'but', 'by', 'this', 'time', 'the', 'party', 'which', 'had', 'gone', 'towards', 'the', 'house', ',', 'rushed', 'out', 'again', ',', 'shouting', 'and', 'screaming', 'with', 'laughter', '.', 'Half', 'of', 'them', 'were', 'on', 'the', 'cats', 'back', ',', 'and', 'half', 'held', 'on', 'by', 'her', 'fur', 'and', 'tail', ',', 'or', 'ran', 'beside', 'her', ';', 'till', ',', 'more', 'coming', 'to', 'their', 'help', ',', 'the', 'furious', 'cat', 'was', 'held', 'fast', ';', 'and', 'they', 'proceeded']


## 3. Train Models

Train models on the training tokens, or else load them if they have been saved in pickle files. 

In [15]:
# define a function
def encode_params(params):
    """
    Encode a dictionary of parameters as a string to be stored in a filename.
    e.g. {'n':3,'b':1.2} => 'n-3,b-1.2'
    """
    s = str(params)
    s = s.replace(':','-')
    s = s.replace("'",'')
    s = s.replace('{','')
    s = s.replace('}','')
    s = s.replace(' ','')
    s = '(' + s + ')'
    return s    

In [16]:
#.. will want to put this and next cell into fns 
# so can call within a loop over nchars to train on
#. could include nchars in sparams
#. might need to use an alist instead of dict, to preserve order



In [25]:
def train_models(nchars=None):
    """
    """
    
    # get sequence of training tokens (slow)
    train_tokens = data.tokens('train', nchars)

    # define models to test
    model_list = [
        [wp.ngram.NgramModel, {'n':2}],
        [wp.ngram.NgramModel, {'n':3}],
        [wp.ngram.NgramModel, {'n':4}],
        #[wp.rnn.RnnModel, {}],
    ]

    # iterate over models
    models = []
    for modelclass, modelparams in model_list:

        # load existing model, or create, train, and save one
        sparams = encode_params(modelparams)
        modelfile = 'models/' + modelclass.__name__ + '-' + sparams + '.pickle'
#        if os.path.isfile(modelfile):
#            print("load model: " + modelfile)
#            model = modelclass.load(modelfile) # static method
#        else:
        if 1:
            print("create model object")
            model = modelclass(**modelparams)

            print("train model")
            model.train(train_tokens)

            print("save model: " + modelfile)
            model.save(modelfile)

        models.append(model)

    print("done")
    return models

## 4. Test Models

Now that we have some trained models, let's test them against some held-out data.

In [18]:
# define a function
def get_tuples(tokens, ntokens_per_tuple):
    """
    Group sequences of tokens together.
    e.g. ['the','dog','barked',...]=>[['the','dog'],['dog','barked'],...]
    """
    tokenlists = [tokens[i:] for i in range(ntokens_per_tuple)]
    tuples = zip(*tokenlists)
    return tuples

In [21]:
def test_models(models, nchars=None):
    """
    """

    # get the test tokens
    test_tokens = data.tokens('test', nchars)

    # run test on the models
    npredictions = 1000
    k = 3 # number of tokens to predict
    for model in models:
        print(model.name)
        n = model.n
        test_tuples = get_tuples(test_tokens, n) # group tokens into sequences
        i = 0
        nright = 0
        for tuple in test_tuples:
            prompt = tuple[:-1]
            actual = tuple[-1]
            prediction = model.predict(prompt, k)
            if prediction: # can be None
                predicted_tokens = [pair[0] for pair in prediction]
                if actual in predicted_tokens:
                    nright += 1
            i += 1
            if i>npredictions: break
        print("nright/total=%d/%d = %f" % (nright, npredictions, nright/npredictions))
        print()

In [27]:
for nchars in (1000,5000,10000,20000):
    print(nchars)
    models = train_models(nchars)
    test_models(models, nchars)
    print()

1000
create model object
train model
get ngrams
add ngrams to model
save model: models/NgramModel-(n-2).pickle
create model object
train model
get ngrams
add ngrams to model
save model: models/NgramModel-(n-3).pickle
create model object
train model
get ngrams
add ngrams to model
save model: models/NgramModel-(n-4).pickle
done
n-gram (n=2)
nright/total=0/1000 = 0.000000

n-gram (n=3)
nright/total=0/1000 = 0.000000

n-gram (n=4)
nright/total=0/1000 = 0.000000


5000
create model object
train model
get ngrams
add ngrams to model
save model: models/NgramModel-(n-2).pickle
create model object
train model
get ngrams
add ngrams to model
save model: models/NgramModel-(n-3).pickle
create model object
train model
get ngrams
add ngrams to model
save model: models/NgramModel-(n-4).pickle
done
n-gram (n=2)
nright/total=36/1000 = 0.036000

n-gram (n=3)
nright/total=3/1000 = 0.003000

n-gram (n=4)
nright/total=0/1000 = 0.000000


10000
create model object
train model
get ngrams
add ngrams to model
sa