# Word Prediction using Recurrent Neural Networks (RNNs)
## Experiment 2016-12-21

Initial experiments with n-gram models.

### Table of Contents

1. Load Training Data
2. Explore Training Data
3. Load / Train Models
4. Test Models

## Imports

In [1]:
# import python modules
from __future__ import print_function, division
import os.path
import random
from nltk import tokenize

In [2]:
# import wp modules (can be slow)
import sys; sys.path.append('../../src')
print('importing wp (and nltk)...')
import wp
print('done')

importing wp (and nltk)...
done


In [3]:
# reload wp modules in case changed (for development purposes)
reload(wp)
reload(wp.ngram)
reload(wp.rnn)
reload(wp.split)

['__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 '__path__',
 'ngram',
 'rnn',
 'split']

## Initialize

In [4]:
random.seed(0)

trainfile = 'data/all-train.txt'
testfile = 'data/all-test.txt'

## 1. Load Training Data

Load the text used to train the models

In [5]:
with open(trainfile, 'rb') as f:
    s_train = f.read()
    s_train = s_train.strip()
    s_train = s_train.lower()
    f.close()
    
s_train[:100]

'the project gutenberg ebook of phantastes, by george macdonald  this ebook is for the use of anyone '

### Split text into tokens for training the models.  

Note that punctuation marks are treated as separate tokens.

In [6]:
train_tokens = wp.split.get_tokens(s_train)

In [7]:
print('ntokens',len(train_tokens)) # ~1 million words
print(train_tokens[2000:2040])

ntokens 1012272
['heaven', 'of', 'stars', ',', 'small', 'and', 'sparkling', 'in', 'the', 'moonlight', '.', 'alas', '!', 'it', 'was', 'no', 'sea', ',', 'but', 'a', 'low', 'bog', 'burnished', 'by', 'the', 'moon', '.', 'surely', 'there', 'is', 'such', 'a', 'sea', 'somewhere', '!', 'said', 'i', 'to', 'myself', '.']


## 2. Explore Training Data

### Show some samples of the training text

In [8]:
nchars = len(s_train)
nsamples = 5
nskip = int(nchars / nsamples)
for i in range(nsamples):
    s = s_train[i*nskip:i*nskip+200]
    s = s.replace('\n', ' ').strip()
    print(s)
    print()

the project gutenberg ebook of phantastes, by george macdonald  this ebook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  yet is it a little window, that loo

o and come in affright as though tossed back and forth between the swords of the uhlans and the fusillade of the brigades of kempt, best, pack, and rylandt; the worst of hand-to-hand conflicts is the

, the sight of birds, tender shadows, agitated branches, and a soul made of sweetness, of faith, of candor, of hope, of aspiration, and of illusion.  cosette had left the convent when she was still al

prepossessing exterior.  coarse paper, coarsely folded--the very sight of certain missives is displeasing.  the letter which basque had brought was of this sort.  it smelled of tobacco.  marius recogn

s.  the ring of the fort drew him with stronger fascination during that hot august weather.  standing, or as his headmaster would have said, "mooning" by the gate, and looking into that enclose

### Split some text into sentences

This just shows how the text was split up into train, validate, and test sets.

In [9]:
sentences = wp.split.get_sentences(s_train[0:50000]) # use first 50k chars instead of all 6mb
random.seed(0)
samples = random.sample(sentences, 5)
print('\n\n'.join(samples))

do you think so?

but the face, which throbbed with fluctuating and pulsatory visibility--not from changes in the light it reflected, but from changes in its own conditions of reflecting power, the alterations being from within, not from without--it was horrible.

they are very fond of having fun with the thick people, as they call you; for, like most children, they like fun better than anything else.

not a living creature crossed my way.

but my attention was first and chiefly attracted by a group of fairies near the cottage, who were talking together around what seemed a last dying primrose.


### Split some text up into tokens, which is how the models will process it

Note that punctuation marks are treated as separate tokens.

In [10]:
for sample in samples:
    sample = str(sample) # convert utf-8 to plain strings (just to avoid u'foo')
    tokens = wp.split.get_tokens(sample)
    print(tokens)
    print()

['do', 'you', 'think', 'so', '?']

['but', 'the', 'face', ',', 'which', 'throbbed', 'with', 'fluctuating', 'and', 'pulsatory', 'visibility', '--', 'not', 'from', 'changes', 'in', 'the', 'light', 'it', 'reflected', ',', 'but', 'from', 'changes', 'in', 'its', 'own', 'conditions', 'of', 'reflecting', 'power', ',', 'the', 'alterations', 'being', 'from', 'within', ',', 'not', 'from', 'without', '--', 'it', 'was', 'horrible', '.']

['they', 'are', 'very', 'fond', 'of', 'having', 'fun', 'with', 'the', 'thick', 'people', ',', 'as', 'they', 'call', 'you', ';', 'for', ',', 'like', 'most', 'children', ',', 'they', 'like', 'fun', 'better', 'than', 'anything', 'else', '.']

['not', 'a', 'living', 'creature', 'crossed', 'my', 'way', '.']

['but', 'my', 'attention', 'was', 'first', 'and', 'chiefly', 'attracted', 'by', 'a', 'group', 'of', 'fairies', 'near', 'the', 'cottage', ',', 'who', 'were', 'talking', 'together', 'around', 'what', 'seemed', 'a', 'last', 'dying', 'primrose', '.']



## 3. Train / Load Models

Load models if they have been saved in pickle files, otherwise train them on the training text.

In [11]:
#. put in a pandas table? use to store results also? 

#. better to just have a list of model objects with these as properties? 
# how handle load vs train/save? 

mlist = [
    ['n-gram (n=2)', 'ngram-model-basic-n-2', wp.ngram.NgramModel, {'n':2}],
    ['n-gram (n=3)', 'ngram-model-basic-n-3', wp.ngram.NgramModel, {'n':3}],
    ['rnn', 'rnn', wp.rnn.RnnModel, {}],
]

models = []
for m in mlist:
    modelname = m[0]
    modelfile = 'models/' + m[1] + '.pickle'
    modelclass = m[2]
    modelparams = m[3]

    # load existing model, or create, train, and save one
    if os.path.isfile(modelfile):
        print("load model: " + modelfile)
        model = modelclass.load(modelfile) # static method
    else:
        print("create model object: " + modelname)
        model = modelclass(**modelparams)

        print("train model")
        model.train(train_tokens)

        print("save model: " + modelfile)
        model.save(modelfile)

    models.append(model)
    
print("done")

create model object: n-gram (n=2)
train model
get ngrams
add ngrams to model
save model: models/ngram-model-basic-n-2.pickle
create model object: n-gram (n=3)
train model
get ngrams
add ngrams to model
save model: models/ngram-model-basic-n-3.pickle
create model object: rnn
train model
save model: models/rnn.pickle
done


## 4. Test Models

Now that we have some trained models, let's test them against some held-out data.

### Load the test data

In [12]:
with open(testfile, 'rb') as f:
    s_test = f.read()
    s_test = s_test.strip()
    s_test = s_test.lower()
    f.close()
    
s_test[5000:5500]

' but could see nothing from which such a shadow should fall.\n\nwhat did i see?\n\ni saw the strangest figure; vague, shadowy, almost transparent, in the central parts, and gradually deepening in substance towards the outside, until it ended in extremities capable of casting such a shadow as fell from the hand, through the awful fingers of which i now saw the moon.\n\novercome with the mingling of terror and joy, i lay for some time almost insensible.\n\ni turned my head, but without moving otherwise, f'

### Split text into tuples of tokens

In [13]:
# tokenize the test text
test_tokens = tokenize.word_tokenize(s_test) # eg ['the','dog','barked',...]

In [14]:
print(test_tokens[1000:1040])

['of', 'my', 'foe', ';', 'for', 'as', 'yet', 'this', 'vague', 'though', 'powerful', 'fear', 'was', 'all', 'the', 'indication', 'of', 'danger', 'i', 'had', '.', 'i', 'looked', 'hurriedly', 'all', 'around', ',', 'but', 'could', 'see', 'nothing', 'from', 'which', 'such', 'a', 'shadow', 'should', 'fall', '.', 'what']


In [15]:
# now group the tokens into tuples
#. will want to iterate over ntokens_per_tuple?
ntokens_per_tuple = 2
tokenlists = [test_tokens[i:] for i in range(ntokens_per_tuple)]
test_tuples = zip(*tokenlists) # eg [['the','dog'], ['dog','barked'], ...]
print(test_tuples[100:120])

[('chamber', 'where'), ('where', 'the'), ('the', 'secretary'), ('secretary', 'stood'), ('stood', ','), (',', 'the'), ('the', 'first'), ('first', 'lights'), ('lights', 'that'), ('that', 'had'), ('had', 'been'), ('been', 'there'), ('there', 'for'), ('for', 'many'), ('many', 'a'), ('a', 'year'), ('year', ';'), (';', 'for'), ('for', ','), (',', 'since')]


### Test models

In [18]:
imax = 100
for model in models:
    print(model.name)
    i = 0
    nright = 0
    for tuple in test_tuples:
        prompt = tuple[:-1]
        y_actual = tuple[-1]
        y_predict = model.predict(prompt)
        #print(tuple, y_predict)
        if y_actual==y_predict:
            nright += 1
        i += 1
        if i>imax: break
    print("nright/total=%d/%d" % (nright, imax))
    print()

n-gram (n=2)
nright/total=8/100

n-gram (n=3)
nright/total=7/100

rnn
nright/total=0/100

