# Word Prediction using Recurrent Neural Networks (RNNs)
## Experiment 2016-12-23

Loop over training size, plot learning curves. 

### Table of Contents

1. Prepare Data
2. Explore Data
3. Analyze Models
4. Generate Text

## Imports

In [17]:
# import python modules
from __future__ import print_function, division
import os.path
import random
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from nltk import tokenize

In [18]:
# import wp modules (can be slow)
import sys; sys.path.append('../../src')
print('importing wp (and nltk)...')
import wp
print('done')

importing wp (and nltk)...
done


In [19]:
# reload wp modules in case changed (for development purposes)
reload(wp)
reload(wp.data)
reload(wp.ngram)
reload(wp.rnn)
reload(wp.analyze)

<module 'wp.analyze' from '../../src\wp\analyze.py'>

## Initialize

In [20]:
random.seed(0)

## 1. Prepare Data

Merge raw text files, convert to plain strings, split into train, validate, and test sets.

In [21]:
# get wrapper around all data and tokenization
data = wp.data.Data()

Merge the raw data files into one and remove non-ascii characters (nltk complains otherwise).

In [22]:
data.merge()

The raw files have already been merged.


Split the merged file by sentences into train, validate, and test sets.

In [23]:
data.split()

The merged file has already been split.


## 2. Explore Data

### Show some samples of the text

In [24]:
s_merged = data.text('merged')
nsamples = 5
nchars = len(s_merged)
nskip = int(nchars / nsamples)
for i in range(nsamples):
    s = s_merged[i*nskip:i*nskip+200]
    s = s.replace('\n', ' ').strip()
    print(s)
    print()

The Project Gutenberg EBook of Phantastes, by George MacDonald  This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  You may copy it, give it away or re

to be vanquished, retreated; but Wellington shouted, "Up, Guards, and aim straight!" The red regiment of English guards, lying flat behind the hedges, sprang up, a cloud of grape-shot riddled the tric

xcept that geometrical point, the _I_; bringing everything back to the soul-atom; expanding everything in God, entangling all activity, from summit to base, in the obscurity of a dizzy mechanism, atta

y, or to speak more accurately, that same evening, as Marius left the table, and was on the point of withdrawing to his study, having a case to look over, Basque handed him a letter saying: "The perso

ay evening in January, the lonely valley had been a desirable place to him; he had watched the green battlements in summer and winter weather, had seen the heaped mounds rising dimly amidst th

### Show some text split into sentences

This shows how the text was split up into the train, validate, and test sets.

In [25]:
# we'll just look at the first 50k characters, because parsing sentences is slow
sentences = data.sentences('merged', 50000)
random.seed(0)
samples = random.sample(sentences, 5)
print('\n\n'.join(samples))

And off they set, after some new mischief.

Primrose is gone.

But how then do you come to live here?

He looked up, and lo!

It contained many wondrous tales of Fairy Land, and olden times, and the Knights of King Arthurs table.


### Show the text split into tokens

Note that punctuation marks are treated as separate tokens.

In [26]:
tokens = data.tokens('merged', 50000) # look at first 50k characters
print('ntokens',len(tokens))
print(tokens[8000:8100])

ntokens 10369
[';', 'as', 'if', 'we', 'were', 'not', 'good', 'enough', 'to', 'look', 'at', 'her', ',', 'and', 'she', 'was', ',', 'the', 'proud', 'thing', '!', '--', 'served', 'her', 'right', '!', 'Oh', ',', 'Pocket', ',', 'Pocket', ',', 'said', 'I', ';', 'but', 'by', 'this', 'time', 'the', 'party', 'which', 'had', 'gone', 'towards', 'the', 'house', ',', 'rushed', 'out', 'again', ',', 'shouting', 'and', 'screaming', 'with', 'laughter', '.', 'Half', 'of', 'them', 'were', 'on', 'the', 'cats', 'back', ',', 'and', 'half', 'held', 'on', 'by', 'her', 'fur', 'and', 'tail', ',', 'or', 'ran', 'beside', 'her', ';', 'till', ',', 'more', 'coming', 'to', 'their', 'help', ',', 'the', 'furious', 'cat', 'was', 'held', 'fast', ';', 'and', 'they', 'proceeded']


## 3. Analyze Models

Train models on the training tokens and test them on the test tokens.

In [27]:
# define models to test
modelspecs = [
    [wp.ngram.NgramModel, {'n':2}],
    [wp.ngram.NgramModel, {'n':3}],
    [wp.ngram.NgramModel, {'n':4}],
    [wp.rnn.RnnModel, {}],
]

In [28]:
# if output table already exists, skip this step
#try:
#    rows
#except:
if 1:
    modelfolder = '../../data/models'
    #. should be ntraining_tokens
    nchars_list = (1000,10000,100000,1000000,6000000)
    rows = []
    npredictions_max = 1000
    k = 3 # predict top k tokens
    for nchars in nchars_list:
        print('ntraining_chars', nchars)
        models = wp.analyze.init_models(modelspecs, modelfolder, data, nchars)
        results = wp.analyze.test_models(models, data, npredictions_max, k, nchars)
        print()
        row = [nchars] + results
        rows.append(row)

ntraining_chars 1000
create model object
load model
create model object
load model
create model object
load model
n-gram (n=2): accuracy = nright/total = 1/180 = 0.005556
n-gram (n=3): accuracy = nright/total = 0/179 = 0.000000
n-gram (n=4): accuracy = nright/total = 0/178 = 0.000000

ntraining_chars 10000
create model object
load model
create model object
load model
create model object
load model
n-gram (n=2): accuracy = nright/total = 99/1001 = 0.098901
n-gram (n=3): accuracy = nright/total = 13/1001 = 0.012987
n-gram (n=4): accuracy = nright/total = 3/1001 = 0.002997

ntraining_chars 100000
create model object
load model
create model object
load model
create model object
load model
n-gram (n=2): accuracy = nright/total = 190/1001 = 0.189810
n-gram (n=3): accuracy = nright/total = 85/1001 = 0.084915
n-gram (n=4): accuracy = nright/total = 14/1001 = 0.013986

ntraining_chars 1000000
create model object
load model
create model object
load model
create model object
load model
n-gram (n=

In [29]:
cols = ['nchars'] + [model.name for model in models]
df = pd.DataFrame(rows, columns=cols)
df

Unnamed: 0,nchars,n-gram (n=2),n-gram (n=3),n-gram (n=4)
0,1000,0.005556,0.0,0.0
1,10000,0.098901,0.012987,0.002997
2,100000,0.18981,0.084915,0.013986
3,1000000,0.233766,0.132867,0.035964
4,6000000,0.238761,0.175824,0.052947


In [30]:
dft = df.transpose()
dft.columns = nchars_list
dft2 = dft.drop('nchars',axis=0)
dft2

Unnamed: 0,1000,10000,100000,1000000,6000000
n-gram (n=2),0.005556,0.098901,0.18981,0.233766,0.238761
n-gram (n=3),0.0,0.012987,0.084915,0.132867,0.175824
n-gram (n=4),0.0,0.002997,0.013986,0.035964,0.052947


In [None]:
plt.plot(dft2.columns, dft2.ix[0])
plt.plot(dft2.columns, dft2.ix[1])
plt.plot(dft2.columns, dft2.ix[2])
plt.legend(loc='best')
plt.xscale('log')
plt.xlabel('Training set size (chars)')
plt.ylabel('Accuracy')
plt.show()

## 4. Generate Text

In [32]:
random.seed(0)
k = 20 # number of words to generate per sequence
nsequences = 10
for model in models:
    print(model.name)
    for seed in range(nsequences):
        random.seed(seed)
        tokens = model.generate(k)
        if tokens:
            s = ' '.join(tokens)
            print(s)
            print()
    print()

n-gram (n=2)
Hucheloup . `` and whatever , and the Italian , there was no additional contact information about the discovery

. Cosette and flung a blow that day , for instance , and blouse , and put the mire

And the old stockade . The lip was indubitable -- THE FOUNDATION , your breakfast ? The woman of

Jean Valjean really took up , child ! There are together , and a large crack , and ``

Unless seen by a third . After descending the young Hawkins , slang for a cutlass . To have

Thenceforth , and mock solemnities . It was warming himself on such as with his grandfather and gladly accept

Madeleine arrived there were all the neck was not the sister 's the banks of a bore a few

'' `` God is subject of them . It was , impressions of grunt of the doctor ? ''

Gray with me . His hair nearly as if suddenly withdrawn , autumn ; `` There , then not

She opened the brim and terrible quagmire was trying to my fancy it had n't know what may be


n-gram (n=3)
Zieten putting France to prefer t