# Word Prediction using Recurrent Neural Networks (RNNs)
## Experiment 2016-12-23

Loop over training size, plot learning curves. 

### Table of Contents

1. Prepare Data
2. Explore Data
3. Analyze Models
4. Generate Text

## Imports

In [132]:
# import python modules
from __future__ import print_function, division
import os.path
import random
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from nltk import tokenize

In [133]:
# import wp modules (can be slow)
import sys; sys.path.append('../../src')
print('importing wp (and nltk)...')
import wp
print('done')

importing wp (and nltk)...
done


In [134]:
# reload wp modules in case changed (for development purposes)
reload(wp)
reload(wp.data)
reload(wp.util)
reload(wp.model)
reload(wp.ngram)
reload(wp.rnn)
reload(wp.analyze)

<module 'wp.analyze' from '../../src\wp\analyze.pyc'>

## Initialize

In [135]:
random.seed(0)

## 1. Prepare Data

Merge raw text files, convert to plain strings, split into train, validate, and test sets.

In [136]:
# get wrapper around all data and tokenization
data = wp.data.Data()

Merge the raw data files into one and remove non-ascii characters (nltk complains otherwise).

In [137]:
data.merge()

The raw files have been merged.


Split the merged file by sentences into train, validate, and test sets.

In [138]:
data.split()

The merged file has been split into train, validate, and test files.


## 2. Explore Data

### Show some samples of the text

In [139]:
s_merged = data.text('merged')
nsamples = 5
nchars = len(s_merged)
nskip = int(nchars / nsamples)
for i in range(nsamples):
    s = s_merged[i*nskip:i*nskip+200]
    s = s.replace('\n', ' ').strip()
    print(s)
    print()

The Project Gutenberg EBook of Les Misrables, by Victor Hugo  This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-us

ese names, Droit-Mur and Aumarais, are very ancient; the streets which bear them are very much more ancient still. Aumarais Lane was called Maugout Lane; the Rue Droit-Mur was called the Rue des glant

y uneasy and very suspicious, and that while seeking to ferret out a man like Ppin or Morey, they might very readily discover a man like Jean Valjean.  Jean Valjean had made up his mind to quit Paris,

ooked mistily about him. First he recognized the doctor with an unmistakable frown; then his glance fell upon me, and he looked relieved. But suddenly his colour changed, and he tried to raise himself

e articles to magazines, in pathetic ignorance of the trade. He felt the immense difficulty of the career of literature without clearly understanding it; the battle was happily in a mist, so t

### Show some text split into sentences

This shows how the text was split up into the train, validate, and test sets.

In [140]:
# we'll just look at the first 50k characters, because parsing sentences is slow
sentences = data.sentences('merged', 50000)
random.seed(0)
samples = random.sample(sentences, 5)
print('\n\n'.join(samples))

To villages where he found no schoolmaster, he quoted once more the people of Queyras: "Do you know how they manage?"

Those who had and those who lacked knocked at M. Myriel's door,--the latter in search of the alms which the former came to deposit.

.

.

.


### Show the text split into tokens

Note that punctuation marks are treated as separate tokens.

In [141]:
tokens = data.tokens('merged', 50000) # look at first 50k characters
print('ntokens',len(tokens))
print(tokens[8000:8100])

ntokens 10556
['hard', 'bishopric', 'for', 'a', 'good', 'bishop', 'the', 'bishop', 'did', 'not', 'omit', 'his', 'pastoral', 'visits', 'because', 'he', 'had', 'converted', 'his', 'carriage', 'into', 'alms', '.', 'END', 'the', 'diocese', 'of', 'd', '--', '--', 'is', 'a', 'fatiguing', 'one', '.', 'END', 'there', 'are', 'very', 'few', 'plains', 'and', 'a', 'great', 'many', 'mountains', ';', 'hardly', 'any', 'roads', ',', 'as', 'we', 'have', 'just', 'seen', ';', 'thirty-two', 'curacies', ',', 'forty-one', 'vicarships', ',', 'and', 'two', 'hundred', 'and', 'eighty-five', 'auxiliary', 'chapels', '.', 'END', 'to', 'visit', 'all', 'these', 'is', 'quite', 'a', 'task', '.', 'END', 'the', 'bishop', 'managed', 'to', 'do', 'it', '.', 'END', 'he', 'went', 'on', 'foot', 'when', 'it', 'was', 'in', 'the', 'neighborhood']


## 3. Analyze Models

Train models on the training tokens and test them on the test tokens.

In [142]:
# define models to test
modelspecs = [
    [wp.ngram.NgramModel, {'n':1}],
    [wp.ngram.NgramModel, {'n':2}],
    [wp.ngram.NgramModel, {'n':3}],
    [wp.ngram.NgramModel, {'n':4}],
#    [wp.rnn.RnnModel, {}],
]

In [148]:
# if output table already exists, skip this step
#try:
#    rows
#except:
if 1:
    modelfolder = '../../data/models'
    #. should be ntraining_tokens
    nchars_list = (1000,10000,100000,1000000,6000000)
    #nchars_list = (1000,10000,100000)
    rows = []
    npredictions_max = 1000
    k = 3 # predict top k tokens
    for nchars in nchars_list:
        print('ntraining_chars', nchars)
        models = wp.analyze.init_models(modelspecs, modelfolder, data, nchars)
        results = wp.analyze.test_models(models, data, npredictions_max, k, nchars)
        print()
        row = [nchars] + results
        rows.append(row)

ntraining_chars 1000
create model object
load model
create model object
load model
create model object
load model
create model object
load model
get complete stream of test tokens, nchars=1000
get tuples, n=1
n-gram (n=1): accuracy = nright/total = 44/185 = 0.237838
get tuples, n=2
n-gram (n=2): accuracy = nright/total = 23/184 = 0.125000
get tuples, n=3
n-gram (n=3): accuracy = nright/total = 12/183 = 0.065574
get tuples, n=4
n-gram (n=4): accuracy = nright/total = 1/182 = 0.005495

ntraining_chars 10000
create model object
load model
create model object
load model
create model object
load model
create model object
load model
get complete stream of test tokens, nchars=10000
get tuples, n=1
n-gram (n=1): accuracy = nright/total = 187/1001 = 0.186813
get tuples, n=2
n-gram (n=2): accuracy = nright/total = 205/1001 = 0.204795
get tuples, n=3
n-gram (n=3): accuracy = nright/total = 106/1001 = 0.105894
get tuples, n=4
n-gram (n=4): accuracy = nright/total = 33/1001 = 0.032967

ntraining_ch

In [149]:
cols = ['nchars'] + [model.name for model in models]
df = pd.DataFrame(rows, columns=cols)
df

Unnamed: 0,nchars,n-gram (n=1),n-gram (n=2),n-gram (n=3),n-gram (n=4)
0,1000,0.237838,0.125,0.065574,0.005495
1,10000,0.186813,0.204795,0.105894,0.032967
2,100000,0.118881,0.298701,0.181818,0.078921
3,1000000,0.118881,0.34965,0.371628,0.32967
4,6000000,0.118881,0.341658,0.381618,0.34965


In [150]:
dft = df.transpose()
dft.columns = nchars_list
dft2 = dft.drop('nchars',axis=0)
dft2

Unnamed: 0,1000,10000,100000,1000000,6000000
n-gram (n=1),0.237838,0.186813,0.118881,0.118881,0.118881
n-gram (n=2),0.125,0.204795,0.298701,0.34965,0.341658
n-gram (n=3),0.065574,0.105894,0.181818,0.371628,0.381618
n-gram (n=4),0.005495,0.032967,0.078921,0.32967,0.34965


In [None]:
plt.plot(dft2.columns, dft2.ix[0])
plt.plot(dft2.columns, dft2.ix[1])
plt.plot(dft2.columns, dft2.ix[2])
plt.plot(dft2.columns, dft2.ix[3])
#plt.legend(loc='best')
plt.legend(loc=(1.1,0.5))
plt.xscale('log')
plt.xlabel('Training set size (chars)')
plt.ylabel('Accuracy')
plt.show()

## 4. Generate Text

In [153]:
nsentences = 5
for model in models:
    print(model.name)
    for seed in range(nsentences):
        random.seed(seed)
        tokens = model.generate()
        if tokens:
            s = ' '.join(tokens)
            print(s)
            print()
    print()

n-gram (n=1)
a and seeking somewhere of in bobbie the END

bad a and it END

occupant good one off a lord me the seemed expressed to case valjean in once safely youth something been de . . END

by something he us was with . a then be will END

by foot in when with in , word and at or the from fairs as , you whenever word for the was which some , plot seemed me on unknown END


n-gram (n=2)
who ought to the matter too much ; and all her pretty nigh hand . END

it was nothing really had seen everywhere parallel rows of rum , and poorly , and sacrificing no longer find to you 're free from the project gutenberg-tm mission of the animal spirits , the foundation of these mingled in courtyards , with them bowed to which are there can get up the white one person conscientiously and let me ! END

he began in debt to grant succor from the middle of the troops , and kissed him . END

the 21st of age . END

the one episode of bread , what ? '' END


n-gram (n=3)
surely that was always closed . EN