# Word Prediction using Recurrent Neural Networks (RNNs)
## Experiment 2016-12-26

Add RNN. 

### Table of Contents

1. Initialize
2. Prepare Data
3. Explore Data
4. Train Models
5. Test Models
6. Generate Text

## 1. Initialize
### Import

In [34]:
# import python modules
from __future__ import print_function, division
import os.path
import random
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from nltk import tokenize

In [35]:
# import wp modules (can be slow)
import sys; sys.path.append('../../src')
print('importing wp (and nltk)...')
import wp
print('done')

importing wp (and nltk)...
done


In [36]:
# reload wp modules in case changed (for development purposes)
reload(wp)
reload(wp.data)
reload(wp.util)
reload(wp.model)
reload(wp.ngram)
reload(wp.rnn)
reload(wp.analyze)

<module 'wp.analyze' from '../../src\wp\analyze.pyc'>

### Initialize

In [37]:
random.seed(0)

## 2. Prepare Data

Clean and merge raw text files, split into train, validate, and test sets.

In [38]:
# get wrapper around all data and tokenization
data = wp.data.Data()

Clean the raw data files - remove Gutenberg headers and footers, and non-ascii characters (nltk complains otherwise).

In [39]:
data.clean()

The raw files have been cleaned.


Merge the cleaned data files into one.

In [40]:
data.merge()

The cleaned files have already been merged.


Split the merged file by sentences into train, validate, and test sets.

In [41]:
data.split()

The merged file has already been split.


## 3. Explore Data

### Show some samples of the text

In [42]:
s_merged = data.text('merged')
nsamples = 4
nchars = len(s_merged)
nskip = int(nchars / nsamples)
for i in range(nsamples):
    s = s_merged[i*nskip:i*nskip+200]
    s = s.replace('\n', ' ').strip()
    print(s)
    print()

List of Illustrations   Bookshelf  Bookcover  Frontpapers  Frontispiece Volume One  Titlepage Volume One  Titlepage Verso  The Comfortor  The Fall  Awakened  Cossette Sweeping  Candlesticks Into the F

strangement into reconciliation. It was not an affliction, but it was an unpleasant duty.  Marius, in addition to his motives of political antipathy, was convinced that his father, _the slasher_, as M

e Ponceau drain of the old Rue Vieille-du-Temple, vaulted between 1600 and 1650; and the handiwork of the eighteenth in the western section of the collecting canal, walled and vaulted in 1740. These t

found, confident, and trustful. She carried her sorrowful head as though she were proud of that sorrow, as though she would say, I--I alone know how to mourn for him as he deserves. But while we were



### Show some text split into sentences

This shows how the text was split up into the train, validate, and test sets.

In [43]:
# we'll just look at the first 50k characters, because parsing sentences is slow
sentences = data.sentences('merged', 50000)
random.seed(2)
samples = random.sample(sentences, 5)
print('\n\n'.join(samples))

Toward nine o'clock in the evening the two women retired and betook themselves to their chambers on the first floor, leaving him alone until morning on the ground floor.

In another dissertation, he examines the theological works of Hugo, Bishop of Ptolemas, great-grand-uncle to the writer of this book, and establishes the fact, that to this bishop must be attributed the divers little works published during the last century, under the pseudonym of Barleycourt.

She was a soul rather than a virgin.

"The halls are nothing but rooms, and it is with difficulty that the air can be changed in them."

Pray, believe, enter into life: the Father is there."


### Show the text split into tokens

Note that punctuation marks are treated as separate tokens.

In [44]:
tokens = data.tokens('merged', 50000)
print('ntokens',len(tokens))
print(tokens[-50:])

ntokens 11180
['two', 'doors', ',', 'one', 'near', 'the', 'chimney', ',', 'opening', 'into', 'the', 'oratory', ';', 'the', 'other', 'near', 'the', 'bookcase', ',', 'opening', 'into', 'the', 'dining-room', '.', 'END', 'The', 'bookcase', 'was', 'a', 'large', 'cupboard', 'with', 'glass', 'doors', 'filled', 'with', 'books', ';', 'the', 'chimney', 'was', 'of', 'wood', 'painted', 'to', 'represent', 'marble', ',', 'and', 'END']


## 4. Train Models

Train models on the training tokens.

In [45]:
# define models to train and test
model_specs = [
    [wp.ngram.NgramModel, {'n':1}],
    [wp.ngram.NgramModel, {'n':2}],
    [wp.ngram.NgramModel, {'n':3}],
    [wp.ngram.NgramModel, {'n':4}],
    [wp.rnn.RnnModel, {'nvocabmax':1000,'nhidden':10}],
]

In [46]:
# train models on different amounts of training data

nchars_list = (1000,10000)#,100000,1000000,6000000)
model_folder = '../../data/models'
model_table = wp.analyze.init_model_table(model_specs, model_folder, data, nchars_list)

ntraining_chars 1000
get complete stream of training tokens, nchars=1000
train model
get ngrams, n=1
add ngrams to model
save model
train model
get ngrams, n=2
add ngrams to model
save model
train model
get ngrams, n=3
add ngrams to model
save model
train model
get ngrams, n=4
add ngrams to model
save model
train model
2016-12-29 15:51:13: Loss after nexamples_seen=0 epoch=0: 5.440088
2016-12-29 15:51:13: Loss after nexamples_seen=16 epoch=1: 5.274515
2016-12-29 15:51:13: Loss after nexamples_seen=32 epoch=2: 5.134026
2016-12-29 15:51:13: Loss after nexamples_seen=48 epoch=3: 5.000382
2016-12-29 15:51:13: Loss after nexamples_seen=64 epoch=4: 4.880135
2016-12-29 15:51:13: Loss after nexamples_seen=80 epoch=5: 4.787715
2016-12-29 15:51:13: Loss after nexamples_seen=96 epoch=6: 4.713569
2016-12-29 15:51:13: Loss after nexamples_seen=112 epoch=7: 4.648478
2016-12-29 15:51:13: Loss after nexamples_seen=128 epoch=8: 4.587836
2016-12-29 15:51:13: Loss after nexamples_seen=144 epoch=9: 4.5295

## 5. Test Models

Test all models on held-out test data.

In [47]:
# test all models and save results to a pandas dataframe

ntest_chars = 10000
npredictions_max = 1000
k = 3 # predict top k tokens

df = wp.analyze.test_model_table(model_table, data, ntest_chars, npredictions_max, k)

get complete stream of test tokens, nchars=10000
n-gram-(nchars-1000-n-1): accuracy = nright/total = 4/1001 = 0.003996
n-gram-(nchars-1000-n-2): accuracy = nright/total = 2/1001 = 0.001998
n-gram-(nchars-1000-n-3): accuracy = nright/total = 0/1001 = 0.000000
n-gram-(nchars-1000-n-4): accuracy = nright/total = 0/1001 = 0.000000
rnn-(nchars-1000-nvocabmax-1000-nhidden-10): accuracy = nright/total = 2/1001 = 0.001998
get complete stream of test tokens, nchars=10000
n-gram-(nchars-10000-n-1): accuracy = nright/total = 229/1001 = 0.228771
n-gram-(nchars-10000-n-2): accuracy = nright/total = 23/1001 = 0.022977
n-gram-(nchars-10000-n-3): accuracy = nright/total = 4/1001 = 0.003996
n-gram-(nchars-10000-n-4): accuracy = nright/total = 2/1001 = 0.001998
rnn-(nchars-10000-nvocabmax-1000-nhidden-10): accuracy = nright/total = 62/1001 = 0.061938


In [48]:
df

Unnamed: 0,1000,10000
n-gram-(nchars-1000-n-1),0.003996,0.228771
n-gram-(nchars-1000-n-2),0.001998,0.022977
n-gram-(nchars-1000-n-3),0.0,0.003996
n-gram-(nchars-1000-n-4),0.0,0.001998
rnn-(nchars-1000-nvocabmax-1000-nhidden-10),0.001998,0.061938


In [None]:
for i in range(len(df.index)):
    ix_i = df.ix[i]
    plt.plot(df.columns, ix_i)
plt.legend(loc=(1.1,0.5))
plt.xscale('log')
plt.xlabel('Training set size (chars)')
plt.ylabel('Accuracy')
plt.show()

## 6. Generate Text

In [50]:
nsentences = 5
models = model_table[-1] # use models with most training data
for model in models[1:]:
    print(model.name)
    print('-'*80)
    for seed in range(nsentences):
        random.seed(seed)
        tokens = model.generate()
        if tokens:
            s = ' '.join(tokens)
            print(s)
            print()
    print()

n-gram-(nchars-10000-n-1)
--------------------------------------------------------------------------------
of had youth , activity white become Holy salary . was him them had . , was thirty-six Drop was END

Frontpapers of had , do have seven diocese great Henri of and had all Father the , '' was very must Give an language le patients very , and Accident , , , which day to of Verso Into a age of sort END

Glandve '' ten ; of the in Seignor . . . his and There the beds Hot Myriel portraits , -- woman 200 END

, Myriel END

, Three one his Cosette adjoins Frontispiece bustling hall , His , intelligent from own he of be bustling a END


n-gram-(nchars-10000-n-2)
--------------------------------------------------------------------------------
She had to whom he , a little matter enclosing a Paving Stone Frontispiece Volume Five Titlepage Verso The Twilight Decline Darkness LES MISRABLES VOLUME I. -- adjoins the Bishop of the epoch of fifteen thousand francs . END

`` Sire , '' END

One mus