# Alice in Wonderland Hyperdictionary Prediction

I am creating a pipeline for testing predictions. I am going to compare different strategies and try and predict the next letter given a sentence from alice and 20 characters within that sentence.



In [1]:

import random_idx
import utils
import pickle
import re
import string

from pylab import *

%matplotlib inline


height has been deprecated.

2016-02-23 13:18


In [4]:
fdict = open("raw_texts/texts_english/alice_in_wonderland.txt")
text = fdict.read()

sentences = text.split('.')

In [5]:
len(sentences)

1207

Next is the function to run the test. This takes in a prediction function, gives it a sentence and asks the function to predict the next letter. Right now, I have the prediction_func return a histogram of letters, but the test_prediction just takes the maximum. Guy was talking about measuring entropy reduction, which is probably a better metric, but this is just a first pass. 

In [None]:
def test_prediction(prediction_func, lookback=20):
    
    # We're doing all this just to make sure the sentence is long enough
    for i in range(100):
        sidx = np.random.randint(lookback)
        sentence_str = sentences[np.random.randint(len(sentences))].lower()
        
        rm = string.punctuation + string.digits
    
        for p in string.punctuation:
            sentence_str = sentence_str.replace(p, '')
        
        sentence_str = sentence_str.replace('\n',' ')
        sentence_str = sentence_str.replace('\r','')
        sentence_str = sentence_str.replace('\t','')
        sentence_str = sentence_str.strip()
        
        if len(sentence_str[sidx:]) > lookback:
            break
            

    
    # ok, so ask for the next letter
    next_letter_dist = prediction_func(sentence_str[:lookback])
    
    # just take the argmax for now.
    pred_lidx = np.argmax(next_letter_dist)
    
    corr_letter = sentence_str[lookback]
    corr_lidx = random_idx.alphabet.find(corr_letter)
    
    # output to analyze performance
    print sentence_str[:lookback], random_idx.alphabet[corr_lidx], random_idx.alphabet[pred_lidx]
    
    return corr_lidx == pred_lidx
    
    

## Always Guess 'e'

The first thing to try is to just guess 'e' every time. Let's see how that does.

In [46]:
def always_predict_e(sentence):
    letter_hist = zeros(len(random_idx.alphabet))
    
    letter_hist[4] = 1
    
    return letter_hist
    

In [48]:
N = 100
iscorrect_prediction = zeros(N)

for i in range(N):
    iscorrect_prediction[i] = test_prediction(always_predict_e)
    
print np.mean(iscorrect_prediction)

at this moment the k i e
the cook threw a fry i e
hand it over here sa i e
dont grunt said alic e e
when she got back to   e
its enough to drive  o e
they all made a rush   e
if the second copy i s e
they cant have anyth i e
then they all crowde d e
if you are redistrib u e
if you received the  w e
tell her about the r e e
as a duck with its e y e
all the time they we r e
there was no one two   e
now at ours they had   e
well thought alice t o e
it quite makes my fo r e
as if it wasnt troub l e
write that down the  k e
there are a few thin g e
the jury all brighte n e
id rather finish my  t e
so she set to work a n e
how can i have done  t e
the trial cannot pro c e
youre a very poor sp e e
they were just begin n e
the three soldiers w a e
if everybody minded  t e
it exists because of   e
why there they are s a e
dinahll miss me very   e
if youre going to tu r e
in another minute th e e
alice said nothing s h e
they very soon came  u e
the reason is said t h e
unless you have remo v e


So, we can get about 10-15% of the next letter guesses correct by just guessing 'e' every time.

## External Dictionary

A pretty sensible method of predicting the next letter is to use an external dictionary and try and base the guess on the last word. This external dictionary is created from the '2of12id.txt' dictionary, but only contains a subset of the full word list. This dictionary also includes every substring of the word, as well as spaces -- the full word contains a space at the end. This way it can guess space. This dictionary is naive to any of the statistics of the word appearance, and will just guess based on what is possible and not what is most likely.

In [49]:
h = np.load('data/hyperdictionary_external-s20-d1M-160223.npz')
letter_vectors_substr = h['letter_vectors']
hyperdictionary_substr = h['hyperdictionary']

In [50]:
N = hyperdictionary_substr.shape[0]

def predict_from_last_word(sentence):
    # find the last space in the sentence
    words = sentence.split()
    
    last_word = words[-1]    
    
    subword = ''
    subvec = np.ones(N)
    for i,letter in enumerate(last_word):
        letter_idx = random_idx.alphabet.find(letter)
        subvec = np.roll(subvec, 1) * letter_vectors_substr[letter_idx,:]
        subword += letter
        
    subvec = np.roll(subvec, 1)
    
    val = np.dot(letter_vectors_substr/N, subvec*hyperdictionary_substr)
    return val

        
trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_last_word)
       
print mean(iscorrect_prediction)

either the well was  v t
id rather finish my  t s
please check the pro j f
do not copy display  p f
so they sat down and   a
that you wont though t a
next came the guests   k
it tells the day of  t t
there was a general  c i
she cant explain it  s a
and just as id taken   y
the foundations prin c c
ive tried the roots  o b
silence all round if   r
in a minute or two t h a
we wont talk about h e u
creating the works f r i
begin at the beginni n n
not quite right im a f n
with extras asked th e r
i can tell you more  t  
i cant go no lower s a c
there could be no do u w
when we were little  t  
it was no doubt only   r
alice felt that this   g
here the other guine a k
ah then yours wasnt  a m
the miserable hatter   o
come lets try the fi r l
you must remember re m m
just think of what w o e
hearthrug          n e i
its really dreadful  s n
off with her head th e r
will you wont you wi l c
oh dear what nonsens e u
first because im on  t t
the reason is said t h a
why what are your sh o u


This dictionary is pretty close performance-wise to just guessing 'e'. However, you can see it does a decent job when there is a long word.

## Using internal hyperdictionary

Next, I built a similar substring hyperdictionary as before, but this time I used the list of words actually from alice. This is a pretty ideal dictionary to have handy. The performance of this dictionary is useful to compare with other algorithms, as this will have a good chance of working well. If another algorithm can beat this, then it has learned a lot about english and predicting Alice, and probably has an important insight about learning.

In [57]:
h = np.load('data/hyperdictionary_alice-d1M-160223.npz')
letter_vectors_alice = h['letter_vectors']
hyperdictionary_alice = h['hyperdictionary']

In [58]:
N = hyperdictionary_alice.shape[0]

def predict_from_last_word(sentence):
    # find the last space in the sentence
    words = sentence.split()
    
    last_word = words[-1]    
    
    subword = ''
    subvec = np.ones(N)
    for i,letter in enumerate(last_word):
        letter_idx = random_idx.alphabet.find(letter)
        subvec = np.roll(subvec, 1) * letter_vectors_alice[letter_idx,:]
        subword += letter
        
    subvec = np.roll(subvec, 1)
    
    val = np.dot(letter_vectors_alice/N, subvec*hyperdictionary_alice)
    return val

        
trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_last_word)
       
print mean(iscorrect_prediction)

come            ill  t  
alice did not much l i a
then the dormouse sh a i
very true said the d u r
on which seven looke d d
you dont know much s a l
would you tell me pl e e
a knot said alice al w l
are youare you fondo f f
alice noticed with s o l
indeed she had quite    
a cat may look at a  k m
anything you like sa i i
on this the white ra b p
im sure im not ada s h l
he unfolded the pape r r
there was no label t h  
the hedgehog was eng a l
as a duck with its e y d
for instance suppose    
so they got their ta i k
collar that dormouse    
org  for additional  c  
he was an old crab h e j
thats nothing to wha t t
we quarrelled last m a o
please maam is this  n  
alice began to feel  v  
i hope theyll rememb e e
it looked goodnature d f
i could tell you my  a s
it was no doubt only    
ten hours the first  d  
the jury all looked  p  
they couldnt have do n i
he took me for his h o j
there are a lot of t h  
its no use speaking  t  
begin at the beginni n n
a knot said alice al w l


In [59]:
trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_last_word)
       
print mean(iscorrect_prediction)

if you didnt sign it   s
i should like to hea r p
1 with active links  o  
you insult me by tal k l
oh i beg your pardon   e
there was nothing el s b
he moved on as he sp o i
so she began again o u l
i dont know of any t h  
treacle said a sleep y  
go on with the next  v  
only mustard isnt a  b m
alice replied rather    
but her sister sat s t l
ive seen hatters bef o o
the gryphon replied  v  
dont grunt said alic e e
the jury all brighte n n
if the second copy i s m
so alice got up and  r  
the poor little thin g g
turn a somersault in   j
what trial is it ali c c
for with all her kno w t
i hope theyll rememb e e
you are old said the   m
except for the limit e e
there was a sound of    
a cheap sort of pres e s
royalty payments sho u w
they cant have anyth i i
no i didnt said alic e e
get up said the quee n r
why  it does the boo t k
you may copy it give   n
it was high time to  g r
contact the foundati o o
this piece of rudene s s
i dont know the mean i w
it must be a very pr e a


Ok, I think this can be done even better because it is dumb when the last letter is a space.|

In [61]:
N = hyperdictionary_alice.shape[0]

def predict_from_last_word(sentence):
    
    if sentence[-1] == ' ':
        letter_hist = zeros(len(random_idx.alphabet))
        # guess 't' if it is a space at the end
        t_idx = random_idx.alphabet.find('t')
        letter_hist[t_idx] = 1
        return letter_hist
    else:
        # find the last space in the sentence
        words = sentence.split()

        last_word = words[-1]    

        subword = ''
        subvec = np.ones(N)
        for i,letter in enumerate(last_word):
            letter_idx = random_idx.alphabet.find(letter)
            subvec = np.roll(subvec, 1) * letter_vectors_alice[letter_idx,:]
            subword += letter

        subvec = np.roll(subvec, 1)

        val = np.dot(letter_vectors_alice/N, subvec*hyperdictionary_alice)
        return val

        
trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_last_word)
       
print mean(iscorrect_prediction)

you dont know much s a l
if i dont take this  c t
limited right of rep l o
it quite makes my fo r r
do not charge a fee  f t
same as if he had a  b t
what is a caucusrace   l
down the rabbithole    t
and the gryphon neve r r
its a friend of mine a  
it was much pleasant e  
i dont know the mean i w
call it what you lik e e
next came an angry v o u
if you do not agree  t t
i havent opened it y e a
have you guessed the   m
theres plenty of roo m t
or would you like th e e
im glad they dont gi v r
give your evidence t h  
itsits a very fine d a r
here was another puz z z
now i give you fair  w t
it quite makes my fo r r
how fond she is of f i e
bythebye what became    
the end      end of  p t
compliance requireme n n
i wonder what i shou l t
do you mean that you   t
how surprised hell b e u
the king laid his ha n i
alice was just begin n  
why said the dodo th e e
she was walking by t h  
to learn more about  t t
turn a somersault in   j
you dont know much s a l
this time alice wait e i


In [None]:
trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_last_word)
       
print mean(iscorrect_prediction)

alice was very glad  t t
come my heads free a t m
write that down the  k t
whos making personal    
do you know why its  c t
they had a large can v t
i didnt know it was  y t
the fee is      owed    
by reading or using  a t
so she swallowed one    
my notion was that y o a
to learn more about  t t
she cant explain it  s t
then they both bowed    
who are you talking  t t
come on then roared  t t
the invalidity or un e t
﻿project gutenberg s  
limited right of rep l o
why what are your sh o i
ill be         judge    
creating the works f r e
this time alice wait e i
so she began again o u l
give your evidence t h  
the idea of having t h  
i wonder if i shall  f t
compliance requireme n n
thats different from    
the poor little thin g g
ahem said the mouse  w t
there was a general  c t
her listeners were p e l
in that case said th e e
i wish i hadnt menti o o
there are a lot of t h  
then you should say  w t
you agree that you h a j
come on cried the gr y y
i wish i hadnt cried    
in