# Alice in Wonderland Hyperdictionary Prediction

I am creating a pipeline for testing predictions. I am going to compare different strategies and try and predict the next letter given a sentence from alice and 20 characters within that sentence.



In [67]:

import random_idx
import utils
import pickle
import re
import string

from pylab import *

%matplotlib inline


In [4]:
fdict = open("raw_texts/texts_english/alice_in_wonderland.txt")
text = fdict.read()

sentences = text.split('.')

In [5]:
len(sentences)

1207

Next is the function to run the test. This takes in a prediction function, gives it a sentence and asks the function to predict the next letter. Right now, I have the prediction_func return a histogram of letters, but the test_prediction just takes the maximum. Guy was talking about measuring entropy reduction, which is probably a better metric, but this is just a first pass. 

In [None]:
def test_prediction(prediction_func, lookback=20):
    
    # We're doing all this just to make sure the sentence is long enough
    for i in range(100):
        sidx = np.random.randint(lookback)
        sentence_str = sentences[np.random.randint(len(sentences))].lower()
        
        rm = string.punctuation + string.digits
    
        for p in string.punctuation:
            sentence_str = sentence_str.replace(p, '')
        
        sentence_str = sentence_str.replace('\n',' ')
        sentence_str = sentence_str.replace('\r','')
        sentence_str = sentence_str.replace('\t','')
        sentence_str = sentence_str.strip()
        
        if len(sentence_str[sidx:]) > lookback:
            break
            

    
    # ok, so ask for the next letter
    next_letter_dist = prediction_func(sentence_str[:lookback])
    
    # just take the argmax for now.
    pred_lidx = np.argmax(next_letter_dist)
    
    corr_letter = sentence_str[lookback]
    corr_lidx = random_idx.alphabet.find(corr_letter)
    
    # output to analyze performance
    print sentence_str[:lookback], random_idx.alphabet[corr_lidx], random_idx.alphabet[pred_lidx]
    
    return corr_lidx == pred_lidx
    
    

## Always Guess 'e'

The first thing to try is to just guess 'e' every time. Let's see how that does.

In [46]:
def always_predict_e(sentence):
    letter_hist = zeros(len(random_idx.alphabet))
    
    letter_hist[4] = 1
    
    return letter_hist
    

In [48]:
N = 100
iscorrect_prediction = zeros(N)

for i in range(N):
    iscorrect_prediction[i] = test_prediction(always_predict_e)
    
print np.mean(iscorrect_prediction)

at this moment the k i e
the cook threw a fry i e
hand it over here sa i e
dont grunt said alic e e
when she got back to   e
its enough to drive  o e
they all made a rush   e
if the second copy i s e
they cant have anyth i e
then they all crowde d e
if you are redistrib u e
if you received the  w e
tell her about the r e e
as a duck with its e y e
all the time they we r e
there was no one two   e
now at ours they had   e
well thought alice t o e
it quite makes my fo r e
as if it wasnt troub l e
write that down the  k e
there are a few thin g e
the jury all brighte n e
id rather finish my  t e
so she set to work a n e
how can i have done  t e
the trial cannot pro c e
youre a very poor sp e e
they were just begin n e
the three soldiers w a e
if everybody minded  t e
it exists because of   e
why there they are s a e
dinahll miss me very   e
if youre going to tu r e
in another minute th e e
alice said nothing s h e
they very soon came  u e
the reason is said t h e
unless you have remo v e


So, we can get about 10-15% of the next letter guesses correct by just guessing 'e' every time.

## External Dictionary

A pretty sensible method of predicting the next letter is to use an external dictionary and try and base the guess on the last word. This external dictionary is created from the '2of12id.txt' dictionary, but only contains a subset of the full word list. This dictionary also includes every substring of the word, as well as spaces -- the full word contains a space at the end. This way it can guess space. This dictionary is naive to any of the statistics of the word appearance, and will just guess based on what is possible and not what is most likely.

In [49]:
h = np.load('data/hyperdictionary_external-s20-d1M-160223.npz')
letter_vectors_substr = h['letter_vectors']
hyperdictionary_substr = h['hyperdictionary']

In [50]:
N = hyperdictionary_substr.shape[0]

def predict_from_last_word(sentence):
    # find the last space in the sentence
    words = sentence.split()
    
    last_word = words[-1]    
    
    subword = ''
    subvec = np.ones(N)
    for i,letter in enumerate(last_word):
        letter_idx = random_idx.alphabet.find(letter)
        subvec = np.roll(subvec, 1) * letter_vectors_substr[letter_idx,:]
        subword += letter
        
    subvec = np.roll(subvec, 1)
    
    val = np.dot(letter_vectors_substr/N, subvec*hyperdictionary_substr)
    return val

        
trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_last_word)
       
print mean(iscorrect_prediction)

either the well was  v t
id rather finish my  t s
please check the pro j f
do not copy display  p f
so they sat down and   a
that you wont though t a
next came the guests   k
it tells the day of  t t
there was a general  c i
she cant explain it  s a
and just as id taken   y
the foundations prin c c
ive tried the roots  o b
silence all round if   r
in a minute or two t h a
we wont talk about h e u
creating the works f r i
begin at the beginni n n
not quite right im a f n
with extras asked th e r
i can tell you more  t  
i cant go no lower s a c
there could be no do u w
when we were little  t  
it was no doubt only   r
alice felt that this   g
here the other guine a k
ah then yours wasnt  a m
the miserable hatter   o
come lets try the fi r l
you must remember re m m
just think of what w o e
hearthrug          n e i
its really dreadful  s n
off with her head th e r
will you wont you wi l c
oh dear what nonsens e u
first because im on  t t
the reason is said t h a
why what are your sh o u


This dictionary is pretty close performance-wise to just guessing 'e'. However, you can see it does a decent job when there is a long word.

## Using internal hyperdictionary

Next, I built a similar substring hyperdictionary as before, but this time I used the list of words actually from alice. This is a pretty ideal dictionary to have handy. The performance of this dictionary is useful to compare with other algorithms, as this will have a good chance of working well. If another algorithm can beat this, then it has learned a lot about english and predicting Alice, and probably has an important insight about learning.

In [57]:
h = np.load('data/hyperdictionary_alice-d1M-160223.npz')
letter_vectors_alice = h['letter_vectors']
hyperdictionary_alice = h['hyperdictionary']

In [58]:
N = hyperdictionary_alice.shape[0]

def predict_from_last_word(sentence):
    
    if sentence[-1] == ' ':
        letter_hist = zeros(len(random_idx.alphabet))
        # guess 't' if it is a space at the end
        t_idx = random_idx.alphabet.find('t')
        letter_hist[t_idx] = 1
        return letter_hist
    else:
        # find the last space in the sentence
        words = sentence.split()

        last_word = words[-1]    

        subword = ''
        subvec = np.ones(N)
        for i,letter in enumerate(last_word):
            letter_idx = random_idx.alphabet.find(letter)
            subvec = np.roll(subvec, 1) * letter_vectors_alice[letter_idx,:]
            subword += letter

        subvec = np.roll(subvec, 1)

        val = np.dot(letter_vectors_alice/N, subvec*hyperdictionary_alice)
        return val

        
trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_last_word)
       
print mean(iscorrect_prediction)

come            ill  t  
alice did not much l i a
then the dormouse sh a i
very true said the d u r
on which seven looke d d
you dont know much s a l
would you tell me pl e e
a knot said alice al w l
are youare you fondo f f
alice noticed with s o l
indeed she had quite    
a cat may look at a  k m
anything you like sa i i
on this the white ra b p
im sure im not ada s h l
he unfolded the pape r r
there was no label t h  
the hedgehog was eng a l
as a duck with its e y d
for instance suppose    
so they got their ta i k
collar that dormouse    
org  for additional  c  
he was an old crab h e j
thats nothing to wha t t
we quarrelled last m a o
please maam is this  n  
alice began to feel  v  
i hope theyll rememb e e
it looked goodnature d f
i could tell you my  a s
it was no doubt only    
ten hours the first  d  
the jury all looked  p  
they couldnt have do n i
he took me for his h o j
there are a lot of t h  
its no use speaking  t  
begin at the beginni n n
a knot said alice al w l


In [59]:
trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_last_word)
       
print mean(iscorrect_prediction)

if you didnt sign it   s
i should like to hea r p
1 with active links  o  
you insult me by tal k l
oh i beg your pardon   e
there was nothing el s b
he moved on as he sp o i
so she began again o u l
i dont know of any t h  
treacle said a sleep y  
go on with the next  v  
only mustard isnt a  b m
alice replied rather    
but her sister sat s t l
ive seen hatters bef o o
the gryphon replied  v  
dont grunt said alic e e
the jury all brighte n n
if the second copy i s m
so alice got up and  r  
the poor little thin g g
turn a somersault in   j
what trial is it ali c c
for with all her kno w t
i hope theyll rememb e e
you are old said the   m
except for the limit e e
there was a sound of    
a cheap sort of pres e s
royalty payments sho u w
they cant have anyth i i
no i didnt said alic e e
get up said the quee n r
why  it does the boo t k
you may copy it give   n
it was high time to  g r
contact the foundati o o
this piece of rudene s s
i dont know the mean i w
it must be a very pr e a


In [61]:
N = hyperdictionary_alice.shape[0]

def predict_from_last_word(sentence):
    
    if sentence[-1] == ' ':
        letter_hist = zeros(len(random_idx.alphabet))
        # guess 't' if it is a space at the end
        t_idx = random_idx.alphabet.find('t')
        letter_hist[t_idx] = 1
        return letter_hist
    else:
        # find the last space in the sentence
        words = sentence.split()

        last_word = words[-1]    

        subword = ''
        subvec = np.ones(N)
        for i,letter in enumerate(last_word):
            letter_idx = random_idx.alphabet.find(letter)
            subvec = np.roll(subvec, 1) * letter_vectors_alice[letter_idx,:]
            subword += letter

        subvec = np.roll(subvec, 1)

        val = np.dot(letter_vectors_alice/N, subvec*hyperdictionary_alice)
        return val

        
trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_last_word)
       
print mean(iscorrect_prediction)

you dont know much s a l
if i dont take this  c t
limited right of rep l o
it quite makes my fo r r
do not charge a fee  f t
same as if he had a  b t
what is a caucusrace   l
down the rabbithole    t
and the gryphon neve r r
its a friend of mine a  
it was much pleasant e  
i dont know the mean i w
call it what you lik e e
next came an angry v o u
if you do not agree  t t
i havent opened it y e a
have you guessed the   m
theres plenty of roo m t
or would you like th e e
im glad they dont gi v r
give your evidence t h  
itsits a very fine d a r
here was another puz z z
now i give you fair  w t
it quite makes my fo r r
how fond she is of f i e
bythebye what became    
the end      end of  p t
compliance requireme n n
i wonder what i shou l t
do you mean that you   t
how surprised hell b e u
the king laid his ha n i
alice was just begin n  
why said the dodo th e e
she was walking by t h  
to learn more about  t t
turn a somersault in   j
you dont know much s a l
this time alice wait e i


In [66]:
trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_last_word)
       
print mean(iscorrect_prediction)

you ought to be asha m m
it proves nothing of    
what is a caucusrace   l
they lived on treacl e e
how can i have done  t t
a cheap sort of pres e s
so she began o mouse    
no please go on alic e e
is that the reason s o l
alice knew it was th e e
once upon a time the r m
well i never heard i t m
and he added in an u n p
alice did not feel e n d
my notion was that y o a
it was the best butt e o
im a poor man your m a o
she had already hear d t
you must have meant  s t
but perhaps he cant  h t
when the pie was all    
its business office  i t
so she set to work a n m
you may charge a rea s r
it is a long tail ce r r
get up said the quee n r
i do alice hastily r e o
thats none of your b u u
alice looked at the  j t
that i cant remember   e
come back the caterp i i
and she squeezed her s  
thats the most impor t t
the table was a larg e e
wow wow wow  here yo u u
orgdonate   section    t
then the words dont  f t
no theyre not said t h  
dont grunt said alic e e
why what are your sh o i


## Using N-Grams for prediction

So, now I made dictionaries that go through the alice text and look at all n-grams, including 'space' as a character. 

In [107]:
h = np.load('data/alice-2gram-space-d10K-160223.npz')
letter_vectors_2g = h['letter_vectors']
hyperdictionary_2g = np.squeeze(h['hyperdictionary'].T)

In [108]:
N = hyperdictionary_2g.shape[0]

In [109]:
def predict_from_2grams(sentence):
    letter = sentence[-1]
    subvec = np.ones(N)
    
    letter_idx = random_idx.alphabet.find(letter)
    subvec = np.roll(subvec, 1) * letter_vectors_2g[letter_idx,:]
    subvec = np.roll(subvec, 1)

    val = np.dot(letter_vectors_2g/N, subvec*hyperdictionary_2g)
    
    return val

trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_2grams)
       
print mean(iscorrect_prediction)


pray dont trouble yo u u
it turned into a pig    
do not copy display  p t
the unfortunate litt l h
alice kept her eyes  a t
you couldnt have wan t  
it quite makes my fo r u
there are a few thin g  
the chief difficulty    
whos making personal   i
alice waited till th e e
i told you butter wo u u
and ever since that  t t
in the very middle o f u
however it was over  a t
3 a full refund of a n n
you may copy it give    
i dont know where di n n
there are a few thin g  
theres more evidence    
however she got up a n n
she is such a dear q u  
3 the project gutenb e e
so they began solemn l  
if you paid a fee fo r u
and she went on plan n  
dont let him know sh e e
we wont talk about h e e
first it marked out  a t
so long as i get som e e
it was this last rem a e
if i dont take this  c t
i thought you did sa i n
the poor little thin g  
presently the rabbit   h
she had just succeed e  
but its volunteers a n n
here was another puz z y
he looked anxiously  o t
do you mean that you   t


Even the 2-grams works fairly well. We can just see what it will predict for each letter, since it is only based on the last letter.

In [112]:
for l in random_idx.alphabet:
    lidx = np.argmax(predict_from_2grams(l))
    print l, random_idx.alphabet[lidx]

a n
b e
c h
d  
e  
f  
g  
h e
i n
j h
k  
l i
m e
n  
o u
p l
q  
r  
s  
t h
u t
v e
w a
x t
y  
z y
  t


So a lot of letters tend to have space after them. 'q' has space somehow, figured it would be 'u'. 't' following space is the most typical.

In [125]:
h = np.load('data/alice-3gram-space-d50K-160223.npz')
letter_vectors_3g = h['letter_vectors']
hyperdictionary_3g = np.squeeze(h['hyperdictionary'].T)

N = hyperdictionary_3g.shape[0]


In [130]:
def predict_from_3grams(sentence):
    letters = sentence[-2:]
    subvec = np.ones(N)
    
    for letter in letters:
        letter_idx = random_idx.alphabet.find(letter)
        subvec = np.roll(subvec, 1) * letter_vectors_3g[letter_idx,:]
        
    subvec = np.roll(subvec, 1)

    val = np.dot(letter_vectors_3g/N, subvec*hyperdictionary_3g)
    
    return val

trials = 100
iscorrect_prediction = zeros(trials)

for i in range(trials):
    iscorrect_prediction[i] = test_prediction(predict_from_3grams)
       
print mean(iscorrect_prediction)

i wasnt asleep he sa i i
nobody asked your op i e
are their heads off  s t
why did they live at    
the king looked anxi o h
a cheap sort of pres e  
ill be         judge   t
once more she found  h t
lets go on with the  g t
why did they live at    
i do hope itll make  m t
you mean you cant ta k n
what do you know abo u u
lets go on with the  g t
whos making personal   i
thats nothing to wha t t
are you content now  s  
we indeed cried the  m t
how fond she is of f i o
you must remember re m  
there was not a mome n  
copyright laws in mo s u
i cant tell you just    
it was the best butt e e
i wonder what i shou l  
it is a long tail ce r  
oh youre sure to do  t i
stupid things alice  b t
except for the limit e  
oh do let me help to    
i must go and get re a  
alice went timidly u p p
she said this last w o a
and she went on plan n d
but if im not the sa m i
you must remember re m  
but her sister sat s t h
you know what to bea u d
if you wish to charg e e
then you know the mo c u


Also works fairly well. pretty good about 'the' and short words.

Now, we can see the whole 3-gram prediction structure as a matrix. The first letter will be on the left, and the second on top.

In [138]:
print
print ' ',
for l in random_idx.alphabet:
    print l,    
print

for l in random_idx.alphabet:
    print l,
    for j in random_idx.alphabet:
        lidx = np.argmax(predict_from_3grams(l+j))
        print random_idx.alphabet[lidx],
    print


  a b c d e f g h i j k l m n o p q r s t u v w x y z  
a r o k   x t e i d p e i e d c p u e     t e z o     t
b c u q o r h x x p f d e x j u h a j d r t h z s d m w
c n   m o   i x   i m   e t   u d j e l   c n k p r p y
d t g h h r u m g z   z y h z   g p t k c c   p   f v t
e d o t     r z a r   n f x   j i w       d e e t   f t
f z i m s n w y i r p n k r   r b v r d e q q u i o o t
g v x t   t o l t v y t a z p a i j e   x     v s r z t
h t b c r   t p q n q i v i x u y c i b   h g v u m r s
i z u e   d   h p w j e l   g n f v d     t l t h l p g
j     p r c b g t x h d t v c f k p a   t s t h p k m x
k q j x j   q k x n u f y   n b e m v m j y e n u h l t
l r j r       z q c k u   s b o u a a w a k k q e   u t
m k l l h   l y h n y a y h r u v n m   s s m t x w d o
n s e e     z   f n a i y u i w y   r     g q m     q a
o t w k g x   r r y p v g e   k e t   b     e   b c t i
p t m b e r g c f j y n a e v x e p o s z w u l u g e c
q u b u c p y o y b b r z z   y e w x d d e j p