# Creating a hyperdictionary

I was considering that the basic way to solve the letter prediction problem given no constraints, would be to just have a dictionary of words, and then be able to reference that dictionary. I am attempting to store a dictionary into a hypervector and create a hyperdictionary.

The hypervectors are very similar to hashes, and so each word or subword has no relationship to the hash. So in order to store a dictionary in the hyper vector, you need to store the word and all of the substrings. 

Essentially, I am encoding an algorithm in the hypervector that does a tree search through a dictionary. I want to start typing in letters and then have the hyperdictionary list the possible next letters given the words that are stored. This means I want to store not only everyword, but the entire tree of substrings that make up the word. 


In [1]:

import random_idx
import utils
import pickle

import string
from pylab import *


%matplotlib inline


height has been deprecated.

2016-02-24 09:47


## Building the hyper dictionary

So, I have gone to the internet and just found a text file that contains a list of common english words. My goal is to put this dictionary into a hyper vector and then see if I can use a standard word-based algorithm to predict the next letter.

In [2]:
fdict = open("2of12id.txt")
word_list = []

In [3]:
for line in fdict:
    words = line.split()
    
    # take out the noun/verb/adjective
    words.pop(1)
    
    for word in words:
        if word.find('{') > 0:
            continue
            
        w = word.strip('()~-|{}!@/')
        
        if len(w) == 0:
            continue
                
        word_list.append(w)

In [4]:
print len(word_list)

100060


So, we have a dictionary of over 100,000 words now. I am going to go through each word, substring by substring, and add each of the substrings to the hypervector. This means that there will be far more than 100k elements that need to be stored in the hypervector, because I am essentially trying to store the entire tree. Since there are so many words, I am going to start using an even larger hyper-vector. There will be issues with how much information we can store in the hypervectors, and there is already some literature on this. 

I really want the hyper vector to just work like a word dictionary. I am only going to add a substring if it is not already present. 

In [None]:
N=1000000
letter_vectors = 2 * (np.random.randn(len(random_idx.alphabet), N) > 0) - 1
print letter_vectors

In [None]:
hyperdictionary = np.zeros(N)
count = 0
vals = []
subwords = []
skip = 20

for word in word_list[0::skip]:
#for word in ['accelerate','aardvark', 'accordion', 'accordionists',  'apple', 'betazoid', 'betakeratine']:
#for word in ['a', 'b', 'c', 'd','e', 'f']:
    #print ""
    print word,
    subword = ''
    subvec = np.ones(N)
    for i,letter in enumerate(word):
        letter_idx = random_idx.alphabet.find(letter)
        subvec = np.roll(subvec, 1) * letter_vectors[letter_idx,:]
        subword += letter
        
        # check to see if the subvec is already present in the hyperdictionary
        val = np.dot(subvec.T, hyperdictionary) / N
        
        # If the substring is not present, then val should be near 0
        if val < 0.4:
            # then add the substring
            hyperdictionary += subvec
            count += 1
            #print subword, 
    
    letter_idx = random_idx.alphabet.find(' ')
    subvec = np.roll(subvec, 1) * letter_vectors[letter_idx,:]
    # check to see if the subvec is already present in the hyperdictionary
    val = np.dot(subvec.T, hyperdictionary) / N
        
    # If the substring is not present, then val should be near 0
    if val < 0.4:
        # then add the subaQstring
        hyperdictionary += subvec
        count += 1
    

In [None]:
print count

In [None]:
random_idx.alphabet

In [None]:
np.savez('data/hyperdictionary_external-s20-d1M-160223.npz', hyperdictionary=hyperdictionary, letter_vectors=letter_vectors)

In [None]:
fdict = open("raw_texts/texts_english/alice_in_wonderland.txt")
text = fdict.read().lower()

punct = string.punctuation + string.digits

for i in punct:
    if i == '-':
        text = text.replace(i, ' ')
    else:
        text = text.replace(i, '')
    
text = text.replace('\n', ' ')
text = text.replace('\r','')
text = text.replace('\t','')
short_text = text[504:137330]

word_list = set(text.split()[1:]);
len(word_list)

In [None]:
short_text = text[504:137330]
word_list = set(short_text.split()[1:]);
len(word_list)

In [None]:
N=1000000
letter_vectors = 2 * (np.random.randn(len(random_idx.alphabet), N) > 0) - 1

hyperdictionary = np.zeros(N)
count = 0
vals = []
subwords = []
skip = 20

for word in word_list:
#for word in ['accelerate','aardvark', 'accordion', 'accordionists',  'apple', 'betazoid', 'betakeratine']:
#for word in ['a', 'b', 'c', 'd','e', 'f']:
    #print ""
    print word,
    subword = ''
    subvec = np.ones(N)
    for i,letter in enumerate(word):
        letter_idx = random_idx.alphabet.find(letter)
        subvec = np.roll(subvec, 1) * letter_vectors[letter_idx,:]
        subword += letter
        
        # check to see if the subvec is already present in the hyperdictionary
        val = np.dot(subvec.T, hyperdictionary) / N
        
        # If the substring is not present, then val should be near 0
        if val < 0.4:
            # then add the substring
            hyperdictionary += subvec
            count += 1
            #print subword, 
    
    letter_idx = random_idx.alphabet.find(' ')
    subvec = np.roll(subvec, 1) * letter_vectors[letter_idx,:]
    # check to see if the subvec is already present in the hyperdictionary
    val = np.dot(subvec.T, hyperdictionary) / N
        
    # If the substring is not present, then val should be near 0
    if val < 0.4:
        # then add the substring
        hyperdictionary += subvec
        count += 1

In [None]:
print count

In [None]:
np.savez('data/hyperdictionary_alice-short-d1M-160223.npz', hyperdictionary=hyperdictionary, letter_vectors=letter_vectors)


## N-gram statistics

Now, going to make a hypervector that keeps stats on the 2-grams of letters in the text (including spaces). 




In [None]:
reload(random_idx)

In [None]:
short_text = text[504:137330]
print short_text

In [None]:
# generate text vector based on each pair of characters

N=20000
letter_vectors = 2 * (np.random.randn(len(random_idx.alphabet), N) > 0) - 1

alice_text_vector2 = random_idx.generate_text_vector(N, letter_vectors, 2, short_text)

In [None]:
alice_text_vector2.shape

In [None]:
np.savez('data/alice-2gram-space-d20K-160223.npz', hyperdictionary=alice_text_vector2, letter_vectors=letter_vectors)

In [None]:

N=20000
letter_vectors = 2 * (np.random.randn(len(random_idx.alphabet), N) > 0) - 1

alice_text_vector3 = random_idx.generate_text_vector(N, letter_vectors, 3, short_text)

In [None]:
np.savez('data/alice-3gram-space-d20K-160223.npz', hyperdictionary=alice_text_vector3, letter_vectors=letter_vectors)

In [None]:
letter_vectors.shape