<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#GloVe-Vector-Importing" data-toc-modified-id="GloVe-Vector-Importing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>GloVe Vector Importing</a></span></li><li><span><a href="#Word-Vector-Dictionary" data-toc-modified-id="Word-Vector-Dictionary-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Word-Vector Dictionary</a></span></li><li><span><a href="#Target-Vocabulary---Anna-Karenina" data-toc-modified-id="Target-Vocabulary---Anna-Karenina-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Target Vocabulary - Anna Karenina</a></span><ul class="toc-item"><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Preprocessing</a></span></li></ul></li><li><span><a href="#PyTorch-Embedding-Layer" data-toc-modified-id="PyTorch-Embedding-Layer-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>PyTorch Embedding Layer</a></span></li></ul></div>

# Libraries

In [19]:
import numpy as np
import pickle
import bcolz

from collections import Counter

# GloVe Vector Importing

In [2]:
# Data file name
datFileName = '6B.50.dat'
print(datFileName)

6B.50.dat


In [3]:
words = []
idx = 0
word2idx = {}
#vectors = bcolz.carray(np.zeros(1), rootdir=f'{glove_path}/6B.50.dat', mode='w')
vectors = bcolz.carray(np.zeros(1), rootdir=f'6B.50.dat', mode='w')

#with open(f'{glove_path}/glove.6B.50d.txt', 'rb') as f:
with open(f'glove.6B.50d.txt', 'rb') as f:
    for l in f:
        line = l.decode().split()
        word = line[0]
        words.append(word)
        word2idx[word] = idx
        idx += 1
        vect = np.array(line[1:]).astype(np.float)
        vectors.append(vect)
    
#vectors = bcolz.carray(vectors[1:].reshape((400000, 50)), rootdir=f'{glove_path}/6B.50.dat', mode='w')
#vectors.flush()
#pickle.dump(words, open(f'{glove_path}/6B.50_words.pkl', 'wb'))
#pickle.dump(word2idx, open(f'{glove_path}/6B.50_idx.pkl', 'wb'))

vectors = bcolz.carray(vectors[1:].reshape((400000, 50)), rootdir=f'6B.50.dat', mode='w')
vectors.flush()
pickle.dump(words, open(f'6B.50_words.pkl', 'wb'))
pickle.dump(word2idx, open(f'6B.50_idx.pkl', 'wb'))

The following code works if bcolz is not available

# Word-Vector Dictionary

In [4]:
vectors = bcolz.open(f'{datFileName}')[:]
words = pickle.load(open(f'6B.50_words.pkl', 'rb'))
word2idx = pickle.load(open(f'6B.50_idx.pkl', 'rb'))

In [5]:
glove = {w: vectors[word2idx[w]] for w in words}

In [6]:
glove['the']

array([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
       -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
        2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
        1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
       -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
       -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
        4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
        7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
       -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
        1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01])

# Target Vocabulary - Anna Karenina

## Load Data

In [7]:
with open('anna.txt', 'r') as f:
    anna_text = f.read()

In [8]:
anna_text[:200]

"Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverything was in confusion in the Oblonskys' house. The wife had\ndiscovered that the husband was carrying on"

## Preprocessing

In [13]:
# Create a dict to turn punctuation into a token.
punct2token = {'.': '<PERIOD>',
                ',': '<COMMA>',
                '"': '<QUOTATION_MARK>',
                ';': '<SEMICOLON>',
                '!': '<EXCLAMATION_MARK>',
                '?': '<QUESTION_MARK>',
                '(': '<LEFT_PAREN>',
                ')': '<RIGHT_PAREN>',
                '--': ' <HYPHENS> ',
                '-': '<DASH>',
                '?': '<QUESTION_MARK>',
                '\n': '<NEW_LINE>',
                ':': ' <COLON> '}

# Tokenize the punctuation
for punct, token in punct2token.items():
    anna_text = anna_text.replace(punct, ' {} '.format(token))

In [14]:
anna_text[:200]

'Chapter 1 <NEW_LINE>  <NEW_LINE>  <NEW_LINE> Happy families are all alike <SEMICOLON>  every unhappy family is unhappy in its own <NEW_LINE> way <PERIOD>  <NEW_LINE>  <NEW_LINE> Everything was in conf'

In [15]:
# split and make all ensure all text is lower case
anna_text = anna_text.lower()
anna_text = anna_text.split()

In [16]:
anna_text[:200]

['chapter',
 '1',
 '<new_line>',
 '<new_line>',
 '<new_line>',
 'happy',
 'families',
 'are',
 'all',
 'alike',
 '<semicolon>',
 'every',
 'unhappy',
 'family',
 'is',
 'unhappy',
 'in',
 'its',
 'own',
 '<new_line>',
 'way',
 '<period>',
 '<new_line>',
 '<new_line>',
 'everything',
 'was',
 'in',
 'confusion',
 'in',
 'the',
 "oblonskys'",
 'house',
 '<period>',
 'the',
 'wife',
 'had',
 '<new_line>',
 'discovered',
 'that',
 'the',
 'husband',
 'was',
 'carrying',
 'on',
 'an',
 'intrigue',
 'with',
 'a',
 'french',
 '<new_line>',
 'girl',
 '<comma>',
 'who',
 'had',
 'been',
 'a',
 'governess',
 'in',
 'their',
 'family',
 '<comma>',
 'and',
 'she',
 'had',
 'announced',
 'to',
 '<new_line>',
 'her',
 'husband',
 'that',
 'she',
 'could',
 'not',
 'go',
 'on',
 'living',
 'in',
 'the',
 'same',
 'house',
 'with',
 'him',
 '<period>',
 '<new_line>',
 'this',
 'position',
 'of',
 'affairs',
 'had',
 'now',
 'lasted',
 'three',
 'days',
 '<comma>',
 'and',
 'not',
 'only',
 'the',
 '<n

In [20]:
# Create vocab dictionaries

# get the word counts
word_counts = Counter(anna_text)
    
# sort from most to least frequent
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)

# define the dictionaries
idx2word = {ii: word for ii, word in enumerate(sorted_vocab)}
word2idx = {word: ii for ii, word in idx2word.items()}

In [21]:
# apply dictionaries to text
target_vocab = [word2idx[word] for word in anna_text]

In [22]:
# save to a pickle file
pickle.dump((target_vocab, idx2word, word2idx, punct2token), open('preprocess.p', 'wb'))

# PyTorch Embedding Layer

Now apply this to our desired dataset's vocabulary (which may be different from that in GloVe).  If the word is in GloVe's vocabulary, load in the pre-trained word vector.  Otherwise, initialize a random vector.