# Distributed Representations of Words and Phrases and Their Compositionality

This notebook contains an implementation of the Neural Net described in the paper "Distributed Representations of Words and Phrases and Their Compositionality". A copy of the paper along with a summary are available in this directory. This implementation is done using Pytorch.

Note that the purpose of this notebook is to explore the data and get a better grounding in some of the implementation details and algorithms needed for pre-processing the data, training the final dataset, and interpretting the results. Please look at *torch_final* for a condensed implementation of these methods, minus some of the exploration and discussion.

## Corpus

The corpus being used for training the neural network is not the same corpus mentioned in the paper (the Google News corpus described in the paper is not publicly available). Instead, there is an alternative corpus being used, which may affect the quality of the word vectors.

## Analogies Datasets

The paper mentions two datasets of analogies they used to evaluate the word embeddings generated by their network. Both of those analogies datasets are publicly available and available in this repository.

In [1]:
import nltk
from nltk.tokenize import word_tokenize

import os
import torch
import time

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/brendanmcnamara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
torch.__version__

'1.3.1'

In [3]:
path_data = '../../data/language-modeling-benchmark-r13/'
path_training_corpus = os.path.join(path_data, 'training-monolingual.tokenized.shuffled')
path_holdout_corpus = os.path.join(path_data, 'heldout-monolingual.tokenized.shuffled')

Here we will be exploring a single file in our training corpus.

In [4]:
path_filename = os.path.join(path_training_corpus, os.listdir(path_training_corpus)[0])

start_time = time.time()

with open(path_filename) as file:
    samples_raw = file.read().split("\n")

total_time = time.time() - start_time

print(f"Total time to load a single file: {total_time:0.2f}s")
print(f"Total samples in the corpus: {len(samples_raw)}")

Total time to load a single file: 0.57s
Total samples in the corpus: 306075


Let's get a sense of the words present in the file:
- How many total words are in the file?
- How many unique words?
- What are the counts for each word?
- What are the most common words?

### Tokenizing Corpus

We will use *nltk* to tokenize our corpus. As an additional pre-processing step, we will also force all our text to be lowercase. We will also remove any tokens that contains only punctuation characters. These steps may not be necessary with such a large corpus, but for this example, I do it to reduce our vocabulary size and our overall training time.

In [5]:
from string import punctuation

punc_list = [p for p in punctuation]

def is_punctuation_token(token):
    """
    A token is a punctuation token if the characters consist of
    only punctuation characters
    """
    return len(token) == len([c for c in token if c in punc_list])


In [6]:
# Quick Test of Above Functions

test1 = is_punctuation_token('.') == True
test2 = is_punctuation_token('?)()') == True
test3 = is_punctuation_token('...1') == False
test4 = is_punctuation_token('abc') == False

(test1, test2, test3, test4)

(True, True, True, True)

In [7]:
start_time = time.time()

samples = []

for s in samples_raw:
    tokens = [t for t in word_tokenize(s.lower()) if not is_punctuation_token(t)]
    if tokens:
        samples.append(tokens)


total_time = time.time() - start_time
print(f"Total processing time: {total_time:0.2f}s")

Total processing time: 84.33s


In [8]:
start_time = time.time()

token_freq = {}

for s in samples:
    for t in s:

        if t not in token_freq:
            token_freq[t] = 0
        token_freq[t] = token_freq[t] + 1
        
total_time = time.time() - start_time
print(f"Total processing time: {total_time:0.2f}s")

print(f"Total unique tokens found: {len(token_freq)}")

Total processing time: 2.61s
Total unique tokens found: 163351


In [9]:
def k_most_frequent(freq, k):
    result = []

    # Keep track of the index and count of the least frequent token in the
    # list. This is the token that is the next candidate to be removed if
    # we find a more frequent token.
    token_to_remove_index = None
    token_to_remove_count = None

    # We will allow the user to enter negative numbers, indicating they
    # want the least frequent words instead of the most frequent words.
    should_append = lambda c: token_to_remove_count < c if k > 0 else token_to_remove_count > c

    k_positive = k if k > 0 else -k

    for (j, (token, count)) in enumerate(freq.items()):
        # If we do not yet have k most frequent tokens, then we can just
        # add this token to the list.
        if len(result) < k_positive:
            result.append(token)
            # "not should_append(c)" indicates that the current token should
            # be the next thing to remove when a better word comes up.
            if token_to_remove_count is None or not should_append(count):
                token_to_remove_count = count
                token_to_remove_index = len(result) - 1
            continue

        # Check if this word is occurring more frequently than the least
        # frequent token in the list.
        if should_append(count):
            result[token_to_remove_index] = token
            
            # Need to search for the next token in the list to remove.
            token_to_remove_count = freq[result[0]]
            token_to_remove_index = 0
            for (i, token) in enumerate(result):
                # "not should_append(c)" indicates that the current token should
                # be the next thing to remove when a better word comes up.
                if not should_append(freq[token]):
                    token_to_remove_count = freq[token]
                    token_to_remove_index = i
    
    return result


Let's generate a list of the most common words and the number of times they show up in our file.

In [10]:
[(t, token_freq[t]) for t in k_most_frequent(token_freq, 20)]

[('he', 39685),
 ('was', 47607),
 ('at', 38909),
 ('as', 40322),
 ('that', 70472),
 ('said', 43592),
 ('is', 57582),
 ('for', 67865),
 ('the', 416622),
 ('from', 33096),
 ('of', 176198),
 ('and', 163408),
 ('it', 46179),
 ('on', 59739),
 ('with', 46932),
 ('a', 166195),
 ('by', 35149),
 ('in', 150692),
 ('to', 184921),
 ("'s", 70117)]

This list mostly contains pronouns, articles, and prepositions, along with some punctuation.

And also a list of the least frequent words:

In [11]:
[(t, token_freq[t]) for t in k_most_frequent(token_freq, -20)]

[('demsky', 1),
 ('cocoon-like', 1),
 ('5,214', 1),
 ('permasteelisa', 1),
 ('fratelli', 1),
 ('volvos', 1),
 ('seidel', 1),
 ('63,291', 1),
 ('sraya', 1),
 ('germanys', 1),
 ('computer-aided', 1),
 ('amitav', 1),
 ('3-43', 1),
 ('moreh', 1),
 ('2-38', 1),
 ('d10', 1),
 ('spinboldak', 1),
 ('checkpoint-friendly', 1),
 ('yamal-europe', 1),
 ('s10', 1)]

We can see that this list includes long numbers (63,291), names of people (Moreh), and joined words (computer-aided, checkpoint-friendly).

Let's now look for any words that contains non-alphabetical characters.

In [12]:
non_alpha_tokens = []

for token in token_freq.keys():
    found_non_alpha = False

    for c in token:

        if ord(c) < ord('a') or ord(c) > ord('z'):
            found_non_alpha = True
            continue
        
    if found_non_alpha:
        non_alpha_tokens.append(token)
        continue


In [13]:
ratio = len(non_alpha_tokens) / float(len(token_freq))
print(f"{ratio*100:0.2f}% of our tokens have characters that are non-alphabetical")

29.67% of our tokens have characters that are non-alphabetical


What are some of the most common tokens with non-alpha numeric characters?

In [14]:
non_alpha_freq = { t:token_freq[t] for t in non_alpha_tokens }

[(t, non_alpha_freq[t]) for t in k_most_frequent(non_alpha_freq, 20)]

[('15', 1770),
 ("'re", 2576),
 ('£', 4843),
 ('2008', 2607),
 ("'t", 11929),
 ("'s", 70117),
 ('2', 2133),
 ('u.s.', 6516),
 ('mr.', 3666),
 ('10', 3374),
 ('3', 1707),
 ('1', 2703),
 ('12', 1825),
 ('2007', 1983),
 ('11', 1800),
 ('2009', 2540),
 ('30', 2232),
 ('20', 2258),
 ('5', 1672),
 ("'ve", 1933)]

We can see that this list contains years (2006, 2007), word abbreviations (Mr. and U.S.), ending of contractions ('t and 've), and common numbers (2, 10). While some of these may lead to extra noise in the data, that is fine since we have such a large corpus. To get something basic working, I'll leave most of these tokens in the vocabulary. We will do a bit of additional processing of the text before we can start working on the neural network:

- Marginalize all tokens that occur with a frequency of <= 5 into a single token
- Find a generate common phrases in the text and treat them as their own tokens

Both of these steps were mentioned in the paper. As a side note, we will be doing the group of our corpus into phrases before we try to remove low-frequency words, since we may end up grouping some of those low frequency words.

### Creating Common Phrases

We will use the formula mentioned in the text:

$$score(w_i, w_j) = \frac {count(w_i, w_j) - \delta} {count(w_i) \times count(w_j)}$$

The score between any 2 tokens is computed based on their bigram and unigram counts. We've already calculated the unigram counts of every token when we collected their frequencies. We will also need to compute the bigram counts of every pair of tokens in our vocabulary.

The $\delta$ used is a discounting coefficient to prevent too many phrases consisting of infrequent words from forming.

Note that $\delta$ is a coefficient that will be dependent on the corpus size (not normalized), so a value used on some text examples may not extend well to the general training corpus.

We need to also choose a score threshold that indicates 2 tokens are merging. After doing a few passes over the entire corpus, we may end up with phrases of 3+ words. Note that after each pass of the database, we will need to recompute our unigram and bigram counts since the set of tokens changes.

We will also keep a mapping of which tokens are combined to process our samples after the phrases have been combined.

In [15]:
def calculate_unigram_and_bigram_counts(samples):
    """
    Calculate the unigram and bigram counts of our samples.
    
    Samples - A list of list of tokens. The our list is a single
              sample. Each sample contains a list of tokens.
    """

    unigram_counts = {}
    bigram_counts = {}

    for s in samples:
        # Unigram count calculation
        for i in range(len(s)):
            unigram = s[i]
            if unigram not in unigram_counts:
                unigram_counts[unigram] = 0
            unigram_counts[unigram] = unigram_counts[unigram] + 1
               
        # Bigram count calculation
        for i in range(0, len(s), 2):
            # Quit if we do not have enough tokens to form
            # more bigrams.
            if i + 1 >= len(s):
                break
                
            bigram = (s[i], s[i+1])

            if bigram not in bigram_counts:
                bigram_counts[bigram] = 0
            bigram_counts[bigram] = bigram_counts[bigram] + 1
            
    return (unigram_counts, bigram_counts)

In [16]:
def phrase_score(t1, t2, unigram_count, bigram_count, delta):
    """
    Calculate the score for phrase combining.
    
    unigram_count - A dictionary mapping tokens in our vocabulary
                    to their counts.
                    
    bigram_count - A dictionary mapping pairs of tokens in our vocab
                   to their counts.
                   
    delta - a coefficient used to adjust the score. Higher delta means we
            discount infrequent words.
    """
    
    # Double check that we are not dividing by 0. This should never
    # happen in theory because if a token has 0 unigram count, it
    # should not be in our vocab.
    if t1 not in unigram_count or t2 not in unigram_count:
        return 0
    
    t1u = unigram_count[t1]
    t2u = unigram_count[t2]

    if t1u == 0 or t2u == 0:
        return 0
    
    b = bigram_count[(t1, t2)]
    
    return (b - delta) / (t1u * t2u)
    

In [17]:
def create_score_map(unigram_count, bigram_count, delta):
    """
    Takes a list of samples and computes a mapping of bigram phrase scoring.
    
    samples - A list of list of tokens.
    """    
    return { (t1, t2): phrase_score(t1, t2, unigram_count, bigram_count, delta) for (t1, t2) in bigram_count.keys() }


In [18]:
start_time = time.time()

uc, bc = calculate_unigram_and_bigram_counts(samples)
score_map = create_score_map(uc, bc, delta=0)

total_time = time.time() - start_time

print(f"This took {total_time:0.2f}s to run")

[(b, score_map[b]) for b in k_most_frequent(score_map, 10)]

This took 8.06s to run


[(('farouq', 'al-qadoumi'), 1.0),
 (('uag', 'tecos'), 1.0),
 (('dhia', 'al-kawaz'), 1.0),
 (('01456', '486358'), 1.0),
 (('akreos', 'sofport'), 1.0),
 (('jennet', 'mallow'), 1.0),
 (('total-immersion', 'how-to-be-a-jack0ff-idiot'), 1.0),
 (('carloforto', 'isola'), 1.0),
 (('danuta', 'budorina'), 1.0),
 (('asawat', 'al-iraq'), 1.0)]

We can see that having a delta of 0 results in a lot of matching for pairs of tokens that only occur once in the text. We will change the delta value to penalize low-frequency words.

Let's try a few different delta values to see which gives better results.

In [19]:
import numpy as np

In [20]:
start_time = time.time()

deltas = np.logspace(-1, 3, 20)

freqs = []

for (i, delta) in enumerate(deltas):
    (uc, bc) = calculate_unigram_and_bigram_counts(samples)
    score_map = create_score_map(uc, bc, delta)
    freq = k_most_frequent(score_map, 40)
    freqs.append([(b, score_map[b]) for b in freq])
    print(f"{i+1:02} / {len(deltas)} completed")

total_time = time.time() - start_time
print(f"This took {total_time/60:0.2f}m")

01 / 20 completed
02 / 20 completed
03 / 20 completed
04 / 20 completed
05 / 20 completed
06 / 20 completed
07 / 20 completed
08 / 20 completed
09 / 20 completed
10 / 20 completed
11 / 20 completed
12 / 20 completed
13 / 20 completed
14 / 20 completed
15 / 20 completed
16 / 20 completed
17 / 20 completed
18 / 20 completed
19 / 20 completed
20 / 20 completed
This took 2.81m


In [21]:
for i in range(len(deltas)):
    print(f"Delta: {deltas[i]}")
    print([b for b, _ in freqs[i]])
    print("\n\n")
    

Delta: 0.1
[('nickname-less', 'whites-only'), ('01456', '486358'), ('anarchism', 'co-exists'), ('poddala', 'jayantha'), ('landrum', 'papered'), ('3888', 'ecocentric.co.uk'), ('dorks', 'jocks'), ('jennet', 'mallow'), ('vujic', 'valjevo'), ('danuta', 'budorina'), ('nicolay', 'bogachev'), ('sural', 'ncv'), ('dhia', 'al-kawaz'), ('valéry', 'lucentini'), ('carloforto', 'isola'), ('4777', 'www.globalgardens.co.uk'), ('wamidh', 'nadhmi'), ('shantanu', 'narayan'), ('total-immersion', 'how-to-be-a-jack0ff-idiot'), ('rías', 'baixas'), ('farouq', 'al-qadoumi'), ('chelyabinsk', 'nizhny'), ('mandu', 'magandi'), ('0532', '793888'), ('162.42', '0.6963'), ('career-long', '57-yarder'), ('aramide', 'olaniyan'), ('violito', 'payla'), ('natasharichardson', 'lindsaylohan'), ('sbcglobal.net', 'www.penfieldhouse.com'), ('rais', 'yatim'), ('mylonitic', 'graphitic'), ('galp', 'energia'), ('t.mccoy', '1-17'), ('bullfighter', 'escamillo'), ('3-36', 'ma.bennett'), ('uag', 'tecos'), ('greater-spotted', 'woodpecker

We could see that low delta values favor very rare words but very high delta values favor very common words. This is likely because for high delta values, the most important thing is our bigram count is larger than delta. For a delta of 1000, anything that has a bigram count of less than 1000 will automatically get scored less than bigrams with a count of greater than 1000. In this way, you can think of the delta score as thresholding bigram counts that we care about.

With this in mind, let's look at some statistics of bigram counts in our example:

In [22]:
import pandas as pd

In [23]:
_, bigram_counts_map = calculate_unigram_and_bigram_counts(samples)

In [24]:
keys = [f + " " + s for f,s in bigram_counts_map.keys()]
vals = list(bigram_counts_map.values())
bigram_series = pd.Series(index=keys, data=vals)

In [25]:
bigram_series.describe()

count    1.164472e+06
mean     2.887178e+00
std      3.837628e+01
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      2.073000e+04
dtype: float64

We can see from our statistics that the vast majority of bigrams frequencies of 1:

In [26]:
len(bigram_series[bigram_series == 1]) / len(bigram_series)

0.7545840518277812

Let's do a quick search over quantiles of our series to get a better understanding of how our data is spread out:

In [27]:
[(q, bigram_series.quantile(q)) for q in 1 - np.logspace(-1, -10, 10)]

[(0.9, 3.0),
 (0.99, 28.0),
 (0.999, 179.0),
 (0.9999, 917.0),
 (0.99999, 3934.423200007528),
 (0.999999, 18025.34677637741),
 (0.9999999, 20636.14363762783),
 (0.99999999, 20720.614363668952),
 (0.999999999, 20729.06143634813),
 (0.9999999999, 20729.906143540982)]

For simplicity, we will pick our $\delta$ to be the 99th quantile of bigram counts, which will automatically give below-zero scores to any bigram that occurs less frequently than the top 99 percent of bigrams in our corpus. This may not be the best way to choose $\delta$, but it will give us a corpus-size agnostic way of doing so.

We then need to choose a valid score threshold, above which are bigrams we consider to be valid phrases. Let's take a look at what scores look like on our current dataset.

In [28]:
delta = bigram_series.quantile(.99)
uc, bc = calculate_unigram_and_bigram_counts(samples)
score_map = create_score_map(uc, bc, delta)
score_series = pd.Series(index=list(score_map.keys()), data=list(score_map.values()))

In [29]:
score_series.describe()

count    1.164472e+06
mean    -1.004987e-01
std      1.268294e+00
min     -2.700000e+01
25%     -5.974870e-04
50%     -5.144288e-05
75%     -6.406394e-06
max      3.155680e-03
dtype: float64

As a sanity check, since we set delta to the 99% quantile, we should expect 99% of our scores less than 0.

In [30]:
len(score_series[score_series < 0]) / len(score_series)

0.9897103579991618

A naive approach would be to create phrases from bigram scores greater than 0. In this case, we would only be considering the bigram counts, since, in our equation above, dividing by the product of the unigram counts would never change the sign of our score.

Let's take a look at just our positive scores:

In [31]:
positive_score_series = score_series[score_series > 0]

In [32]:
positive_score_series.describe()

count    1.151100e+04
mean     5.906361e-06
std      6.370966e-05
min      2.192717e-10
25%      6.156941e-08
50%      1.915316e-07
75%      6.312957e-07
max      3.155680e-03
dtype: float64

In [33]:
positive_score_series.nlargest(n=20)

(suu, kyi)             0.003156
(dalai, lama)          0.002205
(notre, dame)          0.002037
(fannie, mae)          0.001528
(las, vegas)           0.001422
(sri, lanka)           0.001368
(abu, dhabi)           0.001323
(hong, kong)           0.001292
(alistair, darling)    0.001256
(goldman, sachs)       0.001128
(saudi, arabia)        0.001113
(hamid, karzai)        0.000996
(bin, laden)           0.000912
(nancy, pelosi)        0.000883
(freddie, mac)         0.000798
(merrill, lynch)       0.000777
(carbon, dioxide)      0.000741
(swine, flu)           0.000699
(osama, bin)           0.000655
(quote, profile)       0.000647
dtype: float64

In [34]:
positive_score_series.nsmallest(n=20)

(this, the)      2.192717e-10
(of, not)        2.363976e-10
(the, may)       3.822674e-10
(her, the)       3.854597e-10
(to, at)         4.169509e-10
(this, to)       4.940131e-10
(that, but)      5.120168e-10
(in, by)         5.663933e-10
(that, that)     6.040711e-10
(which, to)      6.446197e-10
('s, on)         7.162093e-10
(them, of)       7.731145e-10
('s, that)       8.095060e-10
(people, the)    8.722658e-10
(to, and)        9.266132e-10
(to, where)      9.463974e-10
(do, and)        1.004539e-09
(you, in)        1.013679e-09
(last, of)       1.059640e-09
(the, 20)        1.063001e-09
dtype: float64

We can see that from our low-scoring examples, they are being penalized for using common words, and therefore have higher unigram counts.

Among the positive scores, we will threshold our phrasing at the top 97% of scores. This value is empirically chosen, and more could be done to search for better score thresholding.

We will consolidate the above procedure for combining common phrases into a single algorithm.

*NOTE: We will use a greedy implementation for combining phrases. If for example, we had a triplet of tokens (a,b,c) such that (b,c) had a higher score than (a,b), we will end up combining (a,b) simply because we will see it first as we enumerate our samples*

In [35]:
def merge_bigrams(samples, bigrams):
    """
    Given a list of tokenized samples, create a new list of samples where bigrams
    have been merged.
    
    samples - A list of samples, each sample being a list of tokens.
    bigrams - A set of bigrams to merge.
    """
    new_samples = []

    for sample in samples:
        
        if len(sample) == 0:
            print("WARNING SAMPLE LEN IS 0", not sample)

        new_sample = []

        # Keep track if we merge in the previous iteration so we don't
        # merge overlapping phrases: for (a, b, c), if (a, b) was merged
        # we do not want to merge (b, c).
        merged_during_previous_iter = False

        for i in range(len(sample) - 1):
            if merged_during_previous_iter:
                merged_during_previous_iter = False
                continue
            
            current = (sample[i], sample[i+1])
            if current in bigrams:
                new_sample.append(sample[i] + " " + sample[i + 1])
                merged_during_previous_iter = True
            else:
                new_sample.append(sample[i])
                
        # We do not iterate the last element. So if the last pair was not
        # merged, we need to add back the last token.
        if not merged_during_previous_iter:
            new_sample.append(sample[-1])

        new_samples.append(new_sample)
                
    return new_samples
    

In [36]:
# Testing the above merge operation.

result1 = merge_bigrams([['a', 'b', 'c']], bigrams={})
result2 = merge_bigrams([['a', 'b', 'c']], bigrams={('b', 'c')})
result3 = merge_bigrams([['a', 'b', 'c']], bigrams={('a', 'b'), ('b', 'c')})

(result1, result2, result3)

([['a', 'b', 'c']], [['a', 'b c']], [['a b', 'c']])

In [37]:
def combine_common_phrases(samples, bigram_percentile=0.95, score_percentile=0.999):
    uc, bc = calculate_unigram_and_bigram_counts(samples)
    
    # Figure out a good delta value using bigram quantile
    bc_series = pd.Series(data=list(bc.values()))
    delta = bc_series.quantile(bigram_percentile)
    
    # Calculate score map and threshold
    score_map = create_score_map(uc, bc, delta)
    score_series = pd.Series(data=list(score_map.values()))
    score_threshold = score_series.quantile(score_percentile).item()
    
    # Find the phrases that have a high-enough score and generate
    # a new set of samples with those phrases merged into a single
    # token.
    phrases = {b for b, s in score_map.items() if s > score_threshold}
    return merge_bigrams(samples, phrases), phrases
   

We will now do 4 full passes over our samples to generate new common phrases.

In [42]:
start_time = time.time()
iter_time = start_time

token_counts_per_iter = []
phrases_per_iter = []

combined_samples = samples
combined_token_freq = token_freq

for i in range(4):
    combined_samples, phrases = combine_common_phrases(combined_samples)
    uc, _ = calculate_unigram_and_bigram_counts(combined_samples)

    combined_token_freq = uc

    token_counts_per_iter.append(len(uc))
    phrases_per_iter.append(phrases)

    prev_iter_time = iter_time
    iter_time = time.time()

    print(f"Iter: {i+1}\tIter Time: {iter_time-prev_iter_time:.2f}s\tTotal Time: {iter_time - start_time:.2f}s\tTokens: {len(uc)}\tNew Phrases: {len(phrases)}")
    
    

Iter: 1	Iter Time: 17.19s	Total Time: 17.19s	Tokens: 164422	New Phrases: 1165
Iter: 2	Iter Time: 16.49s	Total Time: 33.68s	Tokens: 165519	New Phrases: 1173
Iter: 3	Iter Time: 17.38s	Total Time: 51.05s	Tokens: 166654	New Phrases: 1183
Iter: 4	Iter Time: 17.28s	Total Time: 68.33s	Tokens: 167826	New Phrases: 1195


Next, we want to use a special token to marginalize over any words that occur less than 5 times. Let's pick a low count token key that is not being used anywhere else in our corpus:

In [43]:
MIN_TOKEN_COUNT = 6
LOW_COUNT_TOKEN = '__LOW_COUNT_TOKEN__'
LOW_COUNT_TOKEN in combined_token_freq

False

In [46]:
low_count_tokens = {t for t,c in combined_token_freq.items() if c < MIN_TOKEN_COUNT}
len(low_count_tokens)

126937

In [49]:
final_token_freq = {}

for token, count in combined_token_freq.items():
    if count < MIN_TOKEN_COUNT:
        prev_count = final_token_freq[LOW_COUNT_TOKEN] if LOW_COUNT_TOKEN in final_token_freq else 0
        final_token_freq[LOW_COUNT_TOKEN] = prev_count + count
    else:
        final_token_freq[token] = count
    

In [51]:
final_samples = []

for sample in samples:
    final_sample = []

    for t in sample:
        if t in low_count_tokens:
            final_sample.append(LOW_COUNT_TOKEN)
        else:
            final_sample.append(t)
            
    final_samples.append(final_sample)


## Encoding Tokens

Before we can feed our data into a neural network, we need to encode our samples using one-hot encoding. First, we need to map every token in our vocabulary to an integer, which we can then use to help us encode the tokens.

In [55]:
# Encodes token to index
encoder = { t:i for i,t in enumerate(final_token_freq.keys()) }

# Decodes index to token
decoder = { t:i for i,t in encoder.items() }


Now that we have an encoder / decoder, we need to use those for encoding / decoding the tokens to / from one-hot vectors.

In [95]:
def create_one_hot_vectors(tokens, encoder):
    """
    Outputs set of one-hot vectors. The output is a numpy 2D array,
    with dimension (len(tokens), V), where V is size of vocab. So each
    column is a one-hot vector for each token in the list.
    
    tokens - A list of string tokens to encode
    encoder - A token to index mapping for each token in the vocab
    """
    vocab_size = len(encoder)
    encoded = np.zeros((len(tokens), vocab_size))
    token_map = [encoder[t] for t in tokens]

    encoded[np.arange(len(tokens)), token_map] = 1.0
    
    return encoded
    

In [96]:
# Test to make sure above code is correct
test_encoder = {'a': 0, 'b': 1, 'c': 2}
create_one_hot_vectors(['a', 'c', 'a', 'b'], test_encoder)

array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [97]:
def create_tokens(encoded, decoder):
    """
    Converts a set of one-hot encoded vectors into a list of their
    corresponding tokens. This outpus a list of string tokens.
    
    encoded - a numpy array of one-hot encodings, where each column
              is a one-hot encoded vector of a token, and each row
              indexes into the vocabulary.
              
    decoder - a mapping from int to string token, used for decoding
              the indices of one-hot vectors.
    """
    
    return [decoder[i] for i in np.argmax(encoded, axis=1)]

In [101]:
# Test to make sure above code is correct.
test_encoder = {'a': 0, 'b': 1, 'c': 2}
test_decoder = {0: 'a', 1: 'b', 2: 'c'}

test_encoding = create_one_hot_vectors(['a', 'c', 'a', 'b'], test_encoder)
create_tokens(test_encoding, test_decoder)

['a', 'c', 'a', 'b']