# Distributed Representations of Words and Phrases and Their Compositionality

This notebook contains an implementation of the Neural Net described in the paper "Distributed Representations of Words and Phrases and Their Compositionality". A copy of the paper along with a summary are available in this directory. This implementation is done using Pytorch.

## Corpus

The corpus being used for training the neural network is not the same corpus mentioned in the paper (the Google News corpus described in the paper is not publicly available). Instead, there is an alternative corpus being used, which may affect the quality of the word vectors.

## Analogies Datasets

The paper mentions two datasets of analogies they used to evaluate the word embeddings generated by their network. Both of those analogies datasets are publicly available and available in this repository.

## Step 1: Explore the Data

In [1]:
import nltk
from nltk.tokenize import word_tokenize

import os
import torch
import time

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/brendanmcnamara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
torch.__version__

'1.3.1'

In [3]:
path_data = '../../data/language-modeling-benchmark-r13/'
path_training_corpus = os.path.join(path_data, 'training-monolingual.tokenized.shuffled')
path_holdout_corpus = os.path.join(path_data, 'heldout-monolingual.tokenized.shuffled')

Here we will be exploring a single file in our training corpus.

In [4]:
path_filename = os.path.join(path_training_corpus, os.listdir(path_training_corpus)[0])

start_time = time.time()

with open(path_filename) as file:
    samples_raw = file.read().split("\n")

total_time = time.time() - start_time

print(f"Total time to load a single file: {total_time:0.2f}s")
print(f"Total samples in the corpus: {len(samples_raw)}")

Total time to load a single file: 0.52s
Total samples in the corpus: 306075


Let's get a sense of the words present in the file:
- How many total words are in the file?
- How many unique words?
- What are the counts for each word?
- What are the most common words?

### Tokenizing Corpus

We will use *nltk* to tokenize our corpus. As an additional pre-processing step, we will also force all our text to be lowercase. We will also remove any tokens that contains only punctuation characters. These steps may not be necessary with such a large corpus, but for this example, I do it to reduce our vocabulary size and our overall training time.

In [5]:
from string import punctuation

punc_list = [p for p in punctuation]

def is_punctuation_token(token):
    """
    A token is a punctuation token if the characters consist of
    only punctuation characters
    """
    return len(token) == len([c for c in token if c in punc_list])


In [6]:
# Quick Test of Above Functions

test1 = is_punctuation_token('.') == True
test2 = is_punctuation_token('?)()') == True
test3 = is_punctuation_token('...1') == False
test4 = is_punctuation_token('abc') == False

(test1, test2, test3, test4)

(True, True, True, True)

In [7]:
start_time = time.time()

samples = []

for s in samples_raw:
    tokens = [t for t in word_tokenize(s.lower()) if not is_punctuation_token(t)]
    samples.append(tokens)


total_time = time.time() - start_time
print(f"Total processing time: {total_time:0.2f}s")

Total processing time: 84.18s


In [8]:
start_time = time.time()

token_freq = {}

for s in samples:
    for t in s:

        if t not in token_freq:
            token_freq[t] = 0
        token_freq[t] = token_freq[t] + 1
        
total_time = time.time() - start_time
print(f"Total processing time: {total_time:0.2f}s")

print(f"Total unique tokens found: {len(token_freq)}")

Total processing time: 2.37s
Total unique tokens found: 163351


In [9]:
def k_most_frequent(freq, k):
    result = []

    # Keep track of the index and count of the least frequent token in the
    # list. This is the token that is the next candidate to be removed if
    # we find a more frequent token.
    token_to_remove_index = None
    token_to_remove_count = None

    # We will allow the user to enter negative numbers, indicating they
    # want the least frequent words instead of the most frequent words.
    should_append = lambda c: token_to_remove_count < c if k > 0 else token_to_remove_count > c

    k_positive = k if k > 0 else -k

    for (j, (token, count)) in enumerate(freq.items()):
        # If we do not yet have k most frequent tokens, then we can just
        # add this token to the list.
        if len(result) < k_positive:
            result.append(token)
            # "not should_append(c)" indicates that the current token should
            # be the next thing to remove when a better word comes up.
            if token_to_remove_count is None or not should_append(count):
                token_to_remove_count = count
                token_to_remove_index = len(result) - 1
            continue

        # Check if this word is occurring more frequently than the least
        # frequent token in the list.
        if should_append(count):
            result[token_to_remove_index] = token
            
            # Need to search for the next token in the list to remove.
            token_to_remove_count = freq[result[0]]
            token_to_remove_index = 0
            for (i, token) in enumerate(result):
                # "not should_append(c)" indicates that the current token should
                # be the next thing to remove when a better word comes up.
                if not should_append(freq[token]):
                    token_to_remove_count = freq[token]
                    token_to_remove_index = i
    
    return result


Let's generate a list of the most common words and the number of times they show up in our file.

In [11]:
[(t, token_freq[t]) for t in k_most_frequent(token_freq, 20)]

[('he', 39685),
 ('was', 47607),
 ('at', 38909),
 ('as', 40322),
 ('that', 70472),
 ('said', 43592),
 ('is', 57582),
 ('for', 67865),
 ('the', 416622),
 ('from', 33096),
 ('of', 176198),
 ('and', 163408),
 ('it', 46179),
 ('on', 59739),
 ('with', 46932),
 ('a', 166195),
 ('by', 35149),
 ('in', 150692),
 ('to', 184921),
 ("'s", 70117)]

This list mostly contains pronouns, articles, and prepositions, along with some punctuation.

And also a list of the least frequent words:

In [12]:
[(t, token_freq[t]) for t in k_most_frequent(token_freq, -20)]

[('demsky', 1),
 ('cocoon-like', 1),
 ('5,214', 1),
 ('permasteelisa', 1),
 ('fratelli', 1),
 ('volvos', 1),
 ('seidel', 1),
 ('63,291', 1),
 ('sraya', 1),
 ('germanys', 1),
 ('computer-aided', 1),
 ('amitav', 1),
 ('3-43', 1),
 ('moreh', 1),
 ('2-38', 1),
 ('d10', 1),
 ('spinboldak', 1),
 ('checkpoint-friendly', 1),
 ('yamal-europe', 1),
 ('s10', 1)]

We can see that this list includes long numbers (63,291), names of people (Moreh), and joined words (computer-aided, checkpoint-friendly).

Let's now look for any words that contains non-alphabetical characters.

In [13]:
non_alpha_tokens = []

for token in token_freq.keys():
    found_non_alpha = False

    for c in token:

        if ord(c) < ord('a') or ord(c) > ord('z'):
            found_non_alpha = True
            continue
        
    if found_non_alpha:
        non_alpha_tokens.append(token)
        continue


In [14]:
ratio = len(non_alpha_tokens) / float(len(token_freq))
print(f"{ratio*100:0.2f}% of our tokens have characters that are non-alphabetical")

29.67% of our tokens have characters that are non-alphabetical


What are some of the most common tokens with non-alpha numeric characters?

In [15]:
non_alpha_freq = { t:token_freq[t] for t in non_alpha_tokens }

[(t, non_alpha_freq[t]) for t in k_most_frequent(non_alpha_freq, 20)]

[('15', 1770),
 ("'re", 2576),
 ('£', 4843),
 ('2008', 2607),
 ("'t", 11929),
 ("'s", 70117),
 ('2', 2133),
 ('u.s.', 6516),
 ('mr.', 3666),
 ('10', 3374),
 ('3', 1707),
 ('1', 2703),
 ('12', 1825),
 ('2007', 1983),
 ('11', 1800),
 ('2009', 2540),
 ('30', 2232),
 ('20', 2258),
 ('5', 1672),
 ("'ve", 1933)]

We can see that this list contains years (2006, 2007), word abbreviations (Mr. and U.S.), ending of contractions ('t and 've), and common numbers (2, 10). While some of these may lead to extra noise in the data, that is fine since we have such a large corpus. To get something basic working, I'll leave most of these tokens in the vocabulary. We will do a bit of additional processing of the text before we can start working on the neural network:

- Marginalize all tokens that occur with a frequency of <= 5 into a single token
- Find a generate common phrases in the text and treat them as their own tokens

Both of these steps were mentioned in the paper. As a side note, we will be doing the group of our corpus into phrases before we try to remove low-frequency words, since we may end up grouping some of those low frequency words.

### Creating Common Phrases

We will use the formula mentioned in the text:

$$score(w_i, w_j) = \frac {count(w_i, w_j) - \delta} {count(w_i) \times count(w_j)}$$

The score between any 2 tokens is computed based on their bigram and unigram counts. We've already calculated the unigram counts of every token when we collected their frequencies. We will also need to compute the bigram counts of every pair of tokens in our vocabulary.

The $\delta$ used is a discounting coefficient to prevent too many phrases consisting of infrequent words from forming.

Note that $\delta$ is a coefficient that will be dependent on the corpus size (not normalized), so a value used on some text examples may not extend well to the general training corpus.

We need to also choose a score threshold that indicates 2 tokens are merging. After doing a few passes over the entire corpus, we may end up with phrases of 3+ words. Note that after each pass of the database, we will need to recompute our unigram and bigram counts since the set of tokens changes.

We will also keep a mapping of which tokens are combined to process our samples after the phrases have been combined.

In [16]:
def calculate_unigram_and_bigram_counts(samples):
    """
    Calculate the unigram and bigram counts of our samples.
    
    Samples - A list of list of tokens. The our list is a single
              sample. Each sample contains a list of tokens.
    """

    unigram_counts = {}
    bigram_counts = {}

    for s in samples:
        # Unigram count calculation
        for i in range(len(s)):
            unigram = s[i]
            if unigram not in unigram_counts:
                unigram_counts[unigram] = 0
            unigram_counts[unigram] = unigram_counts[unigram] + 1
               
        # Bigram count calculation
        for i in range(0, len(s), 2):
            # Quit if we do not have enough tokens to form
            # more bigrams.
            if i + 1 >= len(s):
                break
                
            bigram = (s[i], s[i+1])

            if bigram not in bigram_counts:
                bigram_counts[bigram] = 0
            bigram_counts[bigram] = bigram_counts[bigram] + 1
            
    return (unigram_counts, bigram_counts)

In [17]:
def phrase_score(t1, t2, unigram_count, bigram_count, delta):
    """
    Calculate the score for phrase combining.
    
    unigram_count - A dictionary mapping tokens in our vocabulary
                    to their counts.
                    
    bigram_count - A dictionary mapping pairs of tokens in our vocab
                   to their counts.
                   
    delta - a coefficient used to adjust the score. Higher delta means we
            discount infrequent words.
    """
    
    # Double check that we are not dividing by 0. This should never
    # happen in theory because if a token has 0 unigram count, it
    # should not be in our vocab.
    if t1 not in unigram_count or t2 not in unigram_count:
        return 0
    
    t1u = unigram_count[t1]
    t2u = unigram_count[t2]

    if t1u == 0 or t2u == 0:
        return 0
    
    b = bigram_count[(t1, t2)]
    
    return (b - delta) / (t1u * t2u)
    

In [18]:
def create_score_map(samples, delta):
    """
    Takes a list of samples and computes a mapping of bigram phrase scoring.
    
    samples - A list of list of tokens.
    """

    (unigram_count, bigram_count) = calculate_unigram_and_bigram_counts(samples)
    
    return { (t1, t2): phrase_score(t1, t2, unigram_count, bigram_count, delta) for (t1, t2) in bigram_count.keys() }


In [19]:
start_time = time.time()

score_map = create_score_map(samples, delta=0)

total_time = time.time() - start_time

print(f"This took {total_time:0.2f}s to run")

[(b, score_map[b]) for b in k_most_frequent(score_map, 10)]

This took 7.61s to run


[(('farouq', 'al-qadoumi'), 1.0),
 (('uag', 'tecos'), 1.0),
 (('dhia', 'al-kawaz'), 1.0),
 (('01456', '486358'), 1.0),
 (('akreos', 'sofport'), 1.0),
 (('jennet', 'mallow'), 1.0),
 (('total-immersion', 'how-to-be-a-jack0ff-idiot'), 1.0),
 (('carloforto', 'isola'), 1.0),
 (('danuta', 'budorina'), 1.0),
 (('asawat', 'al-iraq'), 1.0)]

We can see that having a delta of 0 results in a lot of matching for pairs of tokens that only occur once in the text. We will change the delta value to penalize low-frequency words.

Let's try a few different delta values to see which gives better results.

In [21]:
import numpy as np

In [30]:
start_time = time.time()

deltas = np.logspace(-1, 3, 20)

freqs = []

for (i, delta) in enumerate(deltas):
    score_map = create_score_map(samples, delta)
    freq = k_most_frequent(score_map, 40)
    freqs.append([(b, score_map[b]) for b in freq])
    print(f"{i+1:02} / {len(deltas)} completed")

total_time = time.time() - start_time
print(f"This took {total_time/60:0.2f}m")

1 / 20 completed
2 / 20 completed
3 / 20 completed
4 / 20 completed
5 / 20 completed
6 / 20 completed
7 / 20 completed
8 / 20 completed
9 / 20 completed
10 / 20 completed
11 / 20 completed
12 / 20 completed
13 / 20 completed
14 / 20 completed
15 / 20 completed
16 / 20 completed
17 / 20 completed
18 / 20 completed
19 / 20 completed
20 / 20 completed
This took 2.77m


In [46]:
for i in range(len(deltas)):
    print(f"Delta: {deltas[i]}")
    print([b for b, _ in freqs[i]])
    print("\n\n")
    

Delta: 0.1
[('nickname-less', 'whites-only'), ('01456', '486358'), ('anarchism', 'co-exists'), ('poddala', 'jayantha'), ('landrum', 'papered'), ('3888', 'ecocentric.co.uk'), ('dorks', 'jocks'), ('jennet', 'mallow'), ('vujic', 'valjevo'), ('danuta', 'budorina'), ('nicolay', 'bogachev'), ('sural', 'ncv'), ('dhia', 'al-kawaz'), ('valéry', 'lucentini'), ('carloforto', 'isola'), ('4777', 'www.globalgardens.co.uk'), ('wamidh', 'nadhmi'), ('shantanu', 'narayan'), ('total-immersion', 'how-to-be-a-jack0ff-idiot'), ('rías', 'baixas'), ('farouq', 'al-qadoumi'), ('chelyabinsk', 'nizhny'), ('mandu', 'magandi'), ('0532', '793888'), ('162.42', '0.6963'), ('career-long', '57-yarder'), ('aramide', 'olaniyan'), ('violito', 'payla'), ('natasharichardson', 'lindsaylohan'), ('sbcglobal.net', 'www.penfieldhouse.com'), ('rais', 'yatim'), ('mylonitic', 'graphitic'), ('galp', 'energia'), ('t.mccoy', '1-17'), ('bullfighter', 'escamillo'), ('3-36', 'ma.bennett'), ('uag', 'tecos'), ('greater-spotted', 'woodpecker

We could see that low delta values favor very rare words but very high delta values favor very common words. This is likely because for high delta values, the most important thing is our bigram count is larger than delta. For a delta of 1000, anything that has a bigram count of less than 1000 will automatically get scored less than bigrams with a count of greater than 1000. In this way, you can think of the delta score as thresholding bigram counts that we care about.

With this in mind, let's look at some statistics of bigram counts in our example:

In [48]:
import pandas as pd

In [47]:
_, bigram_counts_map = calculate_unigram_and_bigram_counts(samples)

In [69]:
keys = [f + " " + s for f,s in bigram_counts_map.keys()]
vals = list(bigram_counts_map.values())
bigram_series = pd.Series(index=keys, data=vals)

In [72]:
bigram_series.describe()

count    1.164472e+06
mean     2.887178e+00
std      3.837628e+01
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      2.073000e+04
dtype: float64

We can see from our statistics that the vast majority of bigrams frequencies of 1:

In [76]:
len(bigram_series[bigram_series == 1]) / len(bigram_series)

0.7545840518277812

Let's do a quick search over quantiles of our series to get a better understanding of how our data is spread out:

In [88]:
[(q, bigram_series.quantile(q)) for q in 1 - np.logspace(-1, -10, 10)]

[(0.9, 3.0),
 (0.99, 28.0),
 (0.999, 179.0),
 (0.9999, 917.0),
 (0.99999, 3934.423200007528),
 (0.999999, 18025.34677637741),
 (0.9999999, 20636.14363762783),
 (0.99999999, 20720.614363668952),
 (0.999999999, 20729.06143634813),
 (0.9999999999, 20729.906143540982)]