# Distributed Representations of Words and Phrases and Their Compositionality

This notebook contains an implementation of the Neural Net described in the paper "Distributed Representations of Words and Phrases and Their Compositionality". A copy of the paper along with a summary are available in this directory. This implementation is done using Pytorch.

## Corpus

The corpus being used for training the neural network is not the same corpus mentioned in the paper (the Google News corpus described in the paper is not publicly available). Instead, there is an alternative corpus being used, which may affect the quality of the word vectors.

## Analogies Datasets

The paper mentions two datasets of analogies they used to evaluate the word embeddings generated by their network. Both of those analogies datasets are publicly available and available in this repository.

## Step 1: Explore the Data

In [3]:
import nltk
from nltk.tokenize import word_tokenize

import os
import torch
import time

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/brendanmcnamara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
torch.__version__

'1.3.1'

In [5]:
path_data = '../../data/language-modeling-benchmark-r13/'
path_training_corpus = os.path.join(path_data, 'training-monolingual.tokenized.shuffled')
path_holdout_corpus = os.path.join(path_data, 'heldout-monolingual.tokenized.shuffled')

Here we will be exploring a single file in our training corpus.

In [6]:
path_filename = os.path.join(path_training_corpus, os.listdir(path_training_corpus)[0])

start_time = time.time()

with open(path_filename) as file:
    samples_raw = file.read().split("\n")

total_time = time.time() - start_time

print(f"Total time to load a single file: {total_time:0.2f}s")
print(f"Total samples in the corpus: {len(samples_raw)}")

Total time to load a single file: 0.47s
Total samples in the corpus: 306075


Let's get a sense of the words present in the file:
- How many total words are in the file?
- How many unique words?
- What are the counts for each word?
- What are the most common words?

### Tokenizing Corpus

We will use *nltk* to tokenize our corpus. As an additional pre-processing step, we will also force all our text to be lowercase. We will also remove any tokens that contains only punctuation characters. These steps may not be necessary with such a large corpus, but for this example, I do it to reduce our vocabulary size and our overall training time.

In [52]:
def is_punctuation_token(token):
    pass # TODO HERE I AM

In [14]:
start_time = time.time()

samples = [word_tokenize(s.lower()) for s in samples_raw]

total_time = time.time() - start_time
print(f"Total processing time: {total_time:0.2f}s")

Total processing time: 59.51s


In [15]:
start_time = time.time()

token_freq = {}

for s in samples:
    for t in s:

        if t not in token_freq:
            token_freq[t] = 1
        token_freq[t] = token_freq[t] + 1
        
total_time = time.time() - start_time
print(f"Total processing time: {total_time:0.2f}s")

print(f"Total unique tokens found: {len(token_freq)}")

Total processing time: 2.87s
Total unique tokens found: 163385


In [43]:
def k_most_frequent(freq, k):
    result = []

    # Keep track of the index and count of the least frequent token in the
    # list. This is the token that is the next candidate to be removed if
    # we find a more frequent token.
    token_to_remove_index = None
    token_to_remove_count = None

    # We will allow the user to enter negative numbers, indicating they
    # want the least frequent words instead of the most frequent words.
    should_append = lambda c: token_to_remove_count < c if k > 0 else token_to_remove_count > c

    k_positive = k if k > 0 else -k

    for (j, (token, count)) in enumerate(freq.items()):
        # If we do not yet have k most frequent tokens, then we can just
        # add this token to the list.
        if len(result) < k_positive:
            result.append(token)
            # "not should_append(c)" indicates that the current token should
            # be the next thing to remove when a better word comes up.
            if token_to_remove_count is None or not should_append(count):
                token_to_remove_count = count
                token_to_remove_index = len(result) - 1
            continue

        # Check if this word is occurring more frequently than the least
        # frequent token in the list.
        if should_append(count):
            result[token_to_remove_index] = token
            
            # Need to search for the next token in the list to remove.
            token_to_remove_count = freq[result[0]]
            token_to_remove_index = 0
            for (i, token) in enumerate(result):
                # "not should_append(c)" indicates that the current token should
                # be the next thing to remove when a better word comes up.
                if not should_append(freq[token]):
                    token_to_remove_count = freq[token]
                    token_to_remove_index = i
    
    return result


Let's generate a list of the most common words and the number of times they show up in our file.

In [44]:
[(t, token_freq[t]) for t in k_most_frequent(token_freq, 20)]

[('as', 40323),
 ("'s", 70118),
 ('it', 46180),
 ('was', 47608),
 ('that', 70473),
 ('.', 310971),
 ('said', 43593),
 ('to', 184922),
 ('for', 67866),
 ('the', 416623),
 ('is', 57583),
 ('he', 39686),
 ('``', 90259),
 ('with', 46933),
 ('and', 163409),
 (',', 354051),
 ('a', 166196),
 ('of', 176199),
 ('in', 150693),
 ('on', 59740)]

This list mostly contains pronouns, articles, and prepositions, along with some punctuation.

And also a list of the least frequent words:

In [45]:
[(t, token_freq[t]) for t in k_most_frequent(token_freq, -20)]

[('demsky', 2),
 ('cocoon-like', 2),
 ('5,214', 2),
 ('permasteelisa', 2),
 ('fratelli', 2),
 ('sraya', 2),
 ('volvos', 2),
 ('seidel', 2),
 ('63,291', 2),
 ('computer-aided', 2),
 ('germanys', 2),
 ('d10', 2),
 ('amitav', 2),
 ('3-43', 2),
 ('moreh', 2),
 ('2-38', 2),
 ('s10', 2),
 ('spinboldak', 2),
 ('checkpoint-friendly', 2),
 ('yamal-europe', 2)]

We can see that this list includes long numbers (63,291), names of people (Moreh), and joined words (computer-aided, checkpoint-friendly).

Let's now look for any words that contains non-alphabetical characters.

In [46]:
non_alpha_tokens = []

for token in token_freq.keys():
    found_non_alpha = False

    for c in token:

        if ord(c) < ord('a') or ord(c) > ord('z'):
            found_non_alpha = True
            continue
        
    if found_non_alpha:
        non_alpha_tokens.append(token)
        continue


In [47]:
ratio = len(non_alpha_tokens) / float(len(token_freq))
print(f"{ratio*100:0.2f}% of our tokens have characters that are non-alphabetical")

29.69% of our tokens have characters that are non-alphabetical


What are some of the most common tokens with non-alpha numeric characters?

In [49]:
non_alpha_freq = { t:token_freq[t] for t in non_alpha_tokens }

[(t, non_alpha_freq[t]) for t in k_most_frequent(non_alpha_freq, 20)]

[('.', 310971),
 (',', 354051),
 ('?', 8331),
 (';', 5989),
 ('%', 3194),
 ('-', 10053),
 ("'t", 11930),
 ("'s", 70118),
 ('$', 12917),
 ('u.s.', 6517),
 ('``', 90259),
 (':', 13597),
 ('(', 22524),
 ("'", 9576),
 ('mr.', 3667),
 ('/', 5138),
 ('£', 4844),
 ('--', 22635),
 (')', 22698),
 ('10', 3375)]

We can see that this list contains puncuation, years (2006, 2007), word abbreviations (Mr. and U.S.), ending of contractions ('t and 've), and common numbers (2, 10). While some of these may lead to extra noise in the data, that is fine since we have such a large corpus. To get something basic working, I'll leave most of these tokens in the vocabulary. We will do a bit of cleanup on the following:

- Remove any tokens that are entirely made of punctuation
- Marginalize all tokens that occur with a frequency of <= 5 into a single token (this was mentioned as a pre-processing step in the paper).

In [50]:
from string import punctuation

In [51]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'