# Distributed Representations of Words and Phrases and Their Compositionality

This notebook contains an implementation of the Neural Net described in the paper "Distributed Representations of Words and Phrases and Their Compositionality". A copy of the paper along with a summary are available in this directory. This implementation is done using Pytorch.

## Corpus

The corpus being used for training the neural network is not the same corpus mentioned in the paper (the Google News corpus described in the paper is not publicly available). Instead, there is an alternative corpus being used, which may affect the quality of the word vectors.

## Analogies Datasets

The paper mentions two datasets of analogies they used to evaluate the word embeddings generated by their network. Both of those analogies datasets are publicly available and available in this repository.

## Step 1: Explore the Data

In [1]:
import nltk
from nltk.tokenize import word_tokenize

import os
import torch
import time

nltk.download('punkt')

In [2]:
torch.__version__

'1.3.1'

In [3]:
path_data = '../../data/language-modeling-benchmark-r13/'
path_training_corpus = os.path.join(path_data, 'training-monolingual.tokenized.shuffled')
path_holdout_corpus = os.path.join(path_data, 'heldout-monolingual.tokenized.shuffled')

Here we will be exploring a single file in our training corpus.

In [11]:
path_filename = os.path.join(path_training_corpus, os.listdir(path_training_corpus)[0])

start_time = time.time()

with open(path_filename) as file:
    samples_raw = file.read().split("\n")

total_time = time.time() - start_time

print(f"Total time to load a single file: {total_time:0.2f}s")
print(f"Total samples in the corpus: {len(samples_raw)}")

Total time to load a single file: 0.46s
Total samples in the corpus: 306075


Let's get a sense of the words present in the file:
- How many total words are in the file?
- How many unique words?
- What are the counts for each word?
- What are the most common words?

In [13]:
start_time = time.time()

samples = [word_tokenize(s) for s in samples_raw]

total_time = time.time() - start_time
print(f"Total processing time: {total_time:0.2f}s")

Total processing time: 59.96s


In [16]:
start_time = time.time()

token_freq = {}

for s in samples:
    for t in s:

        if t not in token_freq:
            token_freq[t] = 1
        token_freq[t] = token_freq[t] + 1
        
total_time = time.time() - start_time
print(f"Total processing time: {total_time:0.2f}s")

print(f"Total unique tokens found: {len(token_freq)}")

Total processing time: 2.57s
Total unique tokens found: 186220


In [29]:
def k_most_frequent(freq, k):
    most_freq = []

    # Keep track of the index and count of the least frequent token in the
    # list. This is the token that is the next candidate to be removed if
    # we find a more frequent token.
    token_to_remove_index = None
    token_to_remove_count = None

    for (token, count) in freq.items():
        # If we do not yet have k most frequent tokens, then we can just
        # add this token to the list.
        if len(most_freq) < k:
            most_freq.append(token)
            if token_to_remove_count is None or count < token_to_remove_count:
                token_to_remove_count = count
                token_to_remove_index = len(most_freq) - 1
            continue

        # Check if this word is occurring more frequently than the least
        # frequent token in the list.
        if token_to_remove_count < count:
            most_freq[token_to_remove_index] = token
            
            # Need to search for the least frequent token in the list.
            token_to_remove_count = freq[most_freq[0]]
            token_to_remove_index = 0
            for (i, token) in enumerate(most_freq):
                if freq[token] < token_to_remove_count:
                    token_to_remove_count = freq[token]
                    token_to_remove_index = i
    
    return most_freq


Let's generate a list of the most common words and the number of times they show up in our file.

In [35]:
[(token_freq[t], t) for t in k_most_frequent(token_freq, 20)]

[(45329, 'with'),
 (43543, 'said'),
 (47491, 'was'),
 (65316, 'for'),
 (67363, 'that'),
 (311503, '.'),
 (57695, 'on'),
 (57011, 'is'),
 (354051, ','),
 (53410, 'The'),
 (70058, "'s"),
 (183646, 'to'),
 (90259, '``'),
 (37629, 'as'),
 (159188, 'and'),
 (36604, 'at'),
 (156645, 'a'),
 (175486, 'of'),
 (141074, 'in'),
 (362986, 'the')]

Let's now look for any words that contains non-alphabetical characters.

In [37]:
non_alpha_tokens = []

for token in token_freq.keys():
    found_non_alpha = False

    for c in token.lower():

        if ord(c) < ord('a') or ord(c) > ord('z'):
            non_alpha_tokens.append(token)
            found_non_alpha = True
            continue
        
        if found_non_alpha:
            continue


In [40]:
len(non_alpha_tokens[:20])

20

What are some of the most common tokens with non-alpha numeric characters?

In [46]:
non_alpha_freq = { t:token_freq[t] for t in non_alpha_tokens }

[(t, non_alpha_freq[t]) for t in k_most_frequent(non_alpha_freq, 30)]

[('.', 311503),
 (',', 354051),
 ('(', 22524),
 ('U.S.', 6320),
 ('30', 2233),
 ('20', 2259),
 ("'t", 11913),
 ("'s", 70058),
 ('$', 12917),
 ('2', 2134),
 ('``', 90259),
 (':', 13597),
 ('2008', 2608),
 ("'", 9576),
 ('Mr.', 3611),
 ('&', 2195),
 ('£', 4844),
 ('--', 22635),
 (')', 22698),
 ('1', 2704),
 ('%', 3194),
 ('...', 2986),
 ('2009', 2541),
 ("'re", 2575),
 ('?', 8331),
 ('10', 3375),
 ('!', 2454),
 ('-', 10053),
 (';', 5989),
 ('/', 5138)]

We can see that this list contains puncuation, years (2006, 2007), word abbreviations (Mr. and U.S.), ending of contractions ('t and 've), and common numbers (2, 10).