# Distributed Representations of Words and Phrases and Their Compositionality

This notebook contains an implementation of the Neural Net described in the paper "Distributed Representations of Words and Phrases and Their Compositionality". A copy of the paper along with a summary are available in this directory. This implementation is done using Pytorch.

Note that the purpose of this notebook is to put together a final implementation of all the data-fetching, pre-processing, training, and evaluation of the work in the paper. There will not be much discussion on the decisions made for hyper-parameters, configurations, and other implementation details. Please look at *torch_exmplore* for an exploration on some of these implementation details.

## Step 1: Data Fetching and Pre-Processing

1. Convert samples into tokens
2. Remove punctuation-only tokens
3. Sub-sample tokens in the vocabulary
4. Marginalize all tokens ocurring less than 5 times in the corpus
5. Learn phrases in the vocabulary
6. Create training examples for negative sampling (input word + set of skip-gram output words)

In [1]:
import nltk
from nltk.tokenize import word_tokenize

import numpy as np
import pandas as pd

import os
import torch
import time

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/brendanmcnamara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
torch.__version__

'1.3.1'

In [3]:
PATH_DATA = '../../data/language-modeling-benchmark-r13/'
PATH_TRAINING_CORPUS = os.path.join(PATH_DATA, 'training-monolingual.tokenized.shuffled')
PATH_HOLDOUT_CORPUS = os.path.join(PATH_DATA, 'heldout-monolingual.tokenized.shuffled')

CONTEXT_SIZE = 5
CORPUS_FILE_COUNT = 10

LOW_COUNT_TOKEN = '__LOW_COUNT_TOKEN__'

### Defining Helper Functions

Helper functions for reading sample files from disk

In [4]:
def get_sample_filenames():
    """
    Get the names of the files to be sampled from.
    """
    return os.path.join(PATH_TRAINING_CORPUS, os.listdir(PATH_TRAINING_CORPUS)[:CORPUS_FILE_COUNT])


def get_raw_samples(filepaths):
    """
    Fetch all raw samples from a particular list of file paths. This
    will pull the data from the files and tokenize the samples. No
    pre-processing is done here.
    
    filepath - A list of filepaths to fetch the raw samples.
    """
    raw_samples = []

    for filepath in filepaths:
        with open(filepath) as file:
            data = file.read().split("\n")

            for sentence in data:
                raw_samples.append([t for t in word_tokenize(sentence.lower())])

    return raw_samples



Helper functions for calculating unigram and bigram counts of corpus.

In [5]:
def create_unigram_counts(samples):
    """
    Given a set of samples, generate a set of unigram counts
    within those samples.
    """
    uc = {}
    for sample in samples:
        for token in sample:
            if token not in uc:
                uc[token] = 0
            uc[token] = uc[token] + 1
    
    return uc
   

def create_unigram_and_bigram_counts(samples):
    """
    Given a set of samples, generate a set of unigram and
    bigram counts within those samples.
    """
    uc = {}
    bc = {}
    for sample in samples:
        for i in range(len(sample)):
            token = sample[i]
            if token not in uc:
                uc[token] = 0
            uc[token] = uc[token] + 1
            
        for i in range(0, len(sample), 2):
            if i + 1 >= len(sample):
                break
            
            bigram = (sample[i], sample[i+1])
            if bigram not in bc:
                bc[bigram] = 0
            bc[bigram] = bc[bigram] + 1
        
    return (uc, bc)



Helper functions for sub-sampling tokens.

In [6]:
def is_punctuation_token(token):
    """
    A token is a punctuation token if the characters consist of
    only punctuation characters
    
    token - The token to inspect.
    """
    return len(token) == len([c for c in token if c in punc_list])


def remove_pred(samples, pred):
    """
    Given a predicate, remove a particular token from the
    sample being processed. If there is at most 1 token in
    the sample after removing the tokens, then the entire sample
    is also removed.
    
    samples - The samples to remove from. No changes will be
              made to this set of samples.
              
    pred - A predicate that returns True if the token should be
           kept, False otherwise.
    """
    new_samples = []
    for sample in samples:
        new_sample = []

        for token in sample:
            if pred(token):
                new_sample.append(token)
                
        if len(new_sample) > 1:
            new_samples.append(new_sample)
            
    return new_samples
    

def remove_tokens(samples, tokens):
    """
    Given a set of tokens, remove those tokens from the list of samples.
    If a token exists in the set of tokens, it is removed from the set
    of samples.
    
    samples - The samples to remove from. No changes will be
              made to this set of samples.
              
    tokens - Tokens to remove from all samples.
    """
    return remove_pred(samples, lambda t: t not in tokens)


def perform_subsampling(samples, uc, t=10e-5):
    """
    Perform sub-sampling according to the procedure outlined in the paper.
    
    samples - The samples to sub-sample from. No changes will be
              made to this set of samples.
              
    uc - The unigram counts of words.
    
    t - This is the t-value defined and used in the paper. Note that 10e-5
        is chosen as the value from the paper and was calibrated to the
        size of the corpus. This parameter is NOT independent of corpus size.
    """
    tokens = uc.keys()
    counts = np.array([uc[t] for t in tokens])
    total_count = np.sum(counts)
    frequencies = counts / total_count
    
    # p values indicate the likelihood that a particular token will
    # be discarded. The decision to discard a token is evaluated on
    # every instance of that token in the dataset.
    p_values = np.maximum(0, 1 - np.sqrt(t / frequencies))
    token_p_map = { t:p for (t, p) in zip(tokens, p_values)}

    # Enumerate every token in the sample and remove it using the p values as
    # likelihood of discarding.
    return remove_pred(samples, lambda t: np.random.random() > token_p_map[t])



def perform_low_count_marginalization(samples, uc, min_count=5):
    """
    Marginalize any tokens that occur less than min_count times into a
    special low count token.
    
    samples - The samples to marginalize over.
    
    uc - The unigram counts of tokens represented from the sample
    
    min_count - The lowest count of tokens allowed to remain in the dataset.
    """
    new_samples = []

    for sample in samples:
        new_sample = []

        for token in sample:
            if uc[token] < min_count:
                new_sample.append(LOW_COUNT_TOKEN)
            else:
                new_sample.append(token)
                
        new_samples.append(new_sample)
        
    return new_samples



Helper functions for combining common phrases.

In [9]:
def phrase_score(t1, t2, unigram_count, bigram_count, delta):
    """
    Calculate the score for phrase combining.
    
    unigram_count - A dictionary mapping tokens in our vocabulary
                    to their counts.
                    
    bigram_count - A dictionary mapping pairs of tokens in our vocab
                   to their counts.
                   
    delta - a coefficient used to adjust the score. Higher delta means we
            discount infrequent words.
    """
    
    # Double check that we are not dividing by 0. This should never
    # happen in theory because if a token has 0 unigram count, it
    # should not be in our vocab.
    if t1 not in unigram_count or t2 not in unigram_count:
        return 0
    
    t1u = unigram_count[t1]
    t2u = unigram_count[t2]

    if t1u == 0 or t2u == 0:
        return 0
    
    b = bigram_count[(t1, t2)]
    
    return (b - delta) / (t1u * t2u)


def create_phrase_score_map(unigram_count, bigram_count, delta):
    """
    Takes a list of samples and computes a mapping of bigram phrase scoring.
    
    samples - A list of list of tokens.
    """    
    return { (t1, t2): phrase_score(t1, t2, unigram_count, bigram_count, delta) for (t1, t2) in bigram_count.keys() }


def merge_bigrams(samples, bigrams):
    """
    Given a list of tokenized samples, create a new list of samples where bigrams
    have been merged.
    
    samples - A list of samples, each sample being a list of tokens.
    bigrams - A set of bigrams to merge.
    """
    new_samples = []

    for sample in samples:
        
        if len(sample) == 0:
            print("WARNING SAMPLE LEN IS 0", not sample)

        new_sample = []

        # Keep track if we merge in the previous iteration so we don't
        # merge overlapping phrases: for (a, b, c), if (a, b) was merged
        # we do not want to merge (b, c).
        merged_during_previous_iter = False

        for i in range(len(sample) - 1):
            if merged_during_previous_iter:
                merged_during_previous_iter = False
                continue
            
            current = (sample[i], sample[i+1])
            if current in bigrams:
                new_sample.append(sample[i] + " " + sample[i + 1])
                merged_during_previous_iter = True
            else:
                new_sample.append(sample[i])
                
        # We do not iterate the last element. So if the last pair was not
        # merged, we need to add back the last token.
        if not merged_during_previous_iter:
            new_sample.append(sample[-1])

        new_samples.append(new_sample)
                
    return new_samples


def perform_combine_common_phrases(samples, uc, bc, bigram_percentile=0.95, score_percentile=0.999, passes=4):
    """
    Given a set of samples and their unigram + bigram counts, we choose to combine phrases
    found in our corpus. bigram_percentile and score_percentile are parameters used to
    configure the phrase combination algorithm.
    """

    for i in range(passes):
        # Figure out a good delta value using bigram quantile
        bc_series = pd.Series(data=list(bc.values()))
        delta = bc_series.quantile(bigram_percentile)
    
        # Calculate score map and threshold
        score_map = create_phrase_score_map(uc, bc, delta)
        score_series = pd.Series(data=list(score_map.values()))
        score_threshold = score_series.quantile(score_percentile).item()
    
        # Find the phrases that have a high-enough score and generate
        # a new set of samples with those phrases merged into a single
        # token.
        phrases = {b for b, s in score_map.items() if s > score_threshold}
        samples = merge_bigrams(samples, phrases)
        
    return samples, phrases



Helper functions for encoding / decoding the tokens.

In [36]:
def create_encoder_and_decoder(uc):
    """
    Given the unigram counts in the database, define an encoding
    and decoding map for our data.
    """

    # Encoder is map from token to index.
    encoder = { t:i for (i,t) in enumerate(uc.keys()) }
    
    # Decoder is map from index to token.
    decoder = { i:t for (t,i) in encoder.items() }
    
    return (encoder, decoder)


def encode_samples(samples, encoder):
    encoded = []
    for sample in samples:
        encoded.append([encoder[t] for t in sample])
    return encoded


def decode_samples(encoded, decoder):
    samples = []
    for e in encoded:
        samples.append([decoder[i] for i in e])
    return samples



Helper functions for creating the training examples:

In [19]:
def create_skipgram_training_examples(samples, context=2):
    """
    Generates a set of training examples for the skip-gram model with a given
    context size.
    
    samples - The samples to generate training examples.

    context - The size of the skip-gram learning context.
    """
    window_size = context * 2 + 1
    window_center = context

    training_examples = []

    for sample in samples:
        if len(sample) < window_size:
            # There are plenty of samples, so skipping some should be fine.
            # Also, it would be awkward to train on non-uniform training sizes,
            # though there may be better ways to do this.
            continue
            
        for i in range(len(sample) - window_size + 1):
            wi = sample[i + window_center]
            wo = sample[i:i+window_center] + sample[i+window_center+1:i+window_size]
            training_examples.append((wi, wo))
    
    return training_examples



In [None]:
# TODO: HERE I AM!! NEED TO TEST TRAINING EXAMPLES!

### Performing Pre-Processing

In [None]:
time_points = [time.time()]

# Step 1: Convert data into tokenized samples.
filepaths = get_sample_filenames()
samples = get_raw_samples(filepaths)

time_points.append(time.time())
print(f"Loading data from files took {(time_points[-1] - time_points[-2]) / 60:.02f}m")


# Step 2: Removing punctuation only tokens.
samples = remove_pred(samples, lambda t: not is_punctuation_token(t))

time_points.append(time.time())
print(f"Removing punctuation tokens took {(time_points[-1] - time_points[-2]) / 60:.02f}m")


# Step 3. Sub-sample tokens in the vocabulary.
uc = create_unigram_counts(samples)
samples = perform_subsampling(samples, uc)

time_points.append(time.time())
print(f"Sub-sampling took {(time_points[-1] - time_points[-2]) / 60:.02f}m")


# Step 4. Marginalize all tokens ocurring less than 5 times
#         in the corpus
uc = create_unigram_counts(samples)
samples = perform_low_count_marginalization(samples, uc)

time_points.append(time.time())
print(f"Marginalizing low-count tokens took {(time_points[-1] - time_points[-2]) / 60:.02f}m")


# Step 5. Learn phrases in the vocabulary
uc, bc = create_unigram_and_bigram_counts(samples)
samples, _ = perform_combine_common_phrases(samples, uc, b)

time_points.append(time.time())
print(f"Learning common phrases took {(time_points[-1] - time_points[-2]) / 60:.02f}m")


# Step 6: Encode the tokens in the vocabulary.
encoder, decoder = create_encoder_and_decoder(uc)
uc_encoded = { encoder(t):c for t,c in encoder.items() }

time_points.append(time.time())
print(f"Encoding vocabulary took {(time_points[-1] - time_points[-2]) / 60:.02f}m")

# Step 7. Create training examples for skip-gram model (input word + words in context).
create_skipgram_training_examples(samples)
time_points.append(time.time())
print(f"Creating negative training samples took {(time_points[-1] - time_points[-2]) / 60:.02f}m")

print(f"Pre-processing took {(time_points[-1] - time_points[0]) / 60:.02f}m")


In [90]:
FILE_NAMES = os.path.join(PATH_TRAINING_CORPUS, os.listdir(PATH_TRAINING_CORPUS)[:CORPUS_FILE_COUNT])

start_time = time.time()

for filename in FILE_NAMES:
    with open(filename) as file:
        

total_time = time.time() - start_time

print(f"Total time to load a single file: {total_time:0.2f}s")
print(f"Total samples in the corpus: {len(samples_raw)}")

IndentationError: expected an indented block (<ipython-input-90-572577a258a8>, line 9)

## Step 2: Build the Network

1. Create a pytorch model for processing the data.
2. Define the negative sampling criterion.
3. Create a training procedure for running samples through network
4. Train the model

Helper functions for encoding / decoding the dataset.

In [92]:
def create_encoder_and_decoder(uc):
    """
    Given the unigram counts in the database, define an encoding
    and decoding map for our data.
    """

    # Encoder is map from token to index.
    encoder = { t:i for (i,t) in uc.items() }
    
    # Decoder is map from index to token.
    decoder = { i:t for (t,i) in encoder.items() }
    
    return (encoder, decoder)



## Step 3: Evaluating the Embeddings

1. Load the sets of analogies from the datasets
2. Find matching tokens for the analogies and discard analogies without analogous tokens
2. Evaluate model against analogies