# Distributed Representations of Words and Phrases and Their Compositionality

This notebook contains an implementation of the Neural Net described in the paper "Distributed Representations of Words and Phrases and Their Compositionality". A copy of the paper along with a summary are available in this directory. This implementation is done using Pytorch.

Note that the purpose of this notebook is to put together a final implementation of all the data-fetching, pre-processing, training, and evaluation of the work in the paper. There will not be much discussion on the decisions made for hyper-parameters, configurations, and other implementation details. Please look at *torch_exmplore* for an exploration on some of these implementation details.

## Step 1: Data Fetching and Pre-Processing

1. Convert samples into tokens
2. Remove punctuation-only tokens
3. Sub-sample tokens in the vocabulary
4. Marginalize all tokens ocurring less than 5 times in the corpus
5. Learn phrases in the vocabulary
6. Create training examples for negative sampling (input word + set of skip-gram output words)
7. One-hot encode the training examples (create encoder / decoder pair)

In [2]:
import nltk
from nltk.tokenize import word_tokenize

import os
import torch
import time

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/brendanmcnamara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
torch.__version__

'1.3.1'

In [6]:
PATH_DATA = '../../data/language-modeling-benchmark-r13/'
PATH_TRAINING_CORPUS = os.path.join(PATH_DATA, 'training-monolingual.tokenized.shuffled')
PATH_HOLDOUT_CORPUS = os.path.join(PATH_DATA, 'heldout-monolingual.tokenized.shuffled')

CONTEXT_SIZE = 5
CORPUS_FILE_COUNT = 10                              

In [9]:
def is_punctuation_token(token):
    """
    A token is a punctuation token if the characters consist of
    only punctuation characters
    """
    return len(token) == len([c for c in token if c in punc_list])


def get_sample_filenames():
    """
    Get the names of the files to be sampled from.
    """
    return os.path.join(PATH_TRAINING_CORPUS, os.listdir(PATH_TRAINING_CORPUS)[:CORPUS_FILE_COUNT])


def get_raw_samples(filepath):
    """
    Fetch all raw samples from a particular file path. This will pull
    the data from the files and tokenize the samples. No pre-processing
    is done here.
    
    filepath - The filepath
    """
    raw_samples = []

    with open(filename) as file:
        data = file.read().split("\n")

        for sentence in data:
            raw_samples.append([t for t in word_tokenize(sentence.lower())])

    return raw_samples


def create_unigram_counts(samples):
    """
    Given a set of samples, generate a set of unigram counts
    within those samples.
    """
    uc = {}
    for sample in samples:
        for token in sample:
            if token not in uc:
                uc[token] = 0
            uc[token] = uc[token] + 1
    
    return uc
   

def create_unigram_and_bigram_counts(samples):
    """
    Given a set of samples, generate a set of unigram and
    bigram counts within those samples.
    """
    uc = {}
    bc = {}
    for sample in samples:
        for i in range(len(sample)):
            token = sample[i]
            if token not in uc:
                uc[token] = 0
            uc[token] = uc[token] + 1
            
        for i in range(0, len(sample), 2):
            if i + 1 >= len(sample):
                break
            
            bigram = (sample[i], sample[i+1])
            if bigram not in bc:
                bc[bigram] = 0
            bc[bigram] = bc[bigram] + 1
        
    return (uc, bc)


def remove_pred(samples, pred):
    """
    Given a predicate, remove a particular token from the
    sample being processed. If there is at most 1 token in
    the sample after removing the tokens, then the entire sample
    is also removed.
    
    samples - The samples to remove from. No changes will be
              made to this set of samples.
              
    pred - A predicate that returns True if the token should be
           kept, False otherwise.
    """
    new_samples = []
    for sample in samples:
        new_sample = []

        for token in sample:
            if pred(token):
                new_sample.append(token)
                
        if len(new_sample) > 1:
            new_samples.append(new_sample)
    

def remove_tokens(samples, tokens):
    """
    Given a set of tokens, remove those tokens from the list of samples.
    If a token exists in the set of tokens, it is removed from the set
    of samples.
    
    samples - The samples to remove from. No changes will be
              made to this set of samples.
              
    tokens - Tokens to remove from all samples.
    """
    return remove_pred(samples, lambda t: t not in tokens)


def perform_subsampling(samples, uc):
    """
    Perform sub-sampling according to the procedure outlined in the paper.
    
    samples - The samples to sub-sample from. No changes will be
              made to this set of samples.
              
    uc - The unigram counts of words.
    """
    #  This is the t value used in the paper. Note that it is not
    # independent of corpus size.
    t = 10e-5

    tokens = uc.keys()
    counts = np.array(list(uc.values()))
    total_count = np.sum(counts)
    frequencies = counts / total_count
    p_values = np.maximum(0, 1 - np.sqrt(t / frequencies))
    
    # TODO: HERE I AM
    

In [44]:
np.random.choice([False, True], replace=False, p=[0.9, 0.1])

True

In [21]:
FILE_NAMES = os.path.join(PATH_TRAINING_CORPUS, os.listdir(PATH_TRAINING_CORPUS)[:CORPUS_FILE_COUNT])

start_time = time.time()

for filename in FILE_NAMES:
    with open(filename) as file:
        

total_time = time.time() - start_time

print(f"Total time to load a single file: {total_time:0.2f}s")
print(f"Total samples in the corpus: {len(samples_raw)}")

IndentationError: expected an indented block (<ipython-input-21-572577a258a8>, line 9)

## Step 2: Build the Network

1. Create a pytorch model for processing the data.
2. Define the negative sampling criterion.
3. Create a training procedure for running samples through network
4. Train the model

## Step 3: Evaluating the Embeddings

1. Load the sets of analogies from the datasets
2. Find matching tokens for the analogies and discard analogies without analogous tokens
2. Evaluate model against analogies