In [5]:
from IPython.core.display import HTML

HTML(open("./css/index.css").read())

# Presentation on Word2Vec
Glenn Abastillas | 24 March, 2020

``` ADD VISUAL OF KING - QUEEN = MAN WITH SVG ```

This notebook goes over an example implementation of Word2Vec and some existing packages that perform Word2Vec training.

Contents
  1.  Preliminary Steps
      * [Load Packages](#load_packages)
      * [Preprocess Data](#preprocess_data)
      * [Quick Background](#quick_background)
  2. [Implementation from Scratch](#implementation_from_scratch)
      * Word2Vec Flavors: Continuous Back of Words (CBOW) / Skip Grams (SG)
      * Training
      * Retrieving the trained matrix
      * Applications
  3. Using an Existing Package

__At each step, we will also cover other packages that can be used to acheive the same thing (e.g., Countvectorizer)__
  
---

### Load Packages <a id="load_packages"></a>
First we import packages and clean the data.

In [None]:
import numpy as np
import spacy
import tqdm
from string import punctuation
from nltk.corpus import gutenberg, stopwords
from collections import namedtuple

We will use data from the `gutenberg` corpus and normalize the input data.

In [2]:
sents = gutenberg.sents('melville-moby_dick.txt')

`[insert navigation here]`

---
### Preprocess Data <a id="preprocess_data"></a>
##### Normalize Vocabulary

To improve processing and richness of our lexical items, we normalize our language data. 

Normalizing data can involve a variety of tasks depending on the final application of our language model. These tasks including making all words the same case, removing punctuation, and removing **<a id="stopword" style="text-decoration: none; cursor: help;" title="Words that contribute little semantic information to a text">stopwords</a>**.

For this presentation, we will use **<a id="token" style="text-decoration: none; cursor: help;" title="Combinations of characters separated by spaces (e.g., words, numbers)">tokens</a>** that are not punctuation nor stopwords.

Let's quickly define some functions we will use to pare our text data down.

In [None]:
stopwords_ = stopwords.words('english')

def is_stopword(token):
    ''' Check if a specified token is a stopword. '''
    try:
        return token.lower() in stopwords_
    except:
        return False

def is_valid_token(token):
    ''' Check if token is valid, i.e., not a stopword or punctuation '''
    try:
        return token.isalnum() & ~is_stopword(token)
    except:
        return False

##### First Normalization Step
Next, we create our `raw_text` data using the functions we just defined.

In [4]:
%%time
normalized_sents = [[word.lower() for word in sent if is_valid_token(word)] for sent in sents]

print(f'Number of sentences : {len(normalized_sents)}')

Number of sentences : 10059
CPU times: user 1.76 s, sys: 47.2 ms, total: 1.81 s
Wall time: 1.82 s


With our `normalized_text`, we can create a `dict` to convert the strings into numbers for faster processing down the line. We also create maps for strings to integers, integers to strings, and a probability distribution of word frequencies for possible negative sampling (if there is time).

In [27]:
flattened_text = [word for sent in normalized_sents for word in sent]

WORDS, COUNTS = np.unique(flattened_text, return_counts=True)

PROBS = COUNTS**(3/4) / (COUNTS**(3/4)).sum()

INDEX = np.arange(WORDS.size)

VOCAB = dict(zip(WORDS, INDEX))
VOCABR = dict(zip(INDEX, WORDS))

##### Second Normalization Step

Using the conversion function defined above, we can convert our `normalized_sents` into `data`, which contains only integers that will be used in our Word2Vec example.

In [6]:
%%time
data = (np.array([VOCAB[token] for token in sent]) for sent in normalized_sents)

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 9.06 µs


`[insert navigation here]`

---

Use this reference for later [Dimensions greater than 300 have diminishing returns](https://www.aclweb.org/anthology/D14-1162/)

### Quick Background <a id="quick_background"></a>
#### Background Into Word2Vec

* What is it?
  - Quick definition (implementation of theoretical matrix bit)
  - What is does to text.
  - What the output is.
  
* Why do we need it?
  - Many uses in AI.
  - Usage in NLP
  - Pros
  - Cons
 
* What cool things can it do?
  - Condense text into a lightweight matrix
  - Provide semantic abilities
  - Enable data to have algebraic properties
 
* What are competing models?
  - Other models to represent text
  - GloVe
  - Other vectorization models

`[insert navigation here]`

---
## Implementation from Scratch <a id="implementation_from_scratch"></a>

For things example, we will create a Word2Vec language model using the data we preprocessed above. 

In this section, we will develop a **<a id="skip-gram" style="text-decoration: none; cursor: help;" title="Using the a token to predict its surroundings">Skip-gram</a>** flavored Word2Vec model.

We will:
  * Create Skip-gram windows
  * Create preliminary <a id='one-hot' style='text-decoration: none; cursor: help;' title='A vector that is comprised of zeros and ones indicating absence or presence of a value'>one-hot vectors</a>


###### Parameters <a id='parameters'></a>
First we define some hyperparameters that we use for training.

In [7]:
parameters = {'window_size' : 2, 'dimensions' : 100, 'learning_rate' : 0.02, 'epochs' : 3}

This table quickly describes what each parameter does.

Parameters | Data Type | Description
--- | :-: | :--
`window_size` | `int` | The number of target tokens before and after a central token to include
`dimensions` | `int` | The number of dimensions in hidden layer. Dimensions greater than 300 have diminishing returns `[cite]`.
`learning_rate` | `float` | How quickly our model will correct itself
`epochs` | `int` | The number of rounds the model is trained

##### Creating the Training Data <a id="creating_the_training_data"></a>

We will generate loose <a id='one-hot' style='text-decoration: none; cursor: help;' title='A vector that is comprised of zeros and ones indicating absence or presence of a value'>one-hot vectors</a> that will serve as input and target data when training our model.

First we filter our data to ensure we have sufficient data to window.

In [8]:
%%time
data = [sent for sent in data if sent.size >= parameters['window_size'] + 1]

print(f'Number of sentences in data: {len(data)}')

CPU times: user 91.8 ms, sys: 37.5 ms, total: 129 ms
Wall time: 140 ms


Now, we generate our one hot vectors using the `VOCAB` as a model for our vector.

Let's define a few functions to help use generate these data.

In [9]:
Datum = namedtuple('Datum', 'target context'.split())

def one_hot(token, size=WORDS.size):
    ''' Convert an input token into an integer according to a specified reference '''
    vector = np.zeros((size, 1))
    vector[token] = 1
    return vector

def process_sentence(sent, processed_sentence={}, window_size=parameters['window_size'] + 1):
    ''' Return a dictionary with token keys and contexts as values
        
        Parameters
        ----------
            sent (list) : sentence to convert into one-hot vectors 
                          and group into target and context
            processed_sentence (dict) : Previously processed sentences to add to
            window_size (int) : Window size of CBOW
        
        Returns
        -------
            (dict) object with tokens as keys and their corresponding
                   one-hot vector targets and context in the following format:
            
            >> { token : [[target, [context-1, context-2, ...]], ...]}
    '''
    for i, token in enumerate(sent):
        a, b, j = max(i - window_size, 0), i + window_size, i+1
        before, after = sent[a:i], sent[j:b]

        context = []

        # Loop through the surrounding tokens
        for context_token in np.append(before, after):
            try:
                context.append(one_hot(context_token))
            except:
                print(token, context_token, a, b, before, after, sent[a:b])
                raise
        
        if token in processed_sentence:
            processed_sentence[token].context.append(context)
        else:
            target = one_hot(token)
            processed_sentence[token] = Datum(target, [context])

    return processed_sentence

```For debuggin purposes, item in index 10 has duplicates```

In [10]:
test = process_sentence(data[10])

Loop through all the sentences to generate `target` and `context` data for training.

In [11]:
processed_sentences = {}
for sent in tqdm.tqdm(data):
    processed_sentences = process_sentence(sent, processed_sentences)


100%|██████████| 8161/8161 [00:10<00:00, 766.33it/s] 


`[insert navigation here]`

---
##### Create Layers <a id='create_layers'></a>

These matrices will serve as the layers that surround our `word2vec` layer during training.

In [12]:
weights_1 = np.random.random((WORDS.size, parameters['dimensions']))
weights_2 = np.random.random((WORDS.size, parameters['dimensions']))
print(f'Dimensions\nWeights 1 {weights_1.shape}\nWeights 2 {weights_2.shape}')

Dimensions
Weights 1 (16993, 100)
Weights 2 (16993, 100)


`[insert navigation here]`

---
###### Feed Forward Algorithm

The first part of a two part algorithm defining a <a id='learning-step' style='text-decoration: none; cursor: help;' title='A phase where training data are learned and errors are adjusted throughout the model'>learning step</a>. This algorithm introduces our randomly initialized model to its first evidence of real data to learn from. It then predicts a surrounding vocabulary item from it.

In [15]:
def softmax(datum):
    ''' Return the an array normalized to a probability '''
    e = np.exp(datum - datum.max())
    return e / e.sum()

def forward(datum, weights_1=weights_1, weights_2=weights_2):
    ''' Return three matrices corresponding to the prediction, hidden layer, and output '''
    hidden = np.dot(weights_1.T, datum)
    output = np.dot(weights_2, hidden)
    prediction = softmax(output)
    return prediction, hidden, output

`[insert navigation here]`

---
###### Backpropagation Algorithm

The second part of the two part algorithm defining a <a id='learning-step' style='text-decoration: none; cursor: help;' title='A phase where training data are learned and errors are adjusted throughout the model'>learning step</a>. This algorithm compares the output of the <a id='feed-forward-algorithm' style='text-decoration: none; cursor: help;' title='The algorithm that takes in new data and attempts to make predictions from it'>feed forward algorithm</a> to the input token's actual context, calculates the error, and adjusts the model to correct for it. The adjustments are made in increments set by the learning rate parameter we set in our `parameters` variable.

In [122]:
t = processed_sentences[1].context[0]
s = sum(t)

In [129]:
a,*_ = np.where(s.flatten() >= 1)

segments = s[0:a[0]]
for i in range(a.size-1):
    j, k = a[i] + 1, a[i + 1]
    
    print(s.flatten()[j:k].size, j, k)


2178 2493 4671


In [133]:
def calculate_error(prediction, context):
    ''' Return a weights with the summed prediction error '''
    error = np.zeros((prediction.size, 1))
    for subcontext in context:
        for token in subcontext:
            error += prediction - token
    return error

def backpropagate(prediction, hidden, target, context, weights_1=weights_1, weights_2=weights_2):
    ''' Update weight matrices according to output from forward() '''
    error = calculate_error(prediction, context)
    
    weights_2_delta = np.outer(error, hidden)
    hidden_error = np.dot(weights_2.T, error)
    weights_1_delta = np.outer(hidden_error, target).T
    
    weights_1 -= weights_1_delta * parameters['learning_rate']
    weights_2 -= weights_2_delta * parameters['learning_rate']
    
def negative_sample(context, index=INDEX, p=PROBS):
    ''' Return an array of randomly sampled tokens to be updated '''
    summed = sum(context).flatten()
    indices, *_ = np.where(summed >= 1)
    
    segments = summed[0:indices[0]]
    segments_p = p[0:indices[0]]
    
    for i in np.arange(indices.size - 1):
        a, b = indices[i] + 1, indices[i + 1]
        segment = index[a:b]
        segment_p = p[a:b]
        
        segments = np.append(segments, segment)
        segments_p = np.append(segments_p, segment_p)
    return segments, segments_p

In [75]:
np.random.choice(INDEX, 5, p=PROBS)

array([13513,  6140, 16298, 14506, 14716])

`[insert navigation here]`

---
##### Training Algorithm

Having both the feed forward and backpropagation algorithms defined, we can now define a training algorithm to learn all training examples for a single <a id='epoch' style='text-decoration: none; cursor: help;' title='A complete cycle of learning steps through all training data'>epoch</a>.

In [108]:
def train(training_data, weights_1=weights_1, weights_2=weights_2, parameters=parameters, total=WORDS.size):
    ''' Train the Word2Vec model on our training data to generate meaningful word vectors '''
    
    data = training_data.items()
    
    for epoch in tqdm.tqdm(range(parameters['epochs'])):
        for __, (target, context) in tqdm.tqdm(data, total=total):
            prediction, hidden, output = forward(target)
            backpropagate(prediction, hidden, target, context)

Sandbox test for FF and BP

In [71]:
example = processed_sentences[0]
target, context = example.target, example.context

In [None]:
pred, h, u = forward(target)

In [107]:
backpropagate(pred, h, target, context)

`[insert navigation here]`

---
###### Test Iteration <a id='test_iteration'></a>

This cell loops through all our training data to demonstrate what happens in one training <a id='epoch' style='text-decoration: none; cursor: help;' title='A complete cycle of learning steps through all training data'>epoch</a>.

In [109]:
train(processed_sentences)

  0%|          | 0/3 [00:00<?, ?it/s]
  0%|          | 0/16577 [00:00<?, ?it/s][A
  0%|          | 1/16577 [00:00<1:01:58,  4.46it/s][A
  0%|          | 2/16577 [00:00<55:56,  4.94it/s]  [A
  0%|          | 4/16577 [00:00<43:53,  6.29it/s][A
  0%|          | 6/16577 [00:00<36:35,  7.55it/s][A
  0%|          | 8/16577 [00:00<30:25,  9.08it/s][A
  0%|          | 11/16577 [00:00<24:32, 11.25it/s][A
  0%|          | 13/16577 [00:00<21:22, 12.91it/s][A
  0%|          | 15/16577 [00:01<25:46, 10.71it/s][A
  0%|          | 17/16577 [00:01<28:52,  9.56it/s][A
  0%|          | 19/16577 [00:02<49:25,  5.58it/s][A
  0%|          | 21/16577 [00:02<56:41,  4.87it/s][A
  0%|          | 24/16577 [00:02<44:12,  6.24it/s][A
  0%|          | 27/16577 [00:03<34:58,  7.89it/s][A
  0%|          | 29/16577 [00:03<30:26,  9.06it/s][A
  0%|          | 31/16577 [00:03<28:38,  9.63it/s][A
  0%|          | 33/16577 [00:03<30:03,  9.17it/s][A
  0%|          | 35/16577 [00:03<26:10, 10.54it/s][A


KeyboardInterrupt: 

In [73]:
collection = processed_sentences.items()
collection = enumerate(collection)
for i, (key, (target, context)) in tqdm.tqdm(collection, total=WORDS.size):
    prediction, hidden, output = forward(target)
    backpropagate(prediction, hidden, target, context)

16577it [11:40, 23.65it/s]


Test the current model trained only on a single epoch.

In [103]:
token = np.random.randint(0, WORDS.size)
a, b = weights_1[token - 1], weights_1[token]
distance = np.dot(a,b) / (np.sqrt(np.dot(a, a)) * np.sqrt(np.dot(b,b)))

print(f'Token number {token} is "{VOCABR[token]}" along with "{VOCABR[token-1]}" with cos score as {distance}')

Token number 4205 is "distinct" along with "distilled" with cos score as 0.1674197989756508


`[insert navigation here]`

---
#### Using Our Trained Model

We can use our trained model to `list things we can do with our Word2Vec model from earlier`.

First, we define functions to help do those things.

In [140]:
def similarity(vector1, vector2):
    ''' Return the cosine similarity score for two tokens input as vectors '''
    a, b, c = np.dot(vector1, vector2), np.dot(vector1, vector1), np.dot(vector2, vector2)
    numerator = a
    denominator = (np.sqrt(b) * np.sqrt(c))
    return numerator / denominator

def most_similar(token, model=weights_1):
    ''' Return the token and vector of the most similar token in the model to input token '''
    best_score = -1
    token_vector = word_vector(token)
    best_vector = None
    for i, current_vector in enumerate(model):
        if i == token:
            continue
            
        score = similarity(token_vector, current_vector)
        if score > best_score:
            best_score = score
            best_vector = i
    return best_vector, best_score

def word_vector(token, model=weights_1):
    ''' Return the word vector corresponding to a token '''
    return model[token]

def topn(token, n=3, model=weights_1):
    ''' Return the top n tokens and vectors most closely related to input token '''
    
    similarity_scores, vectors = [], []
    token_vector = word_vector(token, model)
    
    for i, current_vector in enumerate(model):
        score = similarity(token_vector, current_vector)
        
        if similarity_scores and similarity_scores[0] < score:
            similarity_scores.insert(0, score)
            vectors.insert(0, i)
        

Test with algebraic interactions of this 

In [148]:
most_similar(994)

(8066, 0.858180036233178)

In [149]:
VOCABR[994], VOCABR[8066]

('babies', 'kindled')

In [127]:
similarity(word_vector(343), word_vector(344)), VOCABR[343], VOCABR[344]

(0.9954024547658752, 'agonized', 'agonizing')

---

In [62]:
vector.shape, matrix_1.shape

((16577, 1), (16577, 100))

In [61]:
pred, h, u = forward(vector, matrix_1, matrix_2)
pred.shape, h.shape, u.shape

((16577, 1), (100, 1), (16577, 1))

In [67]:
VOCABR[vector.argmax()]

'bounties'

In [65]:
VOCABR[pred.argmax()]

'copestone'

## Using an Existing Package

There are existing implementations that already exist that allow you to use Word2Vec technology out of the box.

Examples of these include:
  * SpaCy
  * gensim
  * ELMo
  * fasttext
 
 
### SpaCy

In [None]:
nlp = spacy.load('en_core_web_sm')