# Presentation on Word2Vec
Glenn Abastillas | 24 March, 2020

This notebook goes over an example implementation of Word2Vec and some existing packages that perform Word2Vec training.

Contents
  1.  Preliminary Steps
      * [Load Packages](#load_packages)
      * [Preprocess Data](#preprocess_data)
      * [Quick Background](#quick_background)
  2. [Implementation from Scratch](#implementation_from_scratch)
      * Word2Vec Flavors: Continuous Back of Words (CBOW) / Skip Grams (SG)
      * Training
      * Retrieving the trained matrix
  3. Using an Existing Package

__At each step, we will also cover other packages that can be used to acheive the same thing (e.g., Countvectorizer)__
  
---

### Load Packages <a id="load_packages"></a>
First we import packages and clean the data.

In [1]:
import numpy as np
import spacy
import tqdm
from string import punctuation
from nltk.corpus import gutenberg, stopwords
from collections import namedtuple

We will use data from the `gutenberg` corpus and normalize the input data.

In [2]:
sents = gutenberg.sents('melville-moby_dick.txt')

`[insert navigation here]`

---
### Preprocess Data <a id="preprocess_data"></a>
##### Normalize Vocabulary

To improve processing and richness of our lexical items, we normalize our language data. 

Normalizing data can involve a variety of tasks depending on the final application of our language model. These tasks including making all words the same case, removing punctuation, and removing **<a id="stopword" style="text-decoration: none; cursor: help;" title="Words that contribute little semantic information to a text">stopwords</a>**.

For this presentation, we will use **<a id="token" style="text-decoration: none; cursor: help;" title="Combinations of characters separated by spaces (e.g., words, numbers)">tokens</a>** that are not punctuation nor stopwords.

Let's quickly define some functions we will use to pare our text data down.

In [3]:
stopwords_ = stopwords.words('english')

def is_stopword(token):
    ''' Check if a specified token is a stopword. '''
    try:
        return token.lower() in stopwords_
    except:
        return False

def is_valid_token(token):
    ''' Check if token is valid, i.e., not a stopword or punctuation '''
    try:
        return token.isalnum() & ~is_stopword(token)
    except:
        return False

##### First Normalization Step
Next, we create our `raw_text` data using the functions we just defined.

In [114]:
%%time
normalized_sents = [[word.lower() for word in sent if is_valid_token(word)] for sent in sents]
normalized_sents = [sent for sent in normalized_sents if len(sent) > 3]

CPU times: user 1.85 s, sys: 39.6 ms, total: 1.89 s
Wall time: 1.9 s


With our `normalized_text`, we can create a `dict` to convert the strings into numbers for faster processing down the line.

In [119]:
flattened_text = [word for sent in normalized_sents for word in sent]

WORDS = np.unique(flattened_text)
INDEX = np.arange(WORDS.size)

VOCAB = dict(zip(WORDS, INDEX))
VOCABR = dict(zip(INDEX, WORDS))

We now define a `to_index` function to convert strings to integers for faster processing.

In [120]:
def to_index(token, reference=VOCAB, indices=INDEX):
    ''' Convert an input token into an integer according to a specified reference
        
        Parameters
        ----------
            token (str) : token to convert to an integer
            reference (list, array) : Reference array with unique vocabulary
            indices (list, array) : Reference array with vocabulary indices
        
        Returns
        -------
            An integer representing the input token's position in the reference object
    
        Notes
        -----
            Function assumes input token is already in lower case
    '''
    return VOCAB[token]

##### Second Normalization Step

Using the conversion function defined above, we can convert our `normalized_sents` into `data`, which contains only integers that will be used in our Word2Vec example.

In [121]:
%%time
data = (np.array([to_index(word) for word in sent]) for sent in normalized_sents)

CPU times: user 5.48 ms, sys: 3.31 ms, total: 8.79 ms
Wall time: 11.6 ms


`[insert navigation here]`

---
### Preprocessing Text

Use this reference for later [Dimensions greater than 300 have diminishing returns](https://www.aclweb.org/anthology/D14-1162/)

### Quick Background <a id="quick_background"></a>
#### Background Into Word2Vec

* What is it?
  - Quick definition (implementation of theoretical matrix bit)
  - What is does to text.
  - What the output is.
  
* Why do we need it?
  - Many uses in AI.
  - Usage in NLP
  - Pros
  - Cons
 
* What cool things can it do?
  - Condense text into a lightweight matrix
  - Provide semantic abilities
  - Enable data to have algebraic properties
 
* What are competing models?
  - Other models to represent text
  - GloVe
  - Other vectorization models

`[insert navigation here]`

---
## Implementation from Scratch <a id="implementation_from_scratch"></a>

For things example, we will create a Word2Vec language model using the data we preprocessed above. 

In this section, we will develop a **<a id="cbow" style="text-decoration: none; cursor: help;" title="Continuous Bag of Words">CBOW</a>** flavored Word2Vec model.

We will:
  * Create CBOW windows
  * Create preliminary <a id='one-hot' style='text-decoration: none; cursor: help;' title='A vector that is comprised of zeros and ones indicating absence or presence of a value'>one-hot vectors</a>


###### Parameters <a id='parameters'></a>
First we define some hyperparameters that we use for training.

In [122]:
parameters = {'window_size' : 2, 'dimensions' : 100, 'learning_rate' : 0.02, 'epochs' : 500}

This table quickly describes what each parameter does.

Parameters | Data Type | Description
--- | :-: | :--
`window_size` | `int` | The number of target tokens before and after a central token to include
`dimensions` | `int` | The number of dimensions in hidden layer. Dimensions greater than 300 have diminishing returns `[cite]`.
`learning_rate` | `float` | How quickly our model will correct itself
`epochs` | `int` | The number of rounds the model is trained

##### Creating the Training Data <a id="creating_the_training_data"></a>

**Windowing** : We will generate loose <a id='one-hot' style='text-decoration: none; cursor: help;' title='A vector that is comprised of zeros and ones indicating absence or presence of a value'>one-hot vectors</a> that will serve as input and target data when training our model.

First we filter our data to ensure we have sufficient data to window.

In [123]:
%%time
data = [sent for sent in data if sent.size >= parameters['window_size'] + 1]

CPU times: user 96.5 ms, sys: 6.52 ms, total: 103 ms
Wall time: 110 ms


Now, we generate our one hot vectors using the `VOCAB` as a model for our vector.

Let's define a few functions to help use generate these data.

In [179]:
Datum = namedtuple('Datum', 'target context'.split())

def one_hot(token, size=WORDS.size, key=VOCAB):
    ''' Convert an input token into an integer according to a specified reference
        
        Parameters
        ----------
            token (str) : token to convert to a one-hot array
            size (int) : length of one-hot array to return
            key (dict) : key-value mapping for tokens and indices
        
        Returns
        -------
            (np.array) A one-hot vector
    '''
    vector = np.zeros((size, 1))
    vector[token] = 1
    return vector

def process_sentence(sent, processed_sentence={}, window_size=parameters['window_size'] + 1):
    ''' Return a dictionary with token keys and contexts as values
        
        Parameters
        ----------
            sent (list) : sentence to convert into one-hot vectors 
                          and group into target and context
            processed_sentence (dict) : Previously processed sentences to add to
            window_size (int) : Window size of CBOW
        
        Returns
        -------
            (dict) object with tokens as keys and their corresponding
                   one-hot vector targets and context in the following format:
            
            >> { token : [[target, [context-1, context-2, ...]], ...]}
    '''
    
    for i, token in enumerate(sent):
        a, b = max(i - window_size, 0), i + window_size
        before, after = sent[a:i], sent[i:b]

        context = []

        for context_token in np.append(before, after):
            try:
                context.append(one_hot(context_token))
            except:
                print(token, context_token, a, b, before, after, sent[a:b])
                raise
        
        if token in processed_sentence:
            processed_sentence[token].context.append(context)
        else:
            target = one_hot(token)
            processed_sentence[token] = Datum(target, [context])

    return processed_sentence

```For debuggin purposes, item in index 10 has duplicates```

In [180]:
test = process_sentence(data[10])

Loop through all the sentences to generate `target` and `context` data for training.

In [181]:
processed_sentences = {}
for sent in tqdm.tqdm(data):
    processed_sentences = process_sentence(sent, processed_sentences)


100%|██████████| 7428/7428 [00:12<00:00, 596.74it/s]


`[insert navigation here]`

---
##### Create Layers <a id='create_layers'></a>

These matrices will serve as the layers that surround our `word2vec` layer during training.

In [182]:
matrix_1 = np.random.random((WORDS.size, parameters['dimensions']))
matrix_2 = np.random.random((WORDS.size, parameters['dimensions']))
hidden_layer = np.random.random((WORDS.size, 1))

In [194]:
processed_sentences[0].context[0][0].dot(matrix_1)

ValueError: shapes (16577,1) and (100,16577) not aligned: 1 (dim 1) != 100 (dim 0)

## Using an Existing Package

There are existing implementations that already exist that allow you to use Word2Vec technology out of the box. In this section, we look at the Python package `gensim` with an API implementing Word2Vec.

Contents
  * SpaCy
  * gensim
 
 
### SpaCy

In [None]:
nlp = spacy.load('en_core_web_sm')