# Presentation on Word2Vec
Glenn Abastillas | 24 March, 2020

This notebook goes over an example implementation of Word2Vec and some existing packages that perform Word2Vec training.

Contents
  1.  Preliminary Steps
      * [Load Packages](#load_packages)
      * [Preprocess Data](#preprocess_data)
      * [Quick Background](#quick_background)
  2. [Implementation from Scratch](#implementation_from_scratch)
      * Word2Vec Flavors: Continuous Back of Words (CBOW) / Skip Grams (SG)
      * Training
      * Retrieving the trained matrix
  3. Using an Existing Package

__At each step, we will also cover other packages that can be used to acheive the same thing (e.g., Countvectorizer)__
  
---

### Load Packages <a id="load_packages"></a>
First we import packages and clean the data.

In [1]:
import numpy as np
import spacy
import tqdm
from string import punctuation
from nltk.corpus import gutenberg, stopwords

We will use data from the `gutenberg` corpus and normalize the input data.

In [2]:
sents = gutenberg.sents('melville-moby_dick.txt')

`[insert navigation here]`

---
### Preprocess Data <a id="preprocess_data"></a>
##### Normalize Vocabulary

To improve processing and richness of our lexical items, we normalize our language data. 

Normalizing data can involve a variety of tasks depending on the final application of our language model. These tasks including making all words the same case, removing punctuation, and removing **<a id="stopword" style="text-decoration: none; cursor: help;" title="Words that contribute little semantic information to a text">stopwords</a>**.

For this presentation, we will use **<a id="token" style="text-decoration: none; cursor: help;" title="Combinations of characters separated by spaces (e.g., words, numbers)">tokens</a>** that are not punctuation nor stopwords.

Let's quickly define some functions we will use to pare our text data down.

In [3]:
stopwords_ = stopwords.words('english')

def is_stopword(token):
    ''' Check if a specified token is a stopword. 
    
        Parameters
        ----------
            token (str) : token to check
        
        Returns
        -------
            (boolean) True if token is a stopword, else False
        
        Notes
        -----
            Input that are not strings will return as False
    '''
    try:
        return token.lower() in stopwords_
    except:
        return False

def is_valid_token(token):
    ''' Check if token is valid, i.e., not a stopword or punctuation
    
        Parameters
        ----------
            token (str) : token to check
        
        Returns
        -------
            (boolean) True if token is a stopword, else False
        
        Notes
        -----
            Input that are not strings will return as False
    '''
    try:
        return token.isalnum() & ~is_stopword(token)
    except:
        return False

##### First Normalization Step
Next, we create our `raw_text` data using the functions we just defined.

In [11]:
%%time
normalized_sents = [[word.lower() for word in sent if is_valid_token(word)] for sent in sents]

Wall time: 2.45 s


With our `normalized_text`, we can create a `dict` to convert the strings into numbers for faster processing down the line.

In [6]:
flattened_text = [word for sent in normalized_sents for word in sent]

VOCAB, INDEX = np.unique(flattened_text, return_index=True)

We now define a `to_index` function to convert strings to integers for faster processing.

In [8]:
def to_index(token, reference=VOCAB):
    ''' Convert an input token into an integer according to a specified reference
        
        Parameters
        ----------
            token (str) : token to convert to an integer
            reference (list, array) : Reference array with unique vocabulary
        
        Returns
        -------
            An integer representing the input token's position in the reference object
    
        Notes
        -----
            Function assumes input token is already in lower case
        
    '''
    return np.argwhere(reference == token)[0][0]

##### Second Normalization Step

Using the conversion function defined above, we can convert our `normalized_sents` into `data`, which contains only integers that will be used in our Word2Vec example.

In [29]:
%%time
data = ([to_index(word) for word in sent] for sent in normalized_sents)

Wall time: 2.96 ms


`[insert navigation here]`

---
### Preprocessing Text

Use this reference for later [Dimensions greater than 300 have diminishing returns](https://www.aclweb.org/anthology/D14-1162/)

### Quick Background <a id="quick_background"></a>
#### Background Into Word2Vec

* What is it?
  - Quick definition (implementation of theoretical matrix bit)
  - What is does to text.
  - What the output is.
  
* Why do we need it?
  - Many uses in AI.
  - Usage in NLP
  - Pros
  - Cons
 
* What cool things can it do?
  - Condense text into a lightweight matrix
  - Provide semantic abilities
  - Enable data to have algebraic properties
 
* What are competing models?
  - Other models to represent text
  - GloVe
  - Other vectorization models

`[insert navigation here]`

---
## Implementation from Scratch <a id="implementation_from_scratch"></a>

For things example, we will create a Word2Vec language model using the data we preprocessed above. 

In this section, we will develop a **<a id="cbow" style="text-decoration: none; cursor: help;" title="Continuous Bag of Words">CBOW</a>** flavored Word2Vec model.

We will:
  * Create CBOW windows
  * Create preliminary <a id='one-hot' style='text-decoration: none; cursor: help;' title='A vector that is comprised of zeros and ones indicating absence or presence of a value'>one-hot vectors</a>


###### Parameters <a id='parameters'></a>
First we define some hyperparameters that we use for training.

In [20]:
parameters = {'window_size' : 2, 'dimensions' : 100, 'learning_rate' : 0.02, 'epochs' : 500}

This table quickly describes what each parameter does.

Parameters | Data Type | Description
--- | :-: | :--
`window_size` | `int` | The number of target tokens before and after a central token to include
`dimensions` | `int` | The number of dimensions in hidden layer. Dimensions greater than 300 have diminishing returns `[cite]`.
`learning_rate` | `float` | How quickly our model will correct itself
`epochs` | `int` | The number of rounds the model is trained

###### Creating the Training Data <a id="creating_the_training_data"></a>

**Windowing** : We will generate loose <a id='one-hot' style='text-decoration: none; cursor: help;' title='A vector that is comprised of zeros and ones indicating absence or presence of a value'>one-hot vectors</a> that will serve as input and target data when training our model.

First we filter our data to ensure we have sufficient data to window.

In [30]:
%%time
data = [sent for sent in data if len(sent) >= parameters['window_size'] + 1]

Wall time: 36.7 s


Now, we generate our one hot vectors using the `VOCAB` as a model for our vector.

In [31]:
vector = np.zeros(VOCAB.size)

for datum in data[:200]:
    print(datum)

[9521, 4196, 7041, 9312, 82]
[14536, 8513, 3264, 16057, 6540, 12817]
[10464, 16057, 15016, 2852, 6941, 1739, 1870, 12963]
[5183, 4684, 10185, 8669, 6541, 11748, 6790, 9528, 4876, 6286, 5758, 8398, 9831, 16856]
[8907, 4683, 10185, 6541, 13728, 9412, 12137, 9629]
[14746, 6782, 12817, 10298, 14832, 9795, 16587, 5724, 2193, 15197, 8598, 7428, 8652, 6718, 574, 577, 9053, 13348, 16843, 3977, 15446]
[682, 9796, 12555, 12513, 3780]
[7365, 846, 16131]
[16397, 16409, 7383, 12510, 16400]
[16686, 676, 12769]
[10695, 10027, 10027, 5594]
[10695, 10027, 10027, 5108]
[5382, 14536, 14357, 14357, 8683]
[12976, 9343, 10451, 2113, 6671, 16859, 11162, 4162, 14357, 14357, 788, 6497, 8850, 16128, 14263, 14042, 4715, 10886, 16607, 11861, 570, 16597, 3443, 754, 5693, 1772, 16608, 12642, 11477]
[14948, 9751, 5187, 2339, 8593, 14746, 7075, 10908, 16587, 14088, 7280, 1083, 5382, 16176, 6514, 2452]
[15254, 663, 1091, 6306, 16571, 11104, 787, 5382, 13703, 16103, 5037, 436, 6402, 1578, 5395, 16234, 11519, 12664, 150

[2206, 1510, 9591, 5723, 8906, 8331, 16587, 10419, 10134, 8644, 6019, 16946, 478]
[13405, 15144, 12428, 16587, 717, 15185, 12765, 13981, 15028, 10459, 11387, 11836, 2630, 16875, 16769, 8861]
[11797, 10178, 2137, 5603]
[10514, 2009, 12765, 1495, 6294, 16597, 7997, 14023]
[4752, 3345, 6487]
[9693, 2455, 9232]
[717, 14219, 16587]
[9819, 13211, 16587, 13204, 5134, 9801, 1035, 5692, 4132, 8491, 13861, 16587, 10419, 10134]
[10403, 2455, 9801, 5723, 9218, 12664, 16196]
[9914, 16969, 72]
[9144, 12733, 13290, 10206, 9931, 16736, 10947, 6067, 1959, 4242, 9600, 10464, 10863, 6413, 16381, 16587, 5840, 12912]
[4846, 10048, 13591]
[11732, 8746, 16780, 1728, 4972, 2280, 10206, 16587, 625, 598, 2, 121, 16932, 9860, 13421, 4981, 9415]
[13731, 16587, 13119, 15381, 14742, 505, 3511, 8720, 16647, 12278, 4402, 15024, 6039, 9415]
[8990, 479, 4961, 6090, 1037, 7776, 13861, 16587, 12515, 11935, 5013, 6909, 16699, 5302, 8147, 13618, 5189, 893, 12624, 1728, 6909, 11537, 16125, 14670, 13731, 16072, 4132]
[9232, 

## Using an Existing Package

There are existing implementations that already exist that allow you to use Word2Vec technology out of the box. In this section, we look at the Python package `gensim` with an API implementing Word2Vec.

Contents
  * SpaCy
  * gensim
 
 
### SpaCy

In [None]:
nlp = spacy.load('en_core_web_sm')