# Presentation on Word2Vec
Glenn Abastillas | 24 March, 2020

This notebook goes over an example implementation of Word2Vec and some existing packages that perform Word2Vec training.

Contents
  1. [Load Packages](#load_packages), [Preprocess Data](#preprocess_data), and some [Quick Background](#quick_background)
  2. [Implementation from Scratch](#implementation_from_scratch)
      * Word2Vec Flavors: Continuous Back of Words (CBOW) / Skip Grams (SG)
      * Training
      * Retrieving the trained matrix
  3. Using an Existing Package

__At each step, we will also cover other packages that can be used to acheive the same thing (e.g., Countvectorizer)__
  
---

### Load Packages <a id="load_packages"></a>
First we import packages and clean the data.

In [51]:
import numpy as np
import spacy
from string import punctuation
from nltk.corpus import gutenberg, stopwords

We will use data from the `gutenberg` corpus and normalize the input data.

In [42]:
sents = gutenberg.sents('melville-moby_dick.txt')

`[insert navigation here]`

---
### Preprocess Data <a id="preprocess_data"></a>
##### Normalize Vocabulary

To improve processing and richness of our lexical items, we normalize our language data. 

Normalizing data can involve a variety of tasks depending on the final application of our language model. These tasks including making all words the same case, removing punctuation, and removing **<a id="stopword" style="text-decoration: none; cursor: help;" title="Words that contribute little semantic information to a text">stopwords</a>**.

For this presentation, we will use **<a id="token" style="text-decoration: none; cursor: help;" title="Combinations of characters separated by spaces (e.g., words, numbers)">tokens</a>** that are not punctuation nor stopwords.

Let's quickly define some functions we will use to pare our text data down.

In [52]:
stopwords = stopwords.words('english')

def is_stopword(token):
    ''' Check if a specified token is a stopword. '''
    return token.lower() in stopwords

def is_valid_token(token):
    ''' Check if token is valid, i.e., not a stopword or punctuation '''
    return token.isalnum() & ~is_stopword(token)

Next, we create our `raw_text` data using the functions we just defined.

In [54]:
normalized_text = [[word.lower() for word in sent if is_valid_token(word)] for sent in sents]

With our `normalized_text`, we can create a `dict` to convert the strings into numbers for faster processing down the line.

In [78]:
flattened_text = [word for sent in normalized_text for word in sent]

VOCAB, INDEX = np.unique(flattened_text, return_index=True)

We now define a `to_index` function to convert strings to integers for faster processing.

In [57]:
def string_to_index(string, reference=WORDS):
    ''' Convert an input string into an integer according to a specified reference
        
        Parameters
        ----------
            string (str) : String to convert to an integer
            reference (list, array) : Reference array with unique vocabulary
        
        Returns
        -------
            An integer representing the input string's position in the reference object
        
    '''
    return np.argwhere(reference == string.lower())[0][0]

In [60]:
np.random.choice(WORDS, 10)

array(['overflowing', 'quiescence', 'cankerous', 'appals', 'attuned',
       'admirers', 'creates', 'convey', '1652', 'pedestrian'],
      dtype='<U20')

We do not need to define an function to convert integers back to strings because of `numpy`'s helpful indexing system that does that for us using th `WORDS` data we created previously.

### Preprocessing Text

Use this reference for later [Dimensions greater than 300 have diminishing returns](https://www.aclweb.org/anthology/D14-1162/)

In [41]:
text_ = [[to_index(word) for word in sent] for sent in text]

In [None]:
data = []

### Quick Background <a id="quick_background"></a>
#### Background Into Word2Vec

* What is it?
  - Quick definition (implementation of theoretical matrix bit)
  - What is does to text.
  - What the output is.
  
* Why do we need it?
  - Many uses in AI.
  - Usage in NLP
  - Pros
  - Cons
 
* What cool things can it do?
  - Condense text into a lightweight matrix
  - Provide semantic abilities
  - Enable data to have algebraic properties
 
* What are competing models?
  - Other models to represent text
  - GloVe
  - Other vectorization models

## Implementation from Scratch

Word2Vec 

Subsets
  * Bullet
  
First we define some hyperparameters that we use for training.

In [None]:
settings = {'window_size' : 2, 'dimensions' : 100, 'learning_rate' : 0.02, 'epochs' : 500}

## Using an Existing Package

There are existing implementations that already exist that allow you to use Word2Vec technology out of the box. In this section, we look at the Python package `gensim` with an API implementing Word2Vec.

Contents
  * SpaCy
  * gensim
 
 
### SpaCy

In [None]:
nlp = spacy.load('en_core_web_sm')