## Skip-Gram Word2Vec

In this notebook, I'll lead you through using PyTorch to implement the [Word2Vec algorithm](https://en.wikipedia.org/wiki/Word2vec) using the skip-gram architecture. By implementing this, you will learn about embedding words for use in natural language processing. This will come handy when dealing with things like machine translation.

## Readings
Here are the resources used to build this notebook.
* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of Word2Vec from Chris McCormick
* [First Word2Vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al
* [Neural Information Processing Systems paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for Word2Vec, also from Mikolov et al
---
## Word embeddings

When you're dealing with words in text, you end up with tens of thousands of word classes to analyze; one for each word in a vocabulary. Trying to one-hot encode these words is massively inefficient because most values in a one-hot vector will be set to zero. So the matrix multiplication that happens in between a one-hot input vector and a first hidden layer will result in mostly zero-valued hidden outputs.

To solve this problem and greatly increase the efficiency of our network, we use what are called **embeddings**. Embeddings are just a fully connected layer like you've seen before. We call this layer the embedding layer and the weights are the embedding weights. We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding to the index of the "on" input unit.

<img src="https://github.com/udacity/deep-learning-v2-pytorch/raw/3a95d118f9df5a86826e1791c5c100817f0fd924/word2vec-embeddings/assets/lookup_matrix.png" width=300px>

Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers, for example, "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you take the 958th row of the embedding matrix. This process is called an **embedding lookup** and the number of hidden units is the **embedding dimension**.

The embedding lookup is just a weight matrix. The embedding layer is just a hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix.

Embeddings aren't only used for words. They can be used for any model where there is a massive amount of classes. A particular type of model called **Word2Vec** uses the embedding layer to find vector representatinos of words that contain semantic meaning.
---
## Word2Vec

The Word2Vec algorithm finds more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words
<img src="https://github.com/udacity/deep-learning-v2-pytorch/raw/3a95d118f9df5a86826e1791c5c100817f0fd924/word2vec-embeddings/assets/context_drink.png" width=300px>
Words that show up in similar **contexts**, such as "coffee", "tea", and "water" will have vectors near each other. Different words will be further away from one another, and relationships can be represented by a distance in vector space.

There are two architectures for implementing Word2Vec:
>* CBOW (Continuous Bag-Of-Words)
>* Skip-gram

<img src="https://github.com/udacity/deep-learning-v2-pytorch/raw/3a95d118f9df5a86826e1791c5c100817f0fd924/word2vec-embeddings/assets/word2vec_architectures.png" width=500px>

In this implementation, we will be using the **skip-gram architecture** with **negative sampling** because it performs better than CBOW and trains faster with negative sampling. Here, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.
---
## Loading Data
Next, we will load in data and place it in a directory.
1. Load the [text8 dataset](); a file of cleaned up *Wikipedia article text* from Matt Mahoney.
2. Place the data in the `data` folder in the home directory.
3. Then you can extract it and delete the archive, zip file to save storage space.

After following these steps, you should only have one file in your data directory: `data/text8`



In [1]:
!wget https://s3.amazonaws.com/video.udacity-data.com/topher/2018/October/5bbe6499_text8/text8.zip
!mkdir ./data
!unzip text8.zip -d ./data

--2021-01-29 21:14:43--  https://s3.amazonaws.com/video.udacity-data.com/topher/2018/October/5bbe6499_text8/text8.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.99.173
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.99.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’


2021-01-29 21:14:44 (33.4 MB/s) - ‘text8.zip’ saved [31344016/31344016]

Archive:  text8.zip
  inflating: ./data/text8            


In [2]:
# read in the extracted text file
with open('data/text8') as f:
    text = f.read()

# Print out the first 100 characters
print(text[:100])

 anarchism originated as a term of abuse first used against early working class radicals including t


## Pre-processing

Here, we will fix the text to make training easier. This comes from the `utils.py` file. The `preprocess` function does a few things:

>* It converts any punctuation into tokens, so a period is changed to `<PERIOD>`. In this dataset, there aren't any peroids, but it will help in other NLP problems.
>* It removes all words that show up five or *fewer* times in the dataset. This will greatly reduce issues due to noise in the data and improve quality of the vector representations.
>* It returns a list of words in the text.

This may take a few seconds to run, since the text file is quite large.

In [3]:
import utils

# get list of words
words = utils.preprocess(text)
print(words[:30])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst']


In [4]:
# print some stats about this word data
print(f"Total words in text: {len(words)}")
print(f"Unique words: {len(set(words))}")

Total words in text: 16680599
Unique words: 63641


### Dictionaries

Next, we are creating two dictionaries to convert words to integers and back again (integers to words). This is again done with a function in the `utils.py` file. `create_lookup_tables` takes in a list of words in a text and returns two dictionaries:
>* The integers are assigned in descending frequency order, so the most frequent word ("the") is given the integer 0 and the next most frequent is 1, and so on.

Once we have our discionaries, the words are converted to integers and stored in the list `int_words`.

In [5]:
vocab_to_int, int_to_vocab = utils.create_lookup_tables(words)
int_words = [vocab_to_int[word] for word in words]

print(int_words[:30])

[5233, 3080, 11, 5, 194, 1, 3133, 45, 58, 155, 127, 741, 476, 10571, 133, 0, 27349, 1, 0, 102, 854, 2, 0, 15067, 58112, 1, 0, 150, 854, 3580]


## Subsampling [next]