# Skip-gram Word2Vec

In this notebook, I'll lead you through using PyTorch to implement the [Word2Vec algorithm](https://en.wikipedia.org/wiki/Word2vec) using the skip-gram architecture. By implementing this, you'll learn about embedding words for use in natural language processing. This will come in handy when dealing with things like machine translation.

## Readings

Here are the resources I used to build this notebook. I suggest reading these either beforehand or while you're working on this material.

* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of Word2Vec from Chris McCormick 
* [First Word2Vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.
* [Neural Information Processing Systems, paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for Word2Vec also from Mikolov et al.

---
## Word embeddings

When you're dealing with words in text, you end up with tens of thousands of word classes to analyze; one for each word in a vocabulary. Trying to one-hot encode these words is massively inefficient because most values in a one-hot vector will be set to zero. So, the matrix multiplication that happens in between a one-hot input vector and a first, hidden layer will result in mostly zero-valued hidden outputs.

<img src='assets/one_hot_encoding.png' width=50%>

To solve this problem and greatly increase the efficiency of our networks, we use what are called **embeddings**. Embeddings are just a fully connected layer like you've seen before. We call this layer the embedding layer and the weights are embedding weights. We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding the index of the "on" input unit.

<img src='assets/lookup_matrix.png' width=50%>

Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers, for example "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you just take the 958th row of the embedding matrix. This process is called an **embedding lookup** and the number of hidden units is the **embedding dimension**.

<img src='assets/tokenize_lookup.png' width=50%>
 
There is nothing magical going on here. The embedding lookup table is just a weight matrix. The embedding layer is just a hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix.

Embeddings aren't only used for words of course. You can use them for any model where you have a massive number of classes. A particular type of model called **Word2Vec** uses the embedding layer to find vector representations of words that contain semantic meaning.

---
## Word2Vec

The Word2Vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words.

<img src="assets/context_drink.png" width=40%>

Words that show up in similar **contexts**, such as "coffee", "tea", and "water" will have vectors near each other. Different words will be further away from one another, and relationships can be represented by distance in vector space.

<img src="assets/vector_distance.png" width=40%>


There are two architectures for implementing Word2Vec:
>* CBOW (Continuous Bag-Of-Words) and 
* Skip-gram

<img src="assets/word2vec_architectures.png" width=60%>

In this implementation, we'll be using the **skip-gram architecture** because it performs better than CBOW. Here, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.

---
## Loading Data

Next, we'll ask you to load in data and place it in the `data` directory

1. Load the [text8 dataset](https://s3.amazonaws.com/video.udacity-data.com/topher/2018/October/5bbe6499_text8/text8.zip); a file of cleaned up *Wikipedia article text* from Matt Mahoney. 
2. Place that data in the `data` folder in the home directory.
3. Then you can extract it and delete the archive, zip file to save storage space.

After following these steps, you should have one file in your data directory: `data/text8`.

In [1]:
# read in the extracted text file      
with open('data/text8') as f:
    text = f.read()

# print out the first 100 characters
print(text[:100])

 anarchism originated as a term of abuse first used against early working class radicals including t


## Pre-processing

Here I'm fixing up the text to make training easier. This comes from the `utils.py` file. The `preprocess` function does a few things:
>* It converts any punctuation into tokens, so a period is changed to ` <PERIOD> `. In this data set, there aren't any periods, but it will help in other NLP problems. 
* It removes all words that show up five or *fewer* times in the dataset. This will greatly reduce issues due to noise in the data and improve the quality of the vector representations. 
* It returns a list of words in the text.

This may take a few seconds to run, since our text file is quite large. If you want to write your own functions for this stuff, go for it!

In [2]:
import utils

# get list of words
words = utils.preprocess(text)
print(words[:30])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst']


In [3]:
# print some stats about this word data
print("Total words in text: {}".format(len(words)))
print("Unique words: {}".format(len(set(words)))) # `set` removes any duplicate words

Total words in text: 16680599
Unique words: 63641


### Dictionaries

Next, I'm creating two dictionaries to convert words to integers and back again (integers to words). This is again done with a function in the `utils.py` file. `create_lookup_tables` takes in a list of words in a text and returns two dictionaries.
>* The integers are assigned in descending frequency order, so the most frequent word ("the") is given the integer 0 and the next most frequent is 1, and so on. 

Once we have our dictionaries, the words are converted to integers and stored in the list `int_words`.

In [4]:
vocab_to_int, int_to_vocab = utils.create_lookup_tables(words)
int_words = [vocab_to_int[word] for word in words]

print(int_words[:30])

[5233, 3080, 11, 5, 194, 1, 3133, 45, 58, 155, 127, 741, 476, 10571, 133, 0, 27349, 1, 0, 102, 854, 2, 0, 15067, 58112, 1, 0, 150, 854, 3580]


## Subsampling

Words that show up often such as "the", "of", and "for" don't provide much context to the nearby words. If we discard some of them, we can remove some of the noise from our data and in return get faster training and better representations. This process is called subsampling by Mikolov. For each word $w_i$ in the training set, we'll discard it with probability given by 

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

where $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset.

$$ P(0) = 1 - \sqrt{\frac{1*10^{-5}}{1*10^6/16*10^6}} = 0.98735 $$

I'm going to leave this up to you as an exercise. Check out my solution to see how I did it.

> **Exercise:** Implement subsampling for the words in `int_words`. That is, go through `int_words` and discard each word given the probablility $P(w_i)$ shown above. Note that $P(w_i)$ is the probability that a word is discarded. Assign the subsampled data to `train_words`.

In [5]:
from collections import Counter
import random
import numpy as np

np.random.seed(7) # helpful for debug

threshold = 1e-5
word_counts = Counter(int_words)
#print(list(word_counts.items())[0])  # dictionary of int_words, how many times they appear

# discard some frequent words, according to the subsampling equation
# create a new list of words for training

train_words = []
total_count = len(int_words)

# need a dict of frequencies and p_drops for each word. iterate through word_counts
freqs       = {}
p_drops     = {}
for word, count in word_counts.items():
    freqs[word]   = count/total_count
    p_drops[word] = 1 - np.sqrt(threshold/freqs[word])

# built the new training list by iterating through the original one (int_words) and applying rand to p_drop
for word in int_words:    
    rand = np.random.random()
    if rand < 1-p_drops[word]:
        train_words.append(word)

print(train_words[:30])

[5233, 3133, 45, 10571, 27349, 102, 15067, 58112, 3580, 10712, 3672, 36, 2757, 686, 5233, 1052, 44611, 2877, 5233, 8983, 4147, 6437, 4186, 5233, 6, 1818, 4860, 7573, 566, 11064]


## Making batches

Now that our data is in good shape, we need to get it into the proper form to pass it into our network. With the skip-gram architecture, for each word in the text, we want to define a surrounding _context_ and grab all the words in a window around that word, with size $C$. 

From [Mikolov et al.](https://arxiv.org/pdf/1301.3781.pdf): 

"Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples... If we choose $C = 5$, for each training word we will select randomly a number $R$ in range $[ 1: C ]$, and then use $R$ words from history and $R$ words from the future of the current word as correct labels."

> **Exercise:** Implement a function `get_target` that receives a list of words, an index, and a window size, then returns a list of words in the window around the index. Make sure to use the algorithm described above, where you chose a random number of words to from the window.

Say, we have an input and we're interested in the idx=2 token, `741`: 
```
[5233, 58, 741, 10571, 27349, 0, 15067, 58112, 3580, 58, 10712]
```

For `R=2`, `get_target` should return a list of four values:
```
[5233, 58, 10571, 27349]
```

In [6]:
def get_target(words, idx, window_size=5):
    ''' Get a list of words in a window around an index. '''
    
    # implement this function
    rand = np.random.randint(1, window_size+1)
    #print(rand, idx, window_size)
    start = idx-rand if idx-rand >= 0 else 0
    end   = idx+rand
    return words[start:idx] + words[idx+1:end+1]

In [7]:
# test your code!

# run this cell multiple times to check for random window selection
int_text = [i for i in range(10)]
print('Input: ', int_text)
idx=8 # word index of interest

target = get_target(int_text, idx=idx, window_size=5)
print('Target: ', target)  # you should get some indices around the idx

Input:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Target:  [6, 7, 9]


### Generating Batches 

Here's a generator function that returns batches of input and target data for our model, using the `get_target` function from above. The idea is that it grabs `batch_size` words from a words list. Then for each of those batches, it gets the target words in a window.

In [8]:
def get_batches(words, batch_size, window_size=5):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)
            x.extend([batch_x]*len(batch_y))
        yield x, y
    

In [9]:
int_text = [i for i in range(20)]
print(int_text)
x,y = next(get_batches(int_text, batch_size=4, window_size=5))

print('x\n', x)
print('y\n', y)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
x
 [0, 1, 1, 2, 2, 3]
y
 [1, 0, 2, 1, 3, 2]


## Building the graph

Below is an approximate diagram of the general structure of our network.
<img src="assets/skip_gram_arch.png" width=60%>

>* The input words are passed in as batches of input word tokens. 
* This will go into a hidden layer of linear units (our embedding layer). 
* Then, finally into a softmax output layer. 

We'll use the softmax layer to make a prediction about the context words by sampling, as usual.

The idea here is to train the embedding layer weight matrix to find efficient representations for our words. We can discard the softmax layer because we don't really care about making predictions with this network. We just want the embedding matrix so we can use it in _other_ networks we build using this dataset.

---
## Validation

Here, I'm creating a function that will help us observe our model as it learns. We're going to choose a few common words and few uncommon words. Then, we'll print out the closest words to them using the cosine similarity: 

<img src="assets/two_vectors.png" width=30%>

$$
\mathrm{similarity} = \cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}||\vec{b}|}
$$


We can encode the validation words as vectors $\vec{a}$ using the embedding table, then calculate the similarity with each word vector $\vec{b}$ in the embedding table. With the similarities, we can print out the validation words and words in our embedding table semantically similar to those words. It's a nice way to check that our embedding table is grouping together words with similar semantic meanings.

In [10]:
def cosine_similarity(embedding, valid_size=16, valid_window=100, device='cpu'):
    """ Returns the cosine similarity of validation words with words in the embedding matrix.
        Here, embedding should be a PyTorch embedding module.
    """
    
    # Here we're calculating the cosine similarity between some random words and 
    # our embedding vectors. With the similarities, we can look at what words are
    # close to our random words.
    
    # sim = (a . b) / |a||b|
    
    embed_vectors = embedding.weight
    
    # magnitude of embedding vectors, |b|
    magnitudes = embed_vectors.pow(2).sum(dim=1).sqrt().unsqueeze(0)
    
    # pick N words from our ranges (0,window) and (1000,1000+window). lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples,
                               random.sample(range(1000,1000+valid_window), valid_size//2))
    valid_examples = torch.LongTensor(valid_examples).to(device)
    
    valid_vectors = embedding(valid_examples)
    similarities = torch.mm(valid_vectors, embed_vectors.t())/magnitudes
        
    return valid_examples, similarities

## SkipGram model

Define and train the SkipGram model. 
> You'll need to define an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) and a final, softmax output layer.

An Embedding layer takes in a number of inputs, importantly:
* **num_embeddings** – the size of the dictionary of embeddings, or how many rows you'll want in the embedding weight matrix
* **embedding_dim** – the size of each embedding vector; the embedding dimension

In [11]:
import torch
from torch import nn
import torch.optim as optim

In [12]:
class SkipGram(nn.Module):
    def __init__(self, n_vocab, n_embed):
        super().__init__()
        
        # complete this SkipGram model
        self.embed      = nn.Embedding(n_vocab, n_embed)
        self.output     = nn.Linear(n_embed, n_vocab)
        self.logsoftmax = nn.LogSoftmax(dim=1)               # consider LogSoftmax(dim=1) instead of Softmax
        
    def forward(self, x):
        
        # define the forward behavior
        x = self.embed(x)
        x = self.output(x)
        x = self.logsoftmax(x)
        
        return x

### Training

Below is our training loop, and I recommend that you train on GPU, if available.

**Note that, because we applied a softmax function to our model output, we are using NLLLoss** as opposed to cross entropy. This is because Softmax  in combination with NLLLoss = CrossEntropy loss .

In [None]:
# this ran for nearly 48 hours and did not complete before the laptop rebooted itseldf.

In [13]:
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION', )
from subprocess import call
# call(["nvcc", "--version"]) does not work
! nvcc --version
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
# call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print('Available devices ', torch.cuda.device_count())
print('Current cuda device: ', torch.cuda.current_device())
print('CUDA Device Name: '), torch.cuda.get_device_name(0)

__Python VERSION: 3.8.11 (default, Aug  6 2021, 09:57:55) [MSC v.1916 64 bit (AMD64)]
__pyTorch VERSION: 1.10.1
__CUDA VERSION
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:47:52_Pacific_Daylight_Time_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0
__CUDNN VERSION: 8200
__Number CUDA Devices: 1
__Devices
Active CUDA Device: GPU 0
Available devices  1
Current cuda device  0


In [14]:
t = torch.cuda.get_device_properties(0).total_memory
r = torch.cuda.memory_reserved(0)
a = torch.cuda.memory_allocated(0)
f = r-a  # free inside reserved 
print(t, r, a, f)
# want r a and f to be close to zero before training

4294967296 0 0 0


In [15]:
torch.cuda.empty_cache # use this to help manage memory issues

<function torch.cuda.memory.empty_cache() -> None>

In [None]:
# check if GPU is available
import timeit
start_time = timeit.default_timer() 
    
device = 'cuda' if torch.cuda.is_available() else 'cpu'
CUDA_LAUNCH_BLOCKING=1
embedding_dim=300 # you can change, if you want

model = SkipGram(len(vocab_to_int), embedding_dim).to(device)
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

print_every = 1500 #500
steps = 0
epochs = 5

batch_size = 128 #512 512 caused memory errors, lower to 128
# train for some number of epochs
for e in range(epochs):
    
    # get input and target batches
    for inputs, targets in get_batches(train_words, batch_size):
        steps += 1
        inputs, targets = torch.LongTensor(inputs), torch.LongTensor(targets)
        inputs, targets = inputs.to(device), targets.to(device)
        
        log_ps = model(inputs)
        loss = criterion(log_ps, targets)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if steps % print_every == 0:                  
            print("Epoch: {}/{}".format(e+1, epochs)) 
            print("Loss: ", loss.item()) # avg batch loss at this point in training
            # getting examples and similarities      
            valid_examples, valid_similarities = cosine_similarity(model.embed, device=device)
            _, closest_idxs = valid_similarities.topk(6) # topk highest similarities
            
            valid_examples, closest_idxs = valid_examples.to('cpu'), closest_idxs.to('cpu')
            for ii, valid_idx in enumerate(valid_examples):
                closest_words = [int_to_vocab[idx.item()] for idx in closest_idxs[ii]][1:]
                print(int_to_vocab[valid_idx.item()] + " | " + ', '.join(closest_words))
            print("...")
            
end_time = timeit.default_timer()  
print()
print("total time: ", end_time - start_time)

Epoch: 1/5
Loss:  11.28200912475586
during | lactase, clouded, pears, intra, negotiator
known | arrivals, hughie, tridactyla, nurseries, plowing
has | telephus, peseta, sleeper, hemoglobin, amigaos
after | breakage, biodiversity, declarations, backpacker, proportionally
his | schwarzwald, xenophon, dymaxion, boogie, mendelevium
these | wow, mbox, frag, dependability, laptop
the | awkward, rquez, baptiste, eudes, suffrage
other | reassembly, europium, pillage, realize, functors
older | deanna, creationists, blacklisted, gann, wry
numerous | haughey, researches, reinforces, mart, alexandrian
road | grazing, interpolated, apocalypticism, jia, reductio
hold | morpheus, manich, bitstrings, decstation, jehovah
account | phthalocyanine, meccans, mensch, camillo, curve
heavy | mechanistic, mathbb, despatched, intersex, lowland
channel | dampier, proclamations, lilia, mondays, ravaged
applied | parenthesis, alicante, tippett, conversation, carelessly
...
Epoch: 1/5
Loss:  11.583023071289062
the

Epoch: 1/5
Loss:  9.60251235961914
who | besieges, exams, beloved, blessing, love
see | miltiades, journal, bernanke, norwegian, socialized
american | nine, book, baseball, lionheart, wedded
war | surrender, during, army, annexation, soviet
history | page, easton, org, contradicts, ca
known | panini, compartments, ambience, facade, crassus
i | overzealous, we, you, ask, e
also | rectify, herr, from, writers, stringbean
discovered | planets, planet, geologic, nitride, atmosphere
defense | defence, simulation, committee, agreement, doraemons
notes | akhdar, underside, awg, superfluous, plethora
except | maisonneuve, penniless, homeomorphic, eftpos, tro
award | rachel, rides, finals, celibate, cavalli
active | sort, subsidiaries, counter, ownership, endemol
engine | engines, empties, projectors, relena, drill
road | airport, tuff, reconquered, miami, merdeka
...
Epoch: 1/5
Loss:  11.432023048400879
six | zero, four, two, three, five
system | netware, constant, contributions, cim, doses
ma

Epoch: 1/5
Loss:  10.44304370880127
i | t, you, me, your, h
it | old, who, ntgen, my, form
an | the, country, in, history, of
six | three, seven, four, one, two
in | the, and, of, a, state
so | produce, make, bracelets, spinosus, light
has | and, economy, a, the, of
who | himself, wives, his, living, tribe
instance | cryptids, amusement, adjectives, transl, dweezil
pre | earliest, history, celtic, civilization, evolved
articles | org, topics, overview, archive, htm
mainly | most, spoken, largest, early, addition
something | anyone, you, whether, nothing, does
freedom | theology, rights, political, viv, declare
award | awards, best, nominated, awarded, nominations
orthodox | judaism, jews, christian, christians, church
...
Epoch: 1/5
Loss:  10.582915306091309
an | phosphatase, c, is, number, dispersed
state | states, counties, territory, washington, bureau
the | in, six, of, st, first
some | are, hamburgers, different, non, rearrangements
no | sheryl, araki, not, track, only
during | un

Epoch: 2/5
Loss:  9.940659523010254
new | york, boston, press, jersey, british
some | many, there, particular, found, contrast
state | states, governor, union, virginia, institutions
s | seven, zero, one, six, eight
states | united, federal, state, u, act
it | is, than, whether, did, reveals
these | researchers, effects, determine, fluoride, less
over | years, seven, spymaster, five, highest
question | questions, think, arguments, scientific, existence
hit | album, hits, albums, billboard, hitting
construction | electricity, steel, wooden, tower, buildings
grand | brabant, prix, rapids, champions, west
institute | university, research, engineering, technology, college
gold | silver, bronze, precious, coins, ore
recorded | album, songs, albums, dancing, solo
quite | stitching, playstation, quake, mycelium, talked
...
Epoch: 2/5
Loss:  10.215202331542969
during | summer, role, biographer, th, after
of | an, external, the, in, eight
years | six, birth, three, age, fell
first | the, featur

Epoch: 2/5
Loss:  10.56949234008789
on | a, this, usually, the, in
that | thus, a, do, completely, or
zero | three, two, four, seven, nine
was | in, a, nine, the, name
is | the, uses, are, in, a
but | side, only, tend, kaczynski, dressed
their | including, to, turns, practice, closer
united | states, zealand, presidents, federal, u
dr | novel, professor, douglas, doctor, jane
test | tests, testing, nuclear, tested, plutonium
ice | water, cooling, glacier, surface, temperature
police | agencies, officers, intelligence, crime, guard
operations | operation, targeting, weapons, security, squadron
numerous | most, practices, were, included, including
additional | full, page, faiz, information, links
road | roads, freight, transportation, construction, vehicles
...
Epoch: 2/5
Loss:  10.665101051330566
war | forces, fought, soldiers, soviet, civil
are | is, or, defined, each, such
united | states, presidents, federal, admitted, washington
is | defined, n, definition, follows, are
by | and, ca

Epoch: 3/5
Loss:  10.133749008178711
into | called, is, which, to, component
between | occurs, is, called, separated, formed
had | was, his, augustus, later, he
or | called, may, any, can, usually
is | if, in, y, of, x
most | abundant, forms, these, found, constituent
by | the, of, that, benjamin, described
it | do, that, to, in, a
assembly | legislative, elections, vote, parliament, representatives
creation | scientific, reconciliation, principles, iaido, speculative
animals | animal, humans, predators, mammals, habits
institute | research, science, university, sciences, graduated
file | files, windows, executable, boot, user
troops | fighting, army, forces, soldiers, withdraw
alternative | herbal, review, forums, genres, yehovah
question | questions, whether, answers, think, yes
...
Epoch: 3/5
Loss:  10.571260452270508
see | external, list, in, links, and
four | five, eight, one, three, six
a | with, and, to, the, like
war | guerrilla, army, terror, civil, nazis
which | and, into, kn

Epoch: 3/5
Loss:  10.039072036743164
between | neutral, demonstrated, same, central, gabab
see | list, links, www, external, homepage
than | fewer, less, meaningful, larger, low
with | are, the, addition, to, form
in | the, of, and, as, one
so | cannot, if, then, aren, nothing
this | is, that, example, basis, for
history | overview, list, links, timeline, library
except | like, individually, beluga, present, muharram
lived | brother, maria, peter, stayed, younger
report | reports, reported, news, agency, retrieved
event | extinction, events, potencies, resurrection, olympics
gold | silver, platinum, nickel, metals, precious
additional | faiz, methods, cluster, called, homepage
taking | valuable, deprived, judged, apulia, throw
writers | novelists, poets, fantasy, authors, playwrights
...
Epoch: 3/5
Loss:  10.594212532043457
state | states, united, constitution, senators, federal
however | been, difficult, opposed, not, some
most | many, influenced, a, although, these
may | accordance, 

Epoch: 4/5
Loss:  9.668204307556152
or | is, seen, must, there, occurs
other | as, or, and, include, usually
had | was, th, becoming, his, he
two | one, five, three, seven, zero
of | the, and, is, as, a
be | to, a, an, is, not
see | external, references, links, list, article
over | three, next, at, two, after
applications | user, windows, interface, portable, networking
stage | vaudeville, theatre, scala, actors, broadway
behind | ever, opposition, pittsburgh, third, out
something | know, we, our, impossible, ve
gold | silver, platinum, tin, copper, purple
event | olympic, policemen, prison, sporting, rides
question | questions, theological, becomes, answered, asks
magazine | interview, magazines, news, npr, newsgroup
...
Epoch: 4/5
Loss:  8.854816436767578
years | males, age, million, thirteen, male
may | not, should, criteria, requires, rejects
of | and, the, in, by, capital
american | musician, actress, footballer, americans, nine
a | or, can, be, and, combination
who | father, he, 

Epoch: 4/5
Loss:  10.023306846618652
more | less, larger, majority, varied, than
only | should, likely, describe, exist, non
also | see, for, and, the, of
after | lasted, during, war, first, the
had | returned, later, was, conquered, sicily
their | have, malay, towards, some, themselves
to | the, for, and, this, was
used | non, use, using, purposes, a
instance | singular, nouns, stylistic, verbs, distinction
professional | arts, teams, basketball, football, sports
universe | cosmic, cosmological, bang, worlds, cosmology
versions | version, feature, apple, software, originally
scale | temperature, geologic, artefact, mechanical, diatonic
award | awards, film, oscar, nominated, best
question | questions, cf, whether, observance, told
pressure | flow, cooling, gases, temperature, pump
...
Epoch: 4/5
Loss:  9.284676551818848
and | of, the, as, in, also
during | was, after, the, middle, last
were | th, descendants, empire, borrowed, century
b | d, c, p, v, e
on | and, a, s, in, with
about |

Epoch: 5/5
Loss:  10.388358116149902
see | references, www, topics, external, links
after | battle, imprisoned, he, defeated, thereafter
or | may, these, is, are, the
seven | one, five, eight, six, nine
s | one, nine, seven, j, r
while | presence, became, when, the, come
states | u, presidents, territory, northern, territories
which | in, of, an, a, called
san | francisco, angeles, puerto, diego, seattle
except | some, waived, it, therefore, certain
rise | nationalism, rising, alps, pontic, mountains
operations | operation, multiplication, fighter, xor, tomcat
applied | philosophy, or, practice, etymology, philosophers
ice | glacier, glaciers, frozen, lake, antarctica
behind | quarterback, defensive, evans, mike, jake
additional | inventory, timing, voices, operations, product
...
Epoch: 5/5
Loss:  9.53900146484375
this | the, between, of, its, by
there | parts, have, originated, western, now
the | of, in, and, a, as
b | d, writer, painter, politician, actor
d | politician, writer, b, 

Epoch: 5/5
Loss:  9.295401573181152
s | nine, eight, born, six, d
use | types, modifications, used, this, uses
been | questioned, evidence, claims, scholars, surviving
world | championships, olympic, championship, nine, shattered
used | form, using, unit, commonly, use
can | properties, function, functions, are, commonly
if | cannot, generalization, polynomial, dimensional, denoted
nine | one, five, six, eight, zero
defense | agency, surveillance, missile, intelligence, danish
woman | she, her, actress, child, love
mainly | groups, lebanese, germany, especially, east
existence | aristotle, argument, proposition, human, question
bill | footballer, actor, jennifer, actress, mike
file | files, unix, users, software, format
http | www, com, org, edu, htm
magazine | interviews, magazines, reviews, publisher, fiction
...
Epoch: 5/5
Loss:  10.756816864013672
by | and, s, from, of, was
have | accepted, been, them, few, regarded
been | attributed, of, by, from, remains
into | of, formed, by, an

## Visualizing the word vectors

Below we'll use T-SNE to visualize how our high-dimensional word vectors cluster together. T-SNE is used to project these vectors into two dimensions while preserving local stucture. Check out [this post from Christopher Olah](http://colah.github.io/posts/2014-10-Visualizing-MNIST/) to learn more about T-SNE and other ways to visualize high-dimensional data.

In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

In [2]:
# getting embeddings from the embedding layer of our model, by name
embeddings = model.embed.weight.to('cpu').data.numpy()

NameError: name 'model' is not defined

In [None]:
viz_words = 600
tsne = TSNE()
embed_tsne = tsne.fit_transform(embeddings[:viz_words, :])

In [None]:
fig, ax = plt.subplots(figsize=(16, 16))
for idx in range(viz_words):
    plt.scatter(*embed_tsne[idx, :], color='steelblue')
    plt.annotate(int_to_vocab[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.7)