# Why word vectors?

When working with words, dealing with the huge but sparse domain of language can be challenging. Even for a small corpus, your neural network (or any type of model) needs to support many thousands of discrete inputs and outputs.

Besides the raw number words, the standard technique of representing words as one-hot vectors (e.g. "the" = `[0 0 0 1 0 0 0 0 ...]`) does not capture any information about relationships between words.

There are a few techniques for creating word vectors. The word2vec algorithm predicts words in a context (e.g. what is the most likely word to appear in "the cat ? the mouse"), while GloVe vectors are based on global counts across the corpus.

# GloVe: Global Vectors for Word Representation
Created at the Stanford NLP group by: Jeffrey Pennington,   Richard Socher,   Christopher D. Manning

Word vectors address this problem by representing words in a multi-dimensional vector space. This can bring the dimensionality of the problem from hundreds-of-thousands to just hundreds. Plus, the vector space is able to capture semantic relationships between words in terms of distance and vector arithmetic.

![](https://i.imgur.com/y4hG1ak.png)






## Loading word vectors

In [3]:
'''Torchtext includes functions to download GloVe (and other) embeddings'''
import torch
import torchtext.vocab as vocab

In [4]:
glove = vocab.GloVe(name='6B', dim=100)

print('Loaded {} words'.format(len(glove.itos)))

.vector_cache/glove.6B.zip: 862MB [02:09, 6.64MB/s]                               
100%|██████████| 400000/400000 [00:14<00:00, 27400.22it/s]


Loaded 400000 words


The returned `GloVe` object includes attributes:
- `stoi` _string-to-index_ returns a dictionary of words to indexes
- `itos` _index-to-string_ returns an array of words by index
- `vectors` returns the actual vectors. To get a word vector get the index to get the vector:

## Finding closest vectors

Going from word to vector is easy enough, but to go from vector to word takes more work. Here I'm calculating the distance for each word in the vocabulary, and sorting based on that distance:

In [5]:
def get_word(word):
    return glove.vectors[glove.stoi[word]]

def closest_readable_implementation(vec, n=10):
    all_dists=[]
    for w in glove.itos:
        distance = torch.dist(vec, get_word(w))
        all_dists.append([w, distance])
    return sorted(all_dists, key=lambda t: t[1])[:n]


def closest(vec, n=10):
    all_dists = [(w, torch.dist(vec, get_word(w))) for w in glove.itos]
    return sorted(all_dists, key=lambda t: t[1])[:n]

# A helper function to print that list
def print_tuples(tuples):
    for tuple in tuples:
        print('(%.4f) %s' % (tuple[1], tuple[0]))

In [28]:
'''Now using a known word vector we can see which other vectors are closest:'''
print_tuples(closest(get_word('c++')))

(0.0000) c++
(4.3938) compiler
(4.6451) fortran
(4.6805) compilers
(4.9288) objective-c
(4.9339) javascript
(5.0873) php
(5.1683) object-oriented
(5.2860) perl
(5.3822) cobol


## Word analogies with vector arithmetic

The most interesting feature of a well-trained word vector space is that certain semantic relationships can be captured with regular vector arithmetic. 

![](https://i.imgur.com/d0KuM5x.png)

(image borrowed from [a slide from Omer Levy and Yoav Goldberg](https://levyomer.wordpress.com/2014/04/25/linguistic-regularities-in-sparse-and-explicit-word-representations/))

In [25]:
# In the form w1 : w2 :: w3 : ?
def analogy(w1, w2, w3):
    print('\n[%s : %s :: %s : ?]' % (w1, w2, w3))
    # w2 - w1 + w3 = w4
    closest_words = closest(get_word(w2) - get_word(w1) + get_word(w3))
    # filter out input words
    closest_words = [t for t in closest_words if t[0] not in [w1, w2, w3]]
    print_tuples(closest_words[:4])

In [8]:
'''The classic example:'''
analogy('king', 'man', 'queen')


[king : man :: queen : ?]
(4.0811) woman
(4.6916) girl
(5.2703) she
(5.2788) teenager


In [9]:
'''Now let's explore the word space and see what stereotypes we can uncover'''
analogy('man', 'actor', 'woman')
analogy('cat', 'kitten', 'dog')
analogy('russia', 'moscow', 'canada')
analogy('rich', 'mansion', 'poor')
analogy('paper', 'newspaper', 'screen')
analogy('earth', 'moon', 'sun') # Interesting failure mode
analogy('house', 'roof', 'castle')
analogy('building', 'architect', 'software')
analogy('virginia', 'richmond', 'california')
analogy('good', 'heaven', 'bad')
analogy('jordan', 'basketball', 'messi')


[man : actor :: woman : ?]
(2.8133) actress
(5.0039) comedian
(5.1399) actresses
(5.2773) starred

[cat : kitten :: dog : ?]
(3.8146) puppy
(4.2944) rottweiler
(4.5888) puppies
(4.6086) pooch

[russia : moscow :: canada : ?]
(4.3024) montreal
(4.3245) toronto
(4.4689) winnipeg
(4.5301) ontario

[rich : mansion :: poor : ?]
(5.8262) residence
(5.9444) riverside
(6.0283) hillside
(6.0328) abandoned

[paper : newspaper :: screen : ?]
(4.7810) tv
(5.1049) television
(5.3818) cinema
(5.5524) feature

[earth : moon :: sun : ?]
(6.2294) lee
(6.4125) kang
(6.4644) tan
(6.4757) yang

[house : roof :: castle : ?]
(6.2919) stonework
(6.3779) masonry
(6.4773) canopy
(6.4954) fortress

[building : architect :: software : ?]
(5.8369) programmer
(6.8881) entrepreneur
(6.9240) inventor
(6.9730) developer

[virginia : richmond :: california : ?]
(4.3444) pasadena
(4.3696) francisco
(4.3829) angeles
(4.5840) mesa

[good : heaven :: bad : ?]
(4.3959) hell
(5.2864) ghosts
(5.2898) hades
(5.3414) madness


In [24]:
analogy('ibm', 'software', 'citigroup')


[ibm : software :: citigroup : ?]
(5.1651) banking
(5.2357) credit
(5.4298) merrill
(5.4445) banks


In [22]:
analogy('ibm', 'computer', 'chrysler')


[ibm : computer :: chrysler : ?]
(4.4520) auto
(5.0944) car
(5.2303) vehicle
(5.3724) utility
