# Word embeddings

This idea that a word’s meaning can be understood by its context, or the words that surround it, is the basis for word embeddings. A word embedding is a representation of a word as a numeric vector, enabling us to compare and contrast how words are used and identify words that occur in similar contexts.

Requires words to be represented as vectors. Vectors are represented as arrays.

Arrays are for elements of the same data type. They are used specifically for lists which need to have mathematical operations done on them.

In [1]:
import numpy as np

scores_xavier = np.array([88, 92])

scores_niko = np.array([94, 87])

scores_alena = np.array([90, 48])


Word embeddings are vector representations of a word.

They allow us to take all the information that is stored in a word, like its meaning and its part of speech, and convert it into a numeric form that is more understandable to a computer.

We can load a basic English word embedding model using spaCy as follows:

To get the vector representation of a word, we call the model with the desired word as an argument and can use the .vector attribute.

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")

love_vector = nlp('love').vector

print(len(love_vector))

96


## Distance

The key at the heart of word embeddings is distance.

Manhattan distance:  also known as city block distance, distance is defined as the sum of the differences across each individual dimension of the vectors"

Euclidean distance, also known as straight line distance. With this distance metric, we take the square root of the sum of the squares of the differences in each dimension.

Cosine distance is concerned with the angle between two vectors, rather than by looking at the distance between the points, or ends, of the vectors. Two vectors that point in the same direction have no angle between them, and have a cosine distance of 0. Two vectors that point in opposite directions, on the other hand, have a cosine distance of 1

In [3]:
from scipy.spatial.distance import cityblock, euclidean, cosine

vector_a = np.array([1,2,3])
vector_b = np.array([2,4,6])

# Manhattan distance:
manhattan_d = cityblock(vector_a,vector_b) # 6
print(manhattan_d)

# Euclidean distance:
euclidean_d = euclidean(vector_a,vector_b) # 3.74
print(euclidean_d)

# Cosine distance:
cosine_d = cosine(vector_a,vector_b) # 0.0
print(cosine_d)

6
3.7416573867739413
0


The idea behind word embeddings is a theory known as the distributional hypothesis. This hypothesis states that words that co-occur in the same contexts tend to have similar meanings. With word embeddings, we map words that exist with the same context to similar places in our vector space (math-speak for the area in which our vectors exist).

The numeric values that are assigned to the vector representation of a word are not important in their own right, but gather meaning from how similar or not words are to each other.

The literal values of a word’s embedding have no actual meaning. We gain value in word embeddings from comparing the different word vectors and seeing how similar or different they are. Encoded in these vectors, however, is latent information about how they are used.

## Word to vec

Word2vec is a statistical learning algorithm that develops word embeddings from a corpus of text. Word2vec uses one of two different model architectures to come up with the values that define a collection of word embeddings.

One method is to use the continuous bag-of-words (CBOW) representation of a piece of text. The word2vec model goes through each word in the training corpus, in order, and tries to predict what word comes at each position based on applying bag-of-words to the words that surround the word in question. In this approach, the order of the words does not matter!

The other method word2vec can use to create word embeddings is continuous skip-grams. Skip-grams function similarly to n-grams, except instead of looking at groupings of n-consecutive words in a text, we can look at sequences of words that are separated by some specified distance between them.

When using continuous skip-grams, the order of context is taken into consideration! Because of this, the time it takes to train the word embeddings is slower than when using continuous bag-of-words. The results, however, are often much better!

With either the continuous bag-of-words or continuous skip-grams representations as training data, word2vec then uses a shallow, 2-layer neural network to come up with the values that place words with a similar context in vectors near each other and words with different contexts in vectors far apart from each other.

## Gensim

Spacy is trained by the Linguistic Data Consortium.

Gensim allow us to train our own word embeddings model on our own corpus of text

In [4]:
import gensim

model = gensim.models.Word2Vec(corpus, size=100, window=5, min_count=1, workers=2, sg=1)

NameError: name 'corpus' is not defined

- corpus is a list of lists, where each inner list is a document in the corpus and each element in the inner lists is a word token
- size determines how many dimensions our word embeddings will include. Word embeddings often have upwards of 1,000 dimensions! Here we will create vectors of 100-dimensions to keep things simple.
- don’t worry about the rest of the keyword arguments here!

To view the entire vocabulary used to train the word embedding model, we can use the .wv.vocab.items() method.

In [None]:
vocabulary_of_model = list(model.wv.vocab.items())

When we train a word2vec model on a smaller corpus of text, we pick up on the unique ways in which words of the text are used.

For example, if we were using scripts from the television show Friends as a training corpus, the model would pick up on the unique ways in which words are used in the show. While the generalized vectors in a spaCy model might not place the vectors for “Ross” and “Rachel” close together, a gensim word embedding model trained on Friends’ scripts would place the vectors for words like “Ross” and “Rachel”, two characters that have a continuous on and off-again relationship throughout the show, very close together!

To easily find which vectors gensim placed close together in its word embedding model, we can use the .most_similar() method.

In [None]:
model.most_similar("my_word_here", topn=100)

- "my_word_here" is the target word token we want to find most similar words to
- topn is a keyword argument that indicates how many similar word vectors we want returned

One last gensim method we will explore is a rather fun one: .doesnt_match().

In [None]:
model.doesnt_match(["asia", "mars", "pluto"])

when given a list of terms in the vocabulary as an argument, .doesnt_match() returns which term is furthest from the others.
