# Introduction to Word Embeddings

Thus far, we've focused on bag-of-word approaches to text analysis, where the text is represented as a vector of word frequencies. This generally works pretty well - we can do a decent job of topic modeling and supervised classification with this approach. However, word frequencies alone don't tell the whole picture. The ordering of words, for example, provides additional context that word frequencies don't capture. Furthermore, words can be used in a variety of ways, with different meanings that get lost in a word frequency representation.

An alternative formalization of text consists of representing the words (or bi-grams, phrases, etc.) as vectors. They're also called word embeddings, because we embed the word in a higher dimensional space. A word vector has no inherent meaning to humans - ultimately, it's just a bunch of floating point numbers. But word vectors are useful because they're a numerical representation of text that captures its semantic meaning, and can easily be used in downstream tasks, such as dictionary methods, classification, topic modeling etc. Furthermore, the vector representation can be used to perform semantic tasks, such as finding synonyms, testing analogies, and others. The million dollar question, however, is: how do we create the word vector in the first place?

The answer to the million dollar question is to pick the right task. Specifically, we're going to calculate the word vectors so that they can be successfully used in one of two tasks: predicting surrounding words, or predicting words within a context. In this workshop, we're going to both use pre-trained word embeddings, and construct our own word embeddings.

# The Word Embedding Model

# Install and Import

In [None]:
import numpy as np
import pandas as pd

We'll be using a package called `gensim` to conduct our word embedding experiments. `gensim` is one of the major Python packages for natural language processing, largely aimed at using different kinds of embeddings.

If you don't have `gensim` installed, you can install it directly within this notebook:

In [None]:
# Run if you do not have gensim installed
!pip install gensim

In [None]:
import gensim.downloader as api

# Using Pre-trained Word Embeddings

The first thing we'll do is use a pre-trained word embedding. This means that we're downloading a word embedding model that has already been trained on a large corpus. Researchers have trained a variety of models in different contexts that are freely available on `gensim`. We can take a look at a few of them by looking in the `gensim` downloader:

In [None]:
gensim_models = list(api.info()['models'].keys())
print(gensim_models)

We are going to use the `word2vec-google-news-300` model: this is a word embedding model that is trained on Google News, where the embedding is 300 dimensions. Downloading this might take a while! The word embedding model is nearly 2 GB. 

In [None]:
wv = api.load('word2vec-google-news-300')

How many word vectors are available in this word embedding model? We can access the `index_to_key` member variable to find out:

In [None]:
n_words = len(wv.index_to_key)
print(f"Number of words: {n_words}")
print(wv.index_to_key[:20])

The model is trained using a vocabulary of size 3 million! This is a huge model, which takes hours to train. This is why we used a pre-trained model - we likely don't have the resources to train this on our local machines.

Accessing the actual word vectors can be done by treating the word vector model as a dictionary. For example, let's take a look at the word vector for `"banana"`:

In [None]:
print(wv["banana"])
print(wv["banana"].size)

As promised, the word vector is 300-dimensional. Looking at the actual values of the vector is pretty uninformative - the values appear to be random floats. However, now that the word has been transformed into a vector, we can more easily perform computations on it that correspond to semantic operations. Let's take a look at a few examples.

## Word Similarity

A semantic question we can ask is  that are similar to "banana". How does word similarity look in vector operations? We'd expect similar words to have vectors that are closer to each other in vector space.

There are many metrics of vector similarity - one of the most useful ones is the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). It has a range of 0 to 1, with orthogonal vectors have a cosine similarity of 0, and parallel vectors having a cosine similarity of 1. `gensim` provides a function that lets us find the most similar vectors to a queried vector - let's give it a shot! 

In [None]:
wv.most_similar('banana')

The most similar vectors to "banana" are other fruits and foods! These are conceptual relationships that are reflected in the word embedding that we did not explicitly train in the model. Let's try another, more abstract word:

In [None]:
wv.most_similar('happy')

We see synonyms of "happy", and even an antonym ("disappointed"). 

In [None]:
wv.most_similar_cosmul()

### Challenge 1

Look up the `doesnt_match` function in `gensim`'s documentation. Use this function to identify which word doesn't match in the following group:

banana, apple, strawberry, happy

Then, try it on groups of words that you choose. Here are some suggestions:

1. A group of fruits, and a vegetable. Can it identify that the vegetable doesn't match?
2. A group of vehicles that travel by land, and a vehicle that travels by air (e.g., a plane or helicopter). Can it identify the vehicle that flies?
3. A group of scientists (e.g., biologist, physicist, chemist, etc.) and a person who does not study science (e.g., an artist). Can it identify the occupation that is not science based?

To be clear, `word2vec` does not learn the precise nature of the differences between these groups. However, the semantic differences correspond to similar words appearing near each other in large corpora.

## Word Analogies

One of the most famous usages of `word2vec` is via word analogies. For example:

Paris is to France as Berlin is to Germany. 

Here, the analogy is between (Paris, France) and (Berlin, Germany), with "capital city" being the concept that connects them. We can abstract the "analogy" relationship to vector modeling. Let's pretend we're working with each of the vectors. Then, the analogy is

$\mathbf{v}_{\text{France}} - \mathbf{v}_{\text{Paris}} \approx \mathbf{v}_{\text{Germany}} - \mathbf{v}_{\text{Berlin}}.$

The vector difference here represents the notion of "capital city". Presumably, going from the Paris vector to the France vector (i.e., the vector difference) will be the same as going from the Berlin vector to the Germany vector, if that difference carries similar semantic meaning.

Let's test this directly. We'll do so by rewriting the above expression:

$\mathbf{v}_{\text{France}} - \mathbf{v}_{\text{Paris}} + \mathbf{v}_{\text{Berlin}} \approx \mathbf{v}_{\text{Germany}}.$

We'll calculate the difference between Paris and France, add on Germany, and find the closest vector to that quantity. Notice that, in all these operations, we set `norm=True`, and renormalize. That's because different vectors might be of different lengths, so the normalization puts everything on a common scale.

In [None]:
# Calculate "capital city" vector difference
difference = wv.get_vector('France', norm=True) - wv.get_vector('Paris', norm=True) 
# Add on Berlin
difference += wv.get_vector('Berlin', norm=True)
# Renormalize vector
difference /= np.linalg.norm(difference)

In [None]:
# What is the most similar vector?
wv.most_similar(difference)

Germany is the most similar! So, word analogies seem possible with `word2vec`.

Carrying out these operations can be done in one fell swoop with the `most_similar` function. Check the documentation for this function. What do the `positive` and `negative` arguments mean?

## Challenge 2

Carry out the following word analogies:

1. Mouse : Mice :: Goose : ?
2. Kangaroo : Joey :: Cat : ?
3. Mexico : Peso :: Dollar : ?
4. Happy : Sad :: Up : ?
4. California : Sacramento :: Canada : ?
5. California : Sacramento :: Washington : ?

Some work well, and others don't work as well. Try to come up with your own analogies!

# Creating Custom Word Embeddings

# Document Embeddings