# Python Text Analysis: Word Embeddings

Thus far, we've focused on bag-of-word approaches to text analysis, where the text is represented as a vector of word frequencies. This generally works pretty well - we can do a decent job of supervised classification with this approach. However, word frequencies alone don't tell the whole picture. The ordering of words, for example, provides additional context that word frequencies don't capture. Furthermore, words can be used in a variety of ways, with different meanings that get lost in a word frequency representation.

An alternative formalization of text consists of representing the words (or bi-grams, phrases, etc.) as vectors. They're also called word embeddings, because we embed the word in a higher dimensional space. A word vector has no inherent meaning to humans - ultimately, it's just a bunch of floating point numbers. But word vectors are useful because they're a numerical representation of text that captures its semantic meaning, and can easily be used in downstream tasks, such as dictionary methods, classification, topic modeling etc. Furthermore, the vector representation can be used to perform semantic tasks, such as finding synonyms, testing analogies, and others. The big question, however, is: how do we create the word vector in the first place?

The answer is to pick the right task. Specifically, we're going to calculate the word vectors so that they can be successfully used in one of two tasks: predicting surrounding words, or predicting words within a context.

# The Word Embedding Model: `word2vec`

The word embedding model, generally referred to as `word2vec`, was developed by [Mikolov et al.](https://arxiv.org/abs/1310.4546) in 2013. The basic premise is to find vector representations of tokens that have semantic meaning. How do we go about learning a "good" vector representation from the data?

Mikolov et al. proposed two approaches: the **skip-gram (SG)** and the **continuous bag-of-words (CBOW)**. Both approaches are similar in that we use the vector representation of a token to try and predict what the nearby tokens are with a shallow neural network.

![word2vec](../images/word2vec.png)

In the continuous bag-of-words model, our goal is to predict a word $w(t)$, given the words that surround it - e.g., $w(t-2), w(t-1), w(t+1), w(t+2)$, etc. So, in an example text such as `I went to the store to get some apples`, we may try to use the word vectors for `I`, `went`, `to`, `the`, `to`, `get`, `some`, `apples` to predict the word `store`. This would correspond to a *window size* of 4: 4 words on either side of the target word.

In the skip-gram model, we construct a word vector that can be used to predict the words surrounding a specific word $w(t)$. This is the reverse of the continuous bag-of-words, and is a harder task, since we have to predict more from less information. In the above example, we'd aim to predict the remaining words in the sentence from the word vector for `store`. 

You can use either approaches to build a set of word embeddings. Mikolov et al. demonstrated that the skip-gram works pretty well in larger corpuses. Furthermore, it's easier to train the skip-gram efficiently, making it faster.

The mechanics of how the training is actually done revolves around a **shallow neural network**. An **objective function** is specified - a mathematical expression that quantifies how well we predicted a word - which allows the values of the word vectors to be optimized using **back propagation**. We won't go into these details for this workshop, but check out the Python Deep Learning workshop if you'd like to learn more about neural networks!

Let's jump into it!

# Installing `gensim`

In [None]:
import numpy as np
import pandas as pd
import re

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV

We'll be using a package called `gensim` to conduct our word embedding experiments. `gensim` is one of the major Python packages for natural language processing, largely aimed at using different kinds of embeddings.

If you don't have `gensim` installed, you can install it directly within this notebook:

In [None]:
# Run if you do not have gensim installed
!pip install gensim

In [None]:
import gensim
import gensim.downloader as api

# Using Pre-trained Word Embeddings

The first thing we'll do is use a pre-trained word embedding. This means that we're downloading a word embedding model that has already been trained on a large corpus. Researchers have trained a variety of models in different contexts that are freely available on `gensim`. We can take a look at a few of them by looking in the `gensim` downloader:

In [None]:
gensim_models = list(api.info()['models'].keys())
print(gensim_models)

We are going to use the `word2vec-google-news-300` model: this is a word embedding model that is trained on Google News, where the embedding is 300 dimensions. Downloading this might take a while! The word embedding model is nearly 2 GB. 

In [None]:
wv = api.load('word2vec-google-news-300')

How many word vectors are available in this word embedding model? We can access the `index_to_key` member variable to find out:

In [None]:
n_words = len(wv.index_to_key)
print(f"Number of words: {n_words}")
print(wv.index_to_key[:20])

The model is trained using a vocabulary of size 3 million! This is a huge model, which takes hours to train. This is why we used a pre-trained model - we likely don't have the resources to train this on our local machines.

Accessing the actual word vectors can be done by treating the word vector model as a dictionary. For example, let's take a look at the word vector for `"banana"`:

In [None]:
print(wv["banana"])
print(wv["banana"].size)

As promised, the word vector is 300-dimensional. Looking at the actual values of the vector is pretty uninformative - the values appear to be random floats. However, now that the word has been transformed into a vector, we can more easily perform computations on it that correspond to semantic operations. Let's take a look at a few examples.

## Word Similarity

A semantic question we can ask is  that are similar to "banana". How does word similarity look in vector operations? We'd expect similar words to have vectors that are closer to each other in vector space.

There are many metrics of vector similarity - one of the most useful ones is the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). It has a range of 0 to 1, with orthogonal vectors have a cosine similarity of 0, and parallel vectors having a cosine similarity of 1. `gensim` provides a function that lets us find the most similar vectors to a queried vector - let's give it a shot! 

In [None]:
wv.most_similar('banana')

The most similar vectors to "banana" are other fruits and foods! These are conceptual relationships that are reflected in the word embedding that we did not explicitly train in the model. Let's try another, more abstract word:

In [None]:
wv.most_similar('happy')

We see synonyms of "happy", and even an antonym ("disappointed"). 

## Challenge 1

Look up the `doesnt_match` function in `gensim`'s documentation. Use this function to identify which word doesn't match in the following group:

banana, apple, strawberry, happy

Then, try it on groups of words that you choose. Here are some suggestions:

1. A group of fruits, and a vegetable. Can it identify that the vegetable doesn't match?
2. A group of vehicles that travel by land, and a vehicle that travels by air (e.g., a plane or helicopter). Can it identify the vehicle that flies?
3. A group of scientists (e.g., biologist, physicist, chemist, etc.) and a person who does not study an empirical science (e.g., an artist). Can it identify the occupation that is not science based?

To be clear, `word2vec` does not learn the precise nature of the differences between these groups. However, the semantic differences correspond to similar words appearing near each other in large corpora.

## Word Analogies

One of the most famous usages of `word2vec` is via word analogies. For example:

`Paris : France :: Berlin : Germany`

Here, the analogy is between (Paris, France) and (Berlin, Germany), with "capital city" being the concept that connects them. We can abstract the "analogy" relationship to vector modeling. Let's pretend we're working with each of the vectors. Then, the analogy is

$\mathbf{v}_{\text{France}} - \mathbf{v}_{\text{Paris}} \approx \mathbf{v}_{\text{Germany}} - \mathbf{v}_{\text{Berlin}}.$

The vector difference here represents the notion of "capital city". Presumably, going from the Paris vector to the France vector (i.e., the vector difference) will be the same as going from the Berlin vector to the Germany vector, if that difference carries similar semantic meaning.

Let's test this directly. We'll do so by rewriting the above expression:

$\mathbf{v}_{\text{France}} - \mathbf{v}_{\text{Paris}} + \mathbf{v}_{\text{Berlin}} \approx \mathbf{v}_{\text{Germany}}.$

We'll calculate the difference between Paris and France, add on Germany, and find the closest vector to that quantity. Notice that, in all these operations, we set `norm=True`, and renormalize. That's because different vectors might be of different lengths, so the normalization puts everything on a common scale.

In [None]:
# Calculate "capital city" vector difference
difference = wv.get_vector('France', norm=True) - wv.get_vector('Paris', norm=True) 
# Add on Berlin
difference += wv.get_vector('Berlin', norm=True)
# Renormalize vector
difference /= np.linalg.norm(difference)

In [None]:
# What is the most similar vector?
wv.most_similar(difference)

Germany is the most similar! So, word analogies seem possible with `word2vec`.

Carrying out these operations can be done in one fell swoop with the `most_similar` function. Check the documentation for this function. What do the `positive` and `negative` arguments mean?

## Challenge 2

Carry out the following word analogies:

1. Mouse : Mice :: Goose : ?
2. Kangaroo : Joey :: Cat : ?
3. United States : Dollar :: Mexico : ?
4. Happy : Sad :: Up : ?
5. California : Sacramento :: Canada : ?
6. California : Sacramento :: Washington : ?

What about something more abstract, such as:

7. United States : hamburger :: Canada : ?

Some work well, and others don't work as well. Try to come up with your own analogies!

# Creating Custom Word Embeddings

In the previous example, we used a *pretrained* word embedding. That is, the word embedding was already trained using a very large corpus from Google News. What about when we want to train our own word embeddings from a new corpus?

We can do that using `gensim` as well. However, if the corpus is large, training becomes very computationally taxing. So, we'll try training our own word embeddings, but on a much smaller corpus. Specifically, we'll return to one you should recognize: the airline tweets corpus!

Let's go ahead and get set up by importing the dataset and preprocessing, as we did in Part 2.

In [None]:
tweets_path = '../data/airline_tweets.csv'
tweets = pd.read_csv(tweets_path, sep=',')

In [None]:
def preprocess(text):
    """Preprocesses a string."""
    # Lowercase
    text = text.lower()
    # Replace URLs
    url_pattern = r'https?:\/\/.*[\r\n]*'
    url_repl = ' URL '
    text = re.sub(url_pattern, url_repl, text)
    # Replace digits
    digit_pattern = '\d+'
    digit_repl = ' DIGIT '
    text = re.sub(digit_pattern, digit_repl, text)
    # Replace hashtags
    hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
    hashtag_repl = ' HASHTAG '
    text = re.sub(hashtag_pattern, hashtag_repl, text)
    # Replace users
    user_pattern = r'@(\w+)'
    user_repl = ' USER '
    text = re.sub(user_pattern, user_repl, text)
    # Remove blank spaces
    blankspace_pattern = r'\s+'
    blankspace_repl = ' '
    text = re.sub(blankspace_pattern, blankspace_repl, text).strip()
    return text

In [None]:
tweets['text_processed'] = tweets['text'].apply(lambda x: preprocess(x))
tweets['text_processed'].head()

To create our own model, we need to import the `Word2Vec` module from `gensim`:

In [None]:
from gensim.models import Word2Vec

You can check out the documentation for this module [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec). The main input to `Word2Vec` is a `sentences` argument, which consists of a list of lists: the outer list enumerates the documents, and the inner list enumerates the tokens within in each list. So, we need to run a word tokenizer on each of the tweets. Let's use `nltk`'s word tokenizer:

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
sentences = [word_tokenize(tweet) for tweet in tweets['text_processed']]

In [None]:
sentences[0]

Now, we train the model. We are going to use CBOW to train the model since it's better suited for smaller datasets. Take note of what other arguments we set:

In [None]:
model = Word2Vec(
    sentences=sentences,
    vector_size=30,
    window=5,
    min_count=1,
    sg=0)

The model is now trained! Let's take a look at some word vectors. We can access them using the `wv` attribute:

In [None]:
len(model.wv)

In [None]:
model.wv['worst']

Let's try running a `most_similar` query to see what we end up with:

In [None]:
model.wv.most_similar('worst')

In [None]:
model.wv.most_similar('great')

In [None]:
model.wv.most_similar('united')

The `word2vec` model learned these relationships from the roughly 11,000 tweets in the corpus. These relationships look similar to some in the Google News word embeddings, have some differences that stem from the particular nature of the corpus, and the smaller number of documents.

## Challenge 3

Try experimenting with different numbers of vector sizes, window sizes, and other parameters available in the `Word2Vec` module. Additionally, try training using skip-grams rather than CBOW.

# Classifying with Trained Embeddings

In the previous module, we used the airline tweets dataset to perform sentiment classification: we tried to classify the sentiment of a text given the bag-of-words representation. Can we do something similar with a word embedding representation?

In the word embedding representation, we have an $N$-dimensional vector for each word in a tweet. How can we come up with a representation for the entire tweet?

The simplest approach we could take is to simply average the vectors together to come up with a "tweet representation". Let's see how this works for predicting sentiment classification.

First, we need to subset the dataset into the tweets which only have positive or negative sentiment:

In [None]:
tweets_binary = tweets[tweets['airline_sentiment'] != 'neutral']
y = tweets_binary['airline_sentiment']
print(y.value_counts(normalize=True))

Now, we need to compute the feature matrix. We will query the word vector in each tweet, and come up with an average for the sample:

In [None]:
vector_size = 30
X = np.zeros((len(y), vector_size))

# Enumerate over tweets
for idx, tweet in enumerate(tweets_binary['text_processed']):
    # Tokenize the current tweet
    tokens = word_tokenize(tweet)
    n_tokens = len(tokens)
    # Enumerate over tokens, obtaining word vectors
    for token in tokens:
        X[idx] += model.wv.get_vector(token)
    # Take the average
    X[idx] /= n_tokens

As before, we'll proceed with splitting the data into train/test examples. We'll bring back the logistic fitter function from before, with some small changes.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
def fit_logistic_regression(X, y):
    """Fits a logistic regression model to provided data."""
    model = LogisticRegressionCV(
        Cs=10,
        penalty='l2',
        max_iter=1000,
        cv=5,
        refit=True).fit(X, y)
    return model

We then run the fit, and evaluate it!

In [None]:
# Fit the logistic regression model
fitter = fit_logistic_regression(X_train, y_train)

In [None]:
print(f"Training accuracy: {fitter.score(X_train, y_train)}")
print(f"Test accuracy: {fitter.score(X_test, y_test)}")

While this performance is pretty good, it's definitely not as good as the bag-of-words representation we used in the previous module. There are few reasons this might be the case:

1. We used a word embedding on a relatively small corpus. A word embedding obtained from a very large corpus would perform better. The tricky part in doing this is that our smaller corpus may have some niche tokens that are not in the larger model, so we'd have to work around that.
2. We simply averaged word embeddings across tokens. When doing this, we lose meaning in the ordering of words. Other methods, such as `doc2vec`, have been proposed to address these concerns.
3. Word embeddings might be an overly complicated approach for the task at hand. In a tweet aimed at an airline, a person needs to convey their sentiment in only 140 characters. So they are more likely to use relatively simple words that easily convey sentiment, making a bag-of-words a natural approach.

It's important to note that we also lose out on the interpretability of the logistic regression model, because the actual dimensions of each word vector do not themselves have any meaning. 

Moral of the story: word embeddings are great, but always start with the simpler model! This is a good way to baseline other approaches, and it might actually work pretty well!

## Challenge 4

Write a function that performs the pipeline of building a `word2vec` model and constructing a design matrix. Use this function to try and see if you can change the performance of the model with other parameters (vector sizes, window sizes, etc.).