# Word Embeddings with Word2Vec

What do linguistic computing and spatial analysis have in common? Both need to incorporate the spatial proximity of their data into a model so that all information available is being used. In spatial analysis, this is done through easily-defined longitude and latitude coordinates. In linguistic computing, however, the proximities and similarities between words aren't always as well-defined. If we ignored context entirely and simply one-hot-encoded our textual data, we'd be missing out on syntax, synonyms, and other characteristics of language that could reveal more about a text. To incorporate "spatial correlation" in a linguistic context, a model would need to recognize and group semantially similar words.

## Background
Word embeddings, of which Word2Vec (word to vector) is one implementation, seek to numerically establish synonymous (happy - joyful) or associated (man - king - queen - woman) words through textual proximity. In other words, it turns each word into a vector of numbers that, when compared to the vectors of other words, establish similarity or dissimilarity ("spatial" grouping of text). The numbers themselves represent latent factors that were engineered by the Word2Vec model and are difficult to interpret. This method was recently pioneered by Tomas Mikolov and his associates in 2013. They pioneered a relatively shallow neural network that could provide powerful feature engineering for use in more complicated models.

## Theory
The basic process of Word2Vec is a shallow neural network. It takes words as input and then outputs word vectors that can be used as features in deeper networks. To preprocess the data, the sentences need to be tokenized or separated into their individual words, which is easily accomplished with `nltk`. Word2Vec then organizes these words into a vocabulary for the text and iteratively constructs word vectors. There are two architectures commonly used to build these vectors: continuous bag of words (CBOW) and continuous skip-gram.
 
**CBOW:** CBOW builds a model that will predict one word based on surrounding words. It takes a "bag of words" as its input, say "This is a ___ presentation" and uses the weights (word vectors) associated with each surrounding words to predict whether the missing word will be "great," "funny," "terrible," etc. The predicted word is the most likely given the joint probabilities from each of the surrounding words.

**Continuous Skip-Gram:** Skip-gram builds a model that will predict surrounding words given one input word, kind of the opposite of CBOW. Given the word "great," the weights associated with the word will be used to predict words before and after, for example "a great presentation." The predicted words are again the most likely, but this time it comes from the probabilities of one word rather than all surrounding words.

In the model, the text is first broken into N-grams, known as the window of the text. For the last sentence and a window of 4, the first two N-grams would be "In the model the" and "the model the text." Word vectors are randomly initialized with weights that will determine which word is predicted in which context. The model, using either the continuous bag of words or continuous skip-gram architecture, then predicts the neighboring word of each window and compares it to the actual word. Backpropagation then adjusts the weights. This process continues until the model converges.

After training the model, word embeddings can be used as features or compared to one another to create measurements of similarity. The resulting word vector values aren't interpretable by themselves, but can be used to show how similar two words are or predict what the next word will be given a sentence fragment.

![CBOW vs. Skip-gram](architectures.png)

Image from https://wiki.pathmind.com/word2vec.

## Implementation with `nltk` and `gensim`
### The Importance of Being Earnest
The main output of a Word2Vec model by itself is a list of similarity indexes for a given word. To illustrate this, we'll apply a model from each approach to *The Importance of Being Earnest* to see if Word2Vec can accurately describe each of the main characters using the words most similar to or most associated with each of the first names. The .txt file was downloaded from [Project Gutenberg](https://www.gutenberg.org/). For ease of cleaning, I went into the .txt file itself and deleted any content before/after the play (Gutenberg information).

First, we import all necessary libraries. `nltk` will preprocess the text into tokenized data; `itertools` and `collections` will help us explore the resulting cleaned text; and `gensim` will fit the Word2Vec model.

In [1]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords # filter out stopwords
from nltk.tokenize import sent_tokenize, word_tokenize # tokenize words
from itertools import chain # unlist nested list 
from collections import Counter # frequency list
from gensim.models import Word2Vec
from IPython.display import display_html # display tables

After reading in the data and removing newline indicators (\n), we need to separate the play into sentences and separate each sentence into words, a process known as tokenizing. Stopwords, or words that contribute little contextual meaning to a sentence ("the", "a", etc.), are filtered out so that only the words that provide information are associated with each character. While `nltk` has a list of stopwords, I also added punctuation and additional words that I considered to not provide much information. Since Python doesn't recognize that words are the same whether they're in lower or uppercase, all words are standardized to lowercase.

In [2]:
## PREPROCESSING
# read in text
with open("earnest.txt", encoding="utf8") as text:
    t =  text.read()

# remove newlines and \ufeff (not sure what that is)
t2 = t.replace("\ufeff", "")
earnest = t2.replace("\n", " ")

# put together a list of stopwords
sw = stopwords.words('english')
punct = [',', ':', '.', ';', '!', '-', '‘', '[', '’', ']', '?', '*', '']
sw.extend(punct)
user_sw = ['like', 'much', 'say', 'get', 'may', 'must', 'mr.', 'mr', 'indeed', 
           'quite', 'would', 'could', 'us', 'ever', 'really', 'one', 'well']
sw.extend(user_sw)

# nested list of words in each sentence in the play
tokens = []
for i in sent_tokenize(earnest): # tokenize sentences
    temp = []

    for j in word_tokenize(i): # tokenize words
        word = j.lower() # standardize to lowercase
        if word not in sw: # filter out stopwords
            temp.append(word)

    tokens.append(temp)


To explore the data as a way to try specifying the hyperparameters (discussed next), we look at the number of words, the number of unique words, and the average sentence length of the text (with and without sentences that are one word long).

In [3]:
## DATA EXPLORATION
# number of words
print(sum([len(i) for i in tokens]))

# number of unique words
words = list(chain(*tokens))
print(len(Counter(words).keys()))

# average sentence length
print(np.mean([len(i) for i in tokens]))
print(np.mean([len(i) for i in tokens if len(i) > 1]))

9726
2508
3.0336868371802868
4.317196167423097


For our analysis, we will use these hyperparameters:
- vector_size: number of latent variables for each word (length of the vector)
    - I arbitrarily chose 100. Without another model to put the resulting word embeddings through, it is difficult to assess the goodness-of-fit of our model
- window: the maximum "slide" or window that the model looks at
    - Although this isn't a rule of thumb, I chose 4 because that's about the average sentence length for this text
- min_count: if a word appears fewer times than specified, ignore that word
    - I didn't want words that only appeared once, so I set this to 2
- sg: training algorithm (1 = skip-gram, 0 = CBOW)
    - Both algorithms will be used for comparison

To fit the model, we simply call the `Word2Vec` function with the data and specified hyperparameters. Since there is randomness involved in this model, we also set a seed to keep results consistent.

In [4]:
## FIT MODEL
# CBOW
model1 = Word2Vec(tokens, min_count = 2, vector_size = 100, 
                  window = 4, sg = 0, seed=486)

# Skip Gram
model2 = Word2Vec(tokens, min_count = 2, vector_size = 100, 
                  window = 4, sg = 1, seed=486)

Finally, we look at the resulting words that are most similar to each name: Ernest, Jack, Algernon, and Gwendolen. To get the similarity indexes, we use the `most_similar` function from the `wv` (word vector) attribute of the model. Higher numbers mean the words are more similar to each other, or in the case of the names, which words are most associated with each name. While there isn't a way to quantitatively assess the goodness-of-fit of our model without applying it to another deeper model like a neural network (thus also making it difficult to tune the parameters), we can compare the similar words to the Sparknotes description of each character. The skip-gram model seems to capture these characters better than CBOW.

### Jack Worthing

Jack is seen as responsible and respectable, even as an illegitimate child who was adopted. He lives a double life and is known as Jack at his country estate in Hertfordshire, where he is also Cecily's guardian.

The most similar words from both models portray Jack as pragmatic and responsible (and honestly kind of boring). His association with the word "uncle" comes from his relationship with Cecily. Unsurprisingly, he is also associated with Ernest.

In [5]:
## JACK WORTHING
j1 = pd.DataFrame(model1.wv.most_similar("jack"))
j2 = pd.DataFrame(model2.wv.most_similar("jack"))

j1_styler = j1.style.set_table_attributes("style='display:inline'").set_caption('Jack: CBOW')
j2_styler = j2.style.set_table_attributes("style='display:inline'").set_caption('Jack: Skip Gram')

display_html(j1_styler._repr_html_()+j2_styler._repr_html_(), raw=True)

Unnamed: 0,0,1
0,first,0.369038
1,publication,0.357285
2,continues,0.341649
3,name,0.334546
4,evening,0.329508
5,friends,0.324209
6,knows,0.316379
7,presents,0.288889
8,severe,0.286122
9,suppose,0.284374

Unnamed: 0,0,1
0,first,0.926982
1,name,0.923992
2,ernest,0.919789
3,seems,0.918567
4,never,0.918149
5,always,0.916575
6,bunbury,0.916192
7,fact,0.915344
8,kind,0.914766
9,suppose,0.914391


### Ernest
Ernest is the other half of double's life in the guise of Jack's black-sheep brother; Jack goes by Ernest when he lives in London. No one but Jack knows of this at the beginning of the play

Perhaps due to Cecily's obsession with Ernest, he is highly associated with her last name. Interestingly, Bunbury is associated with Ernest, probably because they are both fictional personas (Bunbury being an imaginary friend of Algernon). "Brother" and "young" also appear due to his guise as Jack's black-sheep brother.

In [6]:
## ERNEST
e1 = pd.DataFrame(model1.wv.most_similar("ernest"))
e2 = pd.DataFrame(model2.wv.most_similar("ernest"))

e1_styler = e1.style.set_table_attributes("style='display:inline'").set_caption('Ernest: CBOW')
e2_styler = e2.style.set_table_attributes("style='display:inline'").set_caption('Ernest: Skip Gram')

display_html(e1_styler._repr_html_()+e2_styler._repr_html_(), raw=True)

Unnamed: 0,0,1
0,never,0.563651
1,worthing,0.511773
2,seems,0.504239
3,young,0.500894
4,always,0.500313
5,time,0.499698
6,think,0.492464
7,married,0.478103
8,bunbury,0.477365
9,tell,0.473771

Unnamed: 0,0,1
0,never,0.981993
1,always,0.981425
2,young,0.980574
3,time,0.979928
4,seems,0.979911
5,worthing,0.979177
6,life,0.978959
7,cardew,0.978706
8,bunbury,0.978341
9,brother,0.9781


### Algernon Moncrieff

Algernon knows Jack as Ernest. He is a bachelor and described as brilliant, witty, selfish, and amoral. He also has a fictional friend named "Bunbury" who gives him an excuse to leave social obligations

Some of Algernon's words are similar to Ernest's, and Ernest/Jack's last name of Worthing is associated with him. This could be because Algernon posed as Ernest during the play or because he is close friends with Jack. These words also paint him as a bit more impulsive, which matches his creation of Bunbury (who oddly is not associated with him in either model)

In [7]:
## ALGERNON MONCRIEFF
a1 = pd.DataFrame(model1.wv.most_similar("algernon"))
a2 = pd.DataFrame(model2.wv.most_similar("algernon"))

a1_styler = a1.style.set_table_attributes("style='display:inline'").set_caption('Algernon: CBOW')
a2_styler = a2.style.set_table_attributes("style='display:inline'").set_caption('Algernon: Skip Gram')

display_html(a1_styler._repr_html_()+a2_styler._repr_html_(), raw=True)

Unnamed: 0,0,1
0,always,0.546839
1,worthing,0.5293
2,course,0.505571
3,little,0.473174
4,make,0.44419
5,go,0.442553
6,young,0.435717
7,suppose,0.431337
8,thousand,0.42657
9,women,0.412364

Unnamed: 0,0,1
0,always,0.939588
1,worthing,0.938941
2,course,0.934334
3,little,0.933379
4,young,0.931614
5,go,0.931404
6,make,0.929424
7,never,0.929199
8,suppose,0.928784
9,seems,0.927466


### Gwendolen Fairfax

Gwendolen knows Jack as Ernest and is in love with him. She is described as sophisticated, intellectual, and pretentious; she keeps up well with high fashion and society.

Most of her associated words have to do with other people (either other characters or people/society in general), which could represent her love of society.

In [8]:
## GWENDOLEN FAIRFAX
g1 = pd.DataFrame(model1.wv.most_similar("gwendolen"))
g2 = pd.DataFrame(model2.wv.most_similar("gwendolen"))

g1_styler = g1.style.set_table_attributes("style='display:inline'").set_caption('Gwendolen: CBOW')
g2_styler = g2.style.set_table_attributes("style='display:inline'").set_caption('Gwendolen: Skip Gram')

display_html(g1_styler._repr_html_()+g2_styler._repr_html_(), raw=True)

Unnamed: 0,0,1
0,way,0.49569
1,people,0.41186
2,life,0.385409
3,brother,0.364765
4,guardian,0.363424
5,incomparable,0.350194
6,come,0.350112
7,might,0.346158
8,parents,0.338225
9,careful,0.336225

Unnamed: 0,0,1
0,way,0.952647
1,people,0.949683
2,life,0.94854
3,brother,0.94635
4,always,0.946128
5,cardew,0.945254
6,dear,0.944569
7,think,0.943896
8,young,0.943563
9,come,0.941612


### Cecily Cardew

Cecily is young, a romantic, and is obsessed with the idea of wickedness despite her own innocence. She has also invented a romance between herself and Ernest, who she only knows as Jack's brother.

A lot of character names are associated with her, and with words like "man", "age" (there is talk about her coming of age in the play), "dear", and "brother", the model seems to capture her imagined romance with Ernest.

In [9]:
## CECILY CARDEW
c1 = pd.DataFrame(model1.wv.most_similar("cecily"))
c2 = pd.DataFrame(model2.wv.most_similar("cecily"))

c1_styler = c1.style.set_table_attributes("style='display:inline'").set_caption('Cecily: CBOW')
c2_styler = c2.style.set_table_attributes("style='display:inline'").set_caption('Cecily: Skip Gram')

display_html(c1_styler._repr_html_()+c2_styler._repr_html_(), raw=True)

Unnamed: 0,0,1
0,age,0.445093
1,man,0.391717
2,last,0.388287
3,dear,0.37753
4,augusta,0.366802
5,people,0.361997
6,suddenly,0.355639
7,john,0.351888
8,week,0.351829
9,place,0.34462

Unnamed: 0,0,1
0,man,0.959906
1,dear,0.95969
2,age,0.958258
3,brother,0.957014
4,cardew,0.956991
5,seems,0.956848
6,think,0.956688
7,people,0.956625
8,life,0.956161
9,last,0.955432


## Conclusion
This was a small example of Word2Vec without applying the resulting word embeddings to another model. As such, it is difficult to assess goodness-of-fit, but it is fun to explore. Word2Vec has broader applications when used for feature engineering in deeper neural networks. By understanding which words are commonly associated with each other, the results of Word2Vec can be used to then generate text (see which word is most likely based on prior words), correct spelling mistakes (or identify the correct word) that would otherwise be difficult to find using regex coding, and find analogies. The examples above deal only with positive associations (Ernest = brother), but one other common implementation is positive and negative semantic association (king (positive) - man (negative) = queen). With the recent invention of this method of text processing, there is much to which it can still be applied.

## Sources
Seminal Papers
- https://arxiv.org/pdf/1301.3781.pdf
- https://arxiv.org/pdf/1310.4546.pdf

Explanations of Background and Theory
- https://code.google.com/archive/p/word2vec/
- https://jalammar.github.io/illustrated-word2vec/
- https://towardsdatascience.com/word2vec-explained-49c52b4ccb71
- https://wiki.pathmind.com/word2vec
- https://www.analyticsvidhya.com/blog/2021/07/word2vec-for-word-embeddings-a-beginners-guide/

Code Help
- https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/
- https://towardsdatascience.com/word2vec-explained-49c52b4ccb71
- https://www.analyticsvidhya.com/blog/2021/07/word2vec-for-word-embeddings-a-beginners-guide/
- https://radimrehurek.com/gensim/models/word2vec.html