[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/demos/nlp/word-2-vec.ipynb)

# Word Embeddings and Word-to-Vec (W2V)
This demo notebook revisits the lecture on word embeddings and Google's word-to-vec algorithm. W2V, like backpropagation, is a very popular algorithm that enjoys much coverage in various blogs, youtube channels, etc. In case you appreciate some additional material to read-up on W2V, here here are some useful resources including,  
- [the original W2V paper](https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)
- the beautiful ["Illustrated Word2vec" by Jay Alammar](https://jalammar.github.io/illustrated-word2vec/)
- the[W2V Tensorflow tutorial](https://www.tensorflow.org/tutorials/text/word2vec)

Last but not least, our main textbook features excellent chapters on word embeddings, W2V, and related algorithms inlcuding GloVe and Fasttext. You can find those parts in [Section 14 of Dive into Deep Learning](http://d2l.ai/chapter_natural-language-processing-pretraining/index.html)

Let's get started with our ADAMS demo.

## Training word-to-vec embeddings
When it comes to embeddings, the most common use case is to **download pre-trained embeddings** and employ these for some downstream tasks (with or without fine-tuning). The Keras *embedding layer* supports that use case very well, as we will see in a future demo on sentiment analysis. Since this demo aims at deepening our understanding of W2V, we focus on a different use case and demonstrate the training of **customer word embeddings** using our IMDB data. 

You could argue that the IMDB forum exhibits a specific type of speech or jargon, and that this justifies training word embeddings for this specific corpus. In practice, using pre-trained embeddings will almost surely give better results than training embeddings from zero. However, without going into too much detail of the pros and cons of pre-training your own embeddings versus employing pre-trained embeddings, perhaps with some finetuning, the point of this section is simply to showcase how you could train from scratch if you want to. To that end, we will use a library called `Gensim`. 

`Gensim` is a popular library for text processing. Although maybe even more geared toward topic modeling, it offers, among others, implementations of several algorithms to learn word embeddings including *W2V*, *GloVe*, and *Fasttext*. We demonstrate training W2V embeddings using our cleaned IMDB movie review data set. Before moving on, make sure to have installed `Gensim`. 

**Credits and disclaimers**: many of the examples you are going to see in this section have been inspired by this very nice [Kaggle post](https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial/notebook).

In [98]:
# Create a global variable to idicate whether the notebook is run in Colab
import sys
import numpy as np
import pandas as pd

IN_COLAB = 'google.colab' in sys.modules

# Configure variables pointing to directories and stored files 
if IN_COLAB:
    # Mount Google-Drive
    from google.colab import drive
    drive.mount('/content/drive')
    DATA_DIR = '/content/drive/My Drive/'  # adjust to Google drive folder with the data if applicable
else:
    DATA_DIR = './' # adjust to the directory where data is stored on your machine (if running the notebook locally)

sys.path.append(DATA_DIR)

CLEAN_REVIEW = DATA_DIR + 'imdb_clean_full_v2.pkl'   # List with tokenized reviews after standard NLP preparation
IMBD_EMBEDDINGS = DATA_DIR + 'w2v_imdb_full_d100_e500.model'

### Recap W2V
Let's quickly revisit the principles of W2V. Please consult the paper of [Mikolov et al. (2013)](https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) for a detailed description.

W2V establishes a word's meaning by the words that frequently appear close-by (distributional semantics). More specifically, the context of a word consists of the words that appear next to it within a pre-defined window (let's say 5 words).

 - the quality of *air* in mainland China has been decreasing since..
 - doctors claim the *air* you breath defines the overall wellbeing...
 - the currents of hot *air* have been bursting from underground
 - the mountain *air* was crystal clean and filled with ..
 - in case of *air* supply shortages, the submarine will..

Taking the word *air* as our **target word**, the words around *air*, called context words, define the **meaning** of the word *air* in W2V.

![w2vprocess](w2v.jpg)
<br>
inspired by https://www.youtube.com/watch?v=BD8wPsr_DAI

### Loading the data
We load the data frame with the original and cleaned reviews. The original version does not matter for this session. We will delete them to save memory. 

In [47]:
import pickle
with open(CLEAN_REVIEW,'rb') as path_name:
    df = pickle.load(path_name)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review        50000 non-null  object
 1   sentiment     50000 non-null  object
 2   review_clean  50000 non-null  object
dtypes: object(3)
memory usage: 1.1+ MB


In [48]:
df.drop(labels="review", axis=1, inplace=True)
df.head()

Unnamed: 0,sentiment,review_clean
0,positive,one reviewer mention watch oz episode hooked r...
1,positive,wonderful little production film technique una...
2,positive,thought wonderful way spend time hot summer we...
3,negative,basically family little boy jake think zombie ...
4,positive,petter love time money visually stun film watc...


### The Gensim W2V model
Training word embeddings using `Gensim` is very easy and just a matter of calling a function. Well, the reason it takes so little code is that we have already cleaned our data and have it available as an array of texts; that is a format that `Gensim`supports. However, note that, depending on your data, the code may take quite a while to run. Again, word embeddings trained on the full 50K data set for 500 epochs are available in our course folder.

Gensim is build for scalability. Would we use a large corpus, it were not be practical to first load all data from disk into your computer's main memory, to then process the data document by document using Gensim. Instead, it would be much more scalable to stream the data from disk. Long story short, we need a bit of infrastructure to input our review data set, which, for simplicity, we keep in a data frame, in a way that complies with what Gensim expects.  into. To that end, we build a little helper class that facilitates streaming reviews from our data frame. It would be easy to extend the helper class so as to facilitate streaming reviews from disk, or support both options. The [Gensim documentation](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#training-your-own-model) provides an example.

In [169]:
# Helper class to input reviews from our data frame into Gensim
from gensim import utils

class CleanReviews:
    """An iterator that yields sentences (lists of str)."""
    
    def __init__(self, reviews):
        self.reviews = reviews
        
    def __iter__(self):
        for line in self.reviews:
            yield utils.simple_preprocess(line)

And here is the simple call to the function `Word2Vec` that trains our custom word embeddings.

In [50]:
# CAUTION: Running the code might take a while
from gensim.models import Word2Vec    

emb_dim = 10  # embedding dimension, we use 10 for a quick demo of the code
reviews = CleanReviews(df.review_clean)
# Train a Word2Vec model
model = Word2Vec(sentences=reviews, 
                 min_count=5,  # min_count means the word frequency threshold, if =2 and word is used only once - it's not included
                 window=5,     # the size of context window
                 epochs=5,     # epochs is set to 5 to decrease runtim, would be much larger in practice
                 vector_size=emb_dim,  # size of embedding
                 workers=2)    # for parallel computing

Make sure to check out the docstring of the `Word2Vec` function to discover how word vectors are trained by default. Importantly, the argument `sg` let's you chose between *skip-gram* and *cbow*. Other concepts we discussed in the lecture include accelerating computations using *hierarchical softmax* and *negative sampling*. Gensim features these through its arguments `hs` and `negative`, respectively. Obviously, tons of other functionality is available, so make sure to study the [documentation](https://radimrehurek.com/gensim/models/word2vec.html?highlight=word2vec) if you plan to use the Gensim library for serious projects. Also, just to remind you, the [Kaggle post](https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial/notebook), which inspired this notebook, has a slightly more elaborate demo of how to set up training and, specifically, how you can break down the individual steps of W2V training into smaller pieces.

The trained word vectors are accessible through the field `wv` of the model class.

In [56]:
# what is the word vector of the words good and bad?
print(model.wv['good'])

[ 1.9756973  1.1006564  2.1407259 -2.7641182  4.1969585 -3.8379216
 -1.5564842  1.7839793 -1.1708127  1.1585494]


In [55]:
print(model.wv['bad'])

[ 0.03799623  1.7121277   3.5084138  -3.0625455   2.6999998  -5.3605113
 -1.6722724   3.313875   -0.6442533  -1.1794131 ]


In [51]:
len(model.wv.key_to_index)  # how many word vectors have been trained

30201

We continue with playing with word vectors shortly but let us first discuss input and output handling with Gensim.

### Input / output handling
Gensim supports saving and loading of trained embeddings in different versions. This makes a lot of sense since training can take a long time. For example, you could train for a couple of epochs, then store your results on disk, and then continue training. Here is how we can store our trained word vectors.

In [57]:
# Save trained word vectors to disk
file="w2v_tmp.model"
save_as_bin = False
model.wv.save_word2vec_format(file, binary=save_as_bin)  # set binary to True to save disk space; false facilitates inspecting the embeddings in a text editor

For Adams, you can obtain word vectors trained on the IMDB corpus for 500 epochs from our [GitHub repository](https://github.com/Humboldt-WI/adams/tree/master/demos/nlp). These vectors are far from comparable to real pre-trained W2V embeddings. On the other hand, their training took a couple of hours so the vectors should carry a bit more information compared to just running the above training code with a small embedding dimension of ten and training for only five epochs. Let's showcase how we can save and load word vectors.

In [58]:
# Load model from disk
from gensim.models import KeyedVectors
w2v = KeyedVectors.load_word2vec_format(IMBD_EMBEDDINGS, binary=False)

Remember that you can also access the `KeyedVectors`, which we load with the previous statement, directly from a trained model object via the field `wv`. Thus, if you would like to run the following demos with the word vectors you trained yourself, simply run the following command. One would expect that the demos give nicer results with the pre-trained embeddings from your repo but you are welcoem to try this our yourself. 

In [65]:
# w2v = model.wv  # continue with the W2V embeddings trained above 

### Playing with embeddings
Again, the embeddings loaded above are far from solid but should give us some somewhat meaningful results in algebraic comparisons. Let's see whether this works out. 

#### Which word is most similar to another word?

In [60]:
w2v.most_similar(positive=['movie'])

[('least', 0.9374186396598816),
 ('probably', 0.9304987192153931),
 ('still', 0.9181867837905884),
 ('ever', 0.9113300442695618),
 ('definitely', 0.9088698625564575),
 ('even', 0.9075060486793518),
 ('honestly', 0.9061371088027954),
 ('lately', 0.8953308463096619),
 ('actually', 0.8867319226264954),
 ('watch', 0.8728312253952026)]

#### How similar are two words?

In [61]:
w2v.similarity('good', 'great')

0.86766267

In [62]:
print('How similar is Tarantino to Spielberg: {}'.format(w2v.similarity('tarantino', 'spielberg')))
print('How similar is Lucas to Spielberg: {}'.format(w2v.similarity('lucas', 'spielberg')))

print('How similar is Paltrow to Bullock: {}'.format(w2v.similarity('paltrow', 'bullock')))
print('How similar is Paltrow to Alba: {}'.format(w2v.similarity('paltrow', 'alba')))

print('How similar is Cruise to Depp: {}'.format(w2v.similarity('cruise', 'depp')))
print('How similar is Cruise to Willis: {}'.format(w2v.similarity('cruise', 'willis')))


How similar is Tarantino to Spielberg: 0.8217860460281372
How similar is Lucas to Spielberg: 0.7440744042396545
How similar is Paltrow to Bullock: 0.7634718418121338
How similar is Paltrow to Alba: 0.8613905310630798
How similar is Cruise to Depp: 0.5666610598564148
How similar is Cruise to Willis: 0.6183159947395325


#### Which word does not fit in?

In [63]:
print(w2v.doesnt_match(['cool', 'great', 'lovely', 'weak']))
print(w2v.doesnt_match(['movie', 'film', 'good']))

weak
good


#### A is to B as C is to ? 

In [64]:
w2v.most_similar(positive=['spielberg', 'woman'], negative=['man'], topn=5)

[('deserves', 0.9519431591033936),
 ('deserve', 0.8851733803749084),
 ('thanks', 0.8781426548957825),
 ('surpass', 0.867881178855896),
 ('qualify', 0.8677225708961487)]

### Phrase detection
W2V trains one embedding per word. The model is agnostic of common phrases such as 'New York'. It would train one embedding for new and another for york, provided both words are part of the vocabulary. You can get better embeddings by adding common phrases to the vocabulary. W2V will then train individual embeddings for these phrases. Gensims also comes with a phrase detection models, which allows you to handle bigrams, trigrams and the like. We will not retrain our W2V model but sketch how you can use Gensim to get these common phrases. You could then consider to add (some of) them to your vocab and enhance the model.  

In [30]:
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
# Train a bigram model
bigram_model = Phrases(sentences=reviews,min_count=10 , threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS) 

After training, we can take text and put it through the bigram model. The model will then alter the text so as to introduce bigrams. Here is an example,

In [38]:
# to process text and replace phrases, we use our phrase detector as follows
bigram_model['I', 'like', 'this', 'movie']  # no phrases to be detected here

['I', 'like', 'this', 'movie']

In [45]:
bigram_model['sex', 'and', 'the', 'city', 'is', 'all', 'about', 'new', 'york']  # but we would expect city names to be detected

['sex', 'and', 'the', 'city', 'is', 'all', 'about', 'new_york']

We can also make use of our counter class to examine the most common bigrams in the corpus, as follows:

In [42]:
import collections
bigram_counter = collections.Counter()
for key in bigram_model.vocab.keys():
    if key.find('_')>-1: # the decode is needed because Gensims stores keys as bytes
        bigram_counter[key] += bigram_model.vocab[key]

In [43]:
bigram_counter.most_common(25)

[('look_like', 3715),
 ('watch_movie', 3121),
 ('ever_see', 2973),
 ('see_movie', 2752),
 ('bad_movie', 2727),
 ('make_movie', 2392),
 ('year_old', 2389),
 ('film_make', 2369),
 ('special_effect', 2308),
 ('movie_make', 2134),
 ('one_best', 2030),
 ('even_though', 1999),
 ('movie_ever', 1987),
 ('movie_like', 1921),
 ('low_budget', 1892),
 ('make_film', 1882),
 ('see_film', 1859),
 ('main_character', 1838),
 ('waste_time', 1793),
 ('watch_film', 1664),
 ('good_movie', 1634),
 ('horror_movie', 1611),
 ('much_well', 1532),
 ('want_see', 1494),
 ('seem_like', 1473)]

The above bigrams might be frequent. However, you would not consider training individual embeddings for phrases such as *look_like* or *waste_time*. This shows how proper phrase detection in the scope of W2V is nontrivial and would require more work before we can hope to get good results.     

### Plotting word vectors
It is fairly easy to create a visualization of the trained word vectors. You can find an example of how to do this in the [Kaggle kernel](https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial) mentioned above. Needless to say, many alternative demos are available online; here is just [one example](https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne). However, to get meaningful results we would need to prepare the data more carefully by, for example, removing too frequent words and too infrequent words. We would also finetune the training, and, overall, invest a lot more work to craft our word embeddings. In practice, we would typically not train our own embeddings from scratch. Instead, we would download pre-trained embeddings, which are available in many flavors (multiple languages, trained on different corpora with different jargon, etc.), and use these in our NLP application. We could also finetune the pre-trained embeddings using our own text data. We will showcase a corresponding approach in a later notebook on sentiment analysis. 