# Word embeddings: Word2Vec, GloVe, and FastText

Practical course material for the ASDM Class 09 (Text Mining) by Florian Leitner.

© 2017 Florian Leitner. All rights reserved.

## Introduction

This lab will look into the three most common algorithms to generate word embeddings.

First, we need to make sure we have all tools on board; It might not be possible for you to install all three requirements, but you should at least be able to get Word2Vec to work, as it is based on `gensim`, which we've been using yesterday.

In [1]:
import fasttext # requires cython
import gensim # for Word2Vec
import glove # package name: glove_python; only easy to install on Linux

from IPython.display import HTML # show text as HTML

Installation notes for [**`fasttext`**](https://github.com/salestock/fastText.py): You will need to have [Cython](http://cython.org/) installed on your machine (e.g., with Homebrew: `brew install cython`). Attention: you will need to **`unset`** the `$CXX` and `$CC` environment variables (see below) used for `glove_python` to install `fasttext` (or first install `fasttext`).

Installation notes for [**`glove_python`**](https://github.com/maciejkula/glove-python) on OSX: You will need to have [Homebrew](https://brew.sh/) installed, and will need a new C/C++ compiler, as Apple's `clang` does not support the required features:

```sh
brew ls gcc
# ONLY if no gcc is installed:
brew install gcc --without-multilib
export CXX=/usr/local/Cellar/gcc/VERSION/bin/g++-7
export CC=/usr/local/Cellar/gcc/7.1.0/bin/gcc-7
$YOUR_PY_PKG_MANGER install glove_python
```

## Data preparation

We will be using `text8`, a standard text file used for evaluating text compression algorithms that turns out to be sufficiently large (or, small...) and therefore very practical for a mini-evaluation of text embeddings. Do note though that "real" text copopra to learn embeddings should be in the GB size range.

First, need to make sure you have the `text8` file installed, or, if not, download and unzip it to the directory this notebook is in:

In [2]:
import os.path
if not os.path.isfile('text8'):
    !wget -c http://mattmahoney.net/dc/text8.zip
    !unzip text8.zip

### Collocation detection

This is a step you might not see in all word embedding tutorials around. However, if you closely read papers by Mikolov and other "pioneers" of word embeddings, you will see that they almost always **make sure to use this preprocessing step**. As homework, try building word embeddings with and without collocations; If the text corpus is big enough (possibly needs to be larger than `text8`), you will be able observe significant performance differences. More obviously, you cannot find an embedding for a collocation (e.g., "New York") without feeding the actual collocation (instead of the individual tokens) to the model in the first place.

Only if not yet done: open the file, read the tokens, detect all phrases (aka. collocations, idioms), and write the collocations back out (while joining the with an underscore, and joining the remaining tokens with spaces [again]):

In [3]:
with open('text8') as f:
    tokens = f.read().split()

print("raw n. tokens =", len(tokens))

from gensim.models import Phrases
from gensim.models.phrases import Phraser

# NB: Phrases.__init__ expects sentences
collocations = Phraser(Phrases([tokens]))
phrases = collocations[tokens]

print("collocated n. tokens =", len(phrases))

with open('text8_collocations', 'wt') as f:
    f.write(" ".join(phrases))

print()
HTML(print(" ".join(phrases[:100])))

raw n. tokens = 17005207
collocated n. tokens = 15682363

anarchism originated as a term of abuse first used against early working_class radicals including the diggers of the english revolution and the sans_culottes of the french_revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken_up as a positive label by self defined anarchists the word anarchism is derived_from the greek without archons ruler chief king anarchism as a political_philosophy is the belief_that rulers are unnecessary and should_be abolished although there_are differing interpretations of what this means anarchism also refers_to related


<IPython.core.display.HTML object>

If you've done the above step already, you can simply reload the collocation corpus:

In [4]:
with open('text8_collocations') as f:
    phrases = f.read().split()
    
HTML(" ".join(phrases[:100]))

In [5]:
def text8_to_sentences(tokens):
    """The models insist on sentences; Let's build some."""
    index = 0
    inc = 200
    
    while index + inc < len(tokens):
        yield tokens[index:index+inc]
        index += inc
        
    yield tokens[index:]

sentences = list(text8_to_sentences(phrases))

## Model training

Now that we have generated all collocations for the words, it is a good time to do the actual wembeddings.

### [FastText](https://github.com/salestock/fastText.py#api-documentation)

Training a FastText model against the text8 text file (around 3 minutes on my MBP, with four corse; Also note that FastText has by a **long** shot the most professional C[++] implementation, so with fewer corse, training times might increase substantially):

In [6]:
%time ft_sg = fasttext.skipgram('text8_collocations', 'ft-sg') # input/output files

CPU times: user 20min 24s, sys: 9.93 s, total: 20min 33s
Wall time: 3min 14s


In [7]:
print("vocabulary size =", len(ft_sg.words))

vocabulary size = 111321


In [8]:
vec = ft_sg['french_revolution']
print('french_revolution', len(vec), vec)

french_revolution 100 [-0.41188904643058777, 0.0858849585056305, -0.3490246534347534, -0.1721481829881668, 0.06440900266170502, 0.23033082485198975, -0.21849319338798523, 0.4249664843082428, 0.5965721607208252, -0.3434455990791321, 0.16860421001911163, 0.7628859877586365, -0.024204956367611885, 0.3768453896045685, -0.28089678287506104, -0.14310553669929504, 0.164326474070549, -0.3946954309940338, 0.29132193326950073, -0.5946333408355713, 0.615812361240387, -0.952389657497406, 0.6845676898956299, 0.9103122353553772, 0.08443176746368408, 0.6040365695953369, -0.6752661466598511, -0.014642216265201569, -0.5868124961853027, -0.2651601731777191, -0.012251163832843304, 0.18524271249771118, -0.17612957954406738, 0.3155304789543152, 0.279428631067276, -0.2875252366065979, -0.22796036303043365, 0.17090444266796112, -0.3238986134529114, -0.4470534026622772, 0.6921630501747131, 0.20638443529605865, -0.2862219214439392, -0.2007114440202713, 0.12906454503536224, -0.42697200179100037, -0.033054064959

Wrapping the fasttext model with the `gensim` API.

Note that `gensim` provides API facilities to [train directly with the FastText binary](http://radimrehurek.com/gensim/models/wrappers/fasttext.html), too. Also, the Python `fasttext` module provides a slightly different (document classification-oriented) API than the `gensim` and `glove_python` (word similarity-oriented) API. As homework, you can figure out how to build the FT model with the `gensim` API.

In [9]:
from gensim.models import KeyedVectors

fasttext_model = KeyedVectors.load_word2vec_format('ft-sg.vec')

### [GloVe](https://github.com/maciejkula/glove-python)

In [10]:
from glove import Glove
from glove import Corpus

Training a GloVe model against the text8 corpus (around 2 minutes on my MBP with four cores; note that GloVe has by an equally long shot the probably worst C[++] implementation of the three algorithms):

In [26]:
%%time

glove_corpus = Corpus()
# NB: Corpus.fit expects sentences
glove_corpus.fit(sentences, window=5) # ft & w2v use 5, but GloVe defaults to 10
glove_model = Glove(no_components=100) # default is 30, but ft & w2v use 100 dims
glove_model.fit(glove_corpus.matrix, no_threads=4) # as many threads as you have cores
glove_model.add_dictionary(glove_corpus.dictionary)
glove_model.save('glove.bin')

CPU times: user 4min 20s, sys: 6 s, total: 4min 26s
Wall time: 2min 6s


Note that the vocabulary size is (much) larger, because the GloVe API give you no means of filtering infrequent words.

In [27]:
print("vocabulary size =", len(glove_corpus.dictionary))

vocabulary size = 297728


In [28]:
word_id = glove_corpus.dictionary['french_revolution']
vec = glove_model.word_vectors[word_id]
print('french_revolution:', len(vec), vec)

french_revolution: 100 [-0.01457327  0.07042467 -0.00738236  0.01869197 -0.10396932  0.07660629
 -0.00847388 -0.04022833 -0.08409861 -0.06840466  0.02895287  0.04013232
  0.03428275 -0.06931875  0.00244692 -0.09012463  0.07294783 -0.02184141
 -0.00240382 -0.09255762 -0.03243057  0.12117265 -0.09894675 -0.05349839
 -0.10930869  0.07230865  0.01685076 -0.0130918   0.01368825  0.08105766
 -0.01775627  0.08379842  0.09672956  0.04751685  0.11381786  0.08950021
  0.08859762  0.0207619   0.10595165  0.11534001  0.08888922  0.06794028
 -0.04743968  0.0028419  -0.07757577 -0.0749322  -0.11629591 -0.10451072
 -0.08805415  0.10000175 -0.06882985 -0.00331876  0.03981137  0.01464841
 -0.04958857 -0.07383062 -0.14829421 -0.06328444 -0.00266095 -0.02554267
  0.07958243  0.12405662 -0.02635902 -0.04983661  0.06857486 -0.08566265
  0.0715739  -0.01834986 -0.00941419  0.07897407  0.02931605 -0.10223707
  0.10103703 -0.06886297  0.03940952 -0.05310901 -0.09632344 -0.04183905
  0.08647341  0.03180886  0.

### [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)

In [14]:
from gensim.models import Word2Vec

Training a Word2Vec model against the text8 corpus (around 3 minutes on my MBP, with four cores); Note that skip-grams perform much better than CBOW, but `gensim`'s Word2Vec API defaults to CBOW.

In [15]:
%time word2vec_model = gensim.models.Word2Vec(list(text8_to_sentences(phrases)), sg=1)

CPU times: user 9min 27s, sys: 4.32 s, total: 9min 32s
Wall time: 3min 19s


In [16]:
print(word2vec_model)

Word2Vec(vocab=111321, size=100, alpha=0.025)


In conclusion, both Word2Vec and FastText are about the same as fast; GloVe, however, is much slower than the other two. This outcome would be very similiary even if you were to use the respective "raw" C/C++ implementations of each algorithm's author.

## Evaluation

Let's do a very unprofessional evaluation: looking at the top 50 words of your corpus. 

Normally, you would evaluate towards your objective for using the embeddings; E.g., evaluate your classifier (or parser, tagger, ...) you are using with each embedding vector collection we generated above, and possibly also evaluate without using embeddings, and then choose the setup that maximizes your classifier's performance (i.e., your final objective).

Another evaluation is to use a long list of similar concept pairs (i.e., including collocations/idioms/phrases). Then you compare that list to random pairings of the concepts and choose the model which maximizes the distance between the similar pairs and the random pairs. Another option (which is used by most word embedding papers) is to compare the correlation is to use the model's predicted similarity (rank) of the pair and compare the average similarity (rank) across models.

In [17]:
fasttext_model.most_similar('french_revolution', topn=50)

[('french_revolutionary', 0.9696083068847656),
 ('pre_revolutionary', 0.8272403478622437),
 ('quiet_revolution', 0.8206069469451904),
 ('copernican_revolution', 0.8189491629600525),
 ('glorious_revolution', 0.8126076459884644),
 ('haitian_revolution', 0.8108875751495361),
 ('revolutionary_france', 0.800675630569458),
 ('velvet_revolution', 0.7999899387359619),
 ('october_revolution', 0.7929908037185669),
 ('cedar_revolution', 0.7885279655456543),
 ('cuban_revolution', 0.7873218059539795),
 ('counter_revolution', 0.7808640003204346),
 ('violent_revolution', 0.777005672454834),
 ('mexican_revolution', 0.7757418155670166),
 ('counterrevolutionary', 0.7704391479492188),
 ('carnation_revolution', 0.7697211503982544),
 ('counter_revolutionary', 0.7678598165512085),
 ('socialist_revolutionary', 0.7639342546463013),
 ('russian_revolution', 0.7569039463996887),
 ('revolutionary_war', 0.7538503408432007),
 ('communist_revolution', 0.7497499585151672),
 ('bolshevik_revolution', 0.7488869428634644

In [29]:
glove_model.most_similar('french_revolution', 50)

[('american_revolution', 0.98545719719560509),
 ('byzantine_empire', 0.98537239752199768),
 ('immediately_after', 0.98526824795005341),
 ('julian_calendar', 0.98293441887801969),
 ('braves', 0.9827548943822858),
 ('white_sox', 0.98201817386180201),
 ('crusades', 0.98047823611917062),
 ('beatles', 0.98038678138099378),
 ('red_army', 0.97926562174246679),
 ('fastest', 0.97899027848733544),
 ('just_before', 0.97854986070970051),
 ('taliban', 0.97491795114221913),
 ('browns', 0.97445546919162207),
 ('royal_navy', 0.97375007741488073),
 ('holocaust', 0.97346146917545107),
 ('luftwaffe', 0.97324417882319947),
 ('khmer_rouge', 0.97291143095959398),
 ('joins', 0.97289841957820655),
 ('napoleonic_wars', 0.97282942944594974),
 ('astros', 0.9726109924660693),
 ('vikings', 0.97257847329751768),
 ('us_army', 0.97074014905486095),
 ('cpr', 0.97036798876359753),
 ('reformation', 0.96958038092561372),
 ('ottoman_empire', 0.96940425590643275),
 ('franks', 0.96936113689026537),
 ('beach_boys', 0.9693366

In [19]:
word2vec_model.most_similar('french_revolution', topn=50)

[('napoleonic', 0.7552270889282227),
 ('english_civil', 0.7475066184997559),
 ('american_revolution', 0.746488094329834),
 ('weimar_republic', 0.7448495030403137),
 ('glorious_revolution', 0.7405481338500977),
 ('jacobin', 0.7388259172439575),
 ('crusading', 0.7386302947998047),
 ('franco_prussian', 0.7379936575889587),
 ('napoleonic_wars', 0.7379871010780334),
 ('inter_war', 0.7285125255584717),
 ('scottish_independence', 0.7227034568786621),
 ('french_revolutionary', 0.7222280502319336),
 ('its_aftermath', 0.7210865020751953),
 ('emancipation', 0.7207050323486328),
 ('interwar', 0.7204379439353943),
 ('pre_revolutionary', 0.7199528813362122),
 ('jacobins', 0.7196218967437744),
 ('clausewitz', 0.7180099487304688),
 ('effectively_ended', 0.7172418832778931),
 ('paris_commune', 0.7117035388946533),
 ('brezhnev', 0.704556941986084),
 ('spanish_civil', 0.7038901448249817),
 ('nazi_era', 0.7018135786056519),
 ('were_suppressed', 0.7017765045166016),
 ('german_empire', 0.7012790441513062),


## Conclusion

FastText is certainly more character-content oriented due to its reliance on character n-grams, and both it and Word2Vec have the advantage that they are (much - particularly notable on on very large corpora) faster than GloVe. Also note that GloVe's similarty scores decay much slower. But at the end of the day, the right choice is more a question of personal taste than anything else. One noteworthy advantage for `gensim`'s Word2Vec implementation is that the model can be updated with new documents at a later data ("online training"). Yours Truly prefers FastText, due to the above reasons and because the character n-gram embeddings provide a good backup strategy for unseen words, but the best rule as always is: Test all approaches against your objective!

## Post scriptum

Yes, with the `gensim` embedding model API, you can do *that*, too:

In [20]:
word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.6721546649932861),
 ('crown_prince', 0.6700811386108398),
 ('empress', 0.6549503803253174),
 ('wife', 0.654847264289856),
 ('matilda', 0.6330344080924988),
 ('throne', 0.6304331421852112),
 ('jadwiga', 0.6300833225250244),
 ('emperor', 0.622090220451355),
 ('anjou', 0.6076939105987549),
 ('queen_consort', 0.6069934368133545)]

Hurray!