![Banner](img/AI_Special_Program_Banner.jpg)

## Text Mining - Material 2: Embeddings
------
This notebook is based on the work of [nlptown](https://github.com/nlptown/nlp-notebooks) and [Matthew Mayo from KDnuggets](https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html).

One of the breakthroughs of neural networks in Natural Language Processing is the usage of word embeddings. Rather than using the words themselves as features, neural network methods typically take as input dense, relatively low-dimensional vectors that model the meaning and usage of a word. Word embeddings were first popularized through the [Word2Vec](https://arxiv.org/abs/1301.3781) model, developed by Thomas Mikolov and colleagues at Google. Since then, scores of alternative approaches have been developed, such as [GloVe](https://nlp.stanford.edu/projects/glove/) and [FastText](https://fasttext.cc/) embeddings. In this notebook, we'll explore word embeddings with the original Word2Vec approach, as implemented in the [Gensim](https://radimrehurek.com/gensim/) library. 

You already used an "Embedding" layer in PyTorch, which was directly trained within the network. However, we now take a closer look at embeddings and especially focus on the more common libraries.

## Overview
- [Training word embeddings](#Training-word-embeddings)
- [Using word embeddings](#Using-word-embeddings)
- [Plotting embeddings](#Plotting-embeddings)
- [Clustering embeddings](#Clustering-embeddings)
- [Conclusion](#Conclusion)

First, we need to download the Corpus. Wikipedia is a good choice for training generic embeddings. Typically you would download the full Wikipedia ([here](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2)), but since this is over 21GB, just use a [fragment of about 275MB](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2) and store it in your `data` directory. We are using the English Wikipedia here.

In [None]:
import sys
from gensim.corpora import WikiCorpus # don't forget to install gensim ...

wikifile = "data/Wikipedia/enwiki-latest-pages-articles1.xml-p1p41242.bz2"
corpusfile = "data/corpus-en.txt"

def make_corpus(in_f, out_f):

    """Convert Wikipedia xml dump file to text corpus"""

    output = open(out_f, 'w', encoding="utf-8")
    wiki = WikiCorpus(in_f)

    i = 0
    for text in wiki.get_texts():
        output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
        i = i + 1
        if (i % 5000 == 0):
            print('Processed ' + str(i) + ' articles')
    output.close()
    print('Processing of '+ str(i) +' articles complete!')


#if __name__ == '__main__':
#    make_corpus("data/Wikipedia/enwiki-latest-pages-articles1.xml-p1p41242.bz2", "data/corpus-en.txt")
make_corpus(wikifile, corpusfile)

Have a look at the result in the `corpus-en.txt` file.

## Training word embeddings

Training word embeddings with Gensim couldn't be easier. The only thing we need is a corpus of sentences in the language under investigation (here German). This means we can feed lists of sentence tokens to Word2Vec by reading the lines in our Wikipedia file and splitting them on spaces (the data has been preprocessed using the WikiCorpus() above).

In [None]:
import os

class SentenceCorpus(object):

    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        with open(self.filename, "r", encoding="utf-8") as i:
            for line in i:
                tokens = line.strip().split()
                yield tokens
               
sentences = SentenceCorpus(corpusfile)

When we train our word embeddings, gensim allows us to set a number of parameters. The most important of these are `min_count`, `window`, `size` and `sg`:

- `min_count` is the minimum frequency of the words in our corpus. For infrequent words, we just don't have enough information to train reliable word embeddings. It therefore makes sense to set this minimum frequency to at least 10. In these experiments, we'll set it to 100 to limit the size of our model even more.
- `window` is number of words to the left and to the right that make up the context that word2vec will take into account.
- `vector_size` is the dimensionality of the word vectors. This is generally between 100 and 1000. You often have to make a trade-off: embeddings with a higher dimensionality are able to model more information, but also need more data to train.
- `sg`: there are two algorithms to train word2vec: skip-gram and CBOW. Skip-gram tries to predict the context on the basis of the target word; CBOW tries to find the target on the basis of the context. By default, Gensim uses CBOW (`sg=0`).

We'll investigate the impact of some of these parameters later.

In [None]:
import gensim

model = gensim.models.Word2Vec(sentences, min_count=100, window=5, vector_size=100, sg=0)

## Using word embeddings

Let's take a look at the model. The word embeddings are on its wordvector (`wv`) attribute, and we can access them by the using the token as key. For example, here is the embedding for German *king*, with the requested 100 dimensions.

In [None]:
model.wv["king"]

We can also easily find the similarity between two words. Similarity is measured as the **cosine** between the two word embeddings, and ranges between -1 and +1. The higher the cosine, the more similar two words are. As expected, the figures below show that *king* is closer to *queen* than to *champion*.

In [None]:
print(model.wv.similarity("king", "queen"))
print(model.wv.similarity("king", "champion"))

In a similar vein, we can find the words that are most similar to a target word. The words with the most similar embedding to *king* are all similar titles (such as *emperor* or *prince*) or are names of kings (*valdemar*).

In [None]:
model.wv.similar_by_word("king", topn=10)

Interestingly, we can look for words that are similar to a set of words and dissimilar to another set of words at the same time. This allows us to look for analogies of the type *king* is to *man* like ... is to *woman*. 
This example gives *queen* as the top result - as it should be. However, there are other promising candidates, as you can see. If you want, you can try out a few more examples by yourself -- but please remeber that our training data is very small and results may vary.

In [None]:
model.wv.most_similar(positive=['king', 'woman'], negative=["man"], topn=10)


Similarly, we can also zoom in on one of the meanings of ambiguous words. For example, the term, *tie* in English can refer to many things: an undecided match, something to waer, or the verb meaning "to join" (for further examples, see, e.g., [here](https://www.yourdictionary.com/articles/words-multiple-meanings))

In [None]:
model.wv.most_similar(positive=["tie"], topn=20)


However, if we specify we're looking for words that are similar to *tie* , but dissimilar to *win*, suddenly the best matches are almost all related to dressing up.

In [None]:
model.wv.most_similar(positive=["tie"], negative=["win"], topn=10)

Finally, we can present the word2vec model with a list of words and ask it to identify the odd one out. It then uses the word embeddings to identify the word that is least similar to the other ones. For example, in the list *sseminar university king study*, it correctly identifies *king* as the odd one out.

In [None]:
print(model.wv.doesnt_match("seminar university king study".split()))

## Plotting embeddings

Let's now visualize some of our embeddings. To plot embeddings with a dimensionality of 100 or more, we first need to map them to a dimensionality of 2. We do this by using **Principal Component Analysis (PCA)** [(more info)](https://en.wikipedia.org/wiki/Principal_component_analysis). T-SNE, short for t-distributed Stochastic Neighbor Embedding, helps us visualize high-dimensional data by mapping similar data to nearby points and dissimilar data to distance points in the low-dimensional space.

T-SNE is present in [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). To run it, we just have to specify the number of dimensions we'd like to map the data to (`n_components`), and the similarity metric that t-SNE should use to compute the similarity between two data points (`metric`). We're going to map to 2 dimensions and use the cosine as our similarity metric. Additionally, we use PCA as an initialization method to remove some noise and speed up computation. The [Scikit-learn user guide](https://scikit-learn.org/stable/modules/manifold.html#t-sne) contains some additional tips for optimizing performance. 

Plotting all the embeddings in our vector space would result in a very crowded figure where the labels are hardly legible. Therefore we'll focus on a subset of embeddings by selecting the 200 most similar words to a target word. 

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn.manifold import TSNE

target_word = "king"
selected_words = [w[0] for w in model.wv.most_similar(positive=[target_word], topn=200)]
embeddings = np.array([model.wv[w] for w in selected_words])

mapped_embeddings = TSNE(
    n_components=2,
    metric='cosine',
    init='pca',
    learning_rate='auto',
    square_distances=True
).fit_transform(embeddings)

If we take *king* as our target word, the figure shows some interesting patterns. Notice how 
* on the top to middle right, the "words" represented by roman numerals are clustered together
* on the top left, we have a "north-eastern" connotation
* on the bottom left, we have terms for noblemen

In [None]:
plt.figure(figsize=(30,30))
x = mapped_embeddings[:,0]
y = mapped_embeddings[:,1]
plt.scatter(x, y)

for i, txt in enumerate(selected_words):
    plt.annotate(txt, (x[i], y[i]), size=20)

## Clustering embeddings

Finally, we're going to cluster our embeddings. This can be useful to model semantic information. We'll use agglomerative clustering, a bottom-up clustering method that iteratively takes together the two most similar clusters (or embeddings) in the data.

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize

vocab = list(model.wv.key_to_index)
vectors = [model.wv[w] for w in vocab]
vectors_norm = normalize(vectors)

clusterer = AgglomerativeClustering(n_clusters=500)
clusters = clusterer.fit_predict(vectors_norm)

Let's inspect some of the clusters. By focusing some of the clusters that contain the names of countries, we can see how these clusters can be useful.

In [None]:
cluster_dictionary = {}
for cluster, word in zip(clusters, vocab): 
    if cluster not in cluster_dictionary:
        cluster_dictionary[cluster] = []
    cluster_dictionary[cluster].append(word)

In [None]:
for x in cluster_dictionary:
    if "korea" in cluster_dictionary[x]:
        print(cluster_dictionary[x])
print("\nAnother cluster:")
for x in cluster_dictionary:
    if "germany" in cluster_dictionary[x]:
        print(cluster_dictionary[x])

In [None]:
with open("data/clusters_nl.tsv", "w", encoding="utf-8") as o:
    for c in cluster_dictionary:
        for w in cluster_dictionary[c]:
            o.write(f"{w}\t{c}\n")

## Conclusion

Word embeddings are one of the most exciting trends on Natural Language Processing since the 2000s. They allow us to model the meaning and usage of a word, and discover words that behave similarly. This is crucial for the generalization capacity of many machine learning models. Moving from raw strings to embeddings allows them to generalize across words that have a similar meaning, and discover patterns that had previously escaped them.