# Word Representations (Embeddings)

In natural language processing, word embedding is the term used for representing a word as a vector. For training a model we need numerical data, which means that we must find a way to represent texts such that we keep as much information as possible considering our current context. This means that sometimes semantic relations will be more important, other times lexical information etc.

By using word embeddings (vectorization) we can represent each word as a number or a list of numbers that conveys this information such that words that are similar will be closer to each other in the vector space than words that are not.

[A nice illustration](https://jalammar.github.io/illustrated-word2vec/)

# Bag of Words (BoW)

Imagine a situation where the context of the words is not relevant, only how often they appear. This is where we use bag of words. This approach just throws all words in a bag, maybe shuffles it a bit, then counts how many times each words appears (or if they appear in case of a binary BoW). It is the easiest vectorization method that we will discuss.

<center><img src='https://drive.google.com/uc?export=view&id=1v6McR199QkVXvuQmC3FWJ80rSXGTbZUS' width=500></center>

Let's take the text from the example. We can either write our own BoW implementation, or we can use the one preimplemented in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

text = ['Did you see the fly?', 'The fly will fly with you.']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['did' 'fly' 'see' 'the' 'will' 'with' 'you']
[[1 1 1 1 0 0 1]
 [0 2 0 1 1 1 1]]


CountVectorizer is a class with predefined parameters. You can always change those parameters, meaning that you can, for example, choose to have a binary representation of bigrams:

In [None]:
text = ['I am not happy.', 'He is very happy']
vectorizer = CountVectorizer(binary=True, ngram_range=(2, 2))
X = vectorizer.fit_transform(text)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['am not' 'he is' 'is very' 'not happy' 'very happy']
[[1 0 0 1 0]
 [0 1 1 0 1]]


N-grams are sequences of n words. They help us get some context about the text, letting us know the difference between _not happy_ and _very happy_ for example. This can be used as a feature for another representation, or on its own to make assumptions about the dataset.

#  Term Frequency - Inverse Document Frequency (Tf-idf)

Just because a word appears often it does not mean that it is necessarily relevant (think about stopwords). If we want to write a search engine for example, it would be more relevant for us to know how often a certain word appears in a document with regards to how common that word generally is. Tf-idf is an algorithm that takes this into account. In other words, a word is important for a given document if it appears many times in this one and rarely in others.

We will consider the given document as the current datapoint and repeat the following for each word in the dataset:
$$TFIDF = TF * IDF$$
where:
$$TF(word, document) = \frac{How\ many\ times\ the\ word\ appears\ in\ the\ document}{Number\ of\ words\ in\ the\ document}$$
and:
$$IDF(word, Documents) = log(\frac{Number\ of\ documents\ in\ the\ corpus}{How\ many\ documents\ contain\ the\ current\ word} + 1)$$
We use **log** in order to smooth our values for an easier analysis.

For the implementation you can use [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). The output will be a matrix where each row corresponds to a datapoint and each column to a word from the full dataset:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

text = ['I am not happy.', 'He is very happy']

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['am' 'happy' 'he' 'is' 'not' 'very']
[[0.6316672  0.44943642 0.         0.         0.6316672  0.        ]
 [0.         0.37997836 0.53404633 0.53404633 0.         0.53404633]]


# Word2vec

Word2vec was one the most popular embedding technique used before the rise of the Transformer in [2017](https://arxiv.org/pdf/1706.03762.pdf). It was originaly published in 2013 ([\[1\]](https://arxiv.org/pdf/1310.4546.pdf), [\[2\]](https://arxiv.org/pdf/1310.4546.pdf)) and it consists of a shallow neural network (with only one hidden layer) trained on each word from a text independently such that similar words are closer to eachother in the vector space (and unrelated words are further).

It all starts from the quote: _You shall know a word by the company it keeps_. The idea is that we can use the context of a word to compute the similarity between different words in our text, use this as a training dataset and create a prediction model that works as an embedding for the words we have in our corpus. The model can use one of the following algorithms:
- Continuous Bag of Words (CBoW): use the context window around a word to predict the word; better for small datasets
- Skip-Gram: use a target word to predict the context around it; better at generalization (for rare words)

<img src= "https://wiki.pathmind.com/images/wiki/word2vec_diagrams.png" width="500" height="300">



[A more in depth explanation with code](https://www.tensorflow.org/text/tutorials/word2vec)

## Continuous Bag-of-Words (CBoW)

Unlike the BoW model, CBoW takes into account the context around a certain word by using a context window.


For example, if you choose the text _The fly will fly with you._ and the window size 1, it will look at exactly 1 word before and after each word in the text, generating the following sequence of (_context_, _target_) pairs:

$$([the, will], fly), ([fly, fly], will), ([will, with], fly), ([fly, you], with)$$

This is the information on which we will train our model to predict the most probable word in a given context.

## Skip-Gram

The Skip-Gram Model works the other way around: given a target word, it aims to predict the context around it. In order to do this, you can train a neural network with one hidden layer for a simple task: to predict the chance of having word _y_ really close to word _x_ in a random text. Then you use this layer as the vector representation of the given word, thus making sure that the vector distance between any 2 words is closer if they are more similar and larger if they are not.

## Training a model

In [None]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

embedding = Word2Vec(
    sentences=common_texts,   # the list of sentences, where each sentence is given as a list of words (processed or not processed)
    vector_size=100,          # the number of features in the vectorized representation
    window=7,                 # the context window
    min_count=3,              # the minimum number of times a word should appear in our dataset in order to be counted
    sg=1                      # sg=1 means skip-gram is used, sg=0 means CBOW is used
)

In [None]:
embedding.wv.key_to_index

{'system': 0, 'graph': 1, 'trees': 2, 'user': 3}

In [None]:
import pandas as pd

df = pd.DataFrame(
    [embedding.wv.get_vector(word) for word in embedding.wv.key_to_index.keys()],
    index=embedding.wv.key_to_index
  )

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
system,-0.000536,0.000236,0.005103,0.009009,-0.009303,-0.007117,0.006459,0.008973,-0.005015,-0.003763,...,0.001631,0.00019,0.003474,0.000218,0.009619,0.005061,-0.008917,-0.007042,0.000901,0.006393
graph,-0.00862,0.003666,0.00519,0.005742,0.007467,-0.006168,0.001106,0.006047,-0.00284,-0.006174,...,0.001088,-0.001576,0.002197,-0.007882,-0.002717,0.002663,0.005347,-0.002392,-0.00951,0.004506
trees,9.5e-05,0.003077,-0.006813,-0.001375,0.007669,0.007346,-0.003673,0.002643,-0.008317,0.006205,...,-0.004509,0.005702,0.00918,-0.0041,0.007965,0.005375,0.005879,0.000513,0.008213,-0.007019
user,-0.008243,0.009299,-0.000198,-0.001967,0.004604,-0.004095,0.002743,0.00694,0.006065,-0.007511,...,-0.007426,-0.001064,-0.000795,-0.002563,0.009683,-0.000459,0.005874,-0.007448,-0.002506,-0.00555


In [None]:
embedding.wv.most_similar('system')

[('graph', -0.01083916611969471),
 ('trees', -0.05234673246741295),
 ('user', -0.111670583486557)]

## Loading a pretrained model

[Info about data and models](https://github.com/piskvorky/gensim-data)

[Examples on how to use](https://radimrehurek.com/gensim/models/word2vec.html)

In [None]:
import gensim.downloader as api

api.info()

In [None]:
model = api.load("word2vec-google-news-300")



In [None]:
model.most_similar('system')

[('systems', 0.7227916717529297),
 ('sytem', 0.7129376530647278),
 ('sys_tem', 0.5871982574462891),
 ('System', 0.5275423526763916),
 ('mechanism', 0.5058810114860535),
 ('sysem', 0.5027822852134705),
 ('systen', 0.49969804286956787),
 ('system.The', 0.49599188566207886),
 ('sytems', 0.4949610233306885),
 ('computerized', 0.47604817152023315)]

In [None]:
model.similarity('system', 'graph')

0.09396098

## Fine-tuning our model:

In [None]:
model.train(common_texts, total_examples=4, epochs=1)

Other cool stuff:

In [None]:
model.most_similar(positive=["king", "woman"], negative=["man"])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

[And less cool stuff:](https://arxiv.org/pdf/1607.06520.pdf)

In [None]:
model.most_similar(positive=["computer_programmer", "woman"], negative=["man"])

[('homemaker', 0.5627118945121765),
 ('housewife', 0.5105047225952148),
 ('graphic_designer', 0.505180299282074),
 ('schoolteacher', 0.497949481010437),
 ('businesswoman', 0.493489146232605),
 ('paralegal', 0.49255111813545227),
 ('registered_nurse', 0.4907974898815155),
 ('saleswoman', 0.4881627559661865),
 ('electrical_engineer', 0.4797725975513458),
 ('mechanical_engineer', 0.4755399227142334)]

Bias is still an unsolved problem in Machine Learning. Do you know any other popular examples of bias?

# Global Vectors (GloVe)

While Word2Vec is based only on local statistics (the occurence of words at
a single-sentence level) [GloVe](https://nlp.stanford.edu/projects/glove/) incorporates global statistics methods. This makes it better suited for smaller datasets, as it does not need as much training data.

The model counts all "word1 word2 ..." pairs (for a context window of x we consider words that have at most distance x between them) and keeps the information in a co-occurrence matrix:

<center><img src='https://drive.google.com/uc?export=view&id=1pnX1lPdQItUauHp9W8xJlx8q2lgTe4cJ' width=500></center>

Afterwards, it computes the probability that a word will be closer to another one based on this matrix:
$$P(j | i) = \frac{X_{ij}}{X_i}$$
where:
$$P(j | i) = the\ probability\ of\ word\ j\ given\ i$$
$$X_{ij} = how\ many\ times\ word\ j\ appears\ in\ the\ context\ of\ i$$
$$X_i = \sum_k X_{ik} = sum\ of\ how\ many\ times\ words\ appear\ in\ the\ context\ of\ i$$

Based on this we should be able to infer relations between words:

<center><img src='https://nlp.stanford.edu/projects/glove/images/table.png' width=500></center>

Notice how _solid_ is related to _ice_ but not _steam_, while _gas_ is related to _steam_ but not _ice_ (very large vs. very small conditional values). _Water_ and _fashion_ on the other hand are either highly related to both or completely unrelated.

Some more computation will bring us to the regression model that is now used for this model. If you want to learn more you can check [the paper](https://aclanthology.org/D14-1162.pdf).

## Using GloVe

We can load a pretrained GloVe model using the gensim library (or other resources):

In [None]:
import gensim.downloader as api

model = api.load("glove-twitter-100")



And use it to compute the word embeddings (or do all other similarity functions that we saw for Word2Vec):

In [None]:
model['system']

array([ 0.43887 ,  0.32601 , -0.28524 , -0.08248 ,  0.43643 ,  0.75065 ,
        0.093945, -0.72626 ,  0.32297 , -0.37128 , -0.23306 ,  0.35499 ,
       -3.1764  ,  0.015004,  0.69725 , -0.15256 ,  0.025449, -0.058944,
        0.20002 , -0.61298 , -0.79661 ,  0.53051 ,  0.64765 ,  0.90153 ,
       -0.27407 ,  0.52871 ,  0.39344 ,  0.56076 ,  0.31942 ,  0.83347 ,
       -0.53268 , -1.0166  , -0.25328 , -0.17347 ,  0.68794 ,  0.25902 ,
        0.42864 ,  0.3844  , -0.071415, -0.026013, -0.42733 ,  0.58874 ,
       -0.30061 , -0.18357 ,  0.21158 , -0.72648 , -0.48477 ,  0.43527 ,
       -0.37412 , -0.48493 ,  0.26264 ,  0.21684 , -0.8822  ,  0.57925 ,
       -0.54    ,  0.7147  , -0.33133 , -0.44715 , -0.40713 , -0.014364,
       -0.083808,  0.45569 , -0.094374,  0.56057 ,  0.65446 , -0.45768 ,
        0.2522  ,  0.34328 , -0.061001, -0.4899  ,  0.3342  ,  0.41277 ,
       -0.55403 ,  0.30807 ,  0.22867 , -0.53921 ,  0.16439 ,  0.021561,
        0.15131 , -0.70287 ,  1.4152  ,  0.83387 , 

Or you can train your own model from scratch:

In [None]:
from glove import Corpus, Glove

corpus = Corpus()
corpus.fit(common_texts, window=4)

glove = Glove(no_components=4, learning_rate=0.1)
glove.fit(corpus.matrix, epochs=10, no_threads=8, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model.txt')

# FastText

The last embedding technique that we will talk about is FastText. With a really nice documentation, FastText also uses Skip-Gram and CBoW (like Word2Vec), but instead of learning words as a whole, it splits them in sequences of characters. This helps the model generalize better, especially with rare words, as it learns prefixes and suffixes along with other short sequences that convey information.

If we choose to split the word _artificial_ in n-grams of size 3 and padding 1, the representation will be: <_ar_, _art_, _rti_, _tif_, _ifi_, _fic_, _ici_, _ial_, _al_>. And then we continue similar as with word2vec. The full explanation is in [the paper](https://aclanthology.org/E17-2068.pdf) and code snippets are in the [documentation](https://fasttext.cc).

# Principal Component Analysis (PCA)

PCA is a dimensionality reduction algorithm -- meaning that we can use it to visualise our data in 2D or 3D. Here is an example of how you can use it to see the distance between embeddings in 2D:

In [None]:
from sklearn.decomposition import PCA

text = ['system', 'graph', 'trees', 'user']
embeddings = [model[word] for word in text]

pca = PCA(n_components=2)
pca.fit(embeddings)
vectors_2d = pca.transform(embeddings)

We can train it the same way we would a normal ML model, and visualize the results using, for example, a plotting library like matplotlib:

In [None]:
import matplotlib.pyplot as plt

x = [v[0] for v in vectors_2d]
y = [v[1] for v in vectors_2d]

fig, ax = plt.subplots()
ax.scatter(x, y)

for i, txt in enumerate(text):
    ax.annotate(txt, (x[i], y[i]))

plt.show()

# Exercises

1. Write your own implementation for Bag of Words from scratch. You should be able to set whether the representation will be binary or frequency-based.
2. Implement your own TfIdf from scratch. You can use as many helper functions as you want.
3. Create the (context, target) pairs and train a neural network for either skip-gram or continuous bag of words. You should quantify each word with a unique id and use padding at the beginning and end of the text for training on the marginal terms.
4. Visualise the distance between a few words in 2D using PCA (or another dimensionality reduction technique)
5. Compare these embeddings using any means (e.g.: train time, most similar word to X, distances in a 2D space, accuracy with a SVM etc.). Also compare the library versions with your own implementations.
