# What do word embeddings represent?
In this exercise, you are going to explore what is represented in word embeddings. We are going to make use of the python gensim package and two sets of pre-trained embeddings. The embeddings can be downloaded from:

* www.robvandergoot.com/data/embeds/twitter.bin.gz
* www.robvandergoot.com/data/embeds/googlenews.bin.gz

The first embeddings are skip-gram embeddings trained on a collection of 2 billion words from English tweets collected during 2012 and 2018 with the default settings of word2vec. The second embeddings are trained on 100 billion words from Google News. They have both been truncated to the most frequent 500,000 words. Note that loading that each of these embeddings require approximately 2GB of ram.

The embeddings can be loaded in gensim as follows:

In [2]:
import gensim.models

twitEmbs = gensim.models.KeyedVectors.load_word2vec_format(
                                'twitter.bin', binary=True)
print('loading finished')

loading finished


You can now use the index operator ``[]`` or the function ``get_vector()`` to acces the individual word embeddings.

In [2]:
twitEmbs['cat']

array([ 4.64285821e-01,  2.37979457e-01, -4.24226150e-02, -4.35831666e-01,
       -4.06450212e-01, -1.43117514e-02,  1.22334510e-01, -5.59092343e-01,
        1.23332568e-01,  2.36625358e-01,  3.58797014e-02, -9.40739065e-02,
       -2.04128489e-01, -1.81295779e-02, -1.08792759e-01, -2.70818472e-01,
        1.05479717e-01,  1.37095019e-01,  1.79271579e-01,  2.91243941e-01,
       -5.87746739e-01,  2.90462654e-02,  6.89281642e-01, -1.80917114e-01,
       -2.57750720e-01, -2.01395631e-01, -5.16403615e-01,  5.85804135e-03,
       -1.67768478e-01,  2.17095211e-01,  2.22494245e-01,  1.56742647e-01,
       -3.60864878e-01,  3.94283593e-01,  8.04448500e-03,  1.11518592e-01,
       -1.85592070e-01, -1.16088443e-01,  3.24357510e-01,  4.00876179e-02,
        9.14092362e-02, -1.04118213e-01, -6.89513862e-01,  1.54412836e-01,
        4.57625002e-01,  2.55037360e-02, -3.84058757e-03,  7.12698698e-02,
       -2.25590184e-01, -1.96693689e-01, -3.88458431e-01, -2.27625713e-01,
        6.94357634e-01, -

## Word similarities
Cosine distance can be used to measure the distance between two words; it is defined as:
\begin{equation}
cos_{\vec{a},\vec{b}} = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| |\vec{b}|} = \frac{\sum^n_1 a_i b_i}{\sqrt{\sum^n_1 a_i^2} \sqrt{\sum^n_1 b_i^2}}
\end{equation}

**1. Implement the cosine similarity using pure python (only the ``math`` package is allowed).** Note that similarity is 1-distance.

You can compare your scores to the gensim implementation to check wheter it is correct. The following code should give the same output

```
print(twitEmbs.distance('cat', 'dog'))
print(cosine(twitEmbs['cat'], twitEmbs['dog']))
```


In wordnet, the distance between two senses can be based on the distance in the taxonomy, the most common metric for this is:

Wu-Palmer Similarity: denotes how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).

It can be computed in python like this:

In [6]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

first_word = wordnet.synsets('cat')[0] #0 means: most common sense
second_word = wordnet.synsets('dog')[0]
print('WordNet similarity: ' + str(first_word.wup_similarity(second_word)))

print('Twitter similarity: ' + str(twitEmbs.similarity('cat', 'dog')))


WordNet similarity: 0.8571428571428571
Twitter similarity: 0.8955348




**2. Think of 5 word pairs which have a high similarity according to you. Estimate the difference between these pairs in wordnet as well as in the Twitter embeddings and the Googlenews embeddings. Which method is closest to your own intuition?** (you are allowed to use the gensim implementation of cosine similarity here)


# Analogies

Analogies have often been used to demonstrate the power of word embeddings. Analogies have the form ``A :: B : C :: D``. In this setting A, B and C are usually given and the fourth term is extracted from the embeddings by using ``3cosadd``:

\begin{equation}
\underset{d}{\mathrm{argmax}} (\cos (d, c) - \cos (d, a) + \cos (d, b))
\label{equ:cosadd}
\end{equation}

You can query analogies with gensim:

In [5]:
twitEmbs.most_similar(positive=['woman', 'king'], negative=['man'], 
                                                         topn=10)

[('queen', 0.8401797413825989),
 ('goddess', 0.7309160828590393),
 ('king…', 0.7233694195747375),
 ('princess', 0.715788722038269),
 ('kings', 0.707615852355957),
 ('godess', 0.6952610015869141),
 ('Queen', 0.6902579069137573),
 ('queen,', 0.6876209378242493),
 ('quee…', 0.6856900453567505),
 ('queens', 0.6832401156425476)]

``3cosadd`` can be used to solve semantic as well as syntactic analogies:

| Syntactic           |                                      |
|---------------------|--------------------------------------|
| Country-capital     | Denmark :: Copenhagen : England :: X |
| Family-relations    | boy :: girl : he :: X                |
| Object-color        | sky :: blue : grass :: X             |

| Semantic            |                                      |
|---------------------|--------------------------------------|
| Superlatives        | nice :: nicer : good :: X            |
| Present-past tense  | work :: worked : drink :: X          |
| Country-nationality | Brazil :: Brazilian : Denmark :: X   |


Try the analogies from the table. Is the correct answer returned for all queries?, if not; are the answers at least ranked high?

**1. Think of another category of semantic analogies that might be encoded in the embeddings and test this empirically by thinking of 5 example analogies. Which embeddings are better at predicting your category (Twitter versus news)?**

**2. Think of another category of syntactic analogies that might be encoded in the embeddings and test this empirically by thinking of 5 example analogies. Which embeddings are better at predicting your category (Twitter versus news)?**
