# LAB2.2: Word embeddings for different languages

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, we introduce you to word embeddings. Word embeddings are vector representations for words learned by a neural network to predict words that occur in their context. The weights applied to make predictions eventually form the vector that acts as a representation of a word. Usually, vector sizes are limited to 300 to 500 dimensions (weights).

For learning these representations a large corpus of text is needed with sufficient occurrences of words in different contexts. Each context word counts as a correct or positive example of a word that needs to be predicted. As negative examples of words that do not occur in the context, random words are chosen.

The learning consists of making the weights of the target word more similar to the weights of the correct context words and less similar to the incorrect (random) contexts words. The learning starts with a random initialisation of the weights but gradually the weights get updated after seeing more and more training cases. 

For example given the following texts, the words ```dog```, ```cat```, and ```mouse``` will get rather similar weights because they partially occur in each other contexts and partially share other words such as ```chase``` and  ```ate```:

```The dog chased the cat.
The cat chased the mouse.
The mouse ate the cheese.
The dog ate a bone.
The cat ate the fish
```

They also differ a little bit because they eat different things. Imagine doing this for hundreds-of-thousands of documents and ten-thousands of words. It will create a large semantic space as we have seen for WordNet positioning words that are related in close proximity. The difference from WordNet being, that this space is derived empirically from texts and not from human judgements.

![word-embedding](./images/word-embedding.png)

Screen dump taken from: https://projector.tensorflow.org

Two major disadvantages of word-embeddings are: 1) different meanings of words get conflated (e.g. mouse) and 2) the vectors represent ```relatedness``` and not the precise semantic relations from WordNet.

One of the technical advantages of representing word meanings as vectors is that comparing vectors of words always gives a result for these certain dimensions: i.e. the vectors are dense vectors. Vectors match most strongly when words occur in similar contexts. Obviously, words that do not have a representation, such as domain specific terminology, cannot be matched either.

**Reference**: Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).


In this notebook, we are going to load prebuilt word embedding and learn how to explore these. Although there are many packages and data sets with embeddings, we focus on publicly available and trainable embeddings, especially for multiple languages. Concretely, we will use embeddings from:

* [Wikipedia2Vec](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/)
* [Fasttext](https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md)

Wikipedia2Vec created word embeddings for 9 languages built from the Wikipedia pages in these languages. Fasttext, created by Facebook, created embeddings for 157 languages from web data collected by the [Common Crawl project](https://commoncrawl.org).

The embeddings are created through different Pythn packages, but they are also available in a common Word2Vec text format. This makes it possible to load the embeddings regardless of how they have been created. We will use the *GenSim* package to load the data.

## 1. Gensim: a package for handling embedding models

Gensim is an open source and free python package that is often used for building and using word embeddings:

* https://radimrehurek.com/gensim/#
* https://radimrehurek.com/gensim/auto_examples/index.html#core-tutorials-new-users-start-here

You need to install Gensim version 4 or higher.

To install Gensim on your local machine from the command line do:

    * conda install -c conda-forge gensim
    
    OR
    
    * pip install gensim


In [2]:
#!conda install -c conda-forge gensim
#!pip install gensim

If gensim is succefully installed you can use existing models models in text (txt) format that are compatable with gensim. To do so, we first import gensim and the package *numpy*, which is often used to represent vectors and arrays.

In [1]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import numpy as np

## 2. Word embeddings built from Wikipedia

Wikipedia is often used to build and test language models. The reasons are obvious: it is large, freely available and there are Wikipedias in many languages, often also linked to each other and partly covering similar content.

You can download pre-trained models in various languages from: https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

There are different variants trained for 100, 300, and even 500 dimensions. If your computer has limited capacity, it is better to start with the 100 dimensions. For this notebook, we will download enwiki_20180420_100d.txt.bz2, which is a compressed version of the 100 dimensions embeddings model built from the English Wikipedia. You need to decompress the "txt.bz2" file to a file with the extension ".txt", using a  file (de)compression utility that can handle "bz2" format.

It is interesting to peek into the file that you have downloaded. It is too big to load it in a text editor so we will try to print part of the file to the screen of the terminal.

We can use the command line commands **cat** (use **type** on Windows) and **more** to inspect the beginning of the text file to see what it contains. Make sure it is decompressed.

Specify the path to your own download after the command and do not forget to add " | more" to the end, 
otherwise it will list a gigabyte large file in your notebook. We only want to inspect the beginning.

In [3]:
%cat /Users/piek/Desktop/t-ONDERWIJS/data/word-embeddings/wiki2vec/enwiki_20180420_100d.txt | more

4530030 100
the -0.0811 0.5145 0.0368 0.0544 -0.0662 0.2121 0.1636 -0.1011 0.0607 -0.1858 -0.1965 -0.0595 0.0703 -0.1013 0.3130 -0.1717 0.0847 -0.1222 0.1024 0.0753 -0.1384 0.0435 -0.0371 0.1932 -0.1226 -0.2227 -0.1530 -0.2890 0.2371 0.2699 0.2693 -0.1666 0.0240 0.1053 -0.1475 -0.3232 0.0236 -0.2056 0.2847 -0.2817 0.1197 0.0314 -0.1215 0.0782 -0.2850 -0.1316 -0.0844 0.1483 -0.2192 -0.0462 -0.2151 0.0582 0.0372 0.0127 -0.3074 -0.1582 -0.1393 0.0361 -0.2519 -0.0305 -0.1532 -0.0286 -0.0955 0.3037 0.5632 -0.1120 -0.0319 -0.2223 -0.2612 -0.2254 -0.1593 0.1807 0.1205 0.3695 -0.2652 -0.0490 -0.2556 0.0130 -0.0898 0.0322 0.0021 -0.2692 0.3129 0.0179 0.3913 0.5415 -0.0049 0.0884 0.1605 0.0878 0.0004 0.1465 0.1872 0.0521 -0.1492 -0.0882 0.1696 0.1894 -0.0866 0.1184
in 0.1245 0.4200 0.2936 0.0924 -0.0669 0.0252 0.1407 -0.0729 0.0680 -0.2951 -0.2720 0.0785 0.0780 0.0248 0.0427 -0.1497 0.1013 -0.0257 0.0364 0.2647 0.0330 0.1047 0.0382 0.0138 -0.0162 -0.0733 0.0960 -0.2090 0.0561 0.1030 0.2898 -0.19

The *cat* command remains active untill you stop the cell in the notebook (see the ```[*]``` before the cell). Please stop running the cell manually (the square next to the play symbol in the top menu of the notebook) to proceed.

If things worked out you see the beginning of the embedding text file. It starts with two digits on the first line.
In my case this is "4530030" and "100". The first is the length of the vocabulary and the second the number of dimensions.

The following lines show a word from the learned vocabulary and a list of 100 digits, which are the weights for the dimensions that form the embedding. The first line shows the word "the" followed by 100 digits. The rest of the file consists of lines for each word with its embeddings. This is the so-called word2vec format for representing word embeddings.

Depending on the package used, embeddings may be represented differently in memory but it always boils down to three components:

* vocabulary: a list of observed words
* matrix of learned weights as word vectors
* mapping from the words in the vocabulary to the vectors in the matrix

A vector is a list of digits, such that each digit is a value for a dimension in an n-dimensional space. If you have a list of these vectors you have a matrix of n-columns and m-rows. Each row corresponds to the vector of a word in the vocabulary.

The matrix of 3 rows and 3 columns
```
[[.34, .56, ,12],
 [.12, .39, ,05],
 [.78, .37, ,01]]
```

The vocabulary with the word as a key and the matrix (list of lists) index that points to the row with the embedding for the word:

```{"dog": 0, "cat" : 1, "car" : 2}```

For this data, a simple lookup function for *dog* will give the embedding *[.34, .56, ,12]*.

Now let's see how this is implemented in GenSim for the Wikipedia derived word embedding model.

## 2.1 Loading a model in Gensim

We will now load the downloaded embedding file using the gensim class "KeyedVectors", which has a function "load_word2vec_format" that can load such a text file. As a parameter, you give it the path to your local file. As an additional parameter you can set a limit for how many words should be read. If you have limited memory on your computer it is wise to set a limit. Note that less words will be read from the file and included in memory. 

In [2]:
# Path to the local copy of a model built from wikipedia
MODEL_FILE='/Users/piek/Desktop/t-ONDERWIJS/data/word-embeddings/wiki2vec/enwiki_20180420_100d.txt'

### If you have a small computer you may want to limit the number of embeddings loaded as shown below:
## wiki2vec = KeyedVectors.load_word2vec_format(MODEL_FILE, limit=5000) 

### To load the full model you should drop the limit.
# Loading the full model can take a while.
wiki2vec = KeyedVectors.load_word2vec_format(MODEL_FILE)

Let's see what type of object **gensim** created by loading the data and what the properties and functions are:

In [10]:
print(type(wiki2vec))
print(dir(wiki2vec))

<class 'gensim.models.keyedvectors.KeyedVectors'>
['__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_adapt_by_suffix', '_load_specials', '_log_evaluate_word_analogies', '_save_specials', '_smart_save', '_upconvert_old_d2vkv', '_upconvert_old_vocab', 'add_lifecycle_event', 'add_vector', 'add_vectors', 'allocate_vecattrs', 'closer_than', 'cosine_similarities', 'distance', 'distances', 'doesnt_match', 'evaluate_word_analogies', 'evaluate_word_pairs', 'expandos', 'fill_norms', 'get_index', 'get_mean_vector', 'get_normed_vectors', 'get_vecattr', 'get_vector', 'has_index_for', 'index2entity', 'index2word', 'index_to_key', 'init_sims', 'intersect

So "wiki2vec" is an instance of a **KeyedVectors** object.  A KeyedVectors object is essentially a mapping from keys to vectors. Each vector is identified by its lookup key, most often a short string token.

One of the functions is the "load_word2vec_format" function we used to load the data from a file, but we also see interesting functions such as: 'get_vector', 'cosine_similarities', 'distance', 'most_similar', 'vocab'. We will look into a few of them below.

### 2.2 The vocabulary

The model has a dictionary that contains words. Let's check how big the vocabulary is of the model derived from English Wikipedia. Note that you will have a different result if you loaded only part of the file:

In [11]:
# Show some properties of our model. Notice these are also in the text file.
print('Vector size =', wiki2vec.vector_size)
print('Vocabulary size =', len(wiki2vec.key_to_index), len(wiki2vec))

Vector size = 100
Vocabulary size = 4530030 4530030


Four-and-a-haf millions words are present in this model. That is a lot more than in the English WordNet. Let's check if the word "man" is in our model:

In [12]:
print("Is there a man?")
"man" in wiki2vec.key_to_index

Is there a man?


True

Unlikely that there is a cultural specific Dutch word such as "tjalk":

In [13]:
print("Is there a tjalk")
"tjalk" in wiki2vec.key_to_index

Is there a tjalk


True

Wikipedia is rich, so this one is in there too. Let's go wild:

In [14]:
print("Is there a tjalk")
"Is there a tjalk" in wiki2vec.key_to_index

Is there a tjalk


False

Sure, whole sentences are not in there. We can also get the vocabulary as a list:

In [15]:
vocabulary = wiki2vec.key_to_index
print('The model vocabulary is represented internally as a...')
print(type(vocabulary))
print((list(vocabulary))[:50])

The model vocabulary is represented internally as a...
<class 'dict'>
['the', 'in', 'of', 'a', 'and', 'is', 'to', 'was', 'by', 'for', 'on', 'as', 'at', 'from', 'with', 'an', 'it', 'that', 'also', 'which', 'first', 'this', 'has', 'he', 'one', 'his', 'are', 'after', 'who', 'were', 'two', 'its', 'new', 'be', 'or', 'but', 'had', 'their', 'been', 'born', 'not', 'other', 'all', 'have', 'during', 'time', 'when', 'may', 'they', 'into']


So here are the first 50 items in the list. As you can see these are very common and frequent English words.

## 2.3  The embeddings

We can now get the embedding representation for a specific word that is in the vocabulary using the **get_vector** function:

In [3]:
print('Information stored in the vocabulary for the word "man":')
man_vector=wiki2vec.get_vector('man')
print(type(man_vector))
print('Distributional meaning of "planet" in Wikipedia:', man_vector)

Information stored in the vocabulary for the word "man":
<class 'numpy.ndarray'>
Distributional meaning of "planet" in Wikipedia: [-2.377e-01  2.686e-01 -9.620e-02  2.707e-01 -2.241e-01 -2.489e-01
  1.065e-01  4.120e-02 -5.349e-01 -1.445e-01 -8.700e-02 -1.877e-01
  1.985e-01 -1.643e-01  1.021e-01 -1.783e-01 -5.520e-02  2.190e-02
 -2.180e-01  1.569e-01 -2.835e-01 -3.299e-01 -6.780e-02  3.505e-01
 -3.241e-01 -9.000e-04 -1.234e-01 -3.452e-01 -4.523e-01  7.449e-01
  1.470e-01 -1.258e-01 -1.073e-01  4.019e-01  1.120e-01  2.230e-02
 -3.720e-01  2.026e-01  3.160e-02  2.910e-02 -2.406e-01  1.368e-01
 -1.750e-02  1.020e-01  8.340e-02  5.012e-01 -3.973e-01  4.010e-02
 -1.653e-01 -1.892e-01 -1.441e-01  6.290e-02 -5.185e-01 -2.638e-01
  3.170e-02 -6.030e-02  1.012e-01 -5.408e-01 -3.528e-01 -1.281e-01
 -2.617e-01 -2.607e-01 -9.150e-02  3.094e-01  4.468e-01 -2.526e-01
 -2.842e-01 -9.303e-01 -3.270e-02 -4.669e-01  4.064e-01  2.045e-01
  2.223e-01  2.501e-01 -4.577e-01  4.089e-01  1.261e-01 -2.000e-01

We get exotic numbers for 100 dimensions. What else did you expect? 

A vector is a sorted bunch of numbers, each representing a dimension. These numbers are the weights learned by the neural network when predicting the context words of 'man'. 

In [None]:
print('Information stored in the vocabulary for the word "planet":')
man_vector=wiki2vec.get_vector('planet')
print(type(man_vector))
print('Distributional meaning of "planet" in Wikipedia:', man_vector)

In [4]:
print('Information stored in the vocabulary for the word "moon":')
man_vector=wiki2vec.get_vector('moon')
print(type(man_vector))
print('Distributional meaning of "moon" in Wikipedia:', man_vector)

Information stored in the vocabulary for the word "moon":
<class 'numpy.ndarray'>
Distributional meaning of "moon" in Wikipedia: [-0.2441  0.5476  0.0562  0.2461 -0.443  -0.6451 -0.6465  0.4212 -0.2495
 -0.2584  0.006  -0.5034  0.3762  0.1306  0.567   0.0217 -0.1584 -0.267
  0.0808 -0.075   0.2084  0.1155 -0.2967  0.683  -0.2611 -0.2676 -0.151
 -0.4972 -0.2294  0.6013  0.1498  0.2122 -0.251   0.231   0.1455  0.1899
 -0.3329 -0.3867  0.3709  0.5036 -0.1328 -0.1343 -0.3898  0.3202 -0.561
  0.5059 -0.1499 -0.0366 -0.1718 -0.2711  0.4232  0.3581 -0.3477 -0.6105
  0.1944 -0.072  -0.218  -0.1459  0.0701 -0.4258  0.2116 -0.3633  0.067
  0.497   0.455  -0.2285 -0.3548 -0.4014 -0.1933 -0.161  -0.0965 -0.4369
  0.1491  0.3831 -0.5052 -0.0037 -0.0218 -0.2246  0.4748  0.0144 -0.0223
  0.1255  0.3319  0.0997  0.6083  0.2705 -0.0103 -0.1101  0.3426 -0.1815
 -0.0199 -0.0199  0.0554  0.2709 -0.5822 -0.2304 -0.0056  0.2128 -0.6797
  0.5848]


In [None]:
print('Information stored in the vocabulary for the word "star":')
man_vector=wiki2vec.get_vector('star')
print(type(man_vector))
print('Distributional meaning of "star" in Wikipedia:', man_vector)

In [5]:
planet_sim = wiki2vec.similar_by_word("planet", topn=10)
print(planet_sim)

[('planets', 0.8699600100517273), ('planetoid', 0.8053288459777832), ('moons', 0.7875239849090576), ('4546b', 0.7842745780944824), ('earth', 0.7779384851455688), ('mamango', 0.7761229872703552), ('g889', 0.7746922969818115), ('orbiting', 0.774646520614624), ('comporellon', 0.7712626457214355), ('terraforms', 0.7709923982620239)]


In [6]:
moon_sim = wiki2vec.similar_by_word("moon", topn=10)
print(moon_sim)

[('moons', 0.7733571529388428), ('alofmethbin', 0.7676382660865784), ('달이', 0.7436352372169495), ('gomrath', 0.7427719235420227), ('earth', 0.7382783889770508), ('sun', 0.7344821691513062), ('nightspirit', 0.7337092757225037), ('dubwitch', 0.7302441000938416), ('youngme', 0.7280471324920654), ('piagrace', 0.7265596985816956)]


In [7]:
star_sim = wiki2vec.similar_by_word("star", topn=10)
print(star_sim)

[('stars', 0.8460984230041504), ('tvpool', 0.7771487236022949), ('fanfilms', 0.740755558013916), ('myfiba', 0.724956750869751), ('60pxsilver', 0.7237366437911987), ('droidworks', 0.714786946773529), ('brighest', 0.7104032635688782), ('tsunkatse', 0.7095942497253418), ('mbongwana', 0.7007470726966858), ('zvjezdane', 0.6999683976173401)]


Note that the data object is of the type ```numpy.ndarray```. This is a special array defined in the **numpy** package. Numpy is a package for efficiently processing numerical data and is used a lot in machine learning. For those interested, here is the description:

https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html


How many dimensions do we have in this vector?

In [17]:
print('Numpy data shape', man_vector.shape)
print('Number of vector dimensions:', len(man_vector))

Numpy data shape (100,)
Number of vector dimensions: 100


Not a surprise: we loaded a model with 100 dimensions based on a layer with 100 weights. This is true for all words so also for 'dog'.

In [18]:
vector4dog = wiki2vec['dog']

In [19]:
len(vector4dog)

100

In [20]:
print(vector4dog)

[-0.013   0.6476  0.1045 -0.3117  0.1475  0.0809 -0.1523  0.265  -0.6462
 -0.2058  0.1564 -0.2072  0.4179  0.0386 -0.0194 -0.2241  0.2227 -0.3452
 -0.426   0.1028 -0.2136 -0.0267  0.1946  0.3652 -0.2265 -0.2736  0.0326
 -0.0279 -0.2359  0.5077  0.3759 -0.2207 -0.0506  0.7909  0.1344 -0.079
 -0.4099  0.1559 -0.0066  0.1236 -0.5474 -0.0877 -0.3738 -0.253  -0.4688
 -0.1184 -0.0501  0.3267 -0.1799 -0.2662  0.0968  0.2891 -0.4816 -0.3374
  0.2488  0.1744  0.0889 -0.1873 -0.3312 -0.1903  0.0547 -0.6149 -0.427
 -0.1079  0.137  -0.1445  0.0521 -0.5711 -0.3859 -0.6626  0.2417 -0.0141
  0.3974  0.1331 -0.6726 -0.227   0.1793  0.2454  0.1545 -0.0923 -0.0247
 -0.4611 -0.1317 -0.2194  0.2143  0.491  -0.2186  0.2463  0.0843  0.1324
 -0.4565  0.008   0.6242 -0.0217 -0.084  -0.4722  0.1191  0.3299 -0.9191
  0.0963]


### Similarity

Because the representations are compatible across the words, we can compare two vector representations through the cosine similarity function:

![Cosine similarity](./images/cosine-full.png "Logo Title Text 1")

So suppose we have two vectors A and B, each with 100 slots, this formula (taken from the Wikipedia page) tells you to sum the results of multiplying each slot across A and B:

A[0]\*B[0]+A[1]\*B[1]+....A[99]\*B[99]

We divide this sum by the square-root of the total sum of the slots of A, multiplied by the square-root of the total sum of the slots of B. Dividing it that way normalises the value between 0 and 1 and it makes the sum of the products of the numerator relative to the product of the sums of the individual vectors.

Embedding software uses such measures to obtain the most similar words. We can now use the *most_similar_by_word* function from Gensim to ask for the words that are most similar:

In [22]:
dog_sim = wiki2vec.similar_by_word("dog", topn=50)
print(dog_sim)

[('dogs', 0.8637314438819885), ('cat', 0.8286387324333191), ('puppy', 0.8150733709335327), ('rabbit', 0.8042253851890564), ('montarges', 0.7981182932853699), ('poodle', 0.7949738502502441), ('barfy', 0.7915432453155518), ('cockapoo', 0.7834540009498596), ('pekapoos', 0.7828510999679565), ('pollicle', 0.7826345562934875), ('hound', 0.782395601272583), ('cow', 0.7820497155189514), ('potcake', 0.7794346809387207), ('dweeble', 0.7784105539321899), ('animohš', 0.7776063084602356), ('qimugta', 0.7706390023231506), ('purries', 0.7704073786735535), ('fluppy', 0.7676477432250977), ('rottweiler', 0.7663619518280029), ('retriever', 0.765354573726654), ('puppies', 0.7636122107505798), ('otterhound', 0.7632703185081482), ('labradoodle', 0.7611411213874817), ('dachshund', 0.7609491944313049), ('pig', 0.7596056461334229), ('fuchsl', 0.7588311433792114), ('yorkipoo', 0.7583247423171997), ('maltipoo', 0.7578954100608826), ('löwchen', 0.7569964528083801), ('sheepdog', 0.7556124329566956), ('malamute', 0

Have a look at this list and compare it with the *dogs* we got from the WordNet hiearchy. Which is richer, which is more precise?

The **similarity** function directly gives the cosine similarity score across the vectors of two words:

In [28]:
print(wiki2vec.similarity("dog", "cat"))
print(wiki2vec.similarity("dog", "coffee"))

0.8286388
0.4113313


We can also combine sets of vectors by providing a list of positive words:

In [30]:
print(wiki2vec.most_similar(positive=['dog', 'song'], topn=50))

[('dogggz', 0.825171947479248), ('souljers', 0.8060130476951599), ('hampsterdance', 0.8026869297027588), ('tweetit', 0.7962952852249146), ('pollicle', 0.7949056029319763), ('crazy', 0.7939842939376831), ('skumfuk', 0.7907962203025818), ('trashville', 0.7886988520622253), ('montarges', 0.7879376411437988), ('sugalumps', 0.7846031785011292), ('muppaphones', 0.7845421433448792), ('미친', 0.7824394702911377), ('swampblood', 0.7788825631141663), ('chooo', 0.7775618433952332), ('popdance', 0.7752609848976135), ('spacewalkin', 0.7737489342689514), ('squaredance', 0.773613691329956), ('unshockable', 0.7721720933914185), ('megachic', 0.771091639995575), ('octopad', 0.7708858251571655), ('wheresthelove', 0.7708709239959717), ('rabbittland', 0.7700871825218201), ('breaktek', 0.76993328332901), ('woggy', 0.7697234749794006), ('phidelity', 0.7695780992507935), ('부르는', 0.7692602872848511), ('fucky', 0.7685251235961914), ('donque', 0.7681090235710144), ('sinequanon', 0.7679200768470764), ('poopie', 0.7

By combining the vectors of **dog** and **song**, we created a new derived vector, a cocktail of both. Next, getting the most similar to this cocktail, we get a very different list of rap-related concepts.

Another operation is to get the word that matches the **least** with others, as in an odd-one-out task:

In [31]:
print(wiki2vec.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'dog']))

dog


Finally, we can add and substract vectors to obtain a target vector in space, assuming that there is parallelism in the semantic space. T

In [34]:
print(wiki2vec.most_similar(positive=["dog"], negative=["song"], topn=10))

[('canine', 0.5311753153800964), ('ENTITY/Guard_dog', 0.48210182785987854), ('dogs', 0.4805601239204407), ('ENTITY/Dog_bite_prevention', 0.4798755943775177), ('tollers', 0.47770798206329346), ('ENTITY/Kangaroo', 0.466862291097641), ('ENTITY/Greyhound_adoption', 0.46052291989326477), ('ENTITY/Housebreaking', 0.45895230770111084), ('ENTITY/Hunting_dog', 0.45722559094429016), ('ENTITY/Fox_Terrier', 0.4570063352584839)]


The most famous example in this respect is not about dogs and cats but about kings and queens (see the original Word2Vec paper Mikolov et al 2013). By adding **woman** to **king** and subtracting **man**, you should end up somewhere in semantic space close to **queen**.

In [36]:
print(wiki2vec.most_similar(positive=["king", "woman"], negative=["man"], topn=10))

[('queen', 0.830649197101593), ('monarch', 0.7416260838508606), ('ENTITY/Queen_consort', 0.7348716855049133), ('laungshe', 0.7347308993339539), ('regnant', 0.7243735194206238), ('chelna', 0.7236213088035583), ('consort', 0.720160722732544), ('indlovukati', 0.7181541919708252), ('kamamalu', 0.7178552150726318), ('indlovukazi', 0.714848518371582)]


Note that Wiki2vec has special prefixes ```ENTITY/``` for words that are actual hyperlinked names. This is an idiosyncracy of Wiki2vec.

The Word2Vec analogy example has been critisized because **king** and **queen** are already very close without doing anything. Nevertheless, adding and substracting did move **queen** closer as it moved from position 12 to 1 and scored 0.83 instead of 0.75 due to our vector manipulations:

In [3]:
print(wiki2vec.similarity("king", "queen"))
print(wiki2vec.most_similar("king", topn=50))

0.7570435
[('meldryn', 0.7807800769805908), ('domnal', 0.7773880362510681), ('buwanekabahu', 0.771612286567688), ('boromakot', 0.7671722173690796), ('borommatrailokanat', 0.764298677444458), ('silamegha', 0.763579249382019), ('parâkramabâhu', 0.7631061673164368), ('kings', 0.7624024152755737), ('monarch', 0.7603241205215454), ('haraald', 0.7582810521125793), ('saysethathirath', 0.7575438618659973), ('queen', 0.7570434808731079), ('13281350', 0.754815399646759), ('15721610', 0.7541301846504211), ('14061454', 0.7522280812263489), ('19101925', 0.7507657408714294), ('panduvasudeva', 0.7499051094055176), ('bhadrasena', 0.7497343420982361), ('13801422', 0.7493346929550171), ('13901406', 0.749112606048584), ('malikum', 0.7482700347900391), ('borommarachathirat', 0.7469587326049805), ('perekule', 0.7465206384658813), ('duttabaung', 0.7459855079650879), ('pālaka', 0.7454712390899658), ('hiranyakasyapa', 0.7443112730979919), ('suliyavongsa', 0.7442653775215149), ('15471553', 0.743290901184082), 

#### Disambiguating word meanings in word embeddings

We can also use the addition and subtraction to separate different meanings of words:

In [44]:
print(wiki2vec.similar_by_word("mouse", topn=10))

[('mice', 0.7914096117019653), ('merized', 0.7884306907653809), ('rabbit', 0.7779645919799805), ('desmarestianus', 0.760723888874054), ('batmouse', 0.7604084610939026), ('cuppedius', 0.7602810263633728), ('hamster', 0.7569741606712341), ('rat', 0.7562689185142517), ('urartensis', 0.7485920190811157), ('ochrotomys', 0.7427321672439575)]


Apparently the animal meaning of mouse is dominant in Wikipedia. We can now add **computer** and subtract **rat** to manipulate the similarity:

In [45]:
print(wiki2vec.most_similar(positive=["mouse", "computer"], negative=["rat"], topn=10))

[('computers', 0.7617183327674866), ('hardware', 0.7281700968742371), ('software', 0.7272316217422485), ('mscape', 0.7164640426635742), ('probeware', 0.716342031955719), ('systemsthe', 0.7032573819160461), ('graphics', 0.702555239200592), ('programmers', 0.7021493315696716), ('programmable', 0.7009457945823669), ('pdp11', 0.6991949677467346)]


In [46]:
print(wiki2vec.most_similar(positive=["mouse", "rat"], negative=["computer"], topn=10))

[('rabbit', 0.6957606673240662), ('mystromys', 0.690538227558136), ('lophiomys', 0.6903327107429504), ('ochrotomys', 0.6902164220809937), ('conilurus', 0.6892696619033813), ('imhausi', 0.6819950938224792), ('culturatus', 0.6803641319274902), ('rufodorsalis', 0.6792685985565186), ('shrew', 0.6781168580055237), ('orangiae', 0.676617443561554)]


## 3. Loading Word Embeddings from Facebook's Fasttext

The research lab of Facebook, created embeddings in 157 languages from Common Crawl. Common Crawl is a project that scrapes the web for texts in as many languages that are present. Check out the Github of Fasttext and Common Crawl for more details:

* https://commoncrawl.org/the-data/get-started/
* https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md

You should download the text version of the embeddings, which are following the same word2vec format as the Wike2Vec models.
I downloaded the embeddings for Frysian and Limburgish, which are two local, regional languages. We can inpsect these two text files in the same way.

In [47]:
%cat  "/Users/piek/Desktop/t-ONDERWIJS/data/word-embeddings/fasttext_we_157/cc.fy.300.vec" | more

526167 300
. 0.0139 -0.0210 -0.0328 -0.0229 0.2410 -0.0260 0.0071 -0.0160 0.0308 0.1494 0.0172 0.0276 -0.0017 0.0205 -0.0072 -0.0089 -0.0154 -0.0288 0.0217 -0.0085 0.0114 0.0119 -0.0284 -0.0170 0.0035 -0.0065 0.0191 -0.0085 0.0190 0.0197 0.0497 -0.0014 0.0060 -0.0069 0.0119 0.0168 0.0049 -0.0049 -0.0081 0.0194 0.0187 -0.0013 -0.0073 0.0139 0.0211 0.0138 -0.0148 0.0042 0.0004 0.0085 -0.0077 -0.0274 -0.0114 -0.0073 -0.0042 0.0038 0.0009 0.1392 0.0047 -0.0014 -0.0010 0.0031 -0.0005 -0.0238 -0.0040 0.0103 0.0276 0.0052 -0.0244 0.0189 0.0208 0.0128 0.0100 -0.0016 0.0019 0.0015 0.0308 -0.0299 -0.0125 0.0094 0.0671 0.0361 -0.0079 -0.0272 0.0130 -0.0070 -0.0004 0.0052 -0.0046 -0.0036 -0.0083 0.0083 0.0192 -0.0097 0.0201 -0.0088 -0.0143 0.0176 -0.0161 -0.0087 0.0108 0.0016 -0.0036 0.0272 0.0440 -0.0102 0.0194 0.0037 0.0198 -0.0054 -0.0010 -0.0030 0.0221 -0.0367 0.0004 0.0175 0.0520 -0.0069 -0.0036 0.0293 -0.0043 -0.0152 -0.0093 0.0255 0.0003 -0.0035 0.0348 0.0176 -0.0236 0.0035 0.0583 -0.0857 -

In [48]:
%cat  "/Users/piek/Desktop/t-ONDERWIJS/data/word-embeddings/fasttext_we_157/cc.li.300.vec" | more

306222 300
' -0.0150 -0.0400 0.0501 -0.0717 -0.0137 0.0211 0.0416 0.0129 0.0141 0.0216 0.0250 0.0593 -0.0445 0.0193 -0.0003 -0.0169 0.0217 0.0115 -0.0477 -0.0079 -0.0130 0.0283 -0.0274 0.0024 -0.0774 -0.0215 -0.0345 -0.0077 0.0146 -0.0027 -0.0087 0.0011 -0.0105 0.1293 -0.0385 0.0089 -0.0080 -0.0460 0.0075 0.0241 0.0128 0.0006 0.0730 -0.0214 0.1288 0.0327 -0.0051 -0.0403 0.0103 0.0162 -0.0266 -0.0041 0.0075 -0.0028 0.0516 -0.0068 0.0191 -0.0159 0.0393 0.0053 -0.0347 0.0049 0.0063 -0.0582 0.0186 -0.0151 -0.0245 0.0054 0.0352 0.0268 0.0313 0.0166 -0.0037 0.0444 -0.0140 0.0595 0.0168 0.0309 -0.0075 -0.0360 0.0133 -0.0097 0.0309 -0.0339 -0.0262 -0.0637 -0.0074 -0.0029 0.0032 0.0123 -0.0102 -0.0100 -0.0129 -0.0051 -0.0047 -0.0834 -0.0273 -0.0322 -0.0355 -0.0005 -0.0230 0.0571 -0.0147 0.0053 -0.0271 0.0189 -0.0605 0.0250 0.0149 0.0120 -0.0225 -0.0752 -0.0751 0.0189 -0.0105 0.0276 -0.0156 -0.0079 -0.0502 -0.0214 0.0635 0.0127 -0.0148 0.0152 0.0049 -0.0183 0.0351 0.0391 -0.0167 0.0224 0.0248 -0

The Frysian model has 526167 words and the Limburgish model as 306222 words. Let's load the models and inspect them.

In [49]:
path_to_my_fasttext_model = "/Users/piek/Desktop/t-ONDERWIJS/data/word-embeddings/fasttext_we_157/cc.fy.300.vec"
fasttext_fy = KeyedVectors.load_word2vec_format(path_to_my_fasttext_model) 

In [50]:
# Show some properties of our model. Notice these are also in the text file.
print('Vector size =', fasttext_fy.vector_size)
print('Vocabulary size =', len(fasttext_fy.key_to_index), len(fasttext_fy))
word_idx = fasttext_fy.key_to_index["tjalk"]
print('Word index for \"tjalk\"', word_idx)
vector = fasttext_fy.get_vector("tjalk")
print('Vecotr for \"tjalk\"', vector)

wordA="tjalk"
wordB="sneek"
wordlistA=["sneek", "joure", "tjalk"]
wordlistB=["aap", "noot", "mies"]
print("Most similar", wordA, fasttext_fy.most_similar(wordA))  # 👍
print()
print("similar_by_word", wordA, fasttext_fy.similar_by_word(wordA))  # 👍
print()
print("similar_by_vector", wordA, fasttext_fy.similar_by_vector(wordA))  # 👍
print()
print("doesnt_match", wordlistA, fasttext_fy.doesnt_match(wordlistA))  # 👍
print()
print("similarity", wordA, wordB, fasttext_fy.similarity(wordA, wordB))  # 👍
print()
print("n_similarity", wordlistA, wordlistB, fasttext_fy.n_similarity(wordlistA, wordlistB))  # 👍
print()

Vector size = 300
Vocabulary size = 526167 526167
Word index for "tjalk" 141358
Vecotr for "tjalk" [-2.80e-02 -6.10e-03 -3.38e-02 -3.78e-02 -1.51e-02 -4.40e-03 -1.70e-03
 -3.31e-02  5.80e-03 -2.07e-02 -8.60e-03  2.24e-02  2.50e-03  1.31e-02
 -1.28e-02 -1.37e-02 -2.20e-02 -2.49e-02 -3.70e-03 -3.46e-02 -9.40e-03
  3.30e-02  2.13e-02 -9.80e-03  3.20e-03 -1.81e-02  3.38e-02 -3.87e-02
 -1.57e-02 -1.38e-02  2.29e-02  5.40e-03  1.81e-02  9.30e-03 -2.04e-02
  2.55e-02 -1.91e-02  6.40e-03 -8.00e-04  5.83e-02  1.73e-02  1.26e-02
  1.97e-02  4.30e-03 -1.90e-03  1.56e-02  1.08e-02 -3.00e-03  8.40e-03
 -4.40e-03  2.30e-02  8.10e-03 -9.00e-04  8.80e-03  7.50e-03  3.79e-02
 -2.03e-02 -1.14e-02 -4.22e-02  5.50e-03  2.41e-02 -2.16e-02 -3.00e-03
  1.38e-02  2.00e-03 -2.12e-02  2.50e-03 -1.04e-02 -1.48e-02 -3.29e-02
  1.56e-02 -1.04e-02 -1.52e-02  7.40e-03  5.05e-02  5.80e-03  1.09e-02
  1.77e-02 -2.80e-03 -5.40e-03 -1.15e-02  1.88e-02 -7.40e-03  5.10e-03
 -7.50e-03 -1.08e-02  1.90e-03 -1.69e-02 -2.59e-0

In [51]:
path_to_my_fasttext_model = "/Users/piek/Desktop/t-ONDERWIJS/data/word-embeddings/fasttext_we_157/cc.li.300.vec"
fasttext_li = KeyedVectors.load_word2vec_format(path_to_my_fasttext_model) 

In [52]:
# Show some properties of our model. Notice these are also in the text file.
print('Vector size =', fasttext_li.vector_size)
print('Vocabulary size =', len(fasttext_li.key_to_index), len(fasttext_li))
word_idx = fasttext_li.key_to_index["stroat"]
print('Word index for \"stroat\"', word_idx)
vector = fasttext_li.get_vector("stroat")
print('Vector for \"stroat\"', word_idx)
print(vector)

Vector size = 300
Vocabulary size = 306222 306222
Word index for "stroat" 240260
Vector for "stroat" 240260
[ 0.0136 -0.0423  0.0041  0.0326 -0.0135 -0.0034 -0.0137 -0.0132  0.0041
 -0.0108 -0.0007 -0.0127 -0.0194 -0.0115 -0.0003 -0.001  -0.0127 -0.0061
 -0.0295  0.0258  0.0062 -0.0114  0.0376 -0.0237  0.0093 -0.009   0.0071
 -0.015   0.0196 -0.0079 -0.0127 -0.0194  0.0105 -0.009   0.0105 -0.035
  0.0254  0.0177 -0.0182 -0.0077 -0.0028  0.0133 -0.0063 -0.0225  0.0073
  0.0186 -0.0089  0.0334  0.015  -0.0091 -0.0188  0.0133  0.0074  0.0185
  0.009   0.0018  0.0021 -0.0166 -0.0023 -0.0148 -0.01   -0.0066  0.0169
 -0.0052 -0.0095 -0.0342 -0.0152  0.0082  0.0168  0.0042  0.0101  0.0535
  0.0059 -0.0028 -0.0021  0.0243  0.0035 -0.0022  0.006  -0.0381 -0.0024
  0.0085  0.0036 -0.003  -0.0064 -0.0116  0.0019  0.0009  0.0049  0.0268
  0.0118  0.0047 -0.025  -0.0103 -0.0034  0.0168 -0.0158  0.0035 -0.0132
 -0.0107 -0.0058 -0.0013 -0.0123 -0.0176  0.0092  0.0178  0.0097 -0.0012
  0.0371  0.0325 

In [54]:
wordA="stroat"
wordB="poet"
wordC="loemel"
wordlistA=["stroat", "poet", "loemel"]
wordlistB=["aap", "noot", "mies"]
print("Most similar", wordA, fasttext_li.most_similar(wordA))  # 👍
print()

print("Most similar", wordB, fasttext_li.most_similar(wordC))  # 👍
print()

print("Most similar", wordC, fasttext_li.most_similar(wordB))  # 👍
print()


print("similar_by_word", wordA, fasttext_li.similar_by_word(wordA))  # 👍
print()
print("similar_by_vector", wordA, fasttext_li.similar_by_vector(wordA))  # 👍
print()
print("doesnt_match", wordlistA, fasttext_li.doesnt_match(wordlistA))  # 👍
print()
print("similarity", wordA, wordB, fasttext_li.similarity(wordA, wordB))  # 👍
print()
print("n_similarity", wordlistA, wordlistB, fasttext_li.n_similarity(wordlistA, wordlistB))  # 👍
print()

Most similar stroat [('dörpssjtroat', 0.6635913848876953), ('Heisjtroat', 0.6460541486740112), ('Nuisjtroat', 0.6408706307411194), ('stroate', 0.6325363516807556), ('Pierresjtroat', 0.6303679347038269), ('proat', 0.6282641291618347), ('stroabbwoar', 0.6275120377540588), ('Laurasjtroat', 0.6274198889732361), ('Sjifferheisjtroat', 0.6225829720497131), ('Egsjtroat', 0.611824631690979)]

Most similar poet [('Hoemel', 0.863980233669281), ('foemel', 0.8294392228126526), ('loemelekriemèrsj', 0.8049797415733337), ('Sjtoemel', 0.7726760506629944), ('dremel', 0.7404616475105286), ('Bieemel', 0.7326872944831848), ('loemele', 0.727192759513855), ('kloemele', 0.7036440968513489), ('tasemel', 0.6926780939102173), ('Loemel', 0.683782696723938)]

Most similar loemel [('fine-boned', 0.5196540355682373), ('laureate', 0.4606165885925293), ('kölsje', 0.4573061466217041), ('face', 0.4414014220237732), ('pipe', 0.3875519335269928), ('Nut.', 0.38502728939056396), ('Jenniches', 0.3829471170902252), ('gepoet',

## 4. Links to existing models available for download

Follow the links to browse available models. The sources listed below contain English models trained using different algorithms, data with different degrees of preprocessing and varying hyperparameter settings. Some resources also include models in other languages.

### Large and commonly used models (English):

* Google word2vec: can be downloaded from here (follow link in instructions): http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

* GloVe (trained on various corpora): https://nlp.stanford.edu/projects/glove/

* FastText embeddings (Facebook): https://fasttext.cc/docs/en/english-vectors.html

* Models with different algorithms, hyperparamters, dimensions and degrees of preprocessing (e.g. dependency parsing windows):  https://vecto.readthedocs.io/en/docs/tutorial/getting_vectors.html


### Various models in English & other languages:

* Various algorithms and parameters for English and other languages: http://vectors.nlpl.eu/repository/#

### Cross lingual embeddings:
* https://ruder.io/cross-lingual-embeddings/


# End of this notebook