# LAB2.2b: Word embeddings from Wikipedia using Gensim

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In case you cannot install Wikipedia2Vec you can try to install the *gensim* package and load the models as described below.
This notebook explains how you can load a Wiki2Vec model in the Word2Vec text format with the Gensim model.


You need to install Gensim on your local machine from the command line:

    conda install -c conda-forge gensim

If gensim is succefully installed you can use existing models Wikipedia2Vec models in a text (txt) format that is compatable with gensim.

You can download pre-trained models in various languages from: https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

There are different variants trained for 100 and 300 dimensions. If your computer has limited capacity, it is better to start with the 100 dimensions. For this notebook, we will download enwiki_20180420_100d.txt.bz2, which is a compressed version of the 100 dimensions embeddings model built from the English Wikipedia. You need to decompress the "txt.bz2" file to a file with the extension ".txt".

In [1]:
%conda install -c conda-forge gensim

Collecting package metadata (current_repodata.json): done
Solving environment: \ ^C
failed with initial frozen solve. Retrying with flexible solve.

Note: you may need to restart the kernel to use updated packages.


In [None]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

In [None]:
# Fill in the path to your local copy of an embedding model.
# Here we specify an example of such a path. Adapt the path to where you have stored the donwload
# Make sure it is decompressed. The *.bz2 file will not load.
#Parameters
#fname (str) – The file path to the saved word2vec-format file.

#fvocab (str, optional) – File path to the vocabulary.Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

#binary (bool, optional) – If True, indicates whether the data is in binary word2vec format.

#encoding (str, optional) – If you trained the C model using non-utf8 encoding for words, specify that encoding in encoding.

#unicode_errors (str, optional) – default ‘strict’, is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), ‘ignore’ or ‘replace’ may help.

#limit (int, optional) – Sets a maximum number of word-vectors to read from the file. The default, None, means read all.

#datatype (type, optional) – (Experimental) Can coerce dimensions to a non-default float type (such as np.float16) to save memory. Such types may result in much slower bulk operations or incompatibility with optimized routines.)

#Returns
#Loaded model.
#Return type Word2VecKeyedVectors


# Loading the model can take a while.

MODEL_FILE='/Users/piek/Desktop/t-ONDERWIJS/data/word-embeddings/wiki2vec/enwiki_20180420_100d.txt'

wiki2vec = KeyedVectors.load_word2vec_format(MODEL_FILE) 

By loading the model, we created an object with the name "wiki2vec" through which we can call functions and attributes.

In [None]:
dir(wiki2vec)

Models may be stored in different (and sometimes confusing) formats, but they all boil down to these components:

* a matrix of word vectors 
* a vocabulary
* a mapping between vectors in the matrix to the words in the vocabulary (often via indices)

Think about what a matrix is (no not the movie). You know that a vector is a list of digits, such that each digit is a value for a dimension in an n-dimensional space. Well, if you have a list of these vectors you have a matrix of n-columns and m-rows. Each row corresponds to the vector of a word in the vocabulary.

The matrix of 3 rows and 3 columns
```
[[.34, .56, ,12],
 [.12, .39, ,05],
 [.78, .37, ,01]]
```

The vocabulary with the word as a key and the matrix list index that points to the row with the embedding for the word:

```{"dog": 0, "cat" : 1, "car" : 2}```

For this data, a simple lookup function for *dog* will give the embedding *[.34, .56, ,12]*.

Now let's see how this is implemented in GenSim for the Wikipedia derived word embedding model.

In [None]:
# Explore the wiki2vec model as a python object:
print('The model is represented internally as a...')
print(type(wiki2vec))

The model has a dictionary that contains words among others. Let's check how big the vocabulary is of the model derived from English Wikipedia:

In [None]:
#vocabulary = wiki2vec.wv.dictionary
print('The model vocabulary is represented internally as a...')
print(type(wiki2vec.vocab))

# Show some properties of our model. Notice these are also in the text file.
print('Vector size =', wiki2vec.vector_size)
print('Vocabulary size =', len(wiki2vec.vocab))

Four-and-a-haf millions words are present in this model. That is a lot more than in the English WordNet. Let's check some of these:

In [6]:
#####
print('Some words from the model vocabulary:')
print(list(wiki2vec.vocab)[:20]) #Note that :20 gives the first 20 items in the list, print(list(vocabulary.words())[-1]) gives the last word
print()

Some words from the model vocabulary:
['the', 'in', 'of', 'a', 'and', 'is', 'to', 'was', 'by', 'for', 'on', 'as', 'at', 'from', 'with', 'an', 'it', 'that', 'also', 'which']



For each word in the vocabulary, we can now get the vector. We assume that 'man' is in the vocabulary:

In [7]:
print('Information stored in the vocabulary for the word "man":')
man_vector=wiki2vec['man']
print(type(man_vector))
print('Embedding of "man" in Wikipedia:', man_vector)

Information stored in the vocabulary for the word "man":
<class 'numpy.ndarray'>
Embedding of "man" in Wikipedia: [-2.377e-01  2.686e-01 -9.620e-02  2.707e-01 -2.241e-01 -2.489e-01
  1.065e-01  4.120e-02 -5.349e-01 -1.445e-01 -8.700e-02 -1.877e-01
  1.985e-01 -1.643e-01  1.021e-01 -1.783e-01 -5.520e-02  2.190e-02
 -2.180e-01  1.569e-01 -2.835e-01 -3.299e-01 -6.780e-02  3.505e-01
 -3.241e-01 -9.000e-04 -1.234e-01 -3.452e-01 -4.523e-01  7.449e-01
  1.470e-01 -1.258e-01 -1.073e-01  4.019e-01  1.120e-01  2.230e-02
 -3.720e-01  2.026e-01  3.160e-02  2.910e-02 -2.406e-01  1.368e-01
 -1.750e-02  1.020e-01  8.340e-02  5.012e-01 -3.973e-01  4.010e-02
 -1.653e-01 -1.892e-01 -1.441e-01  6.290e-02 -5.185e-01 -2.638e-01
  3.170e-02 -6.030e-02  1.012e-01 -5.408e-01 -3.528e-01 -1.281e-01
 -2.617e-01 -2.607e-01 -9.150e-02  3.094e-01  4.468e-01 -2.526e-01
 -2.842e-01 -9.303e-01 -3.270e-02 -4.669e-01  4.064e-01  2.045e-01
  2.223e-01  2.501e-01 -4.577e-01  4.089e-01  1.261e-01 -2.000e-01
 -8.700e-02 -2.

As expected, a vector is a sorted bunch of numbers, each representing a dimension. These numbers are actually the weights learned by the neural network that are applied to the hidden layer when learning to predict the context words of 'man'. 

Note that the data object is of the type 'numpy.ndarray'. Numpy is a package for dealing with numerical data that is used a lot in machine learning. For those interested, here is the description of what it is:

https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html


How many dimensions do we have in this vector?

In [8]:
print('Numpy data shape', man_vector.shape)
print('Number of vector dimensions:', len(man_vector))

Numpy data shape (100,)
Number of vector dimensions: 100


Not a surprise: we loaded a model with 100 dimensions based on a hidden layer with 100 neurons. This is true for all words so also for 'dog'.

In [9]:
vector4dog = wiki2vec['dog']

In [10]:
len(vector4dog)

100

In [11]:
print(vector4dog)

[-0.013   0.6476  0.1045 -0.3117  0.1475  0.0809 -0.1523  0.265  -0.6462
 -0.2058  0.1564 -0.2072  0.4179  0.0386 -0.0194 -0.2241  0.2227 -0.3452
 -0.426   0.1028 -0.2136 -0.0267  0.1946  0.3652 -0.2265 -0.2736  0.0326
 -0.0279 -0.2359  0.5077  0.3759 -0.2207 -0.0506  0.7909  0.1344 -0.079
 -0.4099  0.1559 -0.0066  0.1236 -0.5474 -0.0877 -0.3738 -0.253  -0.4688
 -0.1184 -0.0501  0.3267 -0.1799 -0.2662  0.0968  0.2891 -0.4816 -0.3374
  0.2488  0.1744  0.0889 -0.1873 -0.3312 -0.1903  0.0547 -0.6149 -0.427
 -0.1079  0.137  -0.1445  0.0521 -0.5711 -0.3859 -0.6626  0.2417 -0.0141
  0.3974  0.1331 -0.6726 -0.227   0.1793  0.2454  0.1545 -0.0923 -0.0247
 -0.4611 -0.1317 -0.2194  0.2143  0.491  -0.2186  0.2463  0.0843  0.1324
 -0.4565  0.008   0.6242 -0.0217 -0.084  -0.4722  0.1191  0.3299 -0.9191
  0.0963]


Because the representations are compatible across the words, we can compare two vector representations through the cosine similarity function:

![Cosine similarity](./images/cosine-full.png "Logo Title Text 1")

So suppose we have two vectors A and B, each with 100 slots, this formula (taken from the Wikipedia page) tells you to sum the results of multiplying each slot across A and B:

A[0]\*B[0]+A[1]\*B[1]+....A[99]\*B[99]

We divide this sum by the square-root of the total sum of the slots of A, multiplied by the square-root of the total sum of the slots of B. Dividing it that way normalises the value between 0 and 1 and it makes the sum of the products of the numerator relative to the product of the sums of the individual vectors.

Embedding software uses such measures to obtain the most similar words. We can now use the *most_similar_by_word* function from Gensim to ask for the words that are most similar:

In [12]:
dog_sim = wiki2vec.similar_by_word("dog", topn=10)

It is imporant to note that Gensim is not the same package as Wikipedia2Vec. Gensim is more generic and does not have the specific data objects of Wikipedia2Vec for the frequency of words or the entities in the Wikipedia articles. However, you can also see that the other notebook for Wikipedia2Vec showed the same results for 'dog'. The way the model is loaded and treated is therefore the same.

```
[(<Word dog>, 0.99999994),
 (<Word dogs>, 0.8637307),
 (<Word cat>, 0.8286426),
 (<Word puppy>, 0.81508684),
 (<Word rabbit>, 0.8042291),
 (<Word montarges>, 0.798108),
 (<Word poodle>, 0.79497886),
 (<Word barfy>, 0.7915491),
 (<Word cockapoo>, 0.783462),
 (<Word pekapoos>, 0.78286505)]
 ```

In [13]:
print(wiki2vec.similarity("king", "queen"))
print(wiki2vec.similarity("king", "coffee"))

0.7570435
0.25811055


# End of this notebook