# `word2vec` example with `gensim`

In this notebook we are going to user the `gensim` library to make operations with word embeddings in spanish. 

We will download the pre-trained embeddings with Wikipedia text for Spanish from https://github.com/uchile-nlp/spanish-word-embeddings. We select 

In [0]:
import gensim, logging, os

We can load a word2vec model by:

In [0]:
embed_home = 'drive/My Drive/kschool-nlp/data/word-embeddings/'
model_file = embed_home + 'SBW-vectors-300-min5.txt.bz2'

#model = gensim.models.Word2Vec.load(model_file)
model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)

## Testing the model

The object `model` contains an enormous matrix of numbers: a table where each file represents a term in the vocabulary and each column is one of the features that assigns a meaning to that term. 

In our model we have more than 26M terms. 

Each term in the vocabulary is represented with a vector of 150 dimensions. We can see one specific value. 

In [0]:
print(model['azul'], '\n')

print(model['verde'], '\n')

print(model['microsoft'])

These vectors don't tell us too much, except that they are small numbers...

The object `model` allows us to access a set of functionalities already implemented that will allow us to evaluate (informally) the model. 

We can compute the semantic similarity between two terms, using method `similarity`, that returns a value between 0 and 1:

In [0]:
print('hombre - mujer', model.similarity('hombre', 'mujer'))

print('perro - gato', model.similarity('perro', 'gato'))

print('gato - periódico', model.similarity('gato', 'periódico'))

print('febrero - azul', model.similarity('febrero', 'azul'))

We can also select the term that does not fit in a list, using method `doesnt_match`:

In [0]:
lista1 = 'madrid barcelona gonzález washington'.split()
print('en la lista', ' '.join(lista1), 'sobra:', model.doesnt_match(lista1))

lista2 = 'psoe pp ciu juan'.split()
print('en la lista', ' '.join(lista2), 'sobra:', model.doesnt_match(lista2))

lista3 = 'publicaron declararon soy negaron'.split()
print('en la lista', ' '.join(lista3), 'sobra:', model.doesnt_match(lista3))

lista4 = 'homero saturno cervantes shakespeare cela'.split()
print('en la lista', ' '.join(lista4), 'sobra:', model.doesnt_match(lista4))

lista5 = 'madrid barcelona valencia marsella'.split()
print('en la lista', ' '.join(lista5), 'sobra:', model.doesnt_match(lista5))

We can also find the most similar words by using the method `most_similar` in the model:

In [0]:
terminos = 'psoe chicago rajoy enero amarillo microsoft iberia messi atlético'.split()

for t in terminos:
    print(t, '==>', model.most_similar(t), '\n')

Furthermore, we can also use the same method `most_similar` to make arithmetic operations with vectors.

In [0]:
print('mujer que ejerce la autoridad en una alcaldía ==> alcalde + mujer - hombre')
most_similar = model.most_similar(positive=['alcalde', 'mujer'], negative=['hombre'], topn=3)
for item in most_similar:
    print(item)

print('monarca soberano ==> reina + hombre - mujer')    
most_similar = model.most_similar(positive=['reina', 'hombre'], negative=['mujer'], topn=3)
for item in most_similar:
    print(item)
    
print('capital de Alemania ==> moscú + alemania - rusia')
most_similar = model.most_similar(positive=['moscú', 'alemania'], negative=['rusia'], topn=3)
for item in most_similar:
    print(item)

print('presidente de Francia ==> rajoy + francia - españa')
most_similar = model.most_similar(positive=['mariano', 'francia'], negative=['españa'], topn=3)
for item in most_similar:
    print(item)

In [0]:
most_similar = model.most_similar(positive=['obama', 'francia'], negative=['eeuu'], topn=3)
for item in most_similar:
    print(item)