The goal is to check that the vector result of *king - man + woman* is close to *queen* vector

## Try with a spaCy pretrained embedding

In [1]:
import spacy
import spacy.cli
from scipy import spatial
# we dowload a nlp english model (with a pre-trained 300-dimension embedding) 
spacy.cli.download("en_core_web_md")
nlp = spacy.load('en_core_web_md')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


spaCy allows to compute directly a pre-trained 300-dimension embedding for every word


In [2]:
king = nlp.vocab['king']
king.vector

array([-1.1296e-01, -4.1865e+00, -1.8453e+00,  3.0781e-01,  2.4956e+00,
        9.6267e-01, -1.8161e+00,  4.4655e+00, -2.8210e+00,  9.7090e-01,
        1.3542e+01,  4.3195e-01, -5.3098e+00,  4.7098e+00,  2.9030e+00,
        1.5588e+00,  6.0064e+00, -3.0345e+00,  1.0626e+00, -7.7197e-01,
       -5.4771e+00, -9.7380e-01, -4.4345e+00,  5.8367e+00,  2.4302e+00,
       -3.9408e+00, -9.1862e-01, -4.9124e+00,  1.4591e+00, -7.2772e-01,
        3.4957e+00, -4.0077e+00, -1.8354e+00, -4.1052e+00,  4.9211e+00,
       -9.7053e-01,  1.9223e+00,  5.2605e+00,  1.6086e+00,  7.1328e-01,
       -1.2146e+00, -1.9869e+00,  8.0265e-01,  2.9298e+00,  7.2985e-01,
       -6.2892e-01, -1.7082e+00,  1.9893e+00,  4.7529e-01,  3.2264e+00,
       -3.9215e+00,  4.6556e+00,  1.3475e+00, -1.0979e+00, -3.0365e+00,
        1.5815e+00,  2.2835e+00, -4.0616e+00,  2.5730e+00,  4.0618e+00,
        9.5438e-01, -6.2563e+00,  5.6463e+00, -3.8933e+00,  4.4076e+00,
        2.0517e+00, -6.6906e+00, -6.9448e+00,  6.0371e+00,  9.30

In [3]:
king.vector.shape

(300,)

In [9]:
queen = nlp.vocab['queen'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector
result = king - man + woman

In [4]:
# Question 1: Compute the vector "king - man + woman" and try to show that the result is close to the vector representation of the word "queen" ;
# a good way to do it is, for example, to find the 10 closest word (among the nlp.vocab words) from the results of "king - man + woman" and to show
# that "queen" is one of them (if not the best)

# The distance we need for that is the cosine similarity, it can be define from the spatial.distance.cosine function imported from the scipy library
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

# Start the exercice here
# Hint: use a loop on nlp.vocab (all the words defined in spaCy vocabulary) ; for each "word" in the vocabulary you can check if the word has an embedding vector ("word.has_vector"), if the word is in
# lower case ("word.is_lower") and is alphanumeric ("word.is_alpha"). Try to consider only the relevant words for the exercice
# ??????

In [14]:
new_vector = king - man + woman
computed_similarities = []
for word in x:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))
computed_similarities = sorted(computed_similarities)

<spacy.vocab.Vocab object at 0x00000219722611B0> has similarity of 0.8489541411399841
<spacy.vocab.Vocab object at 0x00000219722611B0> has similarity of 0.07003621011972427
<spacy.vocab.Vocab object at 0x00000219722611B0> has similarity of 0.30994713306427
0.6178014278411865


## Try with a pretrained Word2Vec embedding model

**Important** To prevent RAM crash in the execution environment, please restart from here the running environment (Execution -> Restart the running environment)

In [None]:
import gensim# Load pretrained vectors from Google
from gensim.models import KeyedVectors

We load the pre-trained glove vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased embedding models (100-dimension embedding)

In [None]:
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")

In [None]:
king = word_vectors['king']

print(king)

In [None]:
king.shape

In [None]:
# Question 2: This time with the GoogleNews embedding model, try to show once again that "king - man + woman" is close to the vector representation of the word "queen" ;
# Hint: There is a pre-defined function in the gensim "word_vectors" object (define just above) that allows to get this result quite easily

# ??????????????

## Try with fastText embedding

**Important** To prevent RAM crash in the execution environment, please restart from here the running environment (Execution -> Restart the running environment)

In [None]:
#Download, extract and load Fasttext word embedding model
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
!gunzip /content/cc.en.300.bin.gz
!pip install fasttext

Load the english fastText model

In [None]:
import fasttext 

model = fasttext.load_model("/content/cc.en.300.bin")

In [None]:
model.get_word_vector("king")

It is possible to get directly the nearest neighbors of a specific word (or even n-gram)

In [None]:
model.get_nearest_neighbors("king")

In [None]:
# Question 3: This time with the fastText embedding model, try to show once again that "king - man + woman" is close to the vector representation of the word "queen" ;
# Hint: There is a pre-defined function in the fastText model, 'get_analogies', that allows to get this result quite easily