## Semantic and Word Vectors

The word2vec model creates word vectors using two approaches, namely the Continous Bag Of Words (CBOW) approach or the skip-gram approach. 

In CBOW - Words before and after the word to be predicted are taken and then the word to be predicted is output. So multiple input, single output. 

In Skip-Gram - Given an input, using the auto-encoder, it tries to find the weighted probabalities of the words that are gonna show up. This process takes longer.

In spacy, each vector has 300 dimensions. 

Since each word has a unqiue vector assigned to it in a multi-dimensional space, we can find how similar words are using cosine similarity. Cosine similarity calculates the distance between vectors.

So now you can perform mathematics like this:

    new_vector = king - man + woman (intuitively we can tell the answer should be queen)

This new vector then attempts to find the closest vector to it, which gives the answer queen. 


In [2]:
import spacy
nlp = spacy.load('en_core_web_lg') # Loading large model

In [3]:
# Vector for a word
nlp(u'lion').vector

array([ 1.8963e-01, -4.0309e-01,  3.5350e-01, -4.7907e-01, -4.3311e-01,
        2.3857e-01,  2.6962e-01,  6.4332e-02,  3.0767e-01,  1.3712e+00,
       -3.7582e-01, -2.2713e-01, -3.5657e-01, -2.5355e-01,  1.7543e-02,
        3.3962e-01,  7.4723e-02,  5.1226e-01, -3.9759e-01,  5.1333e-03,
       -3.0929e-01,  4.8911e-02, -1.8610e-01, -4.1702e-01, -8.1639e-01,
       -1.6908e-01, -2.6246e-01, -1.5983e-02,  1.2479e-01, -3.7276e-02,
       -5.7125e-01, -1.6296e-01,  1.2376e-01, -5.5464e-02,  1.3244e-01,
        2.7519e-02,  1.2592e-01, -3.2722e-01, -4.9165e-01, -3.5559e-01,
       -3.0630e-01,  6.1185e-02, -1.6932e-01, -6.2405e-02,  6.5763e-01,
       -2.7925e-01, -3.0450e-03, -2.2400e-02, -2.8015e-01, -2.1975e-01,
       -4.3188e-01,  3.9864e-02, -2.2102e-01, -4.2693e-02,  5.2748e-02,
        2.8726e-01,  1.2315e-01, -2.8662e-02,  7.8294e-02,  4.6754e-01,
       -2.4589e-01, -1.1064e-01,  7.2250e-02, -9.4980e-02, -2.7548e-01,
       -5.4097e-01,  1.2823e-01, -8.2408e-02,  3.1035e-01, -6.33

In [4]:
#Vector for a sentence
nlp(u'The quick brown fox jumps over the lazy dog.').vector

array([-2.00720906e-01,  4.20156009e-02, -9.31793973e-02, -8.15414935e-02,
        2.50049680e-03,  1.67080835e-01, -1.08621001e-01,  7.71560054e-03,
        1.44465894e-01,  1.79850388e+00, -2.77828038e-01, -4.84851822e-02,
       -7.94451088e-02, -1.16215110e-01, -1.56261414e-01,  5.96945994e-02,
        5.02396040e-02,  1.05396795e+00, -8.51650070e-03, -2.77383685e-01,
       -1.63225681e-01,  2.29413994e-02, -1.59576014e-02, -2.49975801e-01,
        1.51895449e-01, -5.72511964e-02, -1.80625603e-01, -1.26084194e-01,
        1.05212606e-01, -1.70930997e-01, -1.81344628e-01,  2.11258501e-01,
        2.26855017e-02, -1.54004004e-02,  1.76830694e-01, -4.40463014e-02,
        7.02560022e-02, -4.22505140e-02, -1.25777842e-02, -1.30870016e-02,
        1.51386812e-01,  1.72551982e-02, -6.04443662e-02, -1.54603601e-01,
        9.66587141e-02,  5.71484976e-02, -1.51370004e-01,  1.12425432e-01,
        7.27925971e-02,  2.20516007e-02, -7.03053921e-02,  1.02219798e-01,
       -3.19202256e-04, -

In [5]:
#The shape of both the above vectors remain the same since the vector of the sentence is the average of all vectors in the 
print(nlp(u'lion').vector.shape)
print(nlp(u'The quick brown fox jumps over the lazy dog.').vector.shape)

(300,)
(300,)


In [7]:
#Similarity between tokens
tokens = nlp(u'lion cat pet')
for t in tokens:
    for t2 in tokens:
        print(t.text,t2.text,t.similarity(t2))

#Simlarity values are between 0 and 1
#Similarity is Cosine Similarity.

lion lion 1.0
lion cat 0.5265438
lion pet 0.39923766
cat lion 0.5265438
cat cat 1.0
cat pet 0.7505457
pet lion 0.39923766
pet cat 0.7505457
pet pet 1.0


In [8]:
#Similarity between tokens with opposite uses but used in similar contexts
# e.g. is you either like, love or hate a book

tokens = nlp(u'like love hate')
for t in tokens:
    for t2 in tokens:
        print(t.text,t2.text,t.similarity(t2))

like like 1.0
like love 0.657904
like hate 0.65746516
love like 0.657904
love love 1.0
love hate 0.63930994
hate like 0.65746516
hate love 0.63930994
hate hate 1.0


Here we can see like-love and like-hate have almost same similarity since they are often used in the same context

In [10]:
len(nlp.vocab.vectors) # Number of unique words for which vectors are present

684830

In [11]:
tokens = nlp(u'dog cat nargle')
for t in tokens:
    print(t.text,t.has_vector,t.vector_norm,t.is_oov) ## is_oov checks if the word is out of vocabulary

dog True 7.0336733 False
cat True 6.6808186 False
nargle False 0.0 True


In [19]:
from scipy import spatial

cosine_similarity = lambda v1, v2: 1 - spatial.distance.cosine(v1, v2)
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

new_vec = king - man + woman
comp_similar = []

for w in nlp.vocab: 
    if w.has_vector:
        if w.is_lower:
            if w.is_alpha:
                similarity = cosine_similarity(new_vec,w.vector)
                comp_similar.append((w,similarity))

comp_similar = sorted(comp_similar,key = lambda item:-item[1])
print ([t[0].text for t in comp_similar[:10]])

['king', 'woman', 'she', 'lion', 'who', 'fox', 'brown', 'when', 'dare', 'cat']
