[View in Colaboratory](https://colab.research.google.com/github/avalerie/nlp/blob/master/Word_vectors.ipynb)

In [0]:
import pandas as pd
import numpy as np
import warnings; warnings.simplefilter('ignore')

Download Glove pre-trained word-vectors (glove.6B.50d.txt) and save on Google Colab. 

There are [various methods](https://colab.research.google.com/notebooks/io.ipynb) of loading data into Colab.

-> From local drive. The file is stored in Colab session space and removed with the session end. Files can be viewed in menu "Files".  To operate files, use the commands:

 `!rm file_name # to remove file;`
`!mv file_name_A file_name_B # to rename file_A to file_B` 

In [0]:
# 1. Upload files from local drive to Google Drive. This function allows to select files and will display the progress bar % while uploding files
from google.colab import files
uploaded = files.upload() 

-> From Google Drive. To access files, we need to mount GD on virtual mashine. 

In [0]:
# 2. Mount Google Drive locally
from google.colab import drive
drive.mount("/content/gdrive/")

Glove txt file containes 50-dimentional vectors of 400K uncased words.

In [0]:
word_vec= pd.read_table('/content/gdrive/My Drive/Glove/glove.6B.50d.txt', sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

In [15]:
# check the shape of word vectors
print("Glove word embedding file shape: ", word_vec.shape)

# view fisrt vectors
word_vec.head(3)

Glove word embedding file shape:  (400000, 50)


Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,41,42,43,44,45,46,47,48,49,50
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,0.418,0.24968,-0.41242,0.1217,0.34527,-0.044457,-0.49688,-0.17862,-0.00066,-0.6566,...,-0.29871,-0.15749,-0.34758,-0.045637,-0.44251,0.18785,0.002785,-0.18411,-0.11514,-0.78581
",",0.013441,0.23682,-0.16899,0.40951,0.63812,0.47709,-0.42852,-0.55641,-0.364,-0.23938,...,-0.080262,0.63003,0.32111,-0.46765,0.22786,0.36034,-0.37818,-0.56657,0.044691,0.30392
.,0.15164,0.30177,-0.16763,0.17684,0.31719,0.33973,-0.43478,-0.31086,-0.44999,-0.29486,...,-6.4e-05,0.068987,0.087939,-0.10285,-0.13931,0.22314,-0.080803,-0.35652,0.016413,0.10216


###  Cosine similarity

Cosine similarity reflects the degree of similariy between vectors.

$$\text{Cosine_Similarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta) $$

where $u.v$ is the dot product (or inner product) of two vectors, $||u||_2$ is the norm (or length) of the vector $u$, and $\theta$ is the angle between $u$ and $v$. This similarity depends on the angle between $u$ and $v$. If $u$ and $v$ are very similar, their cosine similarity will be close to 1; if they are dissimilar, the cosine similarity will take a smaller value. 

The norm of $u$ is defined as:

$$ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$$


In [0]:
def cosine_similarity(u, v):
    # L2 norm of u and v
    norm_u = np.sqrt(np.sum(u**2))
    norm_v = np.sqrt(np.sum(v**2))
    # Cosine similarity
    cosine_similarity = np.dot(u,v)/(norm_u*norm_v)
    
    return cosine_similarity

Calculate similiarity between the words using cosine_similarity function.

In [34]:
print("cosine_similarity:")
pairs=[('father','mother'),('king','queen'), ('tree','woods'),('apple','orange'),('water','earth')]
for i in pairs: 
  print('{:10} : {:10} = {}'.format(i[0],i[1],cosine_similarity(word_vec.loc[i[0]], word_vec.loc[i[1]])))

cosine_similarity:
father     : mother     = 0.8909038442893616
king       : queen      = 0.783904301096412
tree       : woods      = 0.47122993590723555
apple      : orange     = 0.5388040721946523
water      : earth      = 0.6150415301704791


Find the list of the most similar words for given word "w". 

In [0]:
def most_similar(word,word_vec):
  # loop over all word_vec to find the word with the best cosine similarity
  
  max_cos_sim=-100 # initialise with negative number
  sim_word=None
  for w in word_vec.index:
    cos_sim=cosine_similarity(word_vec.loc[word], word_vec.loc[w])
    if cos_sim>max_cos_sim: 
      max_cos_sim=cos_sim
      sim_word=w
 
  return sim_word, cos_sim

### Operations on Glove vectors with [Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html)

Gensim converts Glove vestors into word2vec with *glove2word2vec* script. 

In [0]:
# istall Gensim into Colab space.
!pip install gensim

In [0]:
from gensim.models import KeyedVectors
from gensim.test.utils import datapath, get_tmpfile

glove_file = datapath('/content/gdrive/My Drive/Glove/glove.6B.50d.txt')
tmp_file = get_tmpfile('/content/gdrive/My Drive/Glove/word2vec.txt')

In [0]:
# default way (through CLI): python -m gensim.scripts.glove2word2vec --input <glove_file> --output <w2v_file>
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_file, tmp_file)
# load the model
model = KeyedVectors.load_word2vec_format(tmp_file)

In [69]:
# list of most similar words
result=model.similar_by_word("water")
for i in result:
  print("{:10}: {:.4f}".format(i[0],i[1]))

dry       : 0.8274
natural   : 0.7858
sand      : 0.7737
waste     : 0.7724
drinking  : 0.7562
clean     : 0.7492
ocean     : 0.7453
soil      : 0.7451
sewage    : 0.7430
seawater  : 0.7394


In [86]:
model.similarity('apple', 'banana') # cosine similarity of two words

0.5607928

In [98]:
model.distance('apple', 'banana') # = 1-similarity

0.43920719623565674

In [67]:
model.most_similar(positive=['woman', 'king'], negative=['man'])[0]

('queen', 0.8523603677749634)

In [90]:
model.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

In [91]:
model.doesnt_match("Paris France Africa London".lower().split())

'africa'

In [93]:
model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])

0.74835527