[View in Colaboratory](https://colab.research.google.com/github/avalerie/misc/blob/master/Word_vectors.ipynb)

In [0]:
import pandas as pd
import numpy as np
import csv

Download Glove pre-trained word-vectors (glove.6B.50d.txt) and save on Google Colab. 

There are [various methods](https://colab.research.google.com/notebooks/io.ipynb) of loading data into Colab.

1.   From local drive. The file is stored in Colab session space and removed with the session end. Files can be viewed in menu "Files".  To operate files, use the commands:

> > `!rm file_name # to remove file` \
> > `!mv file_name_A file_name_B # to rename file_A to file_B` 

In [0]:
# 1. Upload files from local drive to Google Drive. 
# this function allows to select files and will display the progress bar % while uploding files
from google.colab import files
uploaded = files.upload() 

2.   From Google Drive. To view the files, we need to mount GD on virtual mashine. 

In [0]:
# 2. Mount Google Drive locally
from google.colab import drive
drive.mount("/content/gdrive/")

2. The Glove txt file containes 50-dimentional vectors of 6B tokens (400K vocab, uncased). Convert Glove txt file into pandas dataframe of shape (400000,50).

In [0]:
word_vec= pd.read_table('glove.6B.50d.txt', sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

In [0]:
# check the shape of word vectors
word_vec.shape[0]==400000

###  Cosine similarity

Cosine similarity reflects the degree of similariy between vectors.

$$\text{Cosine_Similarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta) $$

where $u.v$ is the dot product (or inner product) of two vectors, $||u||_2$ is the norm (or length) of the vector $u$, and $\theta$ is the angle between $u$ and $v$. This similarity depends on the angle between $u$ and $v$. If $u$ and $v$ are very similar, their cosine similarity will be close to 1; if they are dissimilar, the cosine similarity will take a smaller value. 

The norm of $u$ is defined as:

$$ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$$


In [0]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v
        
    Arguments:
        u -- a word vector of shape (n,)          
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    
    distance = 0.0
    
    # Compute the dot product between u and v (≈1 line)
    dot = np.dot(u,v)
    # Compute the L2 norm of u (≈1 line)
    norm_u = np.sqrt(np.sum(u**2))
    
    # Compute the L2 norm of v (≈1 line)
    norm_v = np.sqrt(np.sum(v**2))
    # Compute the cosine similarity defined by formula (1) (≈1 line)
    cosine_similarity = dot/(norm_u*norm_v)
    
    return cosine_similarity

In [0]:
father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
print("cosine_similarity(father, mother) = ", cosine_similarity(father, mother))