cf. the lab from the [course](https://www.coursera.org/learn/classification-vector-spaces-in-nlp) for more details

# Manipulating Word Vectors

The goal of this lab is to us pre-trained word vectors to predict relationships betwen words, using cosine similarity and euclidian distance as similarity metrics.

## Loading pretrained vectors

In [1]:
import pandas as pd 
import numpy as np 
import pickle
import pprint as pp

# Load pre-trained embeddings
word_embeddings = pickle.load( open( "word_embeddings_subset.p", "rb" ) )
len(word_embeddings) # there should be 243 words

243

In [2]:
# see which words are in the vocab
for key in sorted(word_embeddings.keys()):
    print(key)

Abuja
Accra
Afghanistan
Albania
Algeria
Algiers
Amman
Angola
Ankara
Antananarivo
Apia
Armenia
Ashgabat
Asmara
Astana
Athens
Australia
Austria
Azerbaijan
Baghdad
Bahamas
Bahrain
Baku
Bamako
Bangkok
Bangladesh
Banjul
Beijing
Beirut
Belarus
Belgium
Belgrade
Belize
Belmopan
Berlin
Bern
Bishkek
Botswana
Bratislava
Brussels
Bucharest
Budapest
Bujumbura
Bulgaria
Burundi
Cairo
Canada
Canberra
Caracas
Chile
China
Chisinau
Conakry
Copenhagen
Croatia
Cuba
Cyprus
Dakar
Damascus
Denmark
Dhaka
Doha
Dominica
Dublin
Dushanbe
Ecuador
Egypt
England
Eritrea
Estonia
Fiji
Finland
France
Funafuti
Gabon
Gaborone
Gambia
Georgetown
Georgia
Germany
Ghana
Greece
Greenland
Guinea
Guyana
Hanoi
Harare
Havana
Helsinki
Honduras
Hungary
Indonesia
Iran
Iraq
Ireland
Islamabad
Italy
Jakarta
Jamaica
Japan
Jordan
Kabul
Kampala
Kathmandu
Kazakhstan
Kenya
Khartoum
Kiev
Kigali
Kingston
Kyrgyzstan
Laos
Latvia
Lebanon
Liberia
Libreville
Libya
Liechtenstein
Lilongwe
Lima
Lisbon
Lithuania
Ljubljana
London
Luanda
Lusaka
Macedonia


In [3]:
countryVector = word_embeddings['country'] # Get the vector representation for the word 'country'
# Print the type of the vector. Note it is a numpy array
print(type(countryVector)) 
# Print the values of the vector.  
print(countryVector) 

<class 'numpy.ndarray'>
[-0.08007812  0.13378906  0.14355469  0.09472656 -0.04736328 -0.02355957
 -0.00854492 -0.18652344  0.04589844 -0.08154297 -0.03442383 -0.11621094
  0.21777344 -0.10351562 -0.06689453  0.15332031 -0.19335938  0.26367188
 -0.13671875 -0.05566406  0.07470703 -0.00070953  0.09375    -0.14453125
  0.04296875 -0.01916504 -0.22558594 -0.12695312 -0.0168457   0.05224609
  0.0625     -0.1484375  -0.01965332  0.17578125  0.10644531 -0.04760742
 -0.10253906 -0.28515625  0.10351562  0.20800781 -0.07617188 -0.04345703
  0.08642578  0.08740234  0.11767578  0.20996094 -0.07275391  0.1640625
 -0.01135254  0.0025177   0.05810547 -0.03222656  0.06884766  0.046875
  0.10107422  0.02148438 -0.16210938  0.07128906 -0.16210938  0.05981445
  0.05102539 -0.05566406  0.06787109 -0.03759766  0.04345703 -0.03173828
 -0.03417969 -0.01116943  0.06201172 -0.08007812 -0.14941406  0.11914062
  0.02575684  0.00302124  0.04711914 -0.17773438  0.04101562  0.05541992
  0.00598145  0.03027344 -0.07

In [4]:
# Get the number of dimensions
dimensions = countryVector.shape[0]
print(f'{dimensions=}')

dimensions=300


In [6]:
# dataframes can be more practical to view and and manipulate data
df_embeds = pd.DataFrame(word_embeddings)
df_embeds.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
country,-0.080078,0.133789,0.143555,0.094727,-0.047363,-0.023560,-0.008545,-0.186523,0.045898,-0.081543,...,-0.145508,0.067383,-0.244141,-0.077148,0.047607,-0.075195,-0.149414,-0.044189,0.097168,0.067383
city,-0.010071,0.057373,0.183594,-0.040039,-0.029785,-0.079102,0.071777,0.013306,-0.143555,0.011292,...,0.024292,-0.168945,-0.062988,0.117188,-0.020508,0.030273,-0.247070,-0.122559,0.076172,-0.234375
China,-0.073242,0.135742,0.108887,0.083008,-0.127930,-0.227539,0.151367,-0.045654,-0.065430,0.034424,...,0.140625,0.087402,0.152344,0.079590,0.006348,-0.037842,-0.183594,0.137695,0.093750,-0.079590
Iraq,0.191406,0.125000,-0.065430,0.060059,-0.285156,-0.102539,0.117188,-0.351562,-0.095215,0.200195,...,-0.100586,-0.077148,-0.123047,0.193359,-0.153320,0.089355,-0.173828,-0.054688,0.302734,0.105957
oil,-0.139648,0.062256,-0.279297,0.063965,0.044434,-0.154297,-0.184570,-0.498047,0.047363,0.110840,...,-0.195312,-0.345703,0.217773,-0.091797,0.051025,0.061279,0.194336,0.204102,0.235352,-0.051025
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Belmopan,-0.265625,-0.380859,-0.049072,0.155273,-0.044922,-0.248047,-0.010376,-0.105957,-0.328125,0.119629,...,0.047119,-0.034424,-0.005219,-0.265625,0.094727,0.170898,-0.353516,0.072754,-0.042969,0.229492
Vaduz,0.324219,-0.056885,0.031494,-0.045898,0.042969,-0.062256,0.004089,-0.328125,-0.151367,0.242188,...,-0.134766,-0.226562,-0.083496,-0.152344,-0.179688,0.205078,-0.016968,0.156250,0.152344,0.027710
Paramaribo,-0.235352,-0.063477,0.154297,0.081055,-0.002716,-0.126953,-0.443359,-0.218750,0.038574,-0.063965,...,-0.318359,-0.187500,0.304688,-0.192383,0.050781,0.234375,-0.341797,-0.100098,0.183594,-0.128906
Nuuk,-0.318359,-0.546875,0.085449,-0.167969,-0.376953,-0.453125,-0.332031,-0.447266,-0.105469,-0.024780,...,0.094727,-0.021484,0.009766,-0.294922,-0.226562,0.084473,-0.104980,0.114746,0.163086,-0.225586


## Predicting relationships between words

You will now write a function that can help predict a word given a relationship between two initial words :

* The function will take as input three words.
* The first two are related to each other.
* It will predict a 4th word which is related to the third word, in a manner similar to the one we can imply from the first 2 words.
* As an example, "Italy is to Rome what France is to __"?
* You will write a program that is capable of finding the fourth word.
* This will be applied to finding the countries of capitals.



To do this, first write functions to compute the cosine similarity metric or the Euclidean distance.

### 1. Cosine Similarity

Cosine similarity is defined as :

$$ simil(\mathbf{a,b}) = cos \theta = \frac{\mathbf a \cdot \mathbf b}{\Vert \mathbf a \Vert \times \lVert \mathbf b \rVert}$$ 

And the Euclidian norm of a vector is defined as   

$$ \Vert \bf a \Vert = \sqrt{\sum_{i=1}^{n}{a_i^2}} $$ 


**Implement a function that takes in 2 word vectors and computes their cosine similarity**  
You may use `np.dot` to calculate the dot product

In [7]:
def cosine_similarity(a, b):
    """ 
    Input :
    a : a numpy array representing word a as a vector
    b : a numpy array representing word b as a vector
    
    Output:
    cos_ab : a scalar proportional to the the similarity in angles between a and b
    
    """
    
    dot_ab = np.dot(a,b)
    norm_a = np.sqrt(np.sum(a**2))
    norm_b = np.sqrt(np.sum(b**2))
    
    # compute cos
    cos_ab = dot_ab/(norm_a*norm_b)
    
    # output
    return cos_ab

In [8]:
# test the function
king = word_embeddings['king']
queen = word_embeddings['queen']

cosine_similarity(king, queen)

0.6510956

**Expected Out**  
0.6510956

### 2. Euclidian Distance

Using the formula above, write a funciton which computes the euclidian distance bewteen 2 word vectors.  
**Hint**: you are looking for the norm of the vector which seperates the tips of 2 vectors...

In [15]:
def euclidean(a, b):
    """
    Input:
    a : a numpy array representing word a as a vector
    b : a numpy array representing word b as a vector
    
    Output:
    d: scalar representing the Euclidean distance between a and b
    """
    
    # the vector going from a to b
    vec_a2b = b-a
    
    # norm of that vector / Euclidian distance bewteen a and b
    d = np.linalg.norm(vec_a2b)

    return d

In [16]:
# Test your function
euclidean(king, queen)

2.4796925

**Expected Out**  
2.4796925

### 3. Finding the country of each capital

Now, you can use the previous functions to compute similarities between vectors, and use these to find the capital cities of countries.  
You will write a function that takes three words as input, and the embeddings dictionary. Your task is to find the correct capital city. For example, given the following words:

* 1: Athens, 2: Greece, 3: Baghdad

your task is to predict the country 4: Iraq.

In [9]:
def get_country(city1, country1, city2, embeddings):
    """
    Input:
        city1: a string (the capital city of country1)
        country1: a string (the country of capital1)
        city2: a string (the capital city of country2)
        embeddings: a dictionary where the keys are words and values are their embeddings
        
    Output:
        countries: a dictionary with the most likely country and its similarity score
    """
    
    
    # store the city1, country 1, and city 2 in a set called group
    group = (city1,country1, city2)
    
    # get their embeddings
    city1_emb = embeddings[city1]
    country1_emb = embeddings[country1]
    city2_emb = embeddings[city2]
    
    # calculate the embedding of country2 using simple linear algebra
    # Remember : king - man + woman = queen
    vec = country1_emb - city1_emb + city2_emb
    
    
    # Finding the closest word embedding :
    
    # loop through all the words in the embeddings dict, checking that it isnt in the group defined above and 
    # then calculate the similarity (using whichever metric you prefer).  If the similarity is higher, then write 
    # over the stored best_similarity and update country, which stores a tuple (country_name, similarity_score).
    # Finally return the country tuple.
    
    
    # Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1)
    best_similarity = -1
    
    # initialize country to an empty string
    country = ''
    
    # loop through all words in the embeddings dictionary
    for word in embeddings.keys():
        
        #check word if not in group
        if word not in group:
        # get the embedding for the word
            word_emb = embeddings[word]
        # compute the similarity
            similarity = cosine_similarity(vec, word_emb)
        # if the similarity is higher
            if similarity > best_similarity:
            # update the best_similarity variable
                best_similarity = similarity
            # update the country variable with a tuple which contains the country and the similarity
                country = (word, similarity)
        
    return country

In [10]:
# Test your function
get_country('Paris', 'France', 'Cairo', word_embeddings)

('Egypt', 0.72618926)

### 4. Model Accuracy

Now you can test your new function on a dataset of capital/country pairs and check the accuracy of the model:

$$\text{Accuracy}=\frac{\text{Correct # of predictions}}{\text{Total # of predictions}}$$

In [11]:
# load data file
data = pd.read_csv('capitals.txt', delimiter=' ')
# name columns
data.columns = ['city1', 'country1', 'city2', 'country2']

data.head() # print first 5 lines

Unnamed: 0,city1,country1,city2,country2
0,Athens,Greece,Bangkok,Thailand
1,Athens,Greece,Beijing,China
2,Athens,Greece,Berlin,Germany
3,Athens,Greece,Bern,Switzerland
4,Athens,Greece,Cairo,Egypt


Write a program that can compute the accuracy on the dataset. You have to iterate over every row to get the corresponding words and feed them into your `get_country` function

In [12]:
def get_accuracy(word_embeddings, data):
    '''
    Input:
        word_embeddings: a dictionary where the key is a word and the value is its embedding
        data: a pandas dataframe containing all the country and capital city pairs
    
    Output:
        accuracy: the accuracy of the model
    '''

    # initialize num correct to zero
    num_correct = 0

    # loop through the rows of the dataframe
    for i, row in data.iterrows():

        # get city1
        city1 = row['city1']

        # get country1
        country1 = row['country1']

        # get city2
        city2 =  row['city2']

        # get country2
        country2 = row['country2']

        # use get_country to find the predicted country2
        predicted_country2, _ = get_country(city1, country1, city2, word_embeddings)

        # if the predicted country2 is the same as the actual country2...
        if predicted_country2 == country2:
            # increment the number of correct by 1
            num_correct += 1

    # get the number of rows in the data dataframe (length of dataframe)
    m = len(data)

    # calculate the accuracy by dividing the number correct by m
    accuracy = num_correct/m

    return accuracy


In [13]:
# compute accuracy
accuracy = get_accuracy(word_embeddings, data)
print(f"Accuracy is {accuracy:.3f}")

Accuracy is 0.919


**Expected Out**  
0.92