<a href="https://colab.research.google.com/github/andrewpkitchin/Word-Embeddings/blob/master/word_embeddings_contempary_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will demonstrate how to use the contempary word embedding models supported by the Gensim library. We will calculate the individual cosine of each enitity and moral standing words as well as the cosine between each enitity and a group/average vector of the moarl standing words.

**Key tips**

We suggest mounting a google drive to output the csv files of cosines for each model. See https://github.com/RaRe-Technologies/gensim-data for a list of models and documentation.

In [0]:
# Dependancies

from google.colab import drive
import csv, time
import numpy as np
import gensim.downloader as api


We first mount our google drive and navigate to our desired folder.

In [0]:
drive.mount('/content/drive')

%cd drive/My\ Drive/word2vecProject/csvFiles

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
/content/drive/My Drive/word2vecProject/csvFiles


In [0]:
list_of_models = ['word2vec-google-news-300']

#'glove-twitter-100', 'glove-twitter-200', 'glove-twitter-25', 'glove-twitter-50', 
#'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-wiki-gigaword-50',
#'fasttext-wiki-news-subwords-300'

Here we create a function which returns the normalized vector representation of a word from a model. Note: word_vector will be defined as follows:

word_vector = api.load(model_name)

we will define this when we load and run the models.

We also create a function to compute the cosine of two vectors as well as a function to compute the average vector of a list of words.

In [0]:
def norm(word):
  return word_vector[word]/np.linalg.norm(word_vector[word])
 

def cosine_similarity(vec1,vec2):
  return np.dot(vec1, vec2)/(np.linalg.norm(vec1)* np.linalg.norm(vec2))


def average_vector(listOfWords, average_vec):
  for word in listOfWords:
    try:
      average_vec += norm(word)
    except KeyError:
      continue
  
  return average_vec/(len(listOfWords)+1)


def relativeNormDifference(groupVec1, groupVec2, word):
  return np.linalg.norm(word - groupVec1)-np.linalg.norm(word - groupVec2)

#  output = 0
#  for word in listOfWords:
#    output += np.linalg.norm(word - groupVec1)-np.linalg.norm(word - groupVec2)
#  return output


Here we define three functions. The first cycles through our list of words and compute the cosine similarity between each pair. The second function is for the average vector approach. The third is using a relative norm difference apporach similar to that of PNAS.

In [0]:
def cosines_to_csv(csv_name, list1, list2, model_name):
  with open(csv_name, 'w', newline='') as file:
    writer = csv.writer(file)

    list2.insert(0,model_name)

    # Write the headings to the csv.
    writer.writerow(list2)

    list2.pop(0)

    for word in list1:
      listOfCosines = []
      listOfCosines.append(word)
      
      for entity in list2:
        try:
          listOfCosines.append(cosine_similarity(norm(word),norm(entity)))
        except KeyError:
          listOfCosines.append('NA')

      # Writing the cosine scores to the csv.
      writer.writerow(listOfCosines)


def average_vector_cosines_to_csv(csv_name,average_vec,list2, model_name):
  with open(csv_name, 'a', newline='') as file:
    writer = csv.writer(file)

    listOfCosines = []
    listOfCosines.insert(0,model_name)

    for word in list2:
      try:
        listOfCosines.append(cosine_similarity(average_vec,norm(word)))
      except KeyError:
        listOfCosines.append('NA')
      
    # Writing the cosine scores to the csv.
    writer.writerow(listOfCosines)


def relative_norm_difference_to_csv(csv_name, average_vec1, average_vec2, list2, model_name):
  with open(csv_name, 'a', newline='') as file:
    writer = csv.writer(file)

    listOfCosines = []
    listOfCosines.insert(0,model_name)

    for word in list2:
      try:
        listOfCosines.append(relativeNormDifference(average_vec1, average_vec2, norm(word)))
      except KeyError:
        listOfCosines.append('NA')
      
    # Writing the cosine scores to the csv.
    writer.writerow(listOfCosines)

Lists of words

In [0]:
moral_standing = ['care', 'cares', 'cared', 'caring', 'help', 'helps', 'helping', 'helped', 'donate', 'donates', 'donating', 'donated', 'aid', 'aiding', 'aided',  'empathy', 'empathetic', 'empathizing', 'empathizes', 'sympathy', 'sympathetic', 'sympathizing', 'sympathizes', 'compassion', 'compassionate']

In [0]:
entities = ['husband','wife','father','mother','son','daughter','brother','sister','uncle','aunt','niece','nephew','cousin','grandmother','grandfather','acquaintance','ally','associate','colleague','comrade','counterpart','fellow','neighbour','patriot','confidant','friend','companion','partner','supporter','member','follower','emigrant','foreigner','intruder','settler','stranger','visitor','vagrant','opposition','rival','opponent','adversary','competitor','invader','trespasser','interloper','occupier','arab','beggar','blacks','crippled','disabled','jew','mexican','unemployed','vagabond','addict','native','elderly','indian','woman','chinese','pauper','enemy','villain','crook','delinquent','murderer','robber','thief','deserter','traitor','liar','convict','criminal','felon','offender','pickpocket','scoundrel','animal','ape','bird','elephant','chicken','cow','dog','fish','pig','shark','bear','snake','cat','fox','monkey','horse','lion',	'nature','forest','lake','mountain','ocean','reef','river','tree','sea','beach','island','air','water','coast','jungle','earth','planet']

In [0]:
moral_standing_positive = ['care', 'cares', 'cared', 'caring', 'help', 'helps', 'helping', 'helped', 'aid', 'aiding', 'aided']

In [0]:
moral_standing_negative = ['harm', 'kill', 'kills', 'killing', 'killed', 'annihilate', 'annihilates', 'annihilated', 'exterminate', 'exterminated']

Average vector approach 

In [0]:
with open('cosines_enitities_and_moral_standings_average_vec_contempary.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    
    entities.insert(0," ")
    writer.writerow(entities)
    entities.pop(0)


for i in list_of_models:
  word_vector = api.load(i)
  
  j = word_vector['word'].shape[0]
  
  average_vec = np.zeros([j, ])

  moral_standings_aver_vec = average_vector(moral_standing_positive, average_vec)

  average_vector_cosines_to_csv('cosines_enitities_and_moral_standings_average_vec_contempary.csv', moral_standings_aver_vec, entities, i)



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL




Individual appraoch

In [0]:
for i in list_of_models:
  word_vector = api.load(i)

  cosines_to_csv('cosines_enitities_and_moral_standings_{}.csv'.format(i), moral_standing, entities, i)



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Relative norm difference approach 

In [0]:
with open('rnd_enitities_and_moral_standings_contempary.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    
    entities.insert(0," ")
    writer.writerow(entities)
    entities.pop(0)

for i in list_of_models:
  word_vector = api.load(i)

  j = word_vector['word'].shape[0]
  
  average_vec = np.zeros([j, ])

  positive_aver_vec = average_vector(moral_standing_positive, average_vec)

  average_vec = np.zeros([j, ])

  negative_aver_vec = average_vector(moral_standing_negative, average_vec)

  relative_norm_difference_to_csv('rnd_enitities_and_moral_standings_contempary.csv', positive_aver_vec, negative_aver_vec, entities, i)




  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL




Archived

In [0]:
list_of_words_to_remove = []

for i in list_of_models:
  list_of_words_to_remove.append(i)
  word_vector = api.load(i)
  list_of_words_to_remove.append('ENTITIES')
  for enitity in entities:
    try: 
      word_vector[enitity]
    except KeyError:
      list_of_words_to_remove.append(enitity)
  list_of_words_to_remove.append('MORALS')
  for moral in moral_standing:
    try: 
      word_vector[moral]
    except KeyError:
      list_of_words_to_remove.append(moral)
  print(list_of_words_to_remove)

  

