# Part 2: Word Representations and Lexical Similarities

This part has 20 points in total.

Here we will compare different measures of semantic similarity between words: (1) WordNet depth distance (2) and cosine similarity of words using a given GloVe model features.

For more reading on vector semantics got to Chapter 6, sections 6.4 and 6.8:
https://web.stanford.edu/~jurafsky/slp3/6.pdf

To learn about Wordnet: https://www.nltk.org/howto/wordnet.html

For additional Wordnet discussions see Chapter 19: https://web.stanford.edu/~jurafsky/slp3/19.pdf

The GloVe word embeddings are described in [this paper](https://nlp.stanford.edu/projects/glove/)


## Part 2.1: Semantic similarity with WordNet

In [1]:
# load wordnet
import nltk
from nltk.corpus import wordnet as wn

# load word-vector glov
import gensim
import gensim.downloader as gensim_api
glove_model = gensim_api.load("glove-wiki-gigaword-50")


from itertools import combinations, product
from scipy.stats import spearmanr
import numpy as np
from tqdm import tqdm

In [2]:
some_words = ['car', 'dog', 'banana', 'delicious', 'baguette', 'jumping', 'hugging', 'election']

### Explore Word Representations in English WordNet 

In [3]:
# For each word above print their synsets
# for each synset print all lemmas, hypernyms, hyponyms
word_synsets = []
for word in some_words:
    word_synsets.append(wn.synsets(word))
print(word_synsets)

lemmas = []
hypernyms = []
hyponyms = []
for word_synset in word_synsets:
    for each_synset in word_synset:
        lemmas.append(each_synset.lemmas())
        hypernyms.append(each_synset.hypernyms())
        hyponyms.append(each_synset.hyponyms())
print(lemmas)
print(hypernyms)
print(hyponyms)



[[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')], [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')], [Synset('banana.n.01'), Synset('banana.n.02')], [Synset('delicious.n.01'), Synset('delightful.s.01'), Synset('delectable.s.01')], [Synset('baguet.n.01')], [Synset('jumping.n.01'), Synset('jump.n.06'), Synset('jump.v.01'), Synset('startle.v.02'), Synset('jump.v.03'), Synset('jump.v.04'), Synset('leap_out.v.01'), Synset('jump.v.06'), Synset('rise.v.11'), Synset('jump.v.08'), Synset('derail.v.02'), Synset('chute.v.01'), Synset('jump.v.11'), Synset('jumpstart.v.01'), Synset('jump.v.13'), Synset('leap.v.02'), Synset('alternate.v.01')], [Synset('caressing.n.01'), Synset('embrace.v.02'), Synset('hug.v.02')], [Synset('election.n.01'), Synset('election.n.02'), Synset('election.n.03'), Synset('election.n.04')]]
[[Lemma

#### Measure The Lexical Similarity 

In [4]:
# Wu-Palmer Similarity is a measure of similarity between to sense based on their depth distance. 
#
# For each pair of words, find their closes sense based on Wu-Palmer Similarity.
# List all word pairs and their highest possible wup_similarity. 
# Use wn.wup_similarity(s1, s2) and itertools (combinations and product).
# if there is no connection between two words, put 0.

wn_sims = []
for word1, word2 in combinations(some_words, 2):
    synsets_w1 = wn.synsets(word1)
    synsets_w2 = wn.synsets(word2)
    pair = word1 , word2
    max_sim = 0
    for synset_w1 in synsets_w1:
        for synset_w2 in synsets_w2:
            similarity = wn.wup_similarity(synset_w1,synset_w2)
            if similarity > max_sim:
                max_sim = similarity
    wn_sims.append(max_sim)
      
    print(f"{word1:9} {word2:9} {max_sim:6.3f}")

# which word pair are the most similar words?
most_similar_pair = None
max_simil = 0

for wn_sim, pair in zip(wn_sims, combinations(some_words, 2)):
    if wn_sim > max_simil:
        most_similar_pair = pair
        max_simil = wn_sim


print(f"{most_similar_pair} {max_simil}")


car       dog        0.667
car       banana     0.421
car       delicious  0.364
car       baguette   0.211
car       jumping    0.167
car       hugging    0.235
car       election   0.133
dog       banana     0.632
dog       delicious  0.556
dog       baguette   0.556
dog       jumping    0.333
dog       hugging    0.286
dog       election   0.182
banana    delicious  0.750
banana    baguette   0.556
banana    jumping    0.167
banana    hugging    0.250
banana    election   0.143
delicious baguette   0.500
delicious jumping    0.500
delicious hugging    0.400
delicious election   0.222
baguette  jumping    0.154
baguette  hugging    0.222
baguette  election   0.125
jumping   hugging    0.400
jumping   election   0.667
hugging   election   0.200
('banana', 'delicious') 0.75


## Part 2.2: Semantic similarity with GloVe and comparison with WordNet

### Measure the similarities on GloVe Word Vectors

In [5]:
glov_sims = []
for word1, word2 in combinations(some_words, 2):
    max_sim = glove_model.similarity(word1, word2)
    glov_sims.append(max_sim)
    print(f"{word1:9} {word2:9} {max_sim:6.3f}")



car       dog        0.464
car       banana     0.219
car       delicious  0.068
car       baguette   0.046
car       jumping    0.516
car       hugging    0.278
car       election   0.333
dog       banana     0.333
dog       delicious  0.404
dog       baguette   0.018
dog       jumping    0.539
dog       hugging    0.410
dog       election   0.181
banana    delicious  0.487
banana    baguette   0.450
banana    jumping    0.108
banana    hugging    0.127
banana    election   0.164
delicious baguette   0.421
delicious jumping    0.042
delicious hugging    0.142
delicious election   0.028
baguette  jumping   -0.075
baguette  hugging    0.161
baguette  election  -0.091
jumping   hugging    0.447
jumping   election   0.206
hugging   election  -0.076


#### Examine if two measures correlate

In [6]:
# a correlation coefficent of two lists
print("Spearman's rho", spearmanr(glov_sims, wn_sims))

# Higher correlation (closer to 1.0) means two measures agree with each other.

Spearman's rho SignificanceResult(statistic=0.4222499442309076, pvalue=0.02519986065189366)


How do the two similarities compare? 

In [7]:
 The Spearman's rho is approximately 0.422, which suggests a positive correlation between the two lists (glov_sims and wn_sims). 
This means that as the values in one list tend to increase, the values in the other list also tend to increase. The p-value associated with the correlation coefficient is 0.0252.
Since the p-value is small it means that there is statistically significant correlation and the null hypothesis doesn't apply. 

SyntaxError: unterminated string literal (detected at line 1) (4156120189.py, line 1)

### Word Vector Representations in GloVe

In [None]:
# Each word is represented as a vector:
print('dog =', glove_model['dog'])

# matrix of all word vectors is trained as parameters of a language model:
# P( target_word | context_word ) = f(word, context ; params)
#
# Words in a same sentence and in close proximity are in context of each other.

dog = [ 0.11008   -0.38781   -0.57615   -0.27714    0.70521    0.53994
 -1.0786    -0.40146    1.1504    -0.5678     0.0038977  0.52878
  0.64561    0.47262    0.48549   -0.18407    0.1801     0.91397
 -1.1979    -0.5778    -0.37985    0.33606    0.772      0.75555
  0.45506   -1.7671    -1.0503     0.42566    0.41893   -0.68327
  1.5673     0.27685   -0.61708    0.64638   -0.076996   0.37118
  0.1308    -0.45137    0.25398   -0.74392   -0.086199   0.24068
 -0.64819    0.83549    1.2502    -0.51379    0.04224   -0.88118
  0.7158     0.38519  ]


### Implement Cosine Similarity 

In [None]:
# based on equation 6.10 J&M (2019)
# https://web.stanford.edu/~jurafsky/slp3/6.pdf
#
def cosine_sim(v1, v2):
    dot_product = np.dot(v1,v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    cos_sim = dot_product / (norm_v1 * norm_v2)
    return cos_sim


cosine_sim(glove_model['car'], glove_model['automobile'])

0.6956218

### Implement top-n most similar words 

In [None]:
# search in glove_model:
def top_n(word, n):
    # example: top_n('dog', 3) =  
    #[('cat', 0.9218005537986755),
    # ('dogs', 0.8513159155845642),
    # ('horse', 0.7907583713531494)]
    # similar to glove_model.most_similar('dog', topn=3)
    
    
    word_vec = glove_model[word]
    similarities = []
    for other_word in glove_model.index_to_key:
        if other_word != word:
            other_vec = glove_model[other_word]
            similarity = cosine_sim(word_vec,other_vec)
            similarities.append((other_word,similarity))
    sorted_similarities = sorted(similarities, key=lambda x: x[1], reverse=True)
    top_n_words = []
    for other_word, _ in sorted_similarities[:n]:
        top_n_words.append(other_word)
    
    return top_n_words
    
top_n('dog',3)


['cat', 'dogs', 'horse']

## VG Part: Examine Fairness In Data Driven Word Vectors

Caliskan et al. (2017) argues that word vectors learn human biases from data. 

Try to replicate one of the tests of the paper:

Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. “Semantics derived automatically from language corpora contain human-like biases.” Science
356.6334 (2017): 183-186. http://opus.bath.ac.uk/55288/


For example on gender bias:
- Male names: John, Paul, Mike, Kevin, Steve, Greg, Jeff, Bill.
- Female names: Amy, Joan, Lisa, Sarah, Diana, Kate, Ann, Donna.
- Career words : executive, management, professional, corporation, salary, office, business, career.
- Family words : home, parents, children, family, cousins, marriage, wedding, relatives.


Report the average cosine similarity of male names to career words, and compare it with the average similarity of female names to career words. (repeat for family words) 

tokens in GloVe model are all in lower case.

Write at least one sentence to describe your observation.

In [8]:
male_names = ['John','Paul','Mike','Kevin','Steve','Greg','Jeff','Bill']
female_names = ['Amy','Joan','Lisa','Sarah','Diana','Kate','Ann','Donna']
career_words = ['executive','management','professional','corporation','salary','office','business','career']
family_words = ['home','parents','children','family','cousins','marriage','wedding','relatives']

In [None]:
cosine_similarities_male_career = []
cosine_similarities_male_family = []
male_names_edited =[]
for name in male_names:
    name= name.lower()
    male_names_edited.append(name)
for name,word in zip(male_names_edited,career_words):
        cosine_similarities_male_career.append(cosine_sim(glove_model['name'],glove_model['word']))
average_cos_sims_male_career = np.mean(cosine_similarities_male_career)
for name,word in zip(male_names_edited,family_words):
        cosine_similarities_male_family.append(cosine_sim(glove_model['name'],glove_model['word']))
average_cos_sims_male_family = np.mean(cosine_similarities_male_family)
print(average_cos_sims_male_career)
print(average_cos_sims_male_family)

0.79418886
0.79418886


In [None]:
cosine_similarities_female_career = []
cosine_similarities_female_family = []
female_names_edited =[]
for name in female_names:
    name= name.lower()
    female_names_edited.append(name)
for name,word in zip(female_names_edited,career_words):
        cosine_similarities_female_career.append(cosine_sim(glove_model['name'],glove_model['word']))
average_cos_sims_female_career = np.mean(cosine_similarities_female_career)
for name,word in zip(female_names_edited,family_words):
        cosine_similarities_female_family.append(cosine_sim(glove_model['name'],glove_model['word']))
average_cos_sims_female_family = np.mean(cosine_similarities_female_family)
print(average_cos_sims_female_career)
print(average_cos_sims_female_family)

0.79418886
0.79418886


In [None]:
glov_sims_male_career = []
glov_sims_male_family = []
male_names_edited =[]
for name in male_names:
    name= name.lower()
    male_names_edited.append(name)
for name, word in zip(male_names_edited, career_words):
    max_sim = glove_model.similarity(name, word)
    glov_sims_male_career.append(max_sim)
average_glov_sim_male_career = np.mean(glov_sims_male_career)
for name, word in zip(male_names_edited, family_words):
    max_sim = glove_model.similarity(name, word)
    glov_sims_male_family.append(max_sim)
average_glov_sim_male_family = np.mean(glov_sims_male_family)
print(average_glov_sim_male_career)
print(average_glov_sim_male_family)



0.31931674
0.2551345


In [None]:
glov_sims_female_career = []
glov_sims_female_family = []
female_names_edited =[]
for name in female_names:
    name = name.lower()
    female_names_edited.append(name)
for name, word in zip(female_names_edited, career_words):
    max_sim = glove_model.similarity(name, word)
    glov_sims_female_career.append(max_sim)
average_glov_sim_female_career = np.mean(glov_sims_female_career)
for name, word in zip(female_names_edited, family_words):
    max_sim = glove_model.similarity(name, word)
    glov_sims_female_family.append(max_sim)
average_glov_sim_female_family = np.mean(glov_sims_female_family)
print(average_glov_sim_female_career)
print(average_glov_sim_female_family)

0.15457344
0.34622508


If I use my cosine similarity function I don't seem to find any difference in the average number of cosine similarities. However, if I use the glove model already existent function there is a significant difference which provides evidence of the corpora gender biases. 