# Word Representations and Lexical Similarities (12 + 10 pt)

For more reading on vector semantics got to Chapter 6, sections 6.4 and 6.8:
https://web.stanford.edu/~jurafsky/slp3/6.pdf

For wordnet exploration use this manual: https://www.nltk.org/howto/wordnet.html

For additional wordnet discussions go to chapter 19: https://web.stanford.edu/~jurafsky/slp3/19.pdf

In [1]:
# load wordnet
from nltk.corpus import wordnet as wn

# load word-vector glov
import gensim.downloader as gensim_api
glove_model = gensim_api.load("glove-wiki-gigaword-50")

from itertools import combinations, product
from scipy.stats import spearmanr
import numpy as np

In [2]:
some_words = ['car', 'dog', 'banana', 'delicious', 'baguette', 'jumping', 'hugging', 'election']

### Explore Word Representations in English WordNet (+3pt)

In [3]:
# For each word above print their synsets
# for each synset print all lemmas, hypernyms, hyponyms

for word in some_words:
    print(f"\n\n---{word}---")
    synset = wn.synsets(word)
    #print(synset)
    for s in synset:
        print()
        print(s.name(), "-", s.definition())
        print("  lemmas:", [x.name() for x in s.lemmas()])
        print("  hypernyms:", [x.name() for x in s.hypernyms()])
        print("  hypernyms:", [x.name() for x in s.hyponyms()])



---car---

car.n.01 - a motor vehicle with four wheels; usually propelled by an internal combustion engine
  lemmas: ['car', 'auto', 'automobile', 'machine', 'motorcar']
  hypernyms: ['motor_vehicle.n.01']
  hypernyms: ['ambulance.n.01', 'beach_wagon.n.01', 'bus.n.04', 'cab.n.03', 'compact.n.03', 'convertible.n.01', 'coupe.n.01', 'cruiser.n.01', 'electric.n.01', 'gas_guzzler.n.01', 'hardtop.n.01', 'hatchback.n.01', 'horseless_carriage.n.01', 'hot_rod.n.01', 'jeep.n.01', 'limousine.n.01', 'loaner.n.02', 'minicar.n.01', 'minivan.n.01', 'model_t.n.01', 'pace_car.n.01', 'racer.n.02', 'roadster.n.01', 'sedan.n.01', 'sport_utility.n.01', 'sports_car.n.01', 'stanley_steamer.n.01', 'stock_car.n.01', 'subcompact.n.01', 'touring_car.n.01', 'used-car.n.01']

car.n.02 - a wheeled vehicle adapted to the rails of railroad
  lemmas: ['car', 'railcar', 'railway_car', 'railroad_car']
  hypernyms: ['wheeled_vehicle.n.01']
  hypernyms: ['baggage_car.n.01', 'cabin_car.n.01', 'club_car.n.01', 'freight_ca

#### Measure The Lexical Similarity (+3pt)

In [4]:
# Wu-Palmer Similarity is a measure of similarity between to sense based on their depth distance. 
#
# For each pair of words, find their closest sense based on Wu-Palmer Similarity.
# List all word pairs and their highest possible wup_similarity. 
# Use wn.wup_similarity(s1, s2) and itertools (combinations and product).
# if there is no connection between two words, put 0.

wn_sims = []
for word1, word2 in combinations(some_words, 2):    
    # check similarities of all senses for words
    similarities = []
    for s1 in wn.synsets(word1):
        for s2 in wn.synsets(word2):
            sim = wn.wup_similarity(s1, s2)
            sim = sim if sim else 0
            similarities.append((s1, s2, sim))
            
    max_sim = max(similarities, key=lambda x: x[2])[2]

    wn_sims.append(max_sim)
    print(f"{word1:9} {word2:9} {max_sim:6.3f}")

car       dog        0.667
car       banana     0.421
car       delicious  0.364
car       baguette   0.211
car       jumping    0.125
car       hugging    0.235
car       election   0.133
dog       banana     0.632
dog       delicious  0.556
dog       baguette   0.556
dog       jumping    0.333
dog       hugging    0.286
dog       election   0.182
banana    delicious  0.750
banana    baguette   0.556
banana    jumping    0.133
banana    hugging    0.250
banana    election   0.143
delicious baguette   0.500
delicious jumping    0.118
delicious hugging    0.222
delicious election   0.125
baguette  jumping    0.118
baguette  hugging    0.222
baguette  election   0.125
jumping   hugging    0.400
jumping   election   0.667
hugging   election   0.200


> **Conclusion:** The most similar words are: `banana` & `delicious` with an impressive score of 0.750!

### Measure the similarities on GloVe Word Vectors

In [5]:
glov_sims = []
for word1, word2 in combinations(some_words, 2):
    max_sim = glove_model.similarity(word1, word2)
    glov_sims.append(max_sim)
    print(f"{word1:9} {word2:9} {max_sim:6.3f}")


car       dog        0.464
car       banana     0.219
car       delicious  0.068
car       baguette   0.046
car       jumping    0.516
car       hugging    0.278
car       election   0.333
dog       banana     0.333
dog       delicious  0.404
dog       baguette   0.018
dog       jumping    0.539
dog       hugging    0.410
dog       election   0.181
banana    delicious  0.487
banana    baguette   0.450
banana    jumping    0.108
banana    hugging    0.127
banana    election   0.164
delicious baguette   0.421
delicious jumping    0.042
delicious hugging    0.142
delicious election   0.028
baguette  jumping   -0.075
baguette  hugging    0.161
baguette  election  -0.091
jumping   hugging    0.447
jumping   election   0.206
hugging   election  -0.076


#### Examine if two measures correlate

In [6]:
# a correlation coefficent of two lists
print("Spearman's rho", spearmanr(glov_sims, wn_sims))

# Higher correlation (closer to 1.0) means two measures agree with each other.

Spearman's rho SpearmanrResult(correlation=0.5364590589374248, pvalue=0.0032515659964184227)


### Word Vector Representations in GloVe

In [7]:
# Each word is represented as a vector:
print('dog =', glove_model['dog'])

# matrix of all word vectors is trained as parameters of a language model:
# P( target_word | context_word ) = f(word, context ; params)
#
# Words in a same sentence and in close proximity are in context of each other.

dog = [ 0.11008   -0.38781   -0.57615   -0.27714    0.70521    0.53994
 -1.0786    -0.40146    1.1504    -0.5678     0.0038977  0.52878
  0.64561    0.47262    0.48549   -0.18407    0.1801     0.91397
 -1.1979    -0.5778    -0.37985    0.33606    0.772      0.75555
  0.45506   -1.7671    -1.0503     0.42566    0.41893   -0.68327
  1.5673     0.27685   -0.61708    0.64638   -0.076996   0.37118
  0.1308    -0.45137    0.25398   -0.74392   -0.086199   0.24068
 -0.64819    0.83549    1.2502    -0.51379    0.04224   -0.88118
  0.7158     0.38519  ]


### Implement Cosine Similarity (+3pt)

cosine(v,w) = $\frac{v \cdot w}{|v||w|}$


In [8]:
# based on equation 6.10 J&M (2019)
# https://web.stanford.edu/~jurafsky/slp3/6.pdf
#
def cosine_sim(v1, v2):
    numerator = np.dot(v1, v2)
    denomenator = np.linalg.norm(v1) * np.linalg.norm(v2)
    out = numerator / denomenator
    return out

cosine_sim(glove_model['car'], glove_model['automobile'])

0.6956217

### Implement top-n most similar words (+3pt)

In [9]:
# search in glove_model:
def top_n(word, n):
    # example: top_n('dog', 3) =  
    #[('cat', 0.9218005537986755), 
    # ('dogs', 0.8513159155845642),
    # ('horse', 0.7907583713531494)]
    # similar to glove_model.most_similar('dog', topn=3)
    
    # compute similarity to all other vectors
    arr = glove_model.cosine_similarities(
        glove_model[word], glove_model.vectors)
    
    # get sorted indices
    # take the last 3 (excluding the word iteself)
    res = arr.argsort()[-n-1:-1][::-1]
    
    # return the word and value at found indices
    return [
        (glove_model.index2word[x], arr[x])
        for x in res
    ]

In [10]:
top_n('dog', 3)

[('cat', 0.92180055), ('dogs', 0.85131586), ('horse', 0.7907584)]

In [11]:
top_n('fun', 4)

[('stuff', 0.86726743),
 ('crazy', 0.8648681),
 ('wonderful', 0.84710056),
 ('really', 0.8386063)]

## Optional: Examine Fairness In Data Driven Word Vectors (+10 pt)

Caliskan et al. (2017) argues that word vectors learn human biases from data. 

Try to replicate one of the tests of the paper:

Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. “Semantics derived automatically from language corpora contain human-like biases.” Science
356.6334 (2017): 183-186. http://opus.bath.ac.uk/55288/


For example on gender bias:
- Male names: John, Paul, Mike, Kevin, Steve, Greg, Jeff, Bill.
- Female names: Amy, Joan, Lisa, Sarah, Diana, Kate, Ann, Donna.
- Career words : executive, management, professional, corporation, salary, office, business, career.
- Family words : home, parents, children, family, cousins, marriage, wedding, relatives.


Report the average cosine similarity of male names to career words, and compare it with the average similarity of female names to career words. (repeat for family words) 

tokens in GloVe model are all in lower case.

Write at least one sentence to describe your observation.

In [12]:
male_names = ['John', 'Paul', 'Mike', 'Kevin', 'Steve', 'Greg', 'Jeff', 'Bill']
female_names = ['Amy', 'Joan', 'Lisa', 'Sarah', 'Diana', 'Kate', 'Ann', 'Donna']
career_words = ['executive', 'management', 'professional', 'corporation', 'salary', 'office', 'business', 'career']
family_words = ['home', 'parents', 'children', 'family', 'cousins', 'marriage', 'wedding', 'relatives']

In [13]:
def compare(word_list1, word_list2):
    sims = []
    for w1 in word_list1:
        for w2 in word_list2:
            v1 = glove_model[w1.lower()]
            v2 = glove_model[w2.lower()]
            sim = cosine_sim(v1, v2)
            sims.append(sim)
    return np.mean(sims)

In [14]:
# male vs. career
print("male vs. career")
print(compare(male_names, career_words))

# female vs. career
print("\nfemale vs. career")
print(compare(female_names, career_words))

# male vs. family
print("\nmale vs. family")
print(compare(male_names, family_words))

# female vs. family
print("\nfemale vs. family")
print(compare(female_names, family_words))

male vs. career
0.35287738

female vs. career
0.16353184

male vs. family
0.2753636

female vs. family
0.37566096
