# Part 2: Word Representations and Lexical Similarities

This part has 20 points in total.

Here we will compare different measures of semantic similarity between words: (1) WordNet depth distance (2) and cosine similarity of words using a given GloVe model features.

For more reading on vector semantics got to Chapter 6, sections 6.4 and 6.8:
https://web.stanford.edu/~jurafsky/slp3/6.pdf

To learn about Wordnet: https://www.nltk.org/howto/wordnet.html

For additional Wordnet discussions see Chapter 19: https://web.stanford.edu/~jurafsky/slp3/19.pdf

The GloVe word embeddings are described in [this paper](https://nlp.stanford.edu/projects/glove/)


## Part 2.1: Semantic similarity with WordNet

In [None]:
# load wordnet
from nltk.corpus import wordnet as wn

# load word-vector glov
import gensim.downloader as gensim_api
glove_model = gensim_api.load("glove-wiki-gigaword-50")

from itertools import combinations, product
from scipy.stats import spearmanr
import numpy as np

In [None]:
some_words = ['car', 'dog', 'banana', 'delicious', 'baguette', 'jumping', 'hugging', 'election']

### Explore Word Representations in English WordNet 

In [None]:
# For each word above print their synsets
# for each synset print all lemmas, hypernyms, hyponyms

# 
# Write your code here
#

#### Measure The Lexical Similarity 

In [None]:
# Wu-Palmer Similarity is a measure of similarity between to sense based on their depth distance. 
#
# For each pair of words, find their closes sense based on Wu-Palmer Similarity.
# List all word pairs and their highest possible wup_similarity. 
# Use wn.wup_similarity(s1, s2) and itertools (combinations and product).
# if there is no connection between two words, put 0.

wn_sims = []
for word1, word2 in combinations(some_words, 2):
    ### Your code here ###
    max_sim = 0
    ######################
    wn_sims.append(max_sim)
    print(f"{word1:9} {word2:9} {max_sim:6.3f}")

# which word pair are the most similar words?

## Part 2.2: Semantic similarity with GloVe and comparison with WordNet

### Measure the similarities on GloVe Word Vectors

In [None]:
glov_sims = []
for word1, word2 in combinations(some_words, 2):
    max_sim = glove_model.similarity(word1, word2)
    glov_sims.append(max_sim)
    print(f"{word1:9} {word2:9} {max_sim:6.3f}")


#### Examine if two measures correlate

In [None]:
# a correlation coefficent of two lists
print("Spearman's rho", spearmanr(glov_sims, wn_sims))

# Higher correlation (closer to 1.0) means two measures agree with each other.

How do the two similarities compare? 

In [None]:
# Write your answer here

### Word Vector Representations in GloVe

In [None]:
# Each word is represented as a vector:
print('dog =', glove_model['dog'])

# matrix of all word vectors is trained as parameters of a language model:
# P( target_word | context_word ) = f(word, context ; params)
#
# Words in a same sentence and in close proximity are in context of each other.

### Implement Cosine Similarity 

In [None]:
# based on equation 6.10 J&M (2019)
# https://web.stanford.edu/~jurafsky/slp3/6.pdf
#
def cosine_sim(v1, v2):
    out = 0
    # code here
    return out


cosine_sim(glove_model['car'], glove_model['automobile'])

### Implement top-n most similar words 

In [None]:
# search in glove_model:
def top_n(word, n):
    # example: top_n('dog', 3) =  
    #[('cat', 0.9218005537986755),
    # ('dogs', 0.8513159155845642),
    # ('horse', 0.7907583713531494)]
    # similar to glove_model.most_similar('dog', topn=3)

    out = []
    #
    # code here
    # 
    return out


## VG Part: Examine Fairness In Data Driven Word Vectors


Caliskan et al. (2017) argues that word vectors learn human biases from data. 

Try to replicate one of the tests of the paper:

Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. “Semantics derived automatically from language corpora contain human-like biases.” Science
356.6334 (2017): 183-186. http://opus.bath.ac.uk/55288/


For example on gender bias:
- Male names: John, Paul, Mike, Kevin, Steve, Greg, Jeff, Bill.
- Female names: Amy, Joan, Lisa, Sarah, Diana, Kate, Ann, Donna.
- Career words : executive, management, professional, corporation, salary, office, business, career.
- Family words : home, parents, children, family, cousins, marriage, wedding, relatives.


Report the average cosine similarity of male names to career words, and compare it with the average similarity of female names to career words. (repeat for family words) 

tokens in GloVe model are all in lower case.

Write at least one sentence to describe your observation.