# Word embeddings

In this first homework, we'll introduce some basic operations of working with static word embeddings (calculating the cosine similarity between two vectors, and finding the nearest neighbors in vector space among a set of embeddings), and then use those basic operations to learn something interesting about the datasets those vectors were estimated from: measuring the orientation of words along different semantic axes (e.g., the degree to which a word like "doctor" is associated with men or women in the underlying dataset); and measuring the *change* in meaning of a word across domains, including time (e.g., "cabinet", "awesome", etc.).

In this notebook, feel free to use [numpy](https://numpy.org/doc/stable/user/quickstart.html) for vector operations, but **do not import any other libraries** outside of those already provided (e.g., do not import pandas).  Many of the homeworks that we'll have later in this course will use vector and matrix operations in libraries like pytorch that are very similar to numpy, so it's worth getting some exposure to numpy now.

## Deliverables:

There are two different deliverables, each to be submitted to a different Gradescope assignment.

1. Submit to GradeScope **HW1 california_nearest_neighbors_50.txt**: california_nearest_neighbors_50.txt. 
2. Submit to GradeScope **HW1 code**: HW1.ipynb (this notebook)

(Please don't alter either of these file names.)

In [1]:
import numpy as np
import math
import operator

In [2]:
def read_vectors(filename):
    vocab=[]
    vocab_map={}
    embeddings=[]
    with(open(filename, encoding="utf-8")) as file:
        for idx, line in enumerate(file):
            cols=line.rstrip().split(" ")
            word=cols[0]
            embedding=cols[1:]

            embeddings.append(embedding)
            vocab.append(word)
            vocab_map[word]=idx
    
    return vocab, vocab_map, np.array(embeddings, dtype="float")

In [3]:
glove_vocab, glove_vocab_map, glove_embeddings=read_vectors("glove.6B.100d.100K.txt")

**Q1.** As we saw in class, one of the most common ways of measuring the similarity of two vectors in NLP is cosine similarity. Write a function to calculate the cosine similarity between any two **numpy** vectors (as with the word embeddings above); this function should return a single real number.

In [4]:
def cosine_similarity(vec1, vec2):
    return (vec1.dot(vec2)) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    

**Q2.** A core operation on word embeddings is to find the $k$-nearest neighbors to a word: e.g., given a target word like "california" and $k=10$, finding the 10 words in your vocabulary whose embeddings have the highest cosine similarity to the embedding for "california".  Write a function that that does just that, given an input set of embeddings, `vocab_map` from the `read_vectors` function, and a query term.  Your function should return a list of the $k$-nearest neighbors in order from most similar to least, and only those $k$ words (e.g., if $k$=2 and `query_term` = "california", you should return `["texas", "florida"]`.  Do not include the query term itself among the nearest neighbors.

In [5]:
def find_nearest_neighbors(embeddings, vocab_map, query_term, k=10):
    q = embeddings[vocab_map[query_term]]
    distances = ((cosine_similarity(embeddings[idx], q), word) for word, idx in vocab_map.items())
    return [p[1] for p in sorted(distances, reverse=True)[1:k+1]]

In [6]:
nearest_neighbors=find_nearest_neighbors(glove_embeddings, glove_vocab_map, "car", k=10)
for idx, k in enumerate(nearest_neighbors):
    print("%s\t%s" % (idx, k))

print()

nearest_neighbors=find_nearest_neighbors(glove_embeddings, glove_vocab_map, "california", k=2)
for idx, k in enumerate(nearest_neighbors):
    print("%s\t%s" % (idx, k))

0	vehicle
1	truck
2	cars
3	driver
4	driving
5	motorcycle
6	vehicles
7	parked
8	bus
9	taxi

0	texas
1	florida


In [7]:
# Execute this cell to use your find_nearest_neighbors function above to find the 50 nearest neighbors
# to the word "california".  This cell writes that output to the file california_nearest_neighbors_50.txt,
# which you will upload to GradeScope as a deliverable.

# DO NOT CHANGE THIS BLOCK
nearest_neighbors=find_nearest_neighbors(glove_embeddings, glove_vocab_map, "california", k=50)
with open("california_nearest_neighbors_50.txt", "w") as out:
    for idx, k in enumerate(nearest_neighbors):
        out.write("%s\t%s\n" % (idx, k))

# SemAxis

Word embeddings are the foundation of many of the NLP models we'll use in this class, and provide a representation of input words for a range of different tasks.  But we can also interrogate the representations themselves to examine a number of different questions.  First, let's consider the question of how to measure the orientation of word representations along specific semantic axes: e.g., for an axis defined by the endpoints "happy" and "sad", where do word embeddings estimated from a specific dataset situate a word like "smile"?  As we've seen in class, this gives us a mechanism for interrogating bias: if we define an axis by the endpoints "man" and "woman", for example, where do we see words like "doctor" and "nurse" appearing along this spectrum? (For similar word in this vein, see [Bolukbasi et al. 2016](https://arxiv.org/pdf/1607.06520.pdf), [Blodgett et al. 2020](https://aclanthology.org/2020.acl-main.485.pdf).)

[SemAxis](https://arxiv.org/pdf/1806.05521.pdf) is one such method, where the axis endpoints are defined not by single words, but by sets of words (e.g., "happy", "cheerful", "ecstatic"). Given a set of word embeddings for one category $S^+ = \{v_1^+, \ldots v_n^+\}$ and embeddings for a contrasting category $S^- = \{v_1^-, \ldots v_n^-\}$ that both define the endpoints of the axis, SemAxis outputs a single real-value score for an input word $w$ with word representation $v_w$:

$$
score(w)_{\mathbf{V_\textrm{axis}}} = \textrm{cos}(v_w, \mathbf{V}_\textrm{axis})
$$

Where: 
$$
\mathbf{V}^+ = {1 \over n} \sum_1^n v_i^+
$$

$$
\mathbf{V}^- = {1 \over m} \sum_1^m v_i^-
$$

$$
\mathbf{V}_{\textrm{axis}} = \mathbf{V}^+ - \mathbf{V}^-
$$

Let's investigate how we can use the methods above to situate words along axes you define.


In [8]:
def get_semaxis_score(vectors, vocab_map, positive_terms=None, negative_terms=None, target_word=None):
    
    positive_vecs=[]
    negative_vecs=[]
    
    for term in positive_terms:
        positive_vecs.append(vectors[vocab_map[term]])
    
    for term in negative_terms:
        negative_vecs.append(vectors[vocab_map[term]])
        
    v_plus=np.mean(positive_vecs, axis=0)
    v_neg=np.mean(negative_vecs, axis=0)
    
    v_axis=v_plus-v_neg
    
    target_vec=vectors[vocab_map[target_word]]
    
    score=cosine_similarity(target_vec, v_axis)

    return score

In [9]:
def score_list_of_targets(vectors, vocab_map, positive_terms=None, negative_terms=None, target_words=None):
    scores=[]
    for target in target_words:
        scores.append((get_semaxis_score(vectors, vocab_map, positive_terms, negative_terms, target), target))

    for k,v in reversed(sorted(scores)):
        print("%.3f\t%s" % (k,v))

In [10]:
targets=["doctor", "nurse", "actor", "actress", "mechanic", "librarian", "architect", "magician", "cook", "chef"]

In [11]:
score_list_of_targets(glove_embeddings, glove_vocab_map, positive_terms=["woman", "women"], negative_terms=["man", "men"], target_words=targets)

0.342	actress
0.294	nurse
0.219	librarian
0.106	doctor
0.024	actor
0.003	chef
-0.019	cook
-0.075	architect
-0.153	magician
-0.194	mechanic


**Q3:** Define your own concept axis by selecting a set of positive and negative terms (as we did for {woman, women}, {man, men} above) and illustrate its utility by scoring a set of 10 target terms.  You may use any axis and target terms that you think can yield an interesting insight; for examples of axes other related work has explored), see [Kozlowski et al. 2019](https://journals.sagepub.com/doi/pdf/10.1177/0003122419877135).

In [12]:
"""
We investigate how much this model can differentiate among POS.

Pronouns taken from https://www.thefreedictionary.com/List-of-pronouns.htm#:~:text=Pronouns%20are%20classified%20as%20personal,%2C%20yours%2C%20his%2C%20hers%2C
Nouns taken from https://www.englishclub.com/vocabulary/common-nouns-25.htm
"""
positive_terms = open("pronouns.txt").read().splitlines()
negative_terms = open("nouns.txt").read().splitlines()
targets = ["her", "him", "they", "shoe", "lamp", "builder"]

# remove terms not in glove
def remove_terms(vocab, terms):
    return [term for term in terms if term in vocab]


positive_terms = remove_terms(glove_vocab_map, positive_terms)
negative_terms = remove_terms(glove_vocab_map, negative_terms)


score_list_of_targets(glove_embeddings, glove_vocab_map, positive_terms=positive_terms, negative_terms=negative_terms, target_words=targets)

# As you can see, this model cannot.

-0.043	lamp
-0.122	him
-0.186	builder
-0.210	they
-0.235	shoe
-0.241	her


# Word sense change

---

Lots of work in NLP has used word embeddings to examine how word meanings have changed over time (e.g., [Hamilton et al. 2016](https://arxiv.org/pdf/1606.02821.pdf), [Garg et al. 2018](https://www.pnas.org/content/115/16/E3635.short), [Kulkarni et al. 2014](https://arxiv.org/pdf/1411.3315.pdf)).  We can examine this here by looking at word embeddings trained on datasets written at different times: GloVe vectors trained on contemporary text (including Wikipedia and the general web), and vectors trained on literary texts from Project Gutenberg (mainly written before 1925).  We can't directly compare two vectors estimated in separate training procedures (since the embedding spaces are not equivalent), but we can compare the overlap in their nearest neighbors to get a sense of the degree of their change across these different domains.

(**There is no deliverable here**, but feel free to play around to see how words have changed their meaning over time using nearest neighbor associations as the means of doing so. What words do you think have changed in meaning over the past 100 years?)

In [13]:
gutenberg_vocab, gutenberg_vocab_map, gutenberg_embeddings=read_vectors("gutenberg.200.vectors.50K.txt")"

In [19]:
def calculate_nn_overlap(term):
    glove_nearest_neighbors=find_nearest_neighbors(glove_embeddings, glove_vocab_map, term, k=10)
    print("GloVe:", glove_nearest_neighbors)
    gutenberg_nearest_neighbors=find_nearest_neighbors(gutenberg_embeddings, gutenberg_vocab_map, term, k=10)
    print("Gutenberg:", gutenberg_nearest_neighbors)
    overlap=set(glove_nearest_neighbors).intersection(set(gutenberg_nearest_neighbors))
    print(overlap)
    print(len(overlap)/len(glove_nearest_neighbors))

In [20]:
calculate_nn_overlap("cabinet")

GloVe: ['ministers', 'prime', 'minister', 'parliament', 'reshuffle', 'parliamentary', 'resignation', 'party', 'resign', 'government']
Gutenberg: ['bureau', 'closet', 'bookcase', 'cabinets', 'chamber', 'cupboard', 'coffer', 'court', 'reading-room', 'council-chamber']
set()
0.0
