# Lab 6: (Sub)Word embeddings
### COSC 426: Fall 2025, Colgate University

Use this notebook to answer the questions in `Lab6.md`. Make sure to include in this notebook all the tests and experiments you run. Make sure to also cite any external resources you use. 

## Part 1: Computing similarity between `GLoVe` embeddings

### Part 1.1

In [90]:
import pickle
import numpy as np

with open("glove_dolma_300_10k.pkl", "rb") as f:
    glove_embeddings = pickle.load(f)

In [91]:
print(len(glove_embeddings.keys()))

10000


In [92]:
words = ["hello", "cat", "chomsky", "supercalifragilisticexpialidocious"]

for word in words:
    if word in glove_embeddings:
        vec = glove_embeddings[word]
        print(f'"{word}": Dimensions', vec.shape, "Mean:", np.mean(vec))
    else:
        print(f'"{word}": Missing embedding')

"hello": Dimensions (300,) Mean: -5.767882e-05
"cat": Dimensions (300,) Mean: -0.00032740872
"chomsky": Missing embedding
"supercalifragilisticexpialidocious": Missing embedding


### Part 1.2

Implement and test the following functions

In [93]:
def get_word_vector_glove(word: str, embeddings: dict):
    """
    Return embedding of word if it exists, if not the mean embedding of all words
    """
    if word not in embeddings:
        return np.mean(list(embeddings.values()), axis=0)
    return embeddings[word]

In [94]:
## Tests
print(str(get_word_vector_glove("hello", glove_embeddings).shape) == "(300,)")
print(str(np.mean(get_word_vector_glove("hello", glove_embeddings))) == "-5.767882e-05")
print(str(get_word_vector_glove("hello", glove_embeddings)[0]) == "0.200562")

print(
    str(
        get_word_vector_glove(
            "supercalifragilisticexpialidocious", glove_embeddings
        ).shape
    )
    == "(300,)"
)
print(
    str(
        np.mean(
            get_word_vector_glove(
                "supercalifragilisticexpialidocious", glove_embeddings
            )
        )
    )
    == "-0.0065580728"
)
print(
    str(
        get_word_vector_glove("supercalifragilisticexpialidocious", glove_embeddings)[0]
    )
    == "-0.08470377"
)

print(str(get_word_vector_glove("chomsky", glove_embeddings).shape) == "(300,)")
print(
    str(np.mean(get_word_vector_glove("chomsky", glove_embeddings))) == "-0.0065580728"
)
print(str(get_word_vector_glove("chomsky", glove_embeddings)[0]) == "-0.08470377")

True
True
True
True
True
True
True
True
True


In [95]:
from numpy.linalg import norm


def cosine_similarity(vec1: np.array, vec2: np.array):
    """
    Returns cosine similarity between two vectors
    """

    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

In [96]:
print(
    str(cosine_similarity(np.array([1, 2, 3]), np.array([1, 2, -3])))
    == "-0.2857142857142857"
)
print(
    str(cosine_similarity(np.array([1, 2, 3]), np.array([1, -2, 3])))
    == "0.42857142857142855"
)
print(str(cosine_similarity(np.array([1, 2, 3]), np.array([1, 2, 3]))) == "1.0")

sims = {
    "hello hello": "1.0000001",
    "hello hey": "0.93034637",
    "hello hi": "0.9133941",
    "hello supercalifragilisticexpialidociou": "0.73929405",
    "hello chomsky": "0.73929405",
    "hello cat": "0.6564517",
}

for key, val in sims.items():
    word1, word2 = key.split()
    res = str(
        cosine_similarity(
            get_word_vector_glove(word1, glove_embeddings),
            get_word_vector_glove(word2, glove_embeddings),
        )
    )

    print(res == val)

True
True
True
False
False
True
True
True
False


In [97]:
def find_similar(word_vec: np.array, n: int, embeddings: dict, exclude: list):
    """
    Params:
        word_vec: a word embedding
        n: number of similar words to return
        embeddings: key word, value embedding
        exclude: words to be excluded from the output

    Returns:
        n words most similar to word; This does not include words in the exclude list
    """
    ecopy = embeddings.copy()
    result = {}
    for e in exclude:
        ecopy.pop(e)

    for word, vector in ecopy.items():  # n
        result[word] = cosine_similarity(vector, word_vec)  # n
    return sorted(result.items(), key=lambda x: x[1], reverse=True)[:n]

In [98]:
find_similar(
    get_word_vector_glove("hello", glove_embeddings),
    10,
    glove_embeddings,
    exclude=["hello"],
)

## Expected output
# [('hey', 0.93034637),
#  ('hi', 0.9133941),
#  ('thank', 0.8759045),
#  ('!', 0.8671343),
#  ('thanks', 0.86197555),
#  ('dear', 0.85105646),
#  ('happy', 0.8466327),
#  ('welcome', 0.838622),
#  ('here', 0.82807016),
#  ('sorry', 0.8268465)]

[('hey', np.float32(0.9303464)),
 ('hi', np.float32(0.9133941)),
 ('thank', np.float32(0.8759045)),
 ('!', np.float32(0.8671342)),
 ('thanks', np.float32(0.86197543)),
 ('dear', np.float32(0.85105634)),
 ('happy', np.float32(0.8466327)),
 ('welcome', np.float32(0.8386218)),
 ('here', np.float32(0.82807004)),
 ('sorry', np.float32(0.82684636))]

In [99]:
find_similar(
    get_word_vector_glove("hello", glove_embeddings), 10, glove_embeddings, exclude=[]
)

## Expected output
# [('hello', 1.0000001),
#  ('hey', 0.93034637),
#  ('hi', 0.9133941),
#  ('thank', 0.8759045),
#  ('!', 0.8671343),
#  ('thanks', 0.86197555),
#  ('dear', 0.85105646),
#  ('happy', 0.8466327),
#  ('welcome', 0.838622),
#  ('here', 0.82807016)]

[('hello', np.float32(0.9999999)),
 ('hey', np.float32(0.9303464)),
 ('hi', np.float32(0.9133941)),
 ('thank', np.float32(0.8759045)),
 ('!', np.float32(0.8671342)),
 ('thanks', np.float32(0.86197543)),
 ('dear', np.float32(0.85105634)),
 ('happy', np.float32(0.8466327)),
 ('welcome', np.float32(0.8386218)),
 ('here', np.float32(0.82807004))]

### Part 1.3

Answer the following questions: 

1. What is the time complexity of `find_similar` if your vocab has `v` words, your embedding size is `m`, and you want to find `n` most similar words to the inputted word? 

we cannot assume cos sim is 

cosine_similarity is o(m) due to the dot product operation which requires multiplying m
columns, and this function is run v times to get the similarities for all words in find_similarity. As a result, find_similarity is o(m*v + v*log(v)) since sorting is vlog(v). n is not important because it just involves taking the top n elements of the sorted array.


1. Consider a scenario (e.g., web application that displays similar words) where you might have to repeatedly run `find_similar`, say for `x` times. What are the benefits and challenges of pre-computing the similarity between all words? How might you overcome the challenges? 

It is beneficial becuase saving the pre-computed vlaue eliminates the need to calculate the similiarities so the time complexirty of finding similar words become constant. However, it might waste computing power because the initial operation is o(v(m*v + v*log(v))), and it may not be necessary to compute the similarities for all words, as some less frequent words may not be used. Instead, the similarities could be only computed for the most common words. The storage would also be v^2, and in order to reduce that, only the first 100 most similar words could be stored for each word. It is unlikely to need any more than about the first 100 most similar words. 

## Part 2: Computing similarity between `distilgpt` embeddings

### Part 2.1

In [100]:
from transformers import AutoModel, AutoTokenizer
import torch

model_name = "distilgpt2"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

embedding_matrix = model.get_input_embeddings().weight

In [101]:
words = ["hello", "cat", "chomsky", "supercalifragilisticexpialidocious"]

for word in words:
    tokens = tokenizer.tokenize(word, add_special_tokens=False, return_tensors="pt")
    print(tokens)
    ids = tokenizer.convert_tokens_to_ids(tokens)
    print(ids)
    embeds = embedding_matrix[ids]
    print(word, embeds.shape)

['hello']
[31373]
hello torch.Size([1, 768])
['cat']
[9246]
cat torch.Size([1, 768])
['ch', 'omsky']
[354, 37093]
chomsky torch.Size([2, 768])
['super', 'cal', 'if', 'rag', 'il', 'ist', 'ice', 'xp', 'ial', 'id', 'ocious']
[16668, 9948, 361, 22562, 346, 396, 501, 42372, 498, 312, 32346]
supercalifragilisticexpialidocious torch.Size([11, 768])


#### Answer the following question
How does `distilgpt` handle words that are not in its vocab? 


It divides the word into sub-word tokens.

### Part 2.2


In [102]:
def get_hf_wordvec(word, tokenizer, embedding_matrix):
    """
    Returns an embedding that is the average over all the sub-word tokens
    """
    tokens = tokenizer.tokenize(word, add_special_tokens=False, return_tensors="pt")
    d = {}
    ids = tokenizer.convert_tokens_to_ids(tokens)
    i = 0
    for id in ids:
        d[tokens[i]] = embedding_matrix[id].detach().numpy()
        i += 1
    return np.mean(list(d.values()), axis=0)

In [103]:
print(str(get_hf_wordvec("hello", tokenizer, embedding_matrix).shape) == "(768,)")
print(
    str(np.mean(get_hf_wordvec("hello", tokenizer, embedding_matrix)))
    == "-0.0013018699"
)
print(str(get_hf_wordvec("hello", tokenizer, embedding_matrix)[0]) == "-0.029814368")

print(
    str(
        get_hf_wordvec(
            "supercalifragilisticexpialidocious", tokenizer, embedding_matrix
        ).shape
    )
    == "(768,)"
)
print(
    str(
        np.mean(
            get_hf_wordvec(
                "supercalifragilisticexpialidocious", tokenizer, embedding_matrix
            )
        )
    )
    == "-0.0007369095"
)
print(
    str(
        get_hf_wordvec(
            "supercalifragilisticexpialidocious", tokenizer, embedding_matrix
        )[0]
    )
    == "-0.023058223"
)

print(str(get_hf_wordvec("chomsky", tokenizer, embedding_matrix).shape) == "(768,)")
print(
    str(np.mean(get_hf_wordvec("chomsky", tokenizer, embedding_matrix)))
    == "0.0015589814"
)
print(str(get_hf_wordvec("chomsky", tokenizer, embedding_matrix)[0]) == "0.026135625")

True
True
True
True
True
True
True
True
True


### Part 2.3

What are the limitations of using `embedding_matrix` in `find_similar`? 

Since embedding_matrix also considers subwords, it will return smiliar subwords rather than complete words. It would likely return the subwords within the word as the first few most similar results, which is not necessarily useful.

### Part 2.4

In [104]:
from transformers import AutoTokenizer


def create_hf_embeddings(hf_model_name: str, vocab: set):
    """
    Returns dictionary. Key: words in the vocab; Value: word embeddings for the hf_model for the word
    """

    model_name = hf_model_name
    model = AutoModel.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    embedding_matrix = model.get_input_embeddings().weight

    d = {}
    for word in vocab:
        d[word] = get_hf_wordvec(word, tokenizer, embedding_matrix)

    return d

In [105]:
gpt_embeddings = create_hf_embeddings(model_name, glove_embeddings.keys())
print(len(glove_embeddings.keys()))
print(len(gpt_embeddings.keys()))

10000
10000


## Part 3: Using analogies to study encoding of gender

### Part 3.1

In [106]:
def compute_analogy(analogy: str, embeddings: dict, n: int):
    """
    Params:
        analogy: String of the format: a - b + c
        embeddings:  key word, value embedding
        n: number of cloest words to return
    Returns:
        n cloest words to the resulting analogy embedding
    """

    words = analogy.split()

    a = words[0]
    b = words[2]
    c = words[-1]

    av = get_word_vector_glove(a, embeddings)
    bv = get_word_vector_glove(b, embeddings)
    cv = get_word_vector_glove(c, embeddings)

    d = av - bv + cv

    r = find_similar(d, n, embeddings, [a, b, c])
    return r

### Part 3.2
Come up with at least 10 analogies that you think are important for testing how robustly gender is encoded in some embeddings. Justify why you picked the examples you did. 

In [111]:
alist = [
    "actress - woman + man",
    "actor - man + woman",
    "grandfather - man + woman",
    "grandmother - woman + man",
    "husband - man + woman",
    "wife - woman + man",
    "daughter - woman + man",
    "son - man + women",
    "doctor - woman + man",
    "doctor - man + woman",
]

n = 5

for a in alist:
    print(a)
    print(compute_analogy(a, gpt_embeddings, n))
    print("-" * 100)

actress - woman + man
[('congressman', np.float32(0.5581259)), ('act', np.float32(0.50505763)), ('mattress', np.float32(0.4937233)), ('intact', np.float32(0.46369284)), ('compact', np.float32(0.46041718))]
----------------------------------------------------------------------------------------------------
actor - man + woman
[('act', np.float32(0.47990918)), ('acting', np.float32(0.46729234)), ('actress', np.float32(0.46015057)), ('female', np.float32(0.44012728)), ('interact', np.float32(0.4395723))]
----------------------------------------------------------------------------------------------------
grandfather - man + woman
[('grandmother', np.float32(0.72981995)), ('grandparents', np.float32(0.6882695)), ('grandchildren', np.float32(0.6657016)), ('grand', np.float32(0.6172141)), ('mother', np.float32(0.60765404))]
----------------------------------------------------------------------------------------------------
grandmother - woman + man
[('grandfather', np.float32(0.63001597)), ('

These examples compare how the model predicts gender both ways by using minimal pairs like actress and actor or aunt uncle. By subtracting man and woman, we can ensure it doesn't just prefer one gender and gender is accurately encoded in the embeddings. We also used some non-family examples such as doctor and actor / actress. Additionally, doctor was a gender neutral example to test whether the model has gender bias.

### Part 3.3
Using analogies from the previous part, test how robustly gender is encoded in `GLoVe` and `distilgpt` embeddings. Add as many code and markdown chunks as you would like. 

In [None]:
for a in alist:
    print(a)
    print(compute_analogy(a, glove_embeddings, n))
    print("-" * 100)

actress - woman + man
[('actor', np.float32(0.90144706)), ('starring', np.float32(0.90102714)), ('star', np.float32(0.8071807)), ('stars', np.float32(0.79611397)), ('movie', np.float32(0.7727861))]
----------------------------------------------------------------------------------------------------
actor - man + woman
[('actress', np.float32(0.9177098)), ('actors', np.float32(0.79851186)), ('starring', np.float32(0.78691894)), ('celebrity', np.float32(0.73182064)), ('portrayed', np.float32(0.7258368))]
----------------------------------------------------------------------------------------------------
grandfather - man + woman
[('grandmother', np.float32(0.90214145)), ('mother', np.float32(0.8688457)), ('daughter', np.float32(0.8610478)), ('wife', np.float32(0.84839374)), ('sister', np.float32(0.8347685))]
----------------------------------------------------------------------------------------------------
grandmother - woman + man
[('grandfather', np.float32(0.88643765)), ('uncle', np.f

Gender is robustly encoded in both models, and both only would output correctly gendered words when subtracting man or woman. However, glove was significantly more accurate in predicting the expected term in the top 10 output, while distilgpt rarely output the expected term and instead output similarly gendered terms.

## Part 4 (optional): Using analogies to study encoding of other features

In [127]:
blist = [
    "runs - run + eat",
    "flies - fly + walk",
    "goes - go + walk",
    "does - do + say",
    "has - have + need",
    "plays - play + work",
    "writes - write + read",
    "sings - sing + ride",
    "moves - move + jump",
    "comes - come + leave",
]

Justify why you think this is a useful feature:


In [128]:
for a in blist:
    print(a)
    print(compute_analogy(a, gpt_embeddings, n))
    print("-" * 100)

runs - run + eat
[('defeat', np.float32(0.68148243)), ('eaten', np.float32(0.6510309)), ('eating', np.float32(0.554748)), ('breaks', np.float32(0.40725547)), ('carbohydrates', np.float32(0.39687207))]
----------------------------------------------------------------------------------------------------
flies - fly + walk
[('walked', np.float32(0.6367054)), ('walking', np.float32(0.57056844)), ('walnuts', np.float32(0.50213677)), ('walker', np.float32(0.47031176)), ('walks', np.float32(0.46179622))]
----------------------------------------------------------------------------------------------------
goes - go + walk
[('walked', np.float32(0.7552455)), ('walking', np.float32(0.57059413)), ('walker', np.float32(0.51984304)), ('crosses', np.float32(0.50737846)), ('writes', np.float32(0.5015175))]
----------------------------------------------------------------------------------------------------
does - do + say
[('said', np.float32(0.47283748)), ('would', np.float32(0.4654507)), ('makes', np.

In [129]:
for a in blist:
    print(a)
    print(compute_analogy(a, glove_embeddings, n))
    print("-" * 100)

runs - run + eat
[('eating', np.float32(0.9161597)), ('eats', np.float32(0.8856606)), ('eaten', np.float32(0.8730644)), ('ate', np.float32(0.86588645)), ('meal', np.float32(0.84295464))]
----------------------------------------------------------------------------------------------------
flies - fly + walk
[('walking', np.float32(0.8480631)), ('walks', np.float32(0.8360235)), ('walked', np.float32(0.7872538)), ('beside', np.float32(0.7817013)), ('away', np.float32(0.77839863))]
----------------------------------------------------------------------------------------------------
goes - go + walk
[('walking', np.float32(0.89258105)), ('walks', np.float32(0.88117766)), ('sits', np.float32(0.85227436)), ('close', np.float32(0.83825624)), ('turns', np.float32(0.83417255))]
----------------------------------------------------------------------------------------------------
does - do + say
[("'s", np.float32(0.9365444)), ('seems', np.float32(0.932347)), ('fact', np.float32(0.91629213)), ('why',

Plurality is a useful feature to test because it is essential for language models to understand plurality in order to generate text, and this is a simple way to test whether the embeddings can capture plurality.

The feature of plurality was definitely encoded in both distilgpt and glove, but it often output words that were similar in meaning to the verbs rather than solely the plural version. Glove was also far more effective, and it accurately predicted the singular version for jump, ride, work, and need as the most likely prediction.

How useful do you think similarity or analogies are in evaluating word embeddings? Are there any limits on the kind of features you can probe?
Analogies are a relatively simple way to evaluate words and often output words outside the scope of the experiment that are generally similar in usage or meaning. This type of experiment also requires that there be a binary difference between words so the meanings can be easily subtracted. Each word needs a specific inverse which exactly captures the difference, which may not always be the case. 