# Programming Assignment 4: Embeddings! (Winter 2026)

-----------------------------------------

## Installing new packages

--------------------------------------------------
**READ/SKIM THIS WHOLE CELL BEFORE RUNNING COMMANDS** :) If you already installed these packages by following our `README.md` directions, then you are good to go

We need to install pytorch in our current environment. We also need to download the [BERT](https://huggingface.co/google-bert/bert-large-uncased) model, and we will do so from [HuggingFace](https://huggingface.co/models/). 

First, **please make sure that you are using the most current version of conda**! If you followed the installation instructions on PA0, this should already be the case. To confirm, you can run the following command:

```
conda -V
```

Your output should be something like ```conda 25.11.1```. If you have an older version of conda, please go back to the documentation from PA0 and update it.

### **Recommended**: Install packages into existing environment

To install these packages into our existing python environment, you can run the following commands:

```
conda activate cs124
conda install -c pytorch pytorch
conda install -c huggingface transformers
```
To use these new packages, you may need to restart your kernel (Kernel > Restart)

- If you run into an issue of pytorch not working because your existing cs124 environment is running through the Rosetta 2 translation layer,  meaning pytorch can't detect your hardware, (on M1+ Macs specifically), you can use the following method to create a new conda environment just for this assignment.

### **Alternative**: Create a new environment for this assignment

Run the following command in the terminal, then restart your notebook: 

```
conda env create -f environment_pa4.yml
conda activate cs124_pa4
```

To use these new packages, change your kernel to use this new package version set. (Kernel -> Change Kernel)

Both of these environments now contain pytorch and huggingface in addition to the existing packages we had. To verify they installed, run the following cell:

-------------------------------------------------------

In [None]:
import os
ALLOWED_ENVIRONMENTS = ["cs124_pa4", "cs124"]
assert os.environ['CONDA_DEFAULT_ENV'] in ALLOWED_ENVIRONMENTS
# This BERT model uses about 440MB of storage, but is deletable after the assignment
from transformers import BertTokenizer, BertModel, file_utils

In [None]:
import numpy as np
try:
    import torch
except:
    print("Error occurred. Did pytorch install correctly? Reach out to us on Ed for help.")

In [None]:
# Do not modify this cell, please just run it!
import quizlet

# Your Mission
 The goal of this assignment is for you to build a deeper intuition about embeddings. We want you to understand how to compute them, what they represent, and how to use them!
 
 In the first half, you will work with static embeddings, while in the latter half, you will use contextual embeddings. You don't have to worry if you haven't learned about transformers or BERT just yet; this assignment will walk you through the basics on how to use these models.

## The Static Embeddings

----------------------------------

You’ll be using subset of ~4k 50-dimensional GloVe embeddings trained on Wikipedia articles. The GloVe (Global Vectors) model learns vector representations for words by looking at global word-word co-occurrence statistics in a body of text and learning vectors such that their dot product is proportional to the probability of the corresponding words co-occuring in a piece of text. The GloVe model was developed right here at Stanford, and if you’re curious you can read more about it [here](https://nlp.stanford.edu/projects/glove/)!

In [None]:
%%bash

if [[ ! -d "./data" ]]
then
    echo "Missing extra files (this probably means you're running on Google Colab). Downloading..."
    git clone https://github.com/cs124/pa4-embeddings.git
    cp -r ./pa4-embeddings/{data,quizlet.py} .
fi

## Part 1a: Synonyms
For this section, your goal is to answer questions of the form:

- What is a synonym for `warrior`?  
  - soldier
  - sailor
  - pirate
  - spy  

You are given as input a word and a list of candidate choices. Your goal is to return the choice you think is the synonym. You’ll first implement three similarity metrics - euclidean distance, dot product, and cosine similarity - then leverage them to answer the multiple choice questions!

Specifically, you will implement the following 5 functions:

* **cosine_similarity()**: calculate the cosine similarity between two vectors. You’ll be using this helper function throughout the other parts of the assignment as well, so you’ll want to get it right!
* **dot_product()**: calculate the dot product between two vectors. 
* **euclidean_distance()**: calculate the euclidean distance between two vectors. 
* **find_synonym()**: given a word, a list of 4 candidate choices, and which similarity metric to use, return which word you think is the synonym! The function takes in `comparison_metric` as a parameter: 

  * if its value is `euc_dist`, you'll use Euclidean distance as the similarity metric. 
  * if its value is `dot_product`, you'll use dot product as the similarity metric. 
  * if its value is `cosine_sim`, you'll use cosine similarity as the metric.
* **part1_written()**: you’ll find that finding synonyms with word embeddings works quite well, especially when using cosine similarity as the metric. However, it’s not perfect. In this function, you’ll look at a question that your `find_synonyms()` function (using cosine similarity) gets wrong, and answer why you think this might be the case. Please return your answer as a string in this function.

Note: for the rest of the assignment, you'll only use cosine similarity as the comparison metric. You won't use the euclidean distance or dot product functions anymore.



In [None]:
def cosine_similarity(v1, v2):
    '''
    Calculates and returns the cosine similarity between vectors v1 and v2
    Arguments:
        v1 (np.array), v2 (np.array): vectors
    Returns:
        cosine_sim (float): the cosine similarity between v1, v2
    '''
    cosine_sim = 0
    #########################################################
    ## TODO: calculate cosine similarity between v1, v2    ##
    #########################################################

    #########################################################
    ## End TODO                                            ##
    #########################################################
    return cosine_sim

def dot_product(v1, v2):
    '''
    Calculates and returns the dot product between vectors v1 and v2
    Arguments:
        v1 (np.array), v2 (np.array): vectors
    Returns:
        dot_product (float): the dot product between v1, v2
    '''
    dot_product = 0
    #########################################################
    ## TODO: calculate dot product between v1, v2    ##
    #########################################################

    #########################################################
    ## End TODO                                            ##
    #########################################################
    return float(dot_product)     

def euclidean_distance(v1, v2):
    '''
    Calculates and returns the euclidean distance between v1 and v2

    Arguments:
        v1 (np.array), v2 (np.array): vectors

    Returns:
        euclidean_dist (float): the euclidean distance between v1, v2
    '''
    euclidean_dist = 0
    #########################################################
    ## TODO: calculate euclidean distance between v1, v2   ##
    #########################################################
   
    #########################################################
    ## End TODO                                           ##
    #########################################################
    return euclidean_dist                 

def find_synonym(word, choices, embeddings, comparison_metric):
    '''
    Answer a multiple choice synonym question! Namely, given a word w 
    and list of candidate answers, find the word that is most similar to w.
    Similarity will be determined by what is passed in as the comparison_metric.

    Arguments:
        word (str): word
        choices (List[str]): list of candidate answers
        embeddings (Dict[str, np.array]): map of words to their embeddings
        comparison_metric (str): 'euc_dist', 'dot_product' or 'cosine_sim'. 
            This indicates which metric to use.
            With euclidean distance, we want the word with the lowest euclidean distance.
            With dot product, we want the word with the highest dot product.
            With cosine similarity, we want the word with the highest cosine similarity.

    Returns:
        answer (str): the word in choices most similar to the given word
    '''
    answer = None
    #########################################################
    ## TODO: find synonym                                  ##
    #########################################################
 
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

def part1_written():
    '''
    Finding synonyms using cosine similarity on word embeddings does fairly well!
    However, it's not perfect. In particular, you should see that it gets the last
    synonym quiz question wrong (the true answer would be positive):

    30. What is a synonym for sanguine?
        a) pessimistic
        b) unsure
        c) sad
        d) positive

    What word does it choose instead? In 1-2 sentences, explain why you think 
    it got the question wrong.
    
    See the cell below for the code to run for this part
    '''
    #########################################################
    ## TODO: replace string with your answer               ##
    ######################################################### 
    answer = ""
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

In [None]:
"""This will create a class to test the functions you implemented above. If you are curious, 
you can see the code for this in quizlet.py but it is not required. If you run this cell,
we will load the test data for you and run it on your functions to test your implementation.

You should get an accuracy of 66% with euclidean distance and 83% with cosine distance
"""

part1 = quizlet.Part1_Runner(find_synonym, part1_written)
part1.evaluate(True)  # To only print the scores, pass in False as an argument

## Part 1b: Testing Understanding of Comparison Metrics

In this section, we want you to exercise your understanding of the concepts you implemented in the functions above to give us answers that satisfy the following questions. Please read the questions carefully.

You do NOT need to write any code to find this answer, we expect you to calculate it yourself by hand using what you have learnt in class. ONLY return the appropriate vector: your answer for each function should be one line only (the vector).

For all of the questions, we are asking for an vector with dimensionality $4$. It is given that $A = [2, 1, -3, 0]$. Your answer should be the return value for each of the following functions. **Please ensure that you are returning a numpy array**!

In [None]:
A = np.array([2, 1, -3, 0])

def zero_dot_product():
    '''
    For this function, return a non-zero vector B, of dimensionality 4 that 
    such that the dot product between A and B is 0. Ensure that your
    answer is a numpy array.
    '''
    B = None
    #########################################################
    ## TODO: find B that minimises dot product             ##
    #########################################################

    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return B

def minimise_euc_dist():
    '''
    For this function, return a vector C, of dimensionality 4 that 
    MINIMISES the euclidean distance between A and C. 
    Ensure that your answer is a numpy array.
    '''
    C = None
    #########################################################
    ## TODO: find C that minimises euclidean distance      ##
    #########################################################
  
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return C

def maximise_cosine_sim():
    '''
    For this function, return a vector D, of dimensionality 4 that 
    MAXIMISES the cosine similarity between A and D. 
    Ensure that your answer is a numpy array.
    '''
    D = None
    #########################################################
    ## TODO: find D that maximises cosine similarity       ##
    #########################################################
 
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return D

def get_vector_E():
    '''
    For this function, return a vector E, of dimensionality 4 
    such that: 
     * The cosine similarity between A and E is < 0.5
     * The Euclidean distance between A and E is > 2

    Any vector that satisfies these constraints is acceptable.
    Ensure that your answer is a numpy array.
    '''
    E = None
    #########################################################
    ## TODO: find E that satisfies constraints             ##
    #########################################################
    
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    
    # Spot check!
    # If your answer is incorrect, you will get an error here.
    assert(cosine_similarity(A, E) < 0.5)
    assert(euclidean_distance(A, E) > 2)
    
    return E

def minimise_cosine_sim():
    '''
    For this function, return a vector F, of dimensionality 4 that 
    MINIMISES the cosine similarity between A and F. 
    Ensure that your answer is a numpy array.
    '''
    F = None
    #########################################################
    ## TODO: find F that minimises cosine similarity       ##
    #########################################################
  
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return F

def get_vector_G():
    '''
    For this function, return a vector G, of dimensionality 4 
    such that the cosine similarity between A and G is > 0.75.

    G CANNOT be equal to A.

    Any vector that satisfies these constraints is acceptable.
    Ensure that your answer is a numpy array.
    '''
    G = None
    #########################################################
    ## TODO: find G that maximises dot product             ##
    #########################################################
  
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return G

Run the following cell to print your results for each of these functions.

In [None]:
print('Vector with zero dot product with A:', zero_dot_product())
print('Vector minimizing euclidean distance from A:', minimise_euc_dist())
print('Vector maximizing cosine similarity with A:', maximise_cosine_sim())
print('A vector that has cosine similarity < 0.5 with A but Euclidean distance > 2 with A:', get_vector_E())
print('Vector minimizing cosine similarity with A:', minimise_cosine_sim())
print('A vector that has cosine similarity > 0.75 with A:', get_vector_G())

Do you notice some parallels between the vectors that maximise and minimise dot product and cosine similarity? When working with word embeddings, we care about the direction of the embeddings relative to each other and NOT their magnitude. This is why we use cosine similarity!

## Part 1c: Antonyms

Whereas synonyms are words with identical or similar meanings, antonyms are words with an opposite meaning, like: 
* long / short 
* big / little 
* fast / slow
* cold / hot 
* rise / fall 
* up / down 
* in / out 

Two senses can be antonyms if they define a binary opposition or are at opposite ends of some scale. This is the case for long/short, fast/slow, or big/little, which are at opposite ends of the length or size scale. Another group of antonyms, reversives, describe change or movement in opposite directions, such as rise/fall or up/down. Antonyms thus differ completely with respect to one aspect of their meaning— their position on a scale or their direction—but are otherwise very similar, sharing almost all other aspects of meaning. Thus, automatically distinguishing synonyms from antonyms can be difficult.

In this section, we explore antonyms in the embedding space.

First, complete the function ```antonym_light()``` to return an antonym of the word "light" that has a *higher cosine similarity* with it than its synonym, "bright" (~0.7481). **Please make sure that your answer is in lowercase!** You can verify that it has a higher cosine similarity to the word light than 0.7481 by running the three cells below.

In [None]:
def antonym_light():
    antonym = ""
    #########################################################
    ## TODO: return an antonym of 'light'                  ##
    #########################################################

    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    assert(antonym.isalpha() and antonym.islower())
    return antonym

Now, find two other words that are **antonyms of each other with high similarity**. Complete the function ```get_antonyms()``` below to return this pair of words. **Please make sure that your answer is in lowercase!**

In [None]:
def get_antonyms():
    word1 = ""
    word2 = ""

    #########################################################
    ## TODO: return a pair of antonyms                     ##
    #########################################################

    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    assert(word1.isalpha() and word1.islower())
    assert(word2.isalpha() and word2.islower())
    return word1, word2

Run the following cell to see the cosine similarity between (1) 'light' and its antonym; and (2) the antonym pair returned by ```get_antonyms()```. You should get an error if any of the words you enter do not contain a corresponding embedding in our data--simply choose another word / pair of antonyms.

In [None]:
# Do not change this cell
part1 = quizlet.Part1_Runner(find_synonym, part1_written)
part1.evaluate_antonyms(antonym_light, get_antonyms, cosine_similarity)

Are these results consistent with what you would expect? Why do you think antonyms are so high in similarity with each other despite having opposite meanings? Answer in 2-3 sentences in the function ```part1_antonyms_written()``` below. 

In [None]:
def part1_antonyms_written():
    #########################################################
    ## TODO: replace string with your answer               ##
    ######################################################### 
    answer = ""
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

## Part 2: Exploration
In this section, you'll do an exploration question. Specifically, you'll implement the following 2 functions:

* **occupation_exploration()**: given a list of occupations, find the top 5 occupations with the highest cosine similarity to the word "man", and the top 5 occupations with the highest cosine similarity to the word "woman".
* **part2_written()**: look at your results from the previous exploration task. What do you observe, and why do you think this might be the case? Write your answer within the function by returning a string.


In [None]:
def occupation_exploration(occupations, embeddings):
    '''
    Given a list of occupations, return the 5 occupations that are closest
    to 'man', and the 5 closest to 'woman', using cosine similarity between
    corresponding word embeddings as a measure of similarity.

    Arguments:
        occupations (List[str]): list of occupations
        embeddings (Dict[str, np.array]): map of words (strings) to their embeddings (np.array)

    Returns:
        top_man_occs (List[str]): list of 5 occupations closest to 'man'
        top_woman_occs (List[str]): list of 5 occuptions closest to 'woman'
            note: both lists should be sorted, with the occupation with highest
                  cosine similarity first in the list
    '''
    top_man_occs = []
    top_woman_occs = []
    #########################################################
    ## TODO: get 5 occupations closest to 'man' & 'woman'  ##
    #########################################################

    #########################################################
    ## End TODO                                            ##
    #########################################################
    return top_man_occs, top_woman_occs

def part2_written():
    '''
    Take a look at what occupations you found are closest to 'man' and
    closest to 'woman'. Do you notice anything curious? In 1-2 sentences,
    describe what you find, and why you think this occurs.
    '''
    #########################################################
    ## TODO: replace string with your answer               ##
    ######################################################### 
    answer = ""
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

In [None]:
part2 = quizlet.Part2_Runner(occupation_exploration, part2_written)
part2.evaluate() 

## Part 3: Contextual embeddings

For this section, your goal is to understand contextual embeddings, which are more powerful than static embeddings. In a static embedding, we just have one vector for each word. In a contextual embedding, such as those produced by the BERT algorithm, the vector for the word is influenced by all its neighbors.  That means the  embedding for the same word is different when it appears in different sentences!   We won't study the transformer that is the core mechansim of BERT until later in the quarter, so in this assignment you are just exploring BERT as a [black box](https://en.wikipedia.org/wiki/Black_box).

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased') # About 440MB large

# Feel free to ignore deprecation/unused weight warnings

In [None]:
# We want to run the model on our GPU if possible, but if not, we can use a CPU
if torch.backends.mps.is_available(): # Available on Macs with Apple silicon or AMD GPUs
    device = torch.device("mps")
    model.to(device)
elif torch.cuda.is_available(): # Available on computers with NVIDIA GPUs
    device = torch.device("cuda")
    model.to(device)
else:
    device = torch.device("cpu")
print("Model is on device: ", device)

### Part 3.1: Contextual embedding with BERT

In this section, you will complete the following function ```get_bert_word_embedding```. In doing so, you will learn how to preprocess text for BERT by tokenizing a sentence and extracting embeddings for a specific word.

You will find the PyTorch section of the ["How to use" the BERT base model](https://huggingface.co/google-bert/bert-base-uncased#how-to-use) helpful. 


In this function, we pass in a single sentence. We want to find the position of ```target_word``` in ```sentence``` (already implemented for you), and use this to extract the embedding of the target word using ``last_hidden_state``. You might also want to refer to the example given in this [BERTModel](https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#bertmodel) documentation.

Please use the variable names provided for you! Simply write your code in place of ```None```.

In [None]:
def get_bert_word_embedding(sentence, target_word):
    '''
    This function runs a sentence through BERT, and
    returns the embedding for that word. (shape (768,))
    '''
    
    # We need to convert the input sentence into tokens that BERT can understand. 
    #########################################################
    ### TODO: Tokenize the sentence. Use return_tensors     #
    #         to get the PyTorch format.                    #
    #########################################################
    ### BEGIN CODE HERE (~1 line) ###
    inputs = None
    ### END CODE HERE ###
    
    #########################################################
    ### TODO: Obtain the ID of the target word from the     #
    #         tokenizer, ensuring no special tokens are     #
    #         added.                                        #
    #########################################################
    # Hint: Use the encode() function of the tokenizer to do this!
    # You might find this documentation helpful: https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PythonBackend.encode
    ### BEGIN CODE HERE (~1 line) ###
    word_id = None
    ### END CODE HERE ###
    
    # Ensures that a word is NOT split into multiple tokens by BERT.
    if len(word_id) != 1:
        raise ValueError(f"'{target_word}' is split into multiple tokens by BERT. Please choose a simpler (~1 syllable) word.") 
    word_id = word_id[0]  # Get the actual token ID of the target word.

    # Extracts the word position of target word in the sentence
    word_position = torch.where(inputs['input_ids'][0] == word_id)[0]  
    
    # Ensures that the target word is found in the sentence.
    if len(word_position) == 0:
        raise ValueError(f"'{target_word}' not found in the sentence.")

    # Pass inputs through the model to get the word embeddings
    with torch.no_grad():
        if device is not torch.device("cpu"):
            inputs = {key: val.to(device) for key, val in inputs.items()}
        outputs = model(**inputs)
    
    #########################################################
    ### TODO: Extract the embedding using the position      #
    #         obtained earlier.                             #
    #########################################################
    # Hint: Use outputs.last_hidden_state, which is a 
    # tensor of shape [batch_size, seq_len, hidden_size].
    ### BEGIN CODE HERE (~1 line) ###
    embedding = None
    ### END CODE HERE ###

    return embedding.cpu().numpy()  # Numpy does not support GPU tensors, so we move it to the CPU


- Your task is to use BERT to study word polysemy (the fact that words can have multiple senses that are different from each other in meaning, like "bat" to mean both the flying mammal and the baseball instrument).  Your job is to find a maximally ambiguous word. We have provided an example in code below.

In [None]:
example_word = "bank"
example_sentence1 = f"I went to the {example_word} to deposit my money."
example_sentence2 = f"I went down by the river {example_word} to see the ducks."

In [None]:
def get_polyseme_similarity(word, sentence1, sentence2, return_score=False):
    embedding1 = get_bert_word_embedding(sentence1, word)
    embedding2 = get_bert_word_embedding(sentence2, word)
    similarity = cosine_similarity(embedding1, embedding2)
    if return_score:
        return similarity
    else:
        print(f"This word is {similarity*100:.2f}% similar in the two sentences.")

get_polyseme_similarity(example_word, example_sentence1, example_sentence2)

- Now it's your turn! Try to find a ~1 syllable [polyseme](https://prepedu.com/en/blog/polysemy-in-english) that can be used in very different contexts. You will get full points for getting it under 54% similarity. We'll have a leaderboard on gradescope for lowest similarity score achieved (In our testing, we achieved approx. 35%).


In [None]:
def part3():
    '''
    Returns
        word (str): the word used in both sentences
        sentence1 (str): the first sentence
        sentence2 (str): the second sentence

    HINT: This word should be a polyseme, meaning it has
    multiple meanings, and each sentence should use a different definiton.
    '''
    #########################################################
    ## TODO: replace strings with your answers             ##
    ######################################################### 
    word = ""
    sentence1 = ""
    sentence2 = ""
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return word, sentence1, sentence2

In [None]:
part3 = quizlet.Part3_Runner(part3, get_bert_word_embedding, cosine_similarity)
part3.evaluate()

## Part 4: Sentence Similarity with BERT

For this section, your goal is to answer questions of the form:

- How semantically similar are the following two sentences?:

    - he later learned that the incident was caused by the concorde's sonic boom

    - he later found out the alarming incident had been caused by concorde's powerful sonic boom

### Part 4.1: Sentence-level embeddings with BERT

In this section, we will be leveraging the BERT model for a sentence classification task. In the real world, many applications of semantic understanding are done with fine-tuned transformer models, and we will be using a simple BERT model that was trained by Google on [BookCorpus](https://en.wikipedia.org/wiki/BookCorpus). To efficiently get the embeddings for multiple sentences, we will implement `get_bert_sentence_embeddings()`

Our `get_bert_sentence_embeddings()` function takes in two parameters besides our inputs. The `batch_size` parameter exists to limit memory usage, which is necessary if you wanted to use this function to compute embeddings on an even larger dataset. (Feel free to try it yourself)! The boolean `use_CLS` explains which of the two following methods we will use for classifying a document:
* **Use the final [CLS] token embedding**: The first token represents the combined context of the full sentence, so we will simply compare this one token across sentences
* **Mean pooling over all sentence tokens**: We will average the token embeddings in the last hidden layer of our BERT outputs. Calculating this is a bit complex, so we've done a lot of the steps for you already. Each step is explained with comments, but for each sentence, we are summing the outputs for each token, but only where the token is not a padding token.

Some (hopefully) helpful hints!
- For extracting the [CLS] token embeddings:
  - We want an output of shape (n_sentences, 768), and the shape of `outputs.last_hidden_state` is  (n_sentences, sequence_length, 768). The reason it is length 768 is because at the last hidden layer of the output in BERT, each token is represented by a vector of this length. If you do not know how multi-dimensional slicing works in NumPy/PyTorch, this guide may be helpful: [Python Slice Indexing](https://www.geeksforgeeks.org/python-slicing-multi-dimensional-arrays/) 
- For getting the mean of all tokens in the output
  - We have implemented the hard part of mean pooling already, and we documented it in the comments. Since we are passing in sentences of varying lenghts, all sentences are padded with [PAD] tokens that we wish to ignore. We use a mask to zero-out the embeddings for the [PAD] tokens, and we also ignore them in our total count by summing the mask. 
  - What is left to do is take the mean of these filtered embeddings. To do so, we sum along the sequence axis, then divide by the provided sum_mask variable.


In [None]:
def get_bert_sentence_embeddings(sentences, use_CLS=True, batch_size=16):
    '''
    Generate embeddings for sentences using BERT's CLS token.
    
    Arguments:
        sentences (List[str]): Input sentences.
        use_CLS (bool): Whether to use the CLS token as the sentence embedding.
                        If it is false, we use mean pooling over sentence tokens.
        batch_size (int): Batch size for processing the sentences.
    
    Returns:
        np.ndarray: Sentence embeddings of shape (n_sentences, 768).
    '''
    all_embeddings = []
    # We process the sentences in batches to avoid running out of memory. 
    # Feel free to experiment with the batch size, 8 or 16 are likely best.
    for i in range(0, len(sentences), batch_size):
        batch_sentences = sentences[i:i+batch_size]
        inputs = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors='pt')
        if device is not torch.device("cpu"):
            inputs = {k: v.to(device) for k, v in inputs.items()}  # Move inputs to your GPU
        with torch.no_grad(): # Runs the model without calculating gradients
            outputs = model(**inputs) # shape: (batch_size, max_sentence_length, 768)
        embeddings = None
        if use_CLS:
            #########################################################
            ### TODO: Extract each CLS token embedding from the     #
            #         output of each sentence                       #
            #########################################################
            ### BEGIN CODE HERE (~1 line) ###
            
            ### END CODE HERE ###
            pass # Remove this line once you have completed your implementation!!
        else:
            # We first create a mask to distinguish real tokens from padding tokens
            attention_mask = inputs['attention_mask']
            # We then expand the mask to the same shape as the embeddings
            mask_expanded = attention_mask.unsqueeze(-1).expand(outputs.last_hidden_state.size()).float()
            # Sum the 1s in the mask to get the number of non-padding tokens for each sentence
            sum_mask = mask_expanded.sum(dim=1)
            # Clamp the sum to 1e-9 to avoid division by zero
            sum_mask = torch.clamp(sum_mask, min=1e-9)
            # Apply the mask to the embeddings, so that the padding tokens are ignored
            masked_embeddings = outputs.last_hidden_state * mask_expanded
            #########################################################
            ### TODO: Extract the mean of all token embeddings      #
            #         by summing the embeddings along the sequence  #
            #         dimension and dividing by the sum_mask        #
            #########################################################
            ### BEGIN CODE HERE (~1-2 lines) ###
            
            ### END CODE HERE ###
        all_embeddings.append(embeddings)
    
    embeddings = torch.cat(all_embeddings, dim=0)
    return embeddings.cpu().numpy() # Numpy does not support GPU tensors, so we move it to the CPU

### 4.2 Spot check
Test out your implementations!
You should find that sentences 2 and 3 are quite similar to each other (>97% on CLS similarity, >90% on mean pooling)

In [None]:
sentences = ["To be or not to be, that is the question.", 
             "The feline sat on the rug.", "The cat sat on the mat."]

embeddings = get_bert_sentence_embeddings(sentences, use_CLS=True)

cos_sim_12 = cosine_similarity(embeddings[0], embeddings[1])
cos_sim_13 = cosine_similarity(embeddings[0], embeddings[2])
cos_sim_23 = cosine_similarity(embeddings[1], embeddings[2])
print(f"Similarity between s1 and s2: {cos_sim_12*100:.2f}%")
print(f"Similarity between s1 and s3: {cos_sim_13*100:.2f}%")
print(f"Similarity between s2 and s3: {cos_sim_23*100:.2f}%")

## Part 5: Ethical Considerations in Embedding Spaces

In Parts 1 through 4, we used embeddings as mathematical tools to measure word similarity and sentence context. However, because these vectors are trained on massive datasets of human-generated text, they are not neutral mirrors of reality; they often encode the historical biases, emotional weights, and social structures of the eras in which they were written. This final section explores how these mathematical distances manifest as real-world ethical challenges in the systems we use every day.

### Part 5.1: Narrative Weight and Media Connotation

In Part 4, you compared sentence similarity to see how BERT handles semantic meaning. However, word embeddings also carry "connotations" that go beyond a dictionary definition.

**A. Sensationalism in the Feed**: Think about how you consume information through platforms like Google News or Apple News. These systems act as news aggregators, using embeddings to "cluster" similar stories together. Consider how different outlets describe the same economic event: one headline says stocks "dipped" while another says they "plunged."

Given that these words might be mathematically close in an embedding space, how might a search engine or news aggregator inadvertently change the "feel" of a topic based on which words it clusters together? Provide your own example of a pair of words (one neutral, one sensationalized) that describe the same type of event. ***Please return your answer as a string in the ``part5_1_a()`` function in the cell below***.

**B. Framing through Word Choice**: Consider word pairs that frequently appear in similar sentence structures, such as "protester" and "rioter." While an embedding model might see them as similar because they appear in similar contexts, they represent different framings of the same event and ultimately, different ways of interpreting what's happening.

How might a system that treats these framings as interchangeable impact public perception of events? Provide your own example of another pair of words that are semantically similar but frame a situation differently. ***Please return your answer as a string in the ``part5_1_b()`` function in the cell below***.


In [None]:
def part5_1_a():
    #########################################################
    ## TODO: replace string with your answer               ##
    ######################################################### 
    answer = ""
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

def part5_1_b():
    #########################################################
    ## TODO: replace string with your answer               ##
    ######################################################### 
    answer = ""
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

### Part 5.2: Algorithmic Career Funneling
Think back to your time in middle or high school. You likely sat through at least one "Interest Inventory" or "Career Aptitude Test" – those surveys where you clicked "strongly agree" or "disagree" to statements about your hobbies and strengths.

Many schools use platforms like Naviance to take those answers and map them to career clusters. Mathematically, these systems often work just like the embeddings you’ve built: they turn your interests into a student vector and find the closest career vectors using similarity metrics. Later this quarter (PA7), you'll implement recommendation engines using collaborative filtering, which relies on these same kinds of vector similarities.

**A. Mirror vs. Aspirational Systems**: In last week's lab, we discussed the distinction between systems that mirror existing patterns (reflecting "what is") versus aspirational systems that guide toward desired outcomes (pursuing "what could be").

How is a career recommendation system that matches students to careers based on historical success profiles similar to and different from this mirror/aspirational distinction? What are the implications of each approach for a student exploring their future?

***Please return your answer as a string in the ``part5_2_a()`` function in the cell below***.

**B. Background Factors**: These systems sometimes integrate data beyond just interests, such as a student's zip code or their school's historical performance metrics.

What concerns might arise from including this background information in career recommendations? How might this affect students differently depending on their circumstances?

***Please return your answer as a string in the ``part5_2_b()`` function in the cell below***.

**C. Individual vs. Societal Aspirations**: Even when we build systems that try to guide students, we have to ask: whose goals are we optimizing for?

Imagine a student from an underrepresented background exploring career options. They might genuinely want to see careers where people like them are already successful, as having role models and community can be important. At the same time, some argue there’s value in increasing representation in fields that have historically been less diverse, while others might prioritize different values or outcomes.

When an individual’s preferences don’t align with broader social goals, what tensions arise? How might you think about designing a system that navigates this complexity?

***Please return your answer as a string in the ``part5_2_c()`` function in the cell below***.


In [None]:
def part5_2_a():
    #########################################################
    ## TODO: replace string with your answer               ##
    ######################################################### 
    answer = ""
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

def part5_2_b():
    #########################################################
    ## TODO: replace string with your answer               ##
    ######################################################### 
    answer = ""
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

def part5_2_c():
    #########################################################
    ## TODO: replace string with your answer               ##
    ######################################################### 
    answer = ""
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

## Congrats on finishing!

As a parting thought, we hope that these past several assignments have got you thinking about how large scale algorithms on text function and how we can improve them at scale. We only implemented small parts at a time, but hopefully these foundations are helpful in thinking about how language modeling algorithms shape how information is stored and retrieved online.

### If you collaborated with a partner, describe below.

In [None]:
def collaboration():
    '''
    Returns:
        answer (str): what you and your partner did each / together
    '''
    return ""

## Submission

Once you're ready to submit, you can run the cell below to prepare and zip up your solution:

If you're running on Google Colab, see the README for instructions on how to submit.


In [None]:
%%bash

if [[ ! -f "./pa4.ipynb" ]]
then
    echo "WARNING: Did not find notebook in Jupyter working directory. This probably means you're running on Google Colab. You'll need to go to File->Download .ipynb to download your notebok and other files, then zip them locally. See the README for more information."
else
    echo "Found notebook file, creating submission zip..."
    zip -r submission.zip pa4.ipynb deps/
fi

__Some reminders for submission:__
 * Make sure you didn't accidentally change the name of your notebook file, (it should be `pa4.ipynb`) as that is required for the autograder to work.
* Go to Gradescope (gradescope.com), find the PA4 Quizlet assignment and upload your zip file (`submission.zip`) as your solution.
* Wait for the autograder to run and check that your submission was graded successfully! If the autograder fails, or you get an unexpected score it may be a sign that your zip file was incorrect.