<a href="https://colab.research.google.com/github/aratrikpaul2024/aratrikpaul-site/blob/main/8.clustering/HW8_Word_Sense_Induction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/8.clustering/HW8_Word_Sense_Induction.ipynb)


# HW8: Unsupervised Word Sense Induction

The same _word type_ can have different _senses_, or meanings. For example, "class" could refer to a category ("I have never flown first class.") or a course ("ANLP is my favorite class.") Indeed, with how frequently language changes, a new word sense can come into common usage before dictionaries can be updated with this new information. In this setting, we might be interested in _inducing_ word senses: how can we surface different senses of a word without knowing the definition _a priori_?

In this homework, you will be working on this classic NLP task by clustering BERT token embeddings.

In [1]:
from transformers import AutoTokenizer, BertModel
import pandas as pd
import torch
from torch import nn

In [2]:
# Make sure the GPU is available
torch.cuda.is_available()

True

In [3]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [4]:
import xml.etree.ElementTree as ET

def parse_semeval_file(filepath):
    data = ET.parse(filepath)
    root = data.getroot()
    tag = root.tag
    word = tag.strip().split(".")[0]

    data = []

    for sentence in root:
        split = sentence.text.lower().split(" ")
        if word not in split:
            continue
        data.append({"word": word, "sentence": split})

    return data

In [5]:
# download data file
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/wsi/address.n.xml

--2025-10-17 01:31:01--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/wsi/address.n.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3389340 (3.2M) [text/plain]
Saving to: ‘address.n.xml’


2025-10-17 01:31:01 (55.4 MB/s) - ‘address.n.xml’ saved [3389340/3389340]



## Exploring the data

### Question 1

Load the data for "address" (`address.n.xml`). Look through the examples and **identify two sentences that use "address" with different senses**. Report each sentence and what definition of "address" is being used.

In [6]:
data = parse_semeval_file("address.n.xml")

In [7]:
for i in range(10):
    print(i, " ".join(data[i]['sentence']))


0 [ 0014 ] in a preferred embodiment , the local address is modularly increasing so as to form cycles of addresses , as claimed in claims 2 and 6. for example , the address may be incremented by one unit at each subsequent block , from zero to the maximum possible value , after which the value zero is used again , and so forth. in this way it is achieved that blocks with the same local address are located within the physical space at the maximum distance one from another . 
1 in order to ensure that the correct addresses are placed on the direct mail-outs , companies may want to consider placing an incentive within the mailing to get the customer to mail them back. this may be in the form of a discount code that is placed within the direct mailing that when consumers refer to it , will indicate that one piece of mail was effective in the marketing campaign. after receiving some of these direct mailings back , the company will have a good indication as to whether or not they are reachin

Sentence: Number 7 - “the address of a web-page is called a url or uniform resource locator.”
Sense: Here, “address” means a web link (URL) used to locate a resource on the internet.

Sentence: Number 3 -  “you will also need to provide change of address information to the post office , and i recommend a note in the mailbox informing the mailman of your arrival.”
Sense: In this case, “address” refers to a physical or mailing location where someone lives or receives mail.

## Setting up data helpers

In [8]:
def tokenize(batch):
    # tokenize the words
    output = tokenizer(batch["sentence"], is_split_into_words=True)

    # find index of first subword token belonging to word of interest
    token_indices = []
    for i, (sentence, word) in enumerate(zip(batch["sentence"], batch["word"])):
        target_id = sentence.index(word)
        token_index = None
        for token_id, word_id in enumerate(output.word_ids(batch_index=i)):
            if word_id == target_id:
                token_index = token_id
                break
        token_indices.append(token_index)

    assert not any(x is None for x in token_indices), "Target token not found in sentence!"
    assert len(token_indices) == len(batch["sentence"]), "Token indices is the wrong length!"

    output["token_indices"] = token_indices
    return output

In [9]:
from torch.nn.utils.rnn import pad_sequence

def collate(items):
    # Converts from a list of dicts to a dict of lists
    batch = {
        k: [item[k] for item in items] for k in items[0]
    }
    # Tokenizes and pads each batch
    outputs = tokenize(batch)
    outputs = {
        k: pad_sequence([torch.tensor(l) for l in v], batch_first=True, padding_value=0) if k != "token_indices" else torch.tensor(v)
        for k, v in outputs.items()
    }
    return outputs

## Running BERT and extracting token representations

In [10]:
bert_model = BertModel.from_pretrained("bert-base-uncased").to("cuda")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [11]:
from torch.utils.data import DataLoader
from tqdm import tqdm

In [12]:
def get_token_embeddings(data, model, batch_size=128):
    embeddings = []
    model.eval()
    with torch.no_grad():
        inference_loader = DataLoader(data, batch_size=batch_size, shuffle=False, collate_fn=collate)
        for batch in tqdm(inference_loader):
            output = bert_model(
                batch["input_ids"].to("cuda"),
                batch["attention_mask"].to("cuda"),
                batch["token_type_ids"].to("cuda")
            )
            # extract the token representation from the last hidden state
            batch_reps = output.last_hidden_state[range(len(batch["token_indices"])), batch["token_indices"], :]
            embeddings.append(batch_reps.detach().cpu())
    embeddings = torch.concat(embeddings, dim=0)
    return embeddings

In [13]:
embeddings = get_token_embeddings(data, bert_model)

100%|██████████| 55/55 [01:26<00:00,  1.58s/it]


## Clustering with K-Means

### Question 2
We will begin by clustering with K-Means. **Write the code to cluster the embeddings with $k=5$.** Use `random_state=0` to ensure consistency. Use `diagnose_clustering` to examine the cluster outputs.

In [14]:
import numpy as np

def cosine_similarity(one, two):
  return np.dot(one,two) / (np.sqrt(np.dot(one,one)) * np.sqrt(np.dot(two,two)))


def diagnose_clustering(clustering):
    # For each cluster, print out the n documents closest to the cluster center
    # To support agglomerative clustering, we calculate the cluster center post-hoc
    clusters = {}
    for idx, label in enumerate(clustering.labels_):
        if label not in clusters:
            clusters[label] = []
        clusters[label].append(idx)

    for label in clusters:
        sims = {}
        cluster_vecs = embeddings[clusters[label]]
        normalized = cluster_vecs / torch.linalg.norm(cluster_vecs, dim=1, keepdims=True)
        cluster_center = normalized.mean(dim=0)
        for idx in clusters[label]:
            sim = cosine_similarity(cluster_center, embeddings[idx])
            sims[idx] = sim
        for k, v in sorted(sims.items(), key=lambda item: item[1], reverse=True)[:5]:
            print(k,"%.3f" % v, " ".join(data[k]["sentence"]))

        print()


In [15]:
from sklearn.cluster import KMeans, AgglomerativeClustering

kmeans_clusters = KMeans(n_clusters=5, random_state=0) #Cluster the BERT embeddings using KMeans
kmeans_clusters.fit(embeddings)

diagnose_clustering(kmeans_clusters) #Diagnose the clustering


  return np.dot(one,two) / (np.sqrt(np.dot(one,one)) * np.sqrt(np.dot(two,two)))


2013 0.927 you will also be able to see an apipa set ip address by opening a command prompt and typing ipconfig /all. this time again you will see that the dhcp setting is enabled even though the address has not been issued by a dhcp server. if you look at the example below , which shows the ipconfig /all output for an apipa set ip address , you can see that there is no line which states which host is the dhcp server. as the address is not manually set , which we can tell because autoconfiguration is enabled , we know that the address has been set by apipa . 
6475 0.925 make sure a static ip address has been configured for the primary server and data vault computer ( s ) . a computer using dhcp changes ip addresses every time it connects to the network , and the client may not be able to find it . 
4815 0.925 [ 0073 ] the client-side communicating section 211 is a given communicating means connected to a network 260 and to be used for communication with a client. a unique fixed address

## Clustering with Agglomerative

### Question 3

K-Means operates in the Euclidean metric space, but we generally use cosine similarity when assessing the similarity of token embeddings. **Use agglomerative clustering with the `cosine` metric, `average` linkage, and $k=5$ clusters.** Again, examine the outputs with `diagnose_clustering`.

**In a few sentences,** make some qualitative comparisons between how well these methods work at inducing word senses. Do the clusters surface coherent senses? Are they distinct? Do you find overlaps in senses between clusters?

In [20]:
from sklearn.cluster import AgglomerativeClustering

agglom = AgglomerativeClustering(n_clusters=5, metric="cosine", linkage="average")
agglom.fit(embeddings)


In [21]:
diagnose_clustering(agglom)

  return np.dot(one,two) / (np.sqrt(np.dot(one,one)) * np.sqrt(np.dot(two,two)))


6379 0.908 every computer on the internet has its own ip address that uniquely identifies it ( much like a street address for a residence ) . these ip addresses are formatted as x.x.x.x , where each x is a number between 0 and 255 , and connecting to another computer involves providing its address to your own computer , so that it knows the exact destination it 's attempting to reach. this means that if you 're trying to connect to another computer or server over the internet , you need to make sure you have the correct ip address . 
2340 0.907 the domain name server recognized the network address or name , but no answer was returned. this can happen if the network host has only ipv4 addresses and a request has been made for ipv6 information only , or vice versa . 
3902 0.907 addressing office. this notice will contain instructions as to when your new address takes affect and how to display your new address on 
1975 0.907 your ip address is assigned to you by your internet service prov

Agglomerative clustering gives slightly clearer sense groups than K-Means because it uses cosine similarity, which is a better way to compare BERT embeddings.

I noticed clearer separation between digital-related senses (like email address and web address) and physical ones (mailing address). However, some clusters still overlap - for example, web and email addresses sometimes appear together.

Overall, agglomerative clustering captures semantic similarity better than K-Means but is slower and still not perfectly distinct.

## Evaluating

In the absence of any ground truth, we will use the silhouette score to perform intrinsic evaluation of clusters. You can read more about the silhouette score in the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).

In [22]:
from sklearn.metrics import silhouette_score

### Question 4

**Compute the silhouette scores** for the agglomerative clustering output and the K-Means output from before, with the `cosine` metric. Compare them; do they align with your qualitative judgment?

In [23]:
from sklearn.metrics import silhouette_score

# Compute silhouette scores using cosine metric
kmeans_score = silhouette_score(embeddings, kmeans_clusters.labels_, metric="cosine")
agglom_score = silhouette_score(embeddings, agglom.labels_, metric="cosine")

print("K-Means silhouette score:", kmeans_score)
print("Agglomerative silhouette score:", agglom_score)

K-Means silhouette score: 0.13237737
Agglomerative silhouette score: 0.36092055


The agglomerative clustering had a silhouette score of 0.36, which is much higher than the K-Means score of 0.13.

This matches what I saw earlier, the agglomerative clusters looked more meaningful and separated. K-Means mixed different senses together, while agglomerative grouped clearer ones like email, web, and postal addresses

### Question 5

For both K-Means and agglomerative clustering, **plot the silhouette scores** for a range of cluster numbers $k = {2, \ldots, 10}$. **In a few sentences,** do these plots align with your expectations? What optimal number of clusters do they suggest, and does that align with your existing understanding of the word "address"? Finally, what might be some reasons they do or don't align with your expectations, and what does that tell us about evaluating unsupervised models more generally?

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score

k_values = range(2, 11)
kmeans_scores = []
agglom_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0) #K-Means clustering
    kmeans.fit(embeddings)
    kmeans_scores.append(silhouette_score(embeddings, kmeans.labels_, metric="cosine"))

    agglom = AgglomerativeClustering(n_clusters=k, metric="cosine", linkage="average") #Agglomerative clustering
    agglom.fit(embeddings)
    agglom_scores.append(silhouette_score(embeddings, agglom.labels_, metric="cosine"))

#Plot
plt.figure(figsize=(8,5))
plt.plot(k_values, kmeans_scores, marker='o', label='K-Means')
plt.plot(k_values, agglom_scores, marker='s', label='Agglomerative')
plt.title("Silhouette Scores for Different Numbers of Clusters (k)")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score (cosine)")
plt.legend()
plt.grid(True)
plt.show()
