[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/8.clustering/HW8_Word_Sense_Induction.ipynb)


# HW8: Unsupervised Word Sense Induction

The same _word type_ can have different _senses_, or meanings. For example, "class" could refer to a category ("I have never flown first class.") or a course ("ANLP is my favorite class.") Indeed, with how frequently language changes, a new word sense can come into common usage before dictionaries can be updated with this new information. In this setting, we might be interested in _inducing_ word senses: how can we surface different senses of a word without knowing the definition _a priori_?

In this homework, you will be working on this classic NLP task by clustering BERT token embeddings.

In [None]:
from transformers import AutoTokenizer, BertModel
import pandas as pd
import torch
from torch import nn

In [None]:
# Make sure the GPU is available
torch.cuda.is_available()

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [None]:
import xml.etree.ElementTree as ET

def parse_semeval_file(filepath):
    data = ET.parse(filepath)
    root = data.getroot()
    tag = root.tag
    word = tag.strip().split(".")[0]

    data = []

    for sentence in root:
        split = sentence.text.lower().split(" ")
        if word not in split:
            continue
        data.append({"word": word, "sentence": split})

    return data

In [None]:
# download data file
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/wsi/address.n.xml

## Exploring the data

### Question 1

Load the data for "address" (`address.n.xml`). Look through the examples and **identify two sentences that use "address" with different senses**. Report each sentence and what definition of "address" is being used.

In [None]:
data = parse_semeval_file("address.n.xml")

## Setting up data helpers

In [None]:
def tokenize(batch):
    # tokenize the words
    output = tokenizer(batch["sentence"], is_split_into_words=True)

    # find index of first subword token belonging to word of interest
    token_indices = []
    for i, (sentence, word) in enumerate(zip(batch["sentence"], batch["word"])):
        target_id = sentence.index(word)
        token_index = None
        for token_id, word_id in enumerate(output.word_ids(batch_index=i)):
            if word_id == target_id:
                token_index = token_id
                break
        token_indices.append(token_index)

    assert not any(x is None for x in token_indices), "Target token not found in sentence!"
    assert len(token_indices) == len(batch["sentence"]), "Token indices is the wrong length!"

    output["token_indices"] = token_indices
    return output

In [None]:
from torch.nn.utils.rnn import pad_sequence

def collate(items):
    # Converts from a list of dicts to a dict of lists
    batch = {
        k: [item[k] for item in items] for k in items[0]
    }
    # Tokenizes and pads each batch
    outputs = tokenize(batch)
    outputs = {
        k: pad_sequence([torch.tensor(l) for l in v], batch_first=True, padding_value=0) if k != "token_indices" else torch.tensor(v)
        for k, v in outputs.items()
    }
    return outputs

## Running BERT and extracting token representations

In [None]:
bert_model = BertModel.from_pretrained("bert-base-uncased").to("cuda")

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm

In [None]:
def get_token_embeddings(data, model, batch_size=128):
    embeddings = []
    model.eval()
    with torch.no_grad():
        inference_loader = DataLoader(data, batch_size=batch_size, shuffle=False, collate_fn=collate)
        for batch in tqdm(inference_loader):
            output = bert_model(
                batch["input_ids"].to("cuda"),
                batch["attention_mask"].to("cuda"),
                batch["token_type_ids"].to("cuda")
            )
            # extract the token representation from the last hidden state
            batch_reps = output.last_hidden_state[range(len(batch["token_indices"])), batch["token_indices"], :]
            embeddings.append(batch_reps.detach().cpu())
    embeddings = torch.concat(embeddings, dim=0)
    return embeddings

In [None]:
embeddings = get_token_embeddings(data, bert_model)

## Clustering with K-Means

### Question 2
We will begin by clustering with K-Means. **Write the code to cluster the embeddings with $k=5$.** Use `random_state=0` to ensure consistency. Use `diagnose_clustering` to examine the cluster outputs.

In [None]:
import numpy as np

def cosine_similarity(one, two):
  return np.dot(one,two) / (np.sqrt(np.dot(one,one)) * np.sqrt(np.dot(two,two)))


def diagnose_clustering(clustering):
    # For each cluster, print out the n documents closest to the cluster center
    # To support agglomerative clustering, we calculate the cluster center post-hoc
    clusters = {}
    for idx, label in enumerate(clustering.labels_):
        if label not in clusters:
            clusters[label] = []
        clusters[label].append(idx)
    
    for label in clusters:
        sims = {}
        cluster_vecs = embeddings[clusters[label]]
        normalized = cluster_vecs / torch.linalg.norm(cluster_vecs, dim=1, keepdims=True)
        cluster_center = normalized.mean(dim=0)
        for idx in clusters[label]:
            sim = cosine_similarity(cluster_center, embeddings[idx])
            sims[idx] = sim
        for k, v in sorted(sims.items(), key=lambda item: item[1], reverse=True)[:5]:
            print(k,"%.3f" % v, " ".join(data[k]["sentence"]))
        
        print()


In [None]:
from sklearn.cluster import KMeans, AgglomerativeClustering

kmeans_clusters =  # FILL ME IN. Should be an instance of sklearn.cluster.KMeans
diagnose_clustering(kmeans_clusters)

## Clustering with Agglomerative

### Question 3

K-Means operates in the Euclidean metric space, but we generally use cosine similarity when assessing the similarity of token embeddings. **Use agglomerative clustering with the `cosine` metric, `average` linkage, and $k=5$ clusters.** Again, examine the outputs with `diagnose_clustering`.

**In a few sentences,** make some qualitative comparisons between how well these methods work at inducing word senses. Do the clusters surface coherent senses? Are they distinct? Do you find overlaps in senses between clusters?

In [None]:
agglom = # FILL ME IN; this will take longer to run than K-Means

In [None]:
diagnose_clustering(agglom)

## Evaluating

In the absence of any ground truth, we will use the silhouette score to perform intrinsic evaluation of clusters. You can read more about the silhouette score in the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).

In [None]:
from sklearn.metrics import silhouette_score

### Question 4

**Compute the silhouette scores** for the agglomerative clustering output and the K-Means output from before, with the `cosine` metric. Compare them; do they align with your qualitative judgment?

### Question 5

For both K-Means and agglomerative clustering, **plot the silhouette scores** for a range of cluster numbers $k = {2, \ldots, 10}$. **In a few sentences,** do these plots align with your expectations? What optimal number of clusters do they suggest, and does that align with your existing understanding of the word "address"? Finally, what might be some reasons they do or don't align with your expectations, and what does that tell us about evaluating unsupervised models more generally?