<a href="https://colab.research.google.com/github/d-noe/NLP_DH_PSL_Fall2025/blob/main/code/1_bert_training/Tutorial_1_WSD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial Session 1: Word Sense Disambiguation (WSD) with BERT

![](https://external-preview.redd.it/3On3o8P2JG1enhtJKsbEidkaP_YeHxuRhXR32QHFhkA.png?width=1080&crop=smart&auto=webp&s=78ee696f5971df43fa1cfc925de9293b60d879e3)


The goal of this notebook is to use the representational power of pre-trained (encoder-only) models to explore Word Sense Disambiguation (WSD).

Note that this tutorial doesn't aim to go through the whole pipeline of WSD (no classification or 'sense-retrieval'), but more to illustrate part of it through the use of BERT-like models.


In [None]:
#@title What we are aiming for:


from IPython.core.display import display, HTML
import urllib.request

fp = urllib.request.urlopen("https://raw.githubusercontent.com/d-noe/NLP_DH_PSL_Fall2025/refs/heads/main/props/digit.html")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

display(HTML(mystr))

## Set-up

Install and import necessary Python libraries and modules.

This notebook will mainly rely on [`transformers` Python library](https://huggingface.co/docs/transformers/installation), we will also use [`scikit-learn`](https://scikit-learn.org/stable/install.html) for dimensionality reduction, and [`altair`](https://altair-viz.github.io/) for visualisation.

In [None]:
! pip install transformers

In [None]:
# For BERT --> using DistilBert, a smaller model that retains most of its power
from transformers import DistilBertTokenizerFast, DistilBertModel
import torch

# For data manipulation and analysis
import pandas as pd
pd.options.display.max_colwidth = 200
import numpy as np
from sklearn.decomposition import PCA

# For interactive data visualization
import altair as alt

In [None]:
# are 'GPU's available?
# If using Colab, you can change your runtime to access GPUs
# but it is not really needed here, computations are not too long on CPUs
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Load Model

In [None]:
model = DistilBertModel.from_pretrained(
    "distilbert-base-uncased",
    output_attentions=True,
    device_map=DEVICE,
)
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [None]:
model

## Data Loading

The dataset used for this tutorial is based on the [CoarseWSD-20 dataset](https://github.com/danlou/bert-disambiguation/tree/master) introduced in *Analysis and Evaluation of Language Models for Word Sense Disambiguation* [(Loureiro et al., 2021)](https://arxiv.org/abs/2008.11608).

The original dataset is based on Wikipedia articles excerpts, developed to evaluate and train word disambiguation models based on a selection of 20 expert-selected words (apple, club, bow, bank, crane, square, chair, java, bass, seal, pitcher, arm, mole, pound, deck, trunk, spring, hood, yard, digit).

In the context of this tutorial, it is presented in the form of a `.csv` file, which can be conviniently loaded as a `pandas.DataFrame`. This file contains the following columns:
- `text` [string]: uncased sentence containing an instance of a polysemeous word (example: *'some displays can show only digit or alphanumeric characters .'*)
- `target` [string]: the polysemeous word to be considered in `text` (example: *'digit'*)
- `label` [int]: the label of the sense associated with the target (example: *0*)
- `label_sense` [str]: a human-understandable label to disambiguate the target (example: *'digit_numerical_digit'*)
- `split` [str]: original data split, i.e. train or test (example: *'test'*).

In [None]:
!wget https://raw.githubusercontent.com/d-noe/NLP_DH_PSL_Fall2025/refs/heads/main/code/scripts/helpers.py
from helpers import load_csv_from_github

# Load the data
wsd_df = load_csv_from_github("data/word_sense/CoarseWSD-20.csv")
print(f"The DataFrame contains {len(wsd_df)} rows.")

# Display 5 random rows
wsd_df.sample(5)

In [None]:
# See number of unique examples per target and number of senses
wsd_df[["target", "label"]].groupby(["target"]).agg(['count', 'nunique'])

For the purpose of this experiment, let's focus on a sample of the original dataset. As we are interested in exploring the role of context in the vector representations produced by the model, we will select only one `focus_word`, and randomly sample `n_samples` (or less if less is available) sentences.

In [None]:
focus_word = "mole"
n_samples = 1000

# Select rows associated with the focus word
df_focus = wsd_df[wsd_df["target"]==focus_word]
# Randomly sample from the DataFrame (if enough data)
if len(df_focus) > n_samples:
  df_focus = df_focus.sample(n_samples)
else:
  print(f"{n_samples} larger than the total amount of data available for {focus_word}: {len(df_focus)}.")

# Print some information about our dataset
unique_senses = df_focus["label_sense"].unique()
print(f"Total number of senses associated with '{focus_word}': {len(unique_senses)}")
for sense in unique_senses:
  print(f"\t{sense}")
  print(f"\t\tProportion of the data: {100*len(df_focus[df_focus['label_sense']==sense])/len(df_focus):.0f}%.")
  print(f"\t\tExample: {df_focus[df_focus['label_sense']==sense].iloc[0]['text']}")

## Contextualized word-embeddings

We will `batch` the examples, this allows to process several examples at once, and to speed up the process.

Then in a loop, for `batches` we will:
- tokenize the texts
- run the tokenized texts through the model
- store:
  - the tokenized texts IDs (this will allow us to retrieve the position of the word of interest)
  - the embeddings produced by the model

In [None]:
texts = list(df_focus['text'])


In [None]:
from tqdm import tqdm

batch_size = 16 # adjust for memory usage
max_length = 256  # adjust as needed

tokenized_texts_ids = []
all_last_hidden_states = []

for i in tqdm(range(0, len(texts), batch_size)):
    batch_texts = texts[i:i+batch_size]
    tokenized = tokenizer(
        batch_texts,
        truncation=True,
        padding="max_length",  # pads every batch to the same length
        max_length=max_length,
        return_tensors="pt"
    ).to(DEVICE)

    with torch.no_grad():
        outputs = model(**tokenized)

    all_last_hidden_states.append(outputs.last_hidden_state.cpu())
    tokenized_texts_ids.append(tokenized["input_ids"].cpu())

# Now safe to concatenate
last_hidden_state = torch.cat(all_last_hidden_states, dim=0)
tokenized_texts_ids = torch.cat(tokenized_texts_ids, dim=0)

In [None]:
last_hidden_state.shape # N_samples x Sequence Length x Embedding Dimension

### Retrieving words vector representations

Then, we will retrieve the vectors associated with the `focus_word` we chose in each of the samples.

In [None]:
# quick and ugly heuristic to retrieve vectors associated with <focus_word> tokens

# 0. Isolate token id from model vocabulary
word_id = tokenizer(focus_word)["input_ids"][1]

# 1. Retrieve focus token positions in tokenized texts
word_token_pids = [np.argmax(t.numpy()==word_id) for t in tokenized_texts_ids]

# 2. Extract vectors at these positions
word_vectors = np.array([
    o[p_id].numpy()
    for o, p_id in zip(last_hidden_state, word_token_pids)
])

In [None]:
word_vectors.shape # N_samples x Embedding Dimensions

We now have the **embeddings** for the words in each of the examples!

Let's visualise it (in 2D, after dimensionality reduction, not 768D), to explore the semantic space produced by the model!

### PCA Visualisation

In [None]:
pca = PCA(n_components=2)

# Just for convenience (+altair integration): store the data in a DataFrame
df_plot = df_focus.copy()

reduced_vectors = pca.fit_transform(word_vectors)

df_plot["PCA_1"] = reduced_vectors[:,0]
df_plot["PCA_2"] = reduced_vectors[:,1]


In [None]:
chart = alt.Chart(
    df_plot,
    title=f"Word PCA Distribution: {focus_word}"
).mark_circle(size=200).encode(
    alt.X('PCA_1',
        scale=alt.Scale(zero=False)
    ),
    y="PCA_2",
    color= "label_sense",
    tooltip=['target', 'label_sense', 'text'],
    ).interactive().properties(
    width=500,
    height=500
)
chart