<a href="https://colab.research.google.com/github/dterlage/Language-and-AI/blob/main/compling26_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Do LLMs know the difference between a pet chicken and a roast chicken?

## Word sense disambiguation in computational models and humans


In human language, words do not always have a fixed meaning. The most striking example is homonymous words: words that have the same form, but very different meanings. For instance, the word "bank", which has a different meaning in the context "I went to the bank to get some money" and "At the river bank, I met my old friend". Polysemous words are words that have different -- yet related -- meanings: for example, "chicken" is the same 'entity' in "My pet chicken is lovely" and "I am having roast chicken for dinner", but has very different meanings in these two contexts. In general, context can modulate almost any word's meaning. This poses a challenge in computational linguistics, as we need to find a way to differentiate among different meanings like humans do. Much research, resources, and models have been put forward to help with this challenge.

In this assignment, you are going to focus on [Trott and Bergen's (2021)](https://aclanthology.org/2021.acl-long.550/) RAW-C dataset: you are going to conduct a number of explorations with this dataset and partially replicate their research by the end of the assignment. In short, the authors explore how good LLMs are at capturing same/different meanings of words across contexts by comparing it to human judgements. To better understand the idea and the research, start by reading the paper.

This assignment entails a series of (interconnected) tasks (altogether worth 85 points):

* **Task 1**. Compute contextual word embeddings at different layers from Trott & Bergen's dataset. Here, each word is found in 4 sentences: 2 with one meaning, 2 with another meaning.
* **Task 2**. Compute sense embeddings for words in Trott & Bergen's dataset using WordNet, so you have an embedding for each definition of the word.
* **Task 3**. Compute the similarity between the contextual word embeddings of the homonyms at different layers and their sense embeddings; explore the relationship between homonyms and dominant senses quantitatively and qualitatively
* **Task 4**. Replicate part of Trott & Bergen's work by computing similarities across sentences with same/different meanings at the different layers and correlate with human similarities; visualise the results and reflect on them

In order to better understand the assignment, we recommend going through it all before starting so that it is clear how each part is connected to the next (which will help you make decisions about data structures, for instance).

# Task 1: Compute contextual word embeddings for homonyms [20 points]

## Task 1.1: read, explore and extract the necessary data [5 points]

First, you will have to (fork and) clone the github repository that stores the data you'll need. This can be found here: https://github.com/sashakenjeeva/raw-c . The repo also includes a README with a description of the original files in the repository, as well as some notes relevant for this assignment specifically.

In [None]:
#your code here (you can use as many cells as necessary/you prefer)

Make sure you mount the drive now so that you have access to the folder (think about setting the working directory in a way that is convenient).

In [1]:
# mount the drive here
!git clone https://github.com/DanielleGroenewegen/ChickenHotWings.git


Cloning into 'ChickenHotWings'...
remote: Enumerating objects: 112, done.[K
remote: Counting objects: 100% (112/112), done.[K
remote: Compressing objects: 100% (76/76), done.[K
remote: Total 112 (delta 31), reused 89 (delta 22), pack-reused 0 (from 0)[K
Receiving objects: 100% (112/112), 7.65 MiB | 19.74 MiB/s, done.
Resolving deltas: 100% (31/31), done.


In [2]:
import pandas as pd
a = pd.read_csv("/content/ChickenHotWings/data/processed/normed_critical.csv")

In [3]:
print(a.head())

      subject responses       word   same version_with_order  item Class  \
0  p7cnykpv2k  {"Q0":4}  breakfast  False          M1_a_M2_a     2     N   
1  p7cnykpv2k  {"Q0":0}       clip  False          M2_a_M1_a    98     N   
2  p7cnykpv2k  {"Q0":4}      scene   True          M1_b_M1_a    34     N   
3  p7cnykpv2k  {"Q0":3}      cross   True          M1_b_M1_a    82     V   
4  p7cnykpv2k  {"Q0":3}       file   True          M2_a_M2_b    66     N   

      string ambiguity_type  relatedness    version  
0  breakfast       Polysemy            4  M1_a_M2_a  
1       clip       Homonymy            0  M1_a_M2_a  
2      scene       Polysemy            4  M1_a_M1_b  
3    crossed       Polysemy            3  M1_a_M1_b  
4       file       Homonymy            3  M2_a_M2_b  


Now, you will have to read the data and organise it in a structure that works for the next parts of the assignment.

Read and explore the dataframe to see its structure (print part of it). What we need from it are the homonyms (in the form that they appear in the sentence -- the lexeme -- and in their regular form -- the lemma) and their corresponding sentences with different meanings (M1_a and M1_b have same meaning; M2_a, M2_b have same meaning). We only will need the stimuli that are in the final RAW-C dataset, as this is what we'll replicate at the end.

You can decide which data structure to use, but make sure that all these pieces of information are there (the word, the string, the meaning id, and the corresponding sentences) and easy to retrieve. Show your data at the end, as well as how many stimuli you end up with.

In [None]:
# read the data here
# remember to print the data structure you produce and the number of stimuli it contains!

## Task 1.2: Compute the contextualised word embeddings [15 points]


Now that you have the homonyms and their corresponding sentences, we will need to compute word embeddings for each of them. For this we will use the BERT base model, in its uncased version.

That is, for each homonym, you will have to compute four embeddings: one for the homonym in M1_a, one in M1_b, one in M2_a, one in M2_b. However, we also want to look into different layers of the BERT model to see which one captures the homonym's meaning best: you want to calculate embeddings at the static layer and at layers 4, 8, 12.

We will use the package psycho-embeddings (you will use it in class), which allows us to specify which target words we want to obtain the embeddings of, in which sentences, and at which layers, among other things. Make sure to read the documentation of the package so that you know the meaning of the arguments and which ones will come useful to you.

First of all, install the psycho-embeddings package below.

In [None]:
# install the psycho-embeddings package here

Now, import the relevant module/function from psycho-embeddings and load the required BERT model.

In [None]:
#your code here

Now, test that everything works correctly by computing an embedding for the word "assignment" in the sentence "I am having so much fun with this assignment!", at static layer and layers 4, 8 and 12 (hint: think of tokenisation and how the embedder deals with that).

In [None]:
#your code here

The next step is to calculate embeddings for the homonyms and their sentences that we got from the RAW-C dataset.

Make sure that your final output includes the word, the meaning id (M1_a, etc), the corresponding sentence and the embeddings at static layer and layers 4, 8, 12. You should maximally optimise this process by calculating in batches (again, check psycho-embeddings documentation), but keep in mind this might still take a while. First test your pipeline with a small number of inputs, and only run the full scale embedding extraction once you're positive the code works as expected.

When done, save the output in [pickle](https://docs.python.org/3/library/pickle.html) format (this is similar to json, but it can also handle np.arrays), so that you can easily load it later when needed and do not have to run it again. After pickle dumping (that's the word for saving it in pickle format), print it so that you are sure everything was saved correctly.

Then, check that your final data includes everything that you need by checking the entry "bank" and print the data pertaining to "bank".

In [None]:
#your code here

  0%|          | 0/112 [00:00<?, ?it/s]

Text tokenization:   0%|          | 0/4 [00:00<?, ? examples/s]

100%|██████████| 1/1 [00:01<00:00,  1.51s/it]
  1%|          | 1/112 [00:06<11:43,  6.34s/it]

processed word: lamb


Text tokenization:   0%|          | 0/4 [00:00<?, ? examples/s]

100%|██████████| 1/1 [00:01<00:00,  1.38s/it]
  2%|▏         | 2/112 [00:09<08:19,  4.54s/it]

processed word: book


Text tokenization:   0%|          | 0/4 [00:00<?, ? examples/s]

100%|██████████| 1/1 [00:01<00:00,  1.35s/it]
  3%|▎         | 3/112 [00:12<07:11,  3.96s/it]

processed word: breakfast


Text tokenization:   0%|          | 0/4 [00:00<?, ? examples/s]

100%|██████████| 1/1 [00:02<00:00,  2.27s/it]
  4%|▎         | 4/112 [00:17<07:42,  4.29s/it]

processed word: chicken


Text tokenization:   0%|          | 0/4 [00:00<?, ? examples/s]

  4%|▎         | 4/112 [00:19<08:51,  4.92s/it]


KeyboardInterrupt: 

# Task 2: Compute sense embeddings for the homonym dataset using WordNet [20 points]

Your next task is to fetch the definitions (glosses) of the homonyms, and compute an embedding for each gloss (each gloss is associated with a specific sense). We do that so we can later see whether the contextualised embeddings computed above represent the meaning of the homonym in context well (by comparing it to the sense embeddings). Figure 18.9 in [Jurafsky's and Martin's (2021) chapter 18](https://web.stanford.edu/~jurafsky/slp3/old_sep21/18.pdf) graphically illustrates this idea. Use this chapter for this part of the assignment, as it will come useful for you both theoretically and practically.

## Task 2.1: Fetch senses and glosses for a word [5 points]

First of all, you will have to figure out how [WordNet](https://www.nltk.org/howto/wordnet.html) works within the nltk package (hint: pay attention to what a synset is).

Install and import all the necessary components and define a function to extract the glosses of a word and create a dictionary with senses and glosses.

Then use the word "bat" to test that everything is working correctly: i.e., for "bat", you should be able to get its senses and the gloss for each of the sense (you will see that synsets might contain related words, but you only need the senses that contain the word of interest or derivates thereof; this should be specified in the function). Print the output for "bat".


In [None]:
#your code here

## Task 2.2: Function to compute sense embeddings [10 points]

Now that you have a function to extract senses and glosses for a given word, write a function that takes a word and computes embeddings for each of the senses following the method explained in Jurafsky's and Martin's chapter. In this case, no need to calculate at different layers: you should use the last layer only. You should maximally optimise this function like before.

The output should include the sense, the gloss, and the embedding. Print the function's output when using the word "bank".


In [None]:
#your code here

## Task 2.3: Compute sense embeddings for the RAW-C stimuli [4 points]

Now, use the function you defined above to compute sense embeddings for the RAW-C stimuli and pickle dump it too.

As above, the information that should be there for each word is: the sense, the gloss, the embedding at the last layer. Again, you can think of which structure to use best, but keep in mind that we will have to compare these to the CWE calculated in task 1, so it is good to think of a similar structure that is easily comparable.

Make sure that the number of stimuli matches the number of stimuli in the final RAW-C dataset.

In [None]:
#your code here

# Task 3: Compute and explore similarity between homonym CWEs and sense embeddings [25 points]

You now have the homonym CWEs computed in task 1, and the sense embeddings computed in task 2. The next step is to calculate cosine similarities between each CWE for each homonym (at the selected layer!) and each sense embedding for that homonym.

For instance, say for the word "bat" with meaning M1_a, you have its CWE at the static layer and at layers 4, 8, 12 and 7 senses: here, you will end up with 16 cosine similarities (take each CWE and compute its similarity to each of the sense embeddings). We then want to see which sense meaning is the closest to each CWE, and do some qualitative explorations with that.

## Task 3.1: Compute the cosine similarity between all the CWEs and the sense embeddings [8 points]

This task is not trivial with regards to how much information you have and how to structure the data (this is why it's also important to think of data structures in the earlier parts of the assignment), so take some time to think how to best breakdown this task. Test each step/function if you have multiple. Pickle dump your final output so that it is easily retrievable for later. At the end, print an example of the entry "bank".

For cosine similarity, the cdist function from scipy.spatial.distance seems the most efficient, but you are free to use any of your liking (hint: pay attention to the shape of your embeddings and to similarity vs distance. You will need the similarity).

In [None]:
#your code here

## Task 3.2: Quantitative and qualitative explorations the relationship between homonym embeddings and dominant senses [20 points]

Now, we can look into how the CWEs in different meanings and layers relate to the different senses of a homonym. We'll focus on the dominant sense in WordNet, see below for more details. This section includes both code blocks and reflection questions.

### Dominant senses in WordNet and top senses across layers (focus on static layer) [7 points]

Embeddings at the static layer do not take into account context, so intuitively they should capture the 'average' meaning, maybe the most common/dominant. We can test this by looking at the most similar sense and seeing if that matches that most common/dominant sense in the synset.

Keep in mind that synsets mark more common/dominant senses with numbering: so n.01 will be the most common noun; v.01 the most common verb, etc. If that is not available, the most common meaning will be the next number (e.g., n.02). You have to take that into account when you extract the top sense, so first extract information about which are the most dominant senses for each word across all the parts of speech: for example, "bat" might have as its two most common senses bat.n.01 and bat.v.02 (because v.01 might not be available; this is just an example). Some words might only have one part of speech in their synset, some more. Print your results.

In [None]:
#your code here

Then, extract the top similarity of homonyms to the senses at all the layers you have available. While we are interested in the static layer for checking dominant senses, it is also interesting to look into other layers to see whether adding context will refine the captured meaning.


In [None]:
#your code here

Let's check an example from our results.

Out of all the similarities of 'bank' to all its senses at all the layers, which one is the highest? Print your results for that entry and reflect below.

In [None]:
#your code here

### Does the static layer capture the most dominant meaning, according to WordNet (and according to you)? [1 point]

%your answer here

### Across other layers and meanings, which layer seems to capture the meaning of bank across meanings best, and why do you make this conclusion? [2 points]

%your answer here

### Checking matches and mismatches with the dominant sense [5 points]

Now, let's quantitatively check if the static layer actually captures the most dominant sense (any POS). You should end up with two data structures: matches (when the most similar sense is one of the dominant senses) and mismatches (when the most similar sense is not one of the dominant sense). Do that also for the other layers to compare. Print the percentage of matches and mismatches per layer.



In [None]:
#your code here

Now, print the matches and mismatches for the static layer only.

In [None]:
#your code here

### Do BERT's static embeddings capture the most dominant sense in WordNet? [2 point]

%your answer here

### Do the percentages of matches and mismatches throughout the layers make sense to you or is it different than what you expected? [2 points]

%your answer here

### For the **static layer**, are there any words that seem to particularly deviate from the dominant meaning? If so, which and why could that be? [3 points]

%your answer here

### Do you think the corpus on which BERT is trained might reflect different meaning dominance than for WordNet's senses? If so/not, why? [3 points]

%your answer here

# Task 4: Partially replicate Trott & Bergen's experiment [20 points]

Now comes the time to partially replicate the RAW-C experiment, by seeing whether different layers of BERT capture meanings more or less similarly to humans. At the end you will have to wrap up with a brief comment on which layer seems to capture meanings best and how that connects to explorations in the previous section.

## Task 4.1: Create a dataframe with cosine similarities between sentences at different layers [7 points]

You should now use the embeddings at the different layers that you computed to calculate similarities between each context: M1a, M1b, M2a, M2b. You will have to have all combinations, so for each string in the RAW-C dataframe, you'll have: M1a vs M1b, M1a vs M2a, M1a vs M2b, M1b vs M2a, M1b vs M2b, M2a vs M2b.

Bear in mind that your final dataframe should include: the word, the string as it appears in the sentence, cosine similarity at layers 4, layer 8, layer 12, the version being compared (is it M1a vs M1b or M1a vs M2a?) and the mean relatadness given by humans (hint: the repo you cloned will come useful here, both in terms of code and data). Print the head of the dataframe to check everything is in order, and check also that the number of stimuli match with your number across the assignment (starting from task 1).

In [None]:
#your code here

## Task 4.2: Correlate with human judgements and visualise [8 points]

First, correlate the cosine similarities at the different layers to the mean human relatedness judgements. Use the same correlation metric used by Trott & Bergen.

In [None]:
#your code here

Next, visualise your results. You want to see the correlation between BERT embeddings and human judgements per layer, but what would also be interesting is to include the meaning contrasts (such as M1_a_M1_b, etc), so that we can see how those play out per layer.

In [None]:
#your code here

### Reflect on the correlations and on the visualisations. What can you observe and infer in terms of which layer(s) might be capturing meaning best? Is there one way to determine that (i.e., what does 'capturing meanings' mean?)? Contrast and compare the layers. [5 points]

%your answer here



