# Lesson notebook 9 - Coreference Resolution



### Resolution with HuggingFace Online NeuralCoref

HuggingFace provides an online demonstration of coreference resolution.  We can use that to see how candidate links are scored.

We'll use SpaCy again, a pretrained open source language processing pipeline.  It provides a platform for processing text in a number of ways without having to perform any fine-tuning or training.  It can also be trained or fine-tuned.

We'll use it to demonstrate SpaCy's coreference resolution capabilities out of the box.  Take a look at the coreference clusters that it finds.  How well do you think it performs?

### Resolution Experiment with BERT Embeddings

We'll use a combination of SpaCy, a pretrained open source language processing pipeline, and BERT to try some coreference resolution.  SpaCy provides a platform for processing text in a number of ways without having to perform any fine-tuning or training.  It can also be trained or fine-tuned.  BERT allows us to leverage contextualized word embeddings and use those to identify a most likely resolution.



<a id = 'returnToTop'></a>

## Notebook Contents

  * 1. [Online Demo](#onlineDemo)
  * 2. [Setup](#spacySetup)
  * 3. [Coref Resolution via Contextualized BERT Embeddings](#corefBERT)
  * 4. [Answers](#answers)      










[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2023-spring-main/blob/master/materials/lesson_notebooks/lesson_9_CoreferenceResolution.ipynb)

[Return to Top](#returnToTop)  
<a id = 'onlineDemo'></a>

## 1. Online Demo


Run a visual example of coreference resolution [here](https://huggingface.co/coref/) without using this notebook.  If you have debug checked in the upper right corner, the display includes the scores for each of the possible coreference links.  This nicely illustrates the approach of performing the **pairwise comparison of all spans or mentions** and then only keeping the high scoring pairs to aggregate in to clusters. 

Try the following sentences and see if the model can correctly resolve all the coreferences.  First, here are the sentences used in the live session slides about Abraham Lincoln.:

`On the afternoon of November 19, 1863, Lincoln went to Gettysburg. He gave his famous speech there. It has been recognized as one of the great speeches of American history.`

**OR**

These sentences have two characters -- the Bond villain Blofeld and his cat. If the system works perfectly, it should generate two clusters -- one for Blofeld and one for the cat.  The Blofeld cluster should contain *Blofeld*, *he*, and *the villain*. The cat cluster should contain *cat* and *her*.

`Ernst Blofeld has a cat. He loves her. The villain has always been fond of animals.` 

**OR**

Another set of sentences with two characters -- my sister and her dog -- and references.

`My sister has a dog. She loves him. He worships the ground she walks upon.`



You could also try your own sentences.


[Return to Top](#returnToTop)  
<a id = 'spacySetup'></a>

## 2. Setup

We're going to use a combination of SpaCy and BERT to demonstrate coreference resolution.  We'll use SpaCy to identify parts of speech and then contextualized BERT embeddings to identify coreference resolutions. 

In [1]:
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m66.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m111.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install -q spacy

In [3]:
import spacy
print(spacy.__version__)

3.4.4


In [4]:
# Load a SpaCy model (one of SpaCy English models)
nlp = spacy.load('en_core_web_sm')

In [5]:
import numpy as np
from scipy.spatial.distance import cosine

In [6]:
from transformers import TFBertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)"tf_model.h5";:   0%|          | 0.00/536M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


[Return to Top](#returnToTop) 
<a id = 'corefBERT'></a>

## 3. Coref Resolution via contextualized BERT embeddings

Let's try some more experiments with coreference resolution.  There's a test called the [Winograd schema challenge](https://en.wikipedia.org/wiki/Winograd_schema_challenge) that creates a sentence where a change in one word changes the pronoun reference.  For example, in this sentence:


> The city councilmen refused the demonstrators a permit because *they* **feared/advocated** violence.



if we use the verb **feared** then *they* refers back to councilmen. However, if we use the verb **advocated** then *they* refers back to demonstrators.

What if we tried to use contextualized embeddings more directly to solve this problem? Would a contextualized embedding for "it" be more similar to the contextualized embedding for "lion" in the first sentence, and for "fish" in the second?

We could try using the embeddings that come out of a pre-trained BERT model. We aren't fine-tuning them for this task, so they probably won't work super well. But we might be able to see a bigger difference in predicted corefs based on meaningful changes in the sentence context.

Let's create a function to use contextualized embeddings from a pre-trained BERT model, and pick the closest noun by cosine similarity to the pronoun. It's very simple and there's no fine-tuning for the coref task, so it doesn't work for all cases but illustrates another approach to the task. (It gets things wrong that neural coref does right, because we only check for nouns using SpaCy, not for the other things like person, gender, and number, though those could be added as rules too.)

In [7]:
def find_pronoun_coref(text, pronoun):
    bert_tokens = tokenizer.tokenize(text)
    pronoun_loc = bert_tokens.index(pronoun)

    spacy_doc = nlp(text)
    bert_tokens_pos = []
    for spacy_tok in spacy_doc:
        bert_toks = tokenizer.tokenize(spacy_tok.text)
        for bert_tok in bert_toks:
            bert_tokens_pos.append(spacy_tok.pos_)
    
    input_ids = tokenizer.convert_tokens_to_ids(bert_tokens)
    bert_context_embeds = model.predict(np.array([input_ids]))[0]

    nouns_dist_to_pronoun = [(bert_tokens[i],
                              cosine(bert_context_embeds[0, i, :],
                                     bert_context_embeds[0, pronoun_loc, :]))
                             for i in range(len(bert_tokens))
                             if i != pronoun_loc and bert_tokens_pos[i] in {'NOUN', 'PROPN'}]
    closest_noun, closest_dist = sorted(nouns_dist_to_pronoun, key=lambda x: x[1])[0]
    return closest_noun, closest_dist

Now we can run some Winograd schema challenge examples through the function and see how well it works.

In [8]:
find_pronoun_coref('The lion saw the fish and it pounced.', 'it')



('lion', 0.2446913719177246)

In [9]:
find_pronoun_coref('The lion saw the fish and it was swimming.', 'it')



('fish', 0.2321985363960266)

In [10]:
find_pronoun_coref('The fisherman hooked a big fish but he lost it.', 'he')



('fisherman', 0.18205279111862183)

In [11]:
find_pronoun_coref('The fisherman hooked a big fish but he swam away.', 'he')



('fish', 0.18597930669784546)

In [12]:
find_pronoun_coref('The girls ate the apples because they were hungry.', 'they')



('girls', 0.20435720682144165)

In [13]:
find_pronoun_coref('The girls ate the apples because they were ripe.', 'they')



('apples', 0.23722761869430542)

### 3.1 Classroom Exercise

Try to come up with other examples that involve an ambiguous pronoun and that BERT contextualized embeddings get right.

## 4. Answers