# Understanding language requires linking events and referents together

<img src="https://cdn-images-1.medium.com/max/1024/1*d9lDwTfR8SWiDIcXNza2hQ.png" width=600 />

Language allows us to reuse familiar units to express complex meanings.

Unlike computers, speakers **vary** how they refer to the same thing, or the **forms** of **referring expressions.**

## Types of referring expressions

English
* Pronouns (they, we, I, he)
* Names (Vector, Liz, Timothée Chalamet)
* Descriptions, vary in complexity, syntactically **noun phrases**
  * the cat sleeping on the couch
  * The way Ember discussed his favorite audiobook

Spanish
* Null pronouns (-)
* Overt pronouns (él, yo)

## What is coreference?

Coreference is a potential **relation** that exists between two referring expressions. When two things are **coreferent** they are assumed to _refer to the same thing_. When they are not coreferent, they refer to slightly different things.

## What types of coreference are there?

* Singleton mentions (a referring expression linked to only one thing)
* Anaphora (a referring expression that "points back" to an earlier referent)
  * In English, clearest cases are pronouns (they, she, I, etc.)
  * <u>Ana</u> ... <u>she</u> loves working ... at the institute.
  * My cat <u>Vector</u> is 15 years old. I adopted <u>him</u> when <u>he</u> was a year old. My mom always tells me she loves her "<u>grandkitten</u>".
* Cataphora (a referring expression that "looks forward" to a future referent)
  * In English, often poetic
  * "If you want <u>some</u>, here's <u>some parmesan cheese</u>." (from [Wikipedia, retrieved 10/31/21 @ 10:07pm](https://en.wikipedia.org/wiki/Cataphora))

## Coreference relations in standard datasets

Guidelines from OntoNotes: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-coreference-guidelines.pdf

* **Identical** - Two referents point to exactly the same person
  * [Ember]x enjoys playing Stardew Valley.
  * In [his]x spare time, [he]x is also designing an emoji language.
* **Appositive** - Two referring expressions that appear next to each other, and refer to the same entity
  * Either part should make complete sense without the other
  * Heads (the "main" mention)
  * Attributes (the "secondary" part that is attributed to the Head)

<u>Example appositives</u>
  * Ember, a linguist I know, enjoys playing Stardew Valley
    * [Ember]x-head
    * [a linguist I know]x-attrib
  * My cat Vector
    * [My cat]x-attrib
    * [Vector]x-head
    
## Representing coreference

Subscripts are arbitrary -- not numbered. Categorical. Used to keep track of different entities.

* **Chains** of coreferring expressions
  * Links between entity mentions
  * _Ordered_ representation ($m_1$, $m_2$, ..., $m_n$) within a document or chunk of text
* **Clusters** of coreferring expressions
  * All the ways a referent has been referred to
  * Unordered -- may occur across documents
* **Nested mentions**
  * A mention may reference another entity within it
  * e.g., _The president of the United States_ could be treated like 



## Solving coreference requires getting a good handle on your data

Let's work through a few concrete examples of coreference:

Army Corps of Engineers
* But [the Army Corps of Engineers] expects the river level 
to continue falling this month. 
* "The flow of the Missouri River is slowed," an [Army Corps] spokesman said.
- Coreferent because the shortened form is a stand-in for the longer form

And of non-coreference:

<img src="https://cdn.vox-cdn.com/thumbor/CtzDFj9Rp2QWQ0cUKoTLTaHmebE=/1400x1400/filters:format(jpeg)/cdn.vox-cdn.com/uploads/chorus_asset/file/19735670/AdobeStock_278395497.jpeg" width=400 />

Wheat
* [Wheat] is an important part of the economy in the Midwest.
* In Kansas, [wheat] fields stretch as far as the eye can see.
- Not coreferent because it is a **modifier** in the second case

<img src="https://upload.wikimedia.org/wikipedia/commons/a/aa/Dickens_Gurney_head.jpg" height=400 /> <img src="https://m.media-amazon.com/images/M/MV5BMTQ2Mzc5ODQ4MV5BMl5BanBnXkFtZTgwNjk4NjI4NzE@._V1_.jpg" height=400 />

Dickens/Dickensian
* [Charles Dickens] was famous for his memorable characters.  
* The [Dickensian] character has since become a literary archetype.
- Not coreferent because it is a **modifier**

## Coreference can apply to almost anything
1. Event coreference
  * Whether two descriptions refer to the same event -- not necessarily an "entity"
2. Entity coreference
  * Whether two referring expressions refer to the same person, organization, etc.

## Why is coreference challenging?

Language is highly ambiguous. Doing coreference resolution requires making several decisions:

1. Is something a new entity or does it belong to an existing entity?
2. If it belongs to an existing chain or cluster, which one does it belong to?

### Winograd Schema

* One common "test" of model linguistic reasoning.
* Builds in tough test cases that are pretty easy for people but highly ambiguous.

Components:
1. A sentence or brief discourse that contains the following
  * Two noun phrases of the same semantic class (male, female, inanimate, or group of objects or people)
  * An ambiguous pronoun that may refer to either of the above noun phrases
  * A special word and alternate word, such that if the special word is replaced with the alternate word, the natural resolution of the pronoun changes.
2. A question asking the identity of the ambiguous pronoun
3. Two answer choices corresponding to the noun phrases in question.

[Example Winograd schema](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WSCExample.xml)

> Dan had to stop Bill from toying with the injured bird. He is very compassionate.
> Snippet: He is very compassionate. \
>  A. Dan \
>  B. Bill 

## Resources for training models to solve coreference

* PropBank (where we also got semantic role labels)
* OntoNotes (entities linked together by XML tags with an "id" property, along with a _type_ of coreferencee)
  * Detailed guidelines https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-coreference-guidelines.pdf

Example OntoNotes annotation (XML)

> If you look back at \<COREF ID="19" TYPE="IDENT"\>this historic agreement that *T*-1 was signed *-2 in \<COREF ID="300" TYPE="IDENT"\>the six party frameworks\</COREF> on \<COREF ID="MSNBC_MEETPRESS_ENG_20060709_215800.part-3-E10" TYPE="IDENT"\>September nineteenth of two thousand five\</COREF>\</COREF> \<COREF ID="16" TYPE="IDENT">those types of assurances\</COREF> are in \<COREF ID="19" TYPE="IDENT">that agreement\</COREF>.


## Rule-based systems for coreference resolution

Many options for approaches to coreference resolution! Some of the most successful systems use stacks and stacks of rules called a **sieve**, such as the one described in [Raghunathan, Lee, Rangarajan, Chambers, Surdeanu, Jurafsky, and Manning (2010), _Proceedings of EMNLP_](https://aclanthology.org/D10-1048/).

Their algorithm describes a seven-layer system:

1. Check whether two descriptions say exactly the same thing (e.g., the Army Corps of Engineers)
2. There exists certain kinds of relationships:
  * appositive relation ([my cat]x-attrib [Vector]x-head)
  * equivalence (...is [a Balinese cat]x)
  * pronoun (...who I have had for 14 years)
  * acronym (Department of Motor Vehicles / the DMV)
  * demonym (e.g., America / American)
3. Matching (head) subwords
4. ...
5. ... Relaxations of 3
6. ...
7. Pronouns
  * Match on number (it vs. they)
  * Animacy (people vs. object)
  * Grammatical gender (they vs. she)
  * Person (we / us)
  * Type of entity (e.g., government vs. real world)



## What are some tricky cases of (non-)coreference?

Lots of exceptions and difficulties for annotating!

* Organization and members 
* Gender and Number
* Indefinite uses of proper nouns
* GPEs and governments
* Quantifying Expressions
* Quantifiers
* Partitives
* Linking quantifying expressions to other mentions
* Possessive extents
* Formulaic mentions
* Sentence fragments
* Metonyms 



### The Stanford CoreNLP system

Comes with a Deterministic Parser. https://nlp.stanford.edu/software/dcoref.html



In [1]:
!pip install stanza # for https://stanfordnlp.github.io/CoreNLP/coref.html

Collecting stanza
  Downloading stanza-1.3.0-py3-none-any.whl (432 kB)
[?25l[K     |▊                               | 10 kB 17.1 MB/s eta 0:00:01[K     |█▌                              | 20 kB 20.2 MB/s eta 0:00:01[K     |██▎                             | 30 kB 21.9 MB/s eta 0:00:01[K     |███                             | 40 kB 22.3 MB/s eta 0:00:01[K     |███▉                            | 51 kB 22.6 MB/s eta 0:00:01[K     |████▌                           | 61 kB 20.7 MB/s eta 0:00:01[K     |█████▎                          | 71 kB 21.1 MB/s eta 0:00:01[K     |██████                          | 81 kB 20.4 MB/s eta 0:00:01[K     |██████▉                         | 92 kB 20.2 MB/s eta 0:00:01[K     |███████▋                        | 102 kB 21.0 MB/s eta 0:00:01[K     |████████▍                       | 112 kB 21.0 MB/s eta 0:00:01[K     |█████████                       | 122 kB 21.0 MB/s eta 0:00:01[K     |█████████▉                      | 133 kB 21.0 MB/s eta 0:0

In [2]:
import stanza
import os

corenlp_dir = './CoreNLP/'
stanza.install_corenlp(dir=corenlp_dir)

os.environ["CORENLP_HOME"] = corenlp_dir

2021-11-01 13:56:12 INFO: Installing CoreNLP package into ./CoreNLP/...


Downloading https://huggingface.co/stanfordnlp/CoreNLP/resolve/main/stanford-corenlp-latest.zip:   0%|        …



Use Stanford's sieve system described above

In [5]:
from stanza.server import CoreNLPClient

text = "My cat Vector is 15 years old. I adopted him when he was a year old. " \
       "My mom always tells me she loves her \"grandkitten\"."
client = CoreNLPClient(
    # deterministic coreference
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner', 'dcoref'], 
    memory='4G', 
    endpoint='http://localhost:9004',
    be_quiet=True)
annotation = client.annotate(text)
print(annotation.corefChain)
client.stop()

2021-11-01 13:57:15 INFO: Writing properties to tmp file: corenlp_server-70236d36545a482a.props
2021-11-01 13:57:15 INFO: Starting server with command: java -Xmx4G -cp ./CoreNLP/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9004 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-70236d36545a482a.props -annotators tokenize,ssplit,pos,lemma,ner,dcoref -preload -outputFormat serialized


[chainID: 1
mention {
  mentionID: 1
  mentionType: "PROPER"
  number: "SINGULAR"
  gender: "NEUTRAL"
  animacy: "INANIMATE"
  beginIndex: 0
  endIndex: 3
  headIndex: 2
  sentenceIndex: 0
  position: 1
}
representative: 0
, chainID: 2
mention {
  mentionID: 2
  mentionType: "PRONOMINAL"
  number: "SINGULAR"
  gender: "UNKNOWN"
  animacy: "ANIMATE"
  beginIndex: 0
  endIndex: 1
  headIndex: 0
  sentenceIndex: 0
  position: 2
}
mention {
  mentionID: 4
  mentionType: "PRONOMINAL"
  number: "SINGULAR"
  gender: "UNKNOWN"
  animacy: "ANIMATE"
  beginIndex: 0
  endIndex: 1
  headIndex: 0
  sentenceIndex: 1
  position: 1
}
mention {
  mentionID: 9
  mentionType: "PRONOMINAL"
  number: "SINGULAR"
  gender: "UNKNOWN"
  animacy: "ANIMATE"
  beginIndex: 0
  endIndex: 1
  headIndex: 0
  sentenceIndex: 2
  position: 2
}
mention {
  mentionID: 10
  mentionType: "PRONOMINAL"
  number: "SINGULAR"
  gender: "UNKNOWN"
  animacy: "ANIMATE"
  beginIndex: 4
  endIndex: 5
  headIndex: 4
  sentenceIndex: 2

In [6]:
annotation.corefMentionToEntityMentionMappings

[0, -1, -1, 1, 2, 3, -1, 4, 5, -1, -1, -1, -1]

Use Stanford's statistical systems instead

In [7]:
client = CoreNLPClient(
    # statistical coreference system
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner', 'coref'], 
    memory='4G', 
    endpoint='http://localhost:9002',
    be_quiet=True)
annotation = client.annotate(text)
print(annotation.corefChain)
client.stop()

2021-11-01 13:59:17 INFO: Writing properties to tmp file: corenlp_server-594a74d0ba344033.props
2021-11-01 13:59:17 INFO: Starting server with command: java -Xmx4G -cp ./CoreNLP/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9002 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-594a74d0ba344033.props -annotators tokenize,ssplit,pos,lemma,ner,coref -preload -outputFormat serialized


[chainID: 9
mention {
  mentionID: 2
  mentionType: "NOMINAL"
  number: "SINGULAR"
  gender: "UNKNOWN"
  animacy: "ANIMATE"
  beginIndex: 0
  endIndex: 3
  headIndex: 1
  sentenceIndex: 0
  position: 3
}
mention {
  mentionID: 5
  mentionType: "PRONOMINAL"
  number: "SINGULAR"
  gender: "MALE"
  animacy: "ANIMATE"
  beginIndex: 2
  endIndex: 3
  headIndex: 2
  sentenceIndex: 1
  position: 2
}
mention {
  mentionID: 12
  mentionType: "NOMINAL"
  number: "SINGULAR"
  gender: "FEMALE"
  animacy: "ANIMATE"
  beginIndex: 0
  endIndex: 2
  headIndex: 1
  sentenceIndex: 2
  position: 5
}
mention {
  mentionID: 6
  mentionType: "PRONOMINAL"
  number: "SINGULAR"
  gender: "MALE"
  animacy: "ANIMATE"
  beginIndex: 4
  endIndex: 5
  headIndex: 4
  sentenceIndex: 1
  position: 3
}
mention {
  mentionID: 8
  mentionType: "PRONOMINAL"
  number: "SINGULAR"
  gender: "FEMALE"
  animacy: "ANIMATE"
  beginIndex: 5
  endIndex: 6
  headIndex: 5
  sentenceIndex: 2
  position: 1
}
mention {
  mentionID: 9
 