<img src="data/images/lecture-notebook-header.png" />

# Entity Resolution / Coreference Resolution

**Entity resolution**, also known as record linkage or identity resolution, is the process of identifying and linking different records or data entries that refer to the same entity or individual across different data sources or within a single dataset. This process is crucial in situations where multiple records might pertain to the same real-world entity but are represented differently or contain errors, such as variations in spelling, abbreviations, missing information, or discrepancies. By employing various algorithms, statistical models, or machine learning techniques, entity resolution aims to accurately match and merge these disparate records, reducing redundancy and ensuring a more comprehensive and accurate view of the underlying entities or individuals within the data. It's commonly used in databases, data integration, fraud detection, customer relationship management, and other fields where data consolidation and accuracy are vital.

Closely related, **coreference resolution** is the task of identifying all the expressions in a text that refer to the same entity. Coreference resolution is important for understanding the meaning of a text and is a crucial component in many applications such as question answering, summarization, and sentiment analysis. This task can be challenging because it involves dealing with a wide range of linguistic phenomena, such as anaphora, bridging references, and appositives. Various techniques have been developed for coreference resolution, including rule-based approaches, statistical models, and deep learning models. Coreference resolution has many practical applications, including information extraction, summarization, and machine translation. By correctly identifying all the expressions that refer to the same entity, coreference resolution can improve the accuracy and effectiveness of these applications.

## Setting up the Notebook

This notebook also requires the [`coreferee`](https://spacy.io/universe/project/coreferee) package to extend the capabilities of `spacy` to perform coreference resolution.

In [None]:
import spacy
import coreferee

from spacy import displacy

In [None]:
nlp = spacy.load("en_core_web_lg")

nlp.add_pipe('coreferee')

---

## Performing Coreference Resolution with spaCy

Once you have installed the require package and language model, you can use spaCy as usual to analyze your input text -- like the example we use in the lecture:


In [None]:
text = "When Neil Armstrong stepped on the moon, he said: \"That's one small step for a man, one giant leap for mankind.\" Then he jumped from the last step onto the moon's surface."

# This example contains pronouns that refer the multiple entities; the code below should be able to handle this as well
#text = "When Alice and Bob felt hungry, they went to restaurant which was nearest to them."

# Analyze text using spaCy including coreference resolution
doc = nlp(text)

# Print coreference chains (i.e., sets of token referring to the same entity)
doc._.coref_chains.print()

The code cell below shows an example how we can get the resolution (i.e., the head of a coreference chain) for word identified by its position -- from the output of the previous code above, we already know that at position 8 is the pronoun *"he"* referring to *"Armstrong"*.

In [None]:
print(doc._.coref_chains.resolve(doc[8]))

---

## Replace all Singular Pronouns

As a concrete application, let's look at the task of replacing all pronouns with the respective mention. This could be seen as some kind of text simplification or text clarification task. While a statement like "Alice likes pizza. She had one for dinner." is very easy to understand, it can be tricky for non-native speakers or otherwise reading-impaired people. Thus, having the alternative "Alice likes pizza. Alice had a pizza for dinner" can help such people a lot.

### Identify all Singular Pronouns

Although coreference resolution aims to find all mentions in a text that are coreferent, here we focus only on replacing pronouns. This means that we first have to identify all pronouns. And not only do we want pronouns, we also want all other mentions that are coreferent.

In [None]:
pronoun_to_resolution = {}

for idx, token in enumerate(doc):
    
    # If the current token is not a pronoun, ignore this token and continue
    if token.pos_ != 'PRON':
        continue
       
    # Get the resolution (i.e., the head of a coreference chain) for the current pronoun.
    resolution = doc._.coref_chains.resolve(token)
    
    # If this was successful, add resolution to the mapping for later use
    if resolution is not None:
        pronoun_to_resolution[idx] = resolution
    
    
print(pronoun_to_resolution)

### Expand Resolutions to Compound Nouns

So far the resolution is only a single token (here *"Armstrong"*). However, we now that the foll resolution is the compound *"Neil Armstrong"*. We therefore first need to extract all complete compounds before we can meaningfully replace a pronoun. Luckily, we already looked into this problem in the notebook "Parsing (Dependency Parsing)" where we saw how we can use the dependencies between words to find all compounds. The code cell below contains the same method `get_compound()` we already implemented in this other notebook.


In [None]:
def get_compound(token, compound_parts=[]):

    # Loop over all children of the token
    for child in token.children:
        # We are only interested in the "compound" relationship
        if child.dep_ == "compound":
            # Call method recursively on the child token
            get_compound(child, compound_parts=compound_parts)
    
    # Add the token itself to the list
    compound_parts.append(token)

With this method, we can now go through all resolutions and extract all compounds; the code cell below accomplishes this.

In [None]:
pronoun_to_resolution_strings = {}

for pronoun_idx, resolutions in pronoun_to_resolution.items():
    resolution_strings = []
    for res in resolutions:
        # Identify compounds
        compound_parts = []
        get_compound(doc[res.i], compound_parts=compound_parts)
        compound_string = ' '.join([t.text for t in compound_parts])
        # Add compound to list of resolution strings
        resolution_strings.append(compound_string)
    # Add new compound as resolution for current peonoun
    pronoun_to_resolution_strings[pronoun_idx] = resolution_strings
    
print(pronoun_to_resolution_strings)

### Replace Pronouns with a "Suitable" Alternative

Now that we found all pronouns (incl. their position in the text) and their resolutions, we can perform the actual replacement. The only thing we need to consider is that the resolution/replacement may contain more than one word. This means, in case of multiple pronouns to be replaced, the first replacement will shift the position of subsequent pronouns. The variable `offset` keeps track of this.

In [None]:
# If we need to replace multiple pronouns -- which often means to replace that single word with a multiterm phrase -- 
# we need to keep track of an offset to get the indices right; otherwise we replace the wrong bits in the sentence
offset = 0

# Let's initialize the sentence as the list of tokens/words form the original sentence
words = [t.text for t in doc]

# Loop over each pronound we found and make the replacement
for pronoun_idx, resolution_strings in pronoun_to_resolution_strings.items():
    
    # If we have a plural pronoun, we concatenate all resolutions with "and"
    replacement = ' and '.join(resolution_strings)
    # Instead of the string, it's easier to work with the list of words
    replacement = replacement.split(' ')

    # Words before the pronoun + replacement + words after the pronoun
    words = words[:pronoun_idx+offset] + [ t for t in replacement ] + words[pronoun_idx+1+offset:]
    
    # Update offset
    offset += len(replacement) - 1
    
    
print(' '.join(words))    

And now it sounds like Dobby the house elf :).

---

## Summary

Coreference resolution is an essential natural language processing (NLP) task aimed at identifying and linking words or expressions within a text that refer to the same entity or concept. The primary goal is to resolve references like pronouns (he, she, it) or noun phrases (names, titles) to their corresponding entities, enabling a deeper understanding of textual context.

Uses and applications of coreference resolution encompass several critical areas:

* **Document Understanding:** Coreference resolution enhances document comprehension by disambiguating references, enabling systems to discern relationships between entities across sentences or paragraphs. It aids in creating cohesive representations of text, improving information extraction and summarization.

* **Question Answering and Information Retrieval:** In question-answering systems, resolving coreferences assists in finding relevant information by linking pronouns or references in questions to their appropriate entities in the text. Similarly, in information retrieval tasks, coreference resolution enhances search accuracy by ensuring retrieved documents contain relevant information despite varying references to entities.

* **Machine Translation and Text Generation:** Coreference resolution contributes to improving machine translation systems by maintaining consistency in translations, especially when dealing with pronouns or ambiguous references. In text generation tasks like summarization or dialogue systems, coreference resolution helps in generating coherent and natural-sounding text by ensuring consistency in references throughout the generated content.

* **Entity Linking and Knowledge Graph Construction:** By resolving coreferences, it aids in entity linking, connecting mentions of entities in text to entries in knowledge bases or ontologies. This is crucial in constructing accurate knowledge graphs that represent relationships between entities.

In essence, coreference resolution is pivotal in numerous NLP applications, playing a critical role in document understanding, information retrieval, machine translation, text generation, entity linking, and knowledge representation. By disambiguating references and establishing connections between entities, it significantly enhances the depth and accuracy of language understanding systems across various domains and applications.