### Introduction to Coreference Resolution

This notebook demonstrates how to use the `neuralcoref` library, which integrates with the popular NLP library `spaCy`, to perform coreference resolution. **Coreference resolution** is the task of finding all expressions in a text that refer to the same real-world entity. For example, in "My sister has a dog. She loves him," "My sister" and "She" refer to the same person, while "a dog" and "him" refer to the same animal.

### Cell 1: Environment Setup

This markdown cell provides the necessary shell commands to install `neuralcoref` and the small English language model for `spaCy`. The `--no-binary` flag is often used to compile the package from source, which can help resolve compatibility issues.

Install neuralcoref (which will upgrade spacy to 2.1) and re-install spacy models.

```sh
pip install neuralcoref --no-binary neuralcoref
python -m spacy download en_core_web_sm
```

### Cell 2: Importing Libraries

Here, we import the necessary Python libraries. We need `spacy` for the core NLP processing and `neuralcoref` for the coreference resolution functionality.

In [None]:
# Import the main NLP library, spaCy
import spacy
# Import the neuralcoref library for coreference resolution
import neuralcoref

### Cell 3: Initializing the NLP Pipeline

This is a crucial step. We first load a `spaCy` English language model. Then, we initialize `neuralcoref` and add it to the `spaCy` processing pipeline. A pipeline is a sequence of components (like a tokenizer, a part-of-speech tagger, a named entity recognizer, etc.) that process a text. By adding `neuralcoref` to this pipeline, we ensure that it will run whenever we process a text with our `nlp` object.

In [None]:
# Load the small English model from spaCy. 
# This object, which we call 'nlp', will now contain the entire processing pipeline.
nlp = spacy.load('en')

# NOTE: The commented line below is an alternative way to load a model, 
# useful if the simple 'en' shortcut doesn't work.
# nlp = spacy.load('en_core_web_sm')

# Initialize the NeuralCoref model, passing it the vocabulary from our loaded spaCy model.
coref = neuralcoref.NeuralCoref(nlp.vocab)

# Add the coreference resolution component to the end of the spaCy pipeline.
# We give it a name, 'neuralcoref', so we can identify it later.
nlp.add_pipe(coref, name='neuralcoref')

### Cell 4: Processing Text

Now that the pipeline is set up, we can process text. We pass a Unicode string to our `nlp` object. This runs the text through all the components in the pipeline (tokenization, tagging, parsing, and our new coreference resolution). The result is a `Doc` object, which contains the processed text and all its annotations.

In [None]:
# Process the example sentence using our nlp pipeline.
# The 'doc' object now contains the text and all the linguistic annotations.
doc = nlp(u'My sister has a dog. She loves him.')

### Cell 5: Understanding Coreference Clusters

This markdown cell explains how to access the results from `neuralcoref`. The results are stored in a custom attribute of the `Doc` object called `doc._.coref_clusters`. This attribute holds a list of clusters. Each cluster represents a single real-world entity and contains all the "mentions" (spans of text) that refer to it.

Coreference clusters can be found in the `_.coref_clusters` attribute of `doc`. `_.coref_clusters` is a list of mention *clusters* -- each *mention* is a span of tokens in the text and a cluster of such mentions are those spans that co-refer to the same unique *entity*.

Each mention is a spacy [Span](https://spacy.io/api/span) object and has all of the methods/attributes of that class.  

### Cell 6: Inspecting the Coreference Chains

This code iterates through the coreference clusters found in our processed document. For each cluster (or "chain"), it prints the index of the chain and then iterates through each mention in that chain, printing the mention's text and its start and end token positions.

In [None]:
# Loop through each coreference cluster found in the document. 
# 'enumerate' provides an index for each chain.
for idx, chain in enumerate(doc._.coref_clusters):
    # Print a header for the current coreference chain.
    print ("Coreference chain %s:" % idx)
    # Loop through each 'mention' (e.g., "My sister", "She") in the current chain.
    for mention in chain.mentions:
        # Print the text of the mention, its starting token index, and its ending token index.
        print(mention.text, mention.start, mention.end)
    # Print a newline for better readability between chains.
    print()

### Cell 7: Understanding Syntactic Heads

This markdown cell explains the `span.root` attribute. For any span of text (like a mention), the root is the single token that is the syntactic head of that span in the sentence's dependency parse tree. This is useful for understanding the grammatical role of the mention in the sentence.

The head of a spacy span can be approximated by the `span.root` attribute, which is "the token with the shortest path to the root of the sentence."  The syntactic relation of the entire mention to the rest of the sentence is best captured by this root.

### Cell 8: Extracting Syntactic Information

Building on the previous example, this code not only prints the mention's text and position but also extracts more detailed syntactic information about the mention's `root` token: its text, its dependency label (`.dep_`), and its syntactic head (`.head`). This shows, for example, that "sister" is the subject (`nsubj`) of "has" and "him" is the direct object (`dobj`) of "loves".

In [None]:
# Loop through each coreference cluster in the document.
for chain in doc._.coref_clusters:
    # Loop through each mention within the cluster.
    for mention in chain.mentions:
        # The following comments describe the attributes we are about to print.
        # mention.text = entire text space of entity
        # mention.start = token start position of entity
        # mention.end = token end position of entity
        # mention.root = spacy Token object that is the syntactic head of the mention (in a dependency tree)
        
        # Print the mention's text, start/end positions, its root token, the root's dependency relationship, and the root's syntactic head token.
        print(mention.text, mention.start, mention.end, mention.root, mention.root.dep_, mention.root.head)
    # Print a newline for better readability.
    print()

### Cell 9: Testing the Limits

This markdown cell introduces the next section of the notebook, where we will test the model's performance on more challenging sentences, such as those from the Winograd Schema Challenge, which are designed to be difficult for machines to understand due to ambiguity.

Now test the limits of spacy coreference. How does it fare on:

- Winograd schema challenge questions?
- long documents?
- near-identity?

### Cell 10: Creating a Helper Function

To make testing easier, we define a simple helper function `print_coref_chains`. This function takes a string of text, runs it through our `nlp` pipeline, and prints the coreference chains in the same way we did before. This saves us from rewriting the loops for each new test case.

In [None]:
# Define a function that takes a text string as input.
def print_coref_chains(text):
    # Process the text with our spaCy pipeline.
    doc = nlp(text)
    # Loop through the resulting coreference clusters.
    for chain in doc._.coref_clusters:
        # Loop through each mention in the cluster.
        for mention in chain.mentions:
            # Print the mention's text and its start/end token positions.
            print(mention.text, mention.start, mention.end)
        # Print a newline to separate the chains.
        print()

### Cell 11: Winograd Schema Test 1

This is our first test using the helper function. The sentence is, "The trophy would not fit in the brown suitcase because **it** was too big." The pronoun "it" is ambiguous: does it refer to the trophy or the suitcase? A human knows "it" refers to the trophy. The model correctly identifies "The trophy" and "it" as being in the same coreference chain.

In [None]:
# Call our helper function with a sentence from the Winograd Schema Challenge.
# The model needs to resolve what "it" refers to.
print_coref_chains("The trophy would not fit in the brown suitcase because it was too big")

### Cell 12: Winograd Schema Test 2

Here, the sentence is, "The town councilors refused to give the demonstrators a permit because **they** feared violence." The pronoun "they" could refer to the councilors or the demonstrators. In this context, it refers to the councilors. The model correctly links "The town councilors" and "they".

In [None]:
# Test another ambiguous sentence. Here, "they" is the ambiguous pronoun.
print_coref_chains("The town councilors refused to give the demonstrators a permit because they feared violence.")

### Cell 13: Winograd Schema Test 3

This is a variation of the previous sentence: "The town councilors refused to give the demonstrators a permit because **they** advocated violence." By changing one word ("feared" to "advocated"), the meaning of "they" flips to refer to the demonstrators. However, the model **incorrectly** still links "they" to "The town councilors," showing the limits of its understanding.

In [None]:
# Test a variation where the meaning of "they" should change.
# The model makes an error here, demonstrating a common failure point for these systems.
print_coref_chains("The town councilors refused to give the demonstrators a permit because they advocated violence.")