This notebook explores the `neuralcoref` spacy package. `neuralcoref` package only works for Spacy 2 (not the Spacy 3 we have installed), so to use it, you'll have to create a new environment and install that version of Spacy.  

```
conda create --name anlp_spacy2 python=3.7
pip install spacy==2.1.0
python -m spacy download en_core_web_sm
pip install neuralcoref --no-binary neuralcoref
conda install nb_conda=2.2.1
```

In [1]:
import spacy
import neuralcoref

In [2]:
nlp = spacy.load('en_core_web_sm')
neuralcoref.add_to_pipe(nlp)

<spacy.lang.en.English at 0x7f7a80987450>

In [3]:
doc = nlp(u'My sister has a dog. She loves him.')

Coreference clusters can be found in the `_.coref_clusters` attribute of `doc`. `_.coref_clusters` is a list of mention *clusters* -- each *mention* is a span of tokens in the text and a cluster of such mentions are those spans that co-refer to the same unique *entity*.

Each mention is a spacy [Span](https://spacy.io/api/span) object and has all of the methods/attributes of that class.  

In [4]:
for idx, chain in enumerate(doc._.coref_clusters):
    print ("Coreference chain %s:" % idx)
    for mention in chain.mentions:
        print(mention.text, mention.start, mention.end)
    print()

Coreference chain 0:
My sister 0 2
She 6 7

Coreference chain 1:
a dog 3 5
him 8 9



The head of a spacy span can be approximated by the `span.root` attribute, which is "the token with the shortest path to the root of the sentence."  The syntactic relation of the entire mention to the rest of the sentence is best captured by this root.

In [5]:
for chain in doc._.coref_clusters:
    for mention in chain.mentions:
        # mention.text = entire text space of entity
        # mention.start = token start position of entity
        # mention.end = token end position of entity
        # mention.root = spacy Token object that is the syntactic head of the mention (in a dependency tree)
        print(mention.text, mention.start, mention.end, mention.root, mention.root.dep_, mention.root.head)
    print()

My sister 0 2 sister nsubj has
She 6 7 She nsubj loves

a dog 3 5 dog dobj has
him 8 9 him dobj loves



Now test the limits of spacy coreference. How does it fare on:

- Winograd schema challenge questions?
- long documents?
- near-identity?

Importantly, note that `neuralcoref` only marks coref chains that involve **two or more** mentions.  Singleton chains (involving only one mention) won't appear at all.

In [6]:
def print_coref_chains(text):
    doc = nlp(text)
    for chain in doc._.coref_clusters:
        for mention in chain.mentions:
            print(mention.text, mention.start, mention.end)
        print()

In [7]:
print_coref_chains("The trophy would not fit in the brown suitcase because it was too big")

The trophy 0 2
it 10 11



In [8]:
print_coref_chains("The town councilors refused to give the man a permit because they feared violence.")

The town councilors 0 3
they 11 12



In [9]:
print_coref_chains("The town councilors refused to give the demonstrators a permit because they advocated violence.")

The town councilors 0 3
they 11 12

