## <span style="color:purple"> Information extraction: pronominal coreference resolution</span>

The pronominal coreference resolution aims to automatically find correct references for pronouns.
EstNLTK includes the Estonian Coreference Resolution System, which was introduced by [Barbu et al. (2020)](https://ebooks.iospress.nl/pdf/doi/10.3233/FAIA200595) and which detects coreference of personal pronouns ("mina", "sina",        "tema"), relative pronouns "kes" and "mis", and the demonstrative pronoun "see".

The source code of the original coreference resolution system along with the training/testing setup can be found [here](https://github.com/SoimulPatriei/EstonianCoreferenceSystem).

### Running as a web tagger

Easiest way to use coreference tagger is via EstNLTK's web service:

In [1]:
from estnltk import Text
from estnltk.web_taggers import CoreferenceV1WebTagger

coref_web_tagger = CoreferenceV1WebTagger(url='http://127.0.0.1:5000/estnltk/tagger/coreference_v1')
coref_web_tagger

name,output layer,output span names,output attributes,input layers
CoreferenceV1WebTagger,coreference_v1,"('pronoun', 'mention')",(),()

0,1
url,http://127.0.0.1:5000/estnltk/tagger/coreference_v1
batch_layer,
batch_max_size,175000
batch_enveloping_layer,


Usage example:

In [2]:
text = Text('Ahto ütles, et tema ei tegele rahadega. Jah, ta tegeleb hoopis suurte plaanidega. Proovib vähendada.')
coref_web_tagger.tag( text )
text['coreference_v1']

layer name,span_names,attributes,ambiguous,relation count
coreference_v1,"pronoun, mention",,False,2

pronoun,mention
tema,Ahto
ta,Ahto


There are two types of named spans in the output layer:
* _pronoun_ -- a pronoun from the set {"mina", "sina", "tema", "kes", "mis", "see"};
* _mention_ -- antecedent: another pronoun, a noun or a proper noun;

### Local installation

*In order to use the coreference resolver locally, you need to install additional packages: [estnltk_neural](https://github.com/estnltk/estnltk/tree/main/estnltk_neural), [scikit-learn](https://scikit-learn.org/stable/install.html), [gensim](https://radimrehurek.com/gensim/), [xgboost](https://pypi.org/project/xgboost). You also need stanza and stanza's Estonian models, but these will be installed automagically once you have other requirements fulfilled.*

`esnltk_neural` provides CoreferenceTagger for detecting pronoun-mention coreference pairs in text. The model and configuration files required by the tagger need to be downloaded separately. There are two ways for downloading the required resources:

   * If you create a new instance of CoreferenceTagger and the resources have not been downloaded yet, you'll be prompted with a question asking for permission to download the resources;
   * Alternatively, you can pre-download resources manually via download function:

```python
from estnltk import download
download('coreference_v1')
```

In [3]:
from estnltk import Text
from estnltk_neural.taggers import CoreferenceTagger

In [4]:
coref_tagger = CoreferenceTagger()

INFO:coreference_api.py:67: test::Initializing resources
INFO:coreference_api.py:69: test::Read Resource Catalog from=>C:\Programmid\Miniconda3\envs\py38_est_coref\lib\site-packages\estnltk-1.7.2-py3.8-win-amd64.egg\estnltk\estnltk_resources\coreference\model_2021-01-04\estonian_configuration_files\estonian_catalog.xml
INFO:coreference_api.py:72: test::Read the global mention scores from=>C:\Programmid\Miniconda3\envs\py38_est_coref\lib\site-packages\estnltk-1.7.2-py3.8-win-amd64.egg\estnltk\estnltk_resources\coreference\model_2021-01-04\estonian_resources/estonian_global_mention_scores/estonian_mentions_score.txt
INFO:coreference_api.py:74: test::Read Eleri Aedmaa abstractness scores from=> C:\Programmid\Miniconda3\envs\py38_est_coref\lib\site-packages\estnltk-1.7.2-py3.8-win-amd64.egg\estnltk\estnltk_resources\coreference\model_2021-01-04\estonian_resources/estonian_abstractness_lexicon/abstractness_ET.txt
INFO:keyedvectors.py:2047: loading projection weights from C:\Programmid\Minic

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:core.py:259: Loading these models for language: et (Estonian):
| Processor | Package |
-----------------------
| tokenize  | edt     |
| pos       | edt     |
| lemma     | edt     |
| depparse  | edt     |

INFO:core.py:278: Using device: cpu
INFO:core.py:284: Loading: tokenize
INFO:core.py:284: Loading: pos
INFO:core.py:284: Loading: lemma
INFO:core.py:284: Loading: depparse
INFO:core.py:336: Done loading processors!
INFO:coreference_api.py:107: test::Initialized stanza nlp pipeline


CoreferenceTagger relies on the stanza's Estonian models for preprocessing of the input text, and does not have any layer dependencies from EstNLTK:

In [5]:
coref_tagger

name,output layer,output span names,output attributes,input layers
CoreferenceTagger,coreference,"('pronoun', 'mention')","('chain_id',)",()

0,1
add_chain_ids,True
stanza_nlp,"<Pipeline: tokenize=TokenizeProcessor(C:\Users\soras\stanza_resources\et\tokeniz ..., type: <class 'stanza.pipeline.core.Pipeline'>"
coref_model,"Pipeline(steps=[('t',\n ColumnTransformer(remainder='passthrough' ..., type: <class 'sklearn.pipeline.Pipeline'>, length: 2"


If you have downloaded _stanza's_ Estonian models manually and placed into a non-default location, you can pass path to the models directory via constructor parameter `stanza_models_dir`:

```python
coref_tagger = CoreferenceTagger(stanza_models_dir = ...)
```
This would avoid re-downloading _stanza's_ models.

Usage example:

In [6]:
text = Text('''Mina ei tagane sammugi, põrutas kapten Silver Üksjalg meestele. Aga teda ei kuulatud.''')
coref_tagger.tag( text )
text['coreference']

layer name,span_names,attributes,ambiguous,relation count
coreference,"pronoun, mention",chain_id,False,2

pronoun,mention,chain_id
Mina,Silver,0
teda,Silver,0


In the outcome layer:
* _pronoun_ -- a pronoun from the set {"mina", "sina", "tema", "kes", "mis", "see"};
* _mention_ -- antecedent: another pronoun, a noun or a proper noun;
* _chain_id_ -- identifier of the chain; if multiple coreference pairs share common members, then they belong to a common chain;

#### Assigning coreference chain id-s

By default, CoreferenceTagger assigns `chain_id`-s to pronoun-mention pairs, so that all coreference relations that share a common mention will obtain a single `chain_id`.
You can switch off the mark-up of `chain_id`-s via flag `add_chain_ids`:
```python
coref_tagger = CoreferenceTagger(add_chain_ids = False)
```
Note that the coreference chain mark-up is very basic and does not go beyond relations detected by the tagger. 
So, even if two mentions are same, but are not connected via a chain of pronoun-mention relations, the mentions will end up in different chains.

#### Extending mentions with named entity information

You can add a layer of named entity annotations as an input layer of CoreferenceTagger via parameter `ner_layer`.
After that, CoreferenceTagger expands detected mentions to full extent named entity phrases, whenever there is an overlap between a detected mention and a named entity phrase.
Example:


In [7]:
# Prepare text with ner layer
text = Text('''Mina ei tagane sammugi, põrutas kapten Silver Üksjalg meestele. Aga teda ei kuulatud.''')
text.tag_layer('ner')
text['ner']

layer name,attributes,parent,enveloping,ambiguous,span count
ner,nertag,,words,False,1

text,nertag
"['Silver', 'Üksjalg']",PER


In [8]:
# Disable CoreferenceTagger's init logging
import logging
logging.disable(logging.INFO)

# Make CoreferenceTagger aware of ner layer
coref_tagger = CoreferenceTagger(ner_layer='ner', logger=logging)

# Detect coreference
coref_tagger.tag( text )
text['coreference']

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

layer name,span_names,attributes,ambiguous,relation count
coreference,"pronoun, mention",chain_id,False,2

pronoun,mention,chain_id
Mina,Silver Üksjalg,0
teda,Silver Üksjalg,0


---