<img align="right" src="images/tf-small.png" width="90"/>
<img align="right" src="images/etcbc.png" width="100"/>


# Analyse Coreference Annotations and Participants

## 1. Introduction 

The *brat* coreference annotations that have been converted with the NB [`corefMake.ipynb`](corefMake.ipynb) can be analysed with the code in this NB. Several functions from `parse_ann` parse can aid in the analysis of coreference annotations to potential participant analysis. Some give a recap of some definitions that have already been given in the NB [`iaa-analysis.ipynb`](iaa-analysis.ipynb):

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. 
* A referring expression is called a **mention**. 
* An entity can be called a coreference **class**, or **C**. A **class** is a set that contains two or more mentions that refer to the same entity. A **class** can therefore be considered as an discourse entity. In the code the **class** is often referred to as `coref`. 
* A mention that refers to an entity that no other mention refers to is called a **singleton**, or **S**. A **singleton** is a set that contains one **mention**. A **singleton** set can also contain all singletons from one text. 

Analysing coreference consists of different tasks: one can analyse mentions, classes and singletons. Since an annotation task implies that mistakes will be made, it is also possible to retrieve and inspect possible annotation mistakes. Lastly, descriptive statistics can be generated of what kind of mention types, pronouns and mention types specified for pronouns are contained in classes, singletons. The following functions and data types enable the described analysis tasks: 

* The function `TexFabricParse()` takes three arguments: `my_book_name`, `from_chapter` and `to_chapter`. For `my_book_name` the Bible book is specified; for `from_chapter` the desired starting chapter; for `to_chapter` the desired end chapter. 
* `TexFabricParse()` returns: 
    - a list of `Mention` objects, in variable `mentions`;
    - a dictionary of `Coref` objects, in variable `corefs` in which its keys are the class numbers indicated with `text number:class number`, e.g. 1:2, Psalm 1, class number 2; on key `0` the singletons are stored. 
    - possible suffix errors in variable `suffix_errors` that may have been created during the annotation process. 
* The function `EnrichMentions()` reconstructs the phrase types of phrase atoms that come with the ETCBC data. The mention object can have a form that ranges between a suffix and a phrase. Since the coreference and mention data is new to the ETCBC data, although it has been derived from it, the phrase type of a mention needs to be reconstructed. This new data type is called: `rpt`. As noted, annotation errors are inevitable, those mistakes however can cause `EnrichMentions()` to reconstruct phrase types in a way that is unusual, or to put it in other words: it may be wrong. The variable `reconsider_rpt` enables the annotator to reconsider the `rpt`, by pointing to possible erroneous annotations. 
* Coreference classes are not yet explicitly identified entities. The function `IdentifyEntities` does that. It operates on a certain order of `rpt`'s  -- 1. 'PrNP', 2. 'NP', 3. 'PtcP', 4. 'VP', 5. 'PPrP', 6. 'DPrP' -- with which the coreference class is identified according to the `rpt` that is most high in the predefined `rpt`-order. 

Accordingly, three groups of functions do the following for any specified text and chapter number(s):

1. Print functions
    *  To use one of the three functions (or all) they can be uncommented below. 
    * `PrintSurvey(corefs)`: prints an overview of the text and class (e.g. `C119:1`), as what or whom the class is identified, the mention that starts the chain and what type it has. Below the demarcation line the mentions that are linked to that class are printed. The singletons are printed in one list at the end of the survey. 
    * `PrintPatternsAndNotes(corefs, suffix_errors)`: prints a more detailed overview than `PrintSurvey()`. It first prints an overview of the text and class (e.g. `C119:1`), as what or whom the class is identified, the mention that starts the chain and what type it has. Below the demarcation line the mentions that are linked to that class are printed. For each mention in a class or in the singletons is printed what type it has and if it carries person, gender, number information this is also printed. If an annotator note has been made on a mention, the note is printed after the mention. To make the overview complete the potential erroneous annotation errors are indicated after the mention with: `!CORRUPT ANN'`. The singletons are printed in one list at the end of the survey. 
    * `PrintCorefID(corefs)`: prints the same information as `PrintSurvey()` but sorts the classes in alefbetical order. 


2. Correct annotation errors
    * `PrintPossibleCorrections(suffix_errors, reconsider_rpt)`: prints only the potential erroneous suffix errors and reconstructed phrase types. For the `suffix_errors` the print order is: node, text, lexeme, brat identifier. For `rpt` the print order is: text position, start word node, end word node, lexeme, pdp, (phrase atom nodes), lexeme(s) and type. 


3. Generate statistics
    * `MakePandasTables(corefs, mentions)`: generates Pandas tables for descriptive statistics of what kind of and how many mention types, pronouns and mention types specified for pronouns are contained in classes and singletons.
    * The overall dataframe `overall_df` gives a general overview. From there the tables give more specific information. The part-of-speech dataframe `pos_df` counts the mention types in classes and singletons. also gives counts of what mention types start a coreference chain. The pronoun dataframe `pronoun_df` counts the pronouns in classes and singletons. The pronouns specified for part-of-speech dataframe `pronoun_pos_class_df` gives counts only for classes. `pronoun_pos_sing_df` gives the same kind of counts, but then only for singletons. 
    * All these Pandas dataframes are printed with the function `PrintThisTable()` which takes as argument a specific dataframe. 
    
The above described functions can be called repeatedly, which makes it possible to study multiple texts at the same time. To do this, just assign specific variables to the returned output of the functions like so for example: 

* `mentions_psalms`, `corefs_psalms`, `suffix_errors_psalms` = `TexFabricParse(my_book_name, from_chapter, to_chapter)`
* `reconsider_rpt_psalms` = `EnrichMentions(mentions_psalms)`
* `IdentifyEntities(corefs_psalms)`

## 2. Load parser

In [2]:
%load_ext autoreload
%autoreload 2

from parse_ann import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 3. Parse and enrich coreference data

In [11]:
my_book_name = 'Psalms'
from_chapter = 75
to_chapter = 75

In [12]:
mentions, corefs, suffix_errors = TexFabricParse(my_book_name, from_chapter, to_chapter)
reconsider_rpt = EnrichMentions(mentions)
IdentifyEntities(corefs)

## 4. Print Functions

In [13]:
PrintSurvey(corefs)
#PrintPatternsAndNotes(corefs, suffix_errors)
#PrintCorefID(corefs) 

C75:1 Who/what: HWDJNW / first: HWDJNW, type: VP
----------------------------------------------------------------------
[HWDJNW, HWDJNW] 

C75:5 Who/what: >LHJM / first: K, type: PPrP
----------------------------------------------------------------------
[K, >LHJM, CM, K, NPL>WTJ, K, >QX, >NJ, >CPV, >NKJ, TKNTJ, >MRTJ, >LHJM, CPV, JCPJL, JRJM, JD&JHWH, JMYW, >LHJ J<QB, >GD<] 

C75:2 Who/what: >RY / first: NMGJM, type: VP
----------------------------------------------------------------------
[NMGJM, >RY, KL&JCBJ, H, <MWDJ, H] 

C75:3 Who/what: HWLLJM / first: HWLLJM, type: NP
----------------------------------------------------------------------
[HWLLJM, THLW] 

C75:8 Who/what: RC<JM / first: RC<JM, type: NP
----------------------------------------------------------------------
[RC<JM, TRJMW, JCTW, KL RC<J&>RY, KL&QRNJ RC<JM] 

C75:4 Who/what: QRNK / first: TRJMW, type: VP
----------------------------------------------------------------------
[TRJMW, QRNK, M, TDBRW] 

C75:6 Who/what: KW

## 5. Print Possible Corrections and Errors

In [None]:
PrintPossibleCorrections(suffix_errors, reconsider_rpt)

## 6. Generate Statistics

In [None]:
overall_df, pos_df, \
pronoun_df, pronoun_pos_class_df, \
pronoun_pos_sing_df = MakePandasTables(corefs, mentions)

In [None]:
PrintThisTable(overall_df)

In [None]:
PrintThisTable(pos_df)

In [None]:
PrintThisTable(pronoun_df)

In [None]:
PrintThisTable(pronoun_pos_class_df)

In [None]:
PrintThisTable(pronoun_pos_sing_df)

## 7. Analyse other text(s)

In [3]:
mentions_nu, corefs_nu, errors_nu = TexFabricParse('Numbers', 1, 36)
reconsider_nu = EnrichMentions(mentions_nu)
IdentifyEntities(corefs_nu)

In [None]:
PrintSurvey(corefs_nu)
#PrintPatternsAndNotes(corefs_nu, errors_nu)
#PrintCorefID(corefs_nu) 

In [None]:
PrintPossibleCorrections(errors_nu, reconsider_nu)

In [4]:
overall_df_nu, pos_df_nu, \
pronoun_df_nu, pronoun_pos_class_df_nu, \
pronoun_pos_sing_df_nu = MakePandasTables(corefs_nu, mentions_nu)

In [5]:
PrintThisTable(overall_df_nu)

Unnamed: 0,total
mentions,11647
singletons,3265
notes,2995
classes,1548


In [6]:
PrintThisTable(pos_df_nu)

Unnamed: 0,NP,VP,PrNP,DPrP,PPrP,AdvP,IPrP,CP,prep,PP,conj,suffix,InrP,AdjP,art,PtcP,total_type
in class,2633,2751,973,93,152,12,5,9,7,3,1,1738,2,1,1,1,8382
singleton,2627,189,233,29,3,53,0,3,14,4,1,105,0,2,2,0,3265
total,5260,2940,1206,122,155,65,5,12,21,7,2,1843,2,3,3,1,11647
% total,45,25,10,1,1,1,0,0,0,0,0,16,0,0,0,0,100
first in chain,863,328,210,66,62,5,5,4,3,1,1,0,0,0,0,0,1548
% chain,56,21,14,4,4,0,0,0,0,0,0,0,0,0,0,0,100


In [7]:
PrintThisTable(pronoun_df_nu)

Unnamed: 0,p1upl,p1usg,p2fpl,p2fsg,p2mpl,p2msg,p3fpl,p3fsg,p3mpl,p3msg,p3upl,ufpl,ufsg,umpl,umsg,usg,uuu,total_pgn
in class,164,267,0,7,397,410,28,337,777,1537,182,2,44,59,82,0,245,4538
singleton,5,4,2,0,11,11,0,7,50,65,7,0,6,9,11,3,67,258
total,169,271,2,7,408,421,28,344,827,1602,189,2,50,68,93,3,312,4796
% total,4,6,0,0,9,9,1,7,17,33,4,0,1,1,2,0,7,100


In [9]:
PrintThisTable(pronoun_pos_class_df_nu)

Unnamed: 0,p1upl,p1usg,p2fsg,p2mpl,p2msg,p3fpl,p3fsg,p3mpl,p3msg,p3upl,ufpl,ufsg,umpl,umsg,uuu,total_pgn
VP,62,122,6,206,232,12,129,376,991,182,2,44,59,82,245,2750
suffix,96,117,0,173,159,14,188,392,497,0,0,0,0,0,0,1636
PPrP,6,28,1,18,19,2,20,9,49,0,0,0,0,0,0,152
total,164,267,7,397,410,28,337,777,1537,182,2,44,59,82,245,4538
% total,4,6,0,9,9,1,7,17,34,4,0,1,1,2,5,100


In [10]:
PrintThisTable(pronoun_pos_sing_df_nu)

Unnamed: 0,p1upl,p1usg,p2fpl,p2mpl,p2msg,p3fsg,p3mpl,p3msg,p3upl,ufsg,umpl,umsg,usg,uuu,total_pgn
VP,1,2,0,10,6,2,10,53,7,6,9,11,3,67,187
suffix,4,2,2,1,2,5,40,12,0,0,0,0,0,0,68
PPrP,0,0,0,0,3,0,0,0,0,0,0,0,0,0,3
total,5,4,2,11,11,7,50,65,7,6,9,11,3,67,258
% total,2,2,1,4,4,3,19,25,3,2,3,4,1,26,100
