<img align="right" src="images/tf-small.png" width="90"/>
<img align="right" src="images/etcbc.png" width="100"/>


# Analyse Coreference Annotations and Participants

## 1. Introduction 

The *brat* coreference annotations that have been converted with the NB [`corefMake.ipynb`](corefMake.ipynb) can be analysed with the code in this NB. The functions in `analyse.py` and `search.py` can aid with analysing coreference annotations as participant data. The way participant data is produced and analysed in the 'Who is Who' project is unprecedented. The developed functions, i.e. algorithms, are experimental and are hence intended for further development and enhancement. 

To give a recap of some definitions that have already been given in the NB [`iaa-analysis.ipynb`](iaa-analysis.ipynb):

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. 
* A referring expression is called a **mention**. 
* An entity can be called a coreference **class**, or **C**. A **class** is a set that contains two or more mentions that refer to the same entity. A **class** can therefore be considered as an discourse entity. In the code the **class** is often referred to as `coref`. 
* A mention that refers to an entity that no other mention refers to is called a **singleton**, or **S**. A **singleton** is a set that contains one **mention**. A **singleton** set can also contain all singletons from one text. 

### Analysis 
Analysing coreference consists of different tasks: one can analyse mentions, classes and singletons. Since an annotation task implies that mistakes will be made, it is also possible to retrieve and inspect possible annotation mistakes. Lastly, descriptive statistics can be generated of what kind of mention types, pronouns and mention types specified for pronouns are contained in classes and singletons. The functions contained in `analyse.py` enable this first data exploration of the coreference annotations. The following functions and the various data they produce enable the described analysis tasks: 

* The main function `ParseAnnotations()` takes three arguments `my_book_name`, `from_chapter` and `to_chapter` with which the desired coreference annotated can be specified. For `my_book_name` the Hebrew Bible book is specified; for `from_chapter` the desired starting chapter; for `to_chapter` the desired end chapter. ParseAnnotations returns parsed mentions, coreference classes, suffix errors and reconstructed phrase types that may need reconsideration. It calls the functions `TexFabricParse()`, `EnrichMentions()` and `AssignIdentity()`.

* Though `TexFabricParse()` does the actual work `ParseAnnotations()` returns: 
    - a list of `Mention` objects, in variable `mentions`;
    - a dictionary of `Coref` objects, in variable `corefs` in which its keys are the class numbers indicated with `text number:class number`, e.g. 1:2, Psalm 1, class number 2; on key `0` the singletons are stored. 
    - possible suffix errors in variable `suffix_errors` that may have been created during the annotation process. 
    - possible erroneous annotations of other mentions in variable `reconsider_rpt`. 
    
`EnrichMentions()` and `AssignIdentity()` are in need of some additional explanation. 
* The function `EnrichMentions(mentions)` reconstructs the phrase types of phrase atoms that come with the ETCBC data. The mention object can have a form that ranges between a suffix and a phrase. Since the coreference and mention data is new to the ETCBC data, although it has been derived from it, the phrase type of a mention needs to be reconstructed. This new mention-phrase type is called: `rpt`. As noted, annotation errors are inevitable. An example is the annotation of a single adverb as one mention (instead of annotating it as a quality of a mention). Those annotation mistakes however can cause `EnrichMentions()` to reconstruct phrase types in a way that is unusual, or to put it in other words: it may be wrong. The variable `reconsider_rpt` enables the annotator to reconsider the `rpt`, by pointing to possible erroneous annotations. 
* Coreference classes are not yet explicitly identified entities. The function `AssignIdentity(corefs)` does that. The coreference chain is searched for a mention type that matches the most high ranked mention type in the predetermined hierarchy of mention types, `rpt`s: 1. `PrNP'; 2. `NP'; 3. `PtcP'; 4. `VP'; 5. `PPrP'; 6. `DPrP'. The annotation with the most high ranked mention type is used to identify the entire class. In the print function that are explained below the coreference class identification is indicated with 'Who/what'. 
* An additional way to facilitate analysis is to determine which mention initiates the coreference class. This means: by determining which mention and which mention type starts the coreference chain the way the participant is introduced in the discourse can also be determined. `AssignIdentity()` stores the mention and its type that starts the coreference chain in a coreference Python object with `c.first()` and `c.first().typ`. The print funtions below indicate the mention and its type that starts the coreference chain with ‘first’ and ‘type’.

Three groups of functions do the following for any specified text and chapter number(s):

1. Print functions
    *  To use one of the three functions (or all) they can be uncommented below. 
    * `PrintSurvey(corefs)`: prints an overview of the text and class (e.g. `C119:1`), as what or whom the class is identified, the mention that starts the chain and what type it has. Below the demarcation line the mentions that are linked to that class are printed. The singletons are printed in one list at the end of the survey. 
    * `PrintPatternsAndNotes(corefs, suffix_errors)`: prints a more detailed overview than `PrintSurvey()`. It first prints an overview of the text and class (e.g. `C119:1`), as what or whom the class is identified, the mention that starts the chain and what type it has. Below the demarcation line the mentions that are linked to that class are printed. For each mention in a class or in the singletons is printed what type it has and if it carries person, gender, number information this is also printed. If an annotator note has been made on a mention, the note is printed after the mention. To make the overview complete the potential erroneous annotation errors are indicated after the mention with: `!CORRUPT ANN'`. The singletons are printed in one list at the end of the survey. 
    * `PrintCorefID(corefs)`: prints the same information as `PrintSurvey()` but sorts the classes in alefbetical order. 


2. Correct annotation errors
    * `PrintPossibleCorrections(suffix_errors, reconsider_rpt)`: prints only the potential erroneous suffix errors and reconstructed phrase types. For the `suffix_errors` the print order is: node, text, lexeme, brat identifier. For `rpt` the print order is: text position, start word node, end word node, lexeme, pdp, (phrase atom nodes), lexeme(s) and type. 


3. Generate statistics
    * `MakePandasTables(corefs, mentions)`: generates Pandas tables for descriptive statistics of what kind of and how many mention types, pronouns and mention types specified for pronouns are contained in classes and singletons.
    * The overall dataframe `overall_df` gives a general overview. From there the tables give more specific information. The part-of-speech dataframe `pos_df` counts the mention types in classes and singletons. also gives counts of what mention types start a coreference chain. The pronoun dataframe `pronoun_df` counts the pronouns in classes and singletons. The pronouns specified for part-of-speech dataframe `pronoun_pos_class_df` gives counts only for classes. `pronoun_pos_sing_df` gives the same kind of counts, but then only for singletons. 
    * All these Pandas dataframes are printed with the function `PrintThisTable()` which takes as argument a specific dataframe. 
    
The above described functions can be called repeatedly, which makes it possible to study multiple texts at the same time. To do this, just assign specific variables to the returned output of the functions like so for example: 

* `mentions_psalms`, `corefs_psalms`, `suffix_errors_psalms`, `reconsider_rpt_psalms` = `ParseAnnotations(my_book_name, from_chapter, to_chapter)`
* Or when studying one text, for example Psalm 1, give the variable the number of the Psalm: `corefs1`, etc. 

### Search
The functionality of `AssignIdentity()` has been further integrated in a number of entity search functions with which entities in the entire coreference annotated corpus can be analysed. The entity search functions are meant to facilitate in-depth discussion between the data and the findings of other exegetical literature, for example commentaries. 


* `FindWho(cd, suffix_errors, who)` Finds classes that are identified as an entity. The argument `cd` is a coreference dictionary ('cd') in which the annotated classes have been stored. The annotations have been parsed first by the function `ParseAnnotations()` in `analyse.py`. The annotations have been produced by me, hence there can still be annotation errors in the data. The argument `suffix_errors` loads those potential errors in the function to ensure that the annotations are displayed completely. The argument `who` is a list that can take multiple strings, for example: `'>LHJM', '>L'` (>L means God). `FindWho()` then finds and prints all classes identified as '>LHJM' and '>L'. The actual searching and printing is done with a helper function `GetPatterns()`. 
* `FindFirst(cd, suffix_errors, mention_type, pgn)` Finds classes for a chosen `mention_type` and/or `pgn`. The arguments `cd` and `suffix_errors` are the same as in `FindWho()`. The argument `mention_type` is a string, for example 'VP', and finds and prints all classes that start with that `mention_type`, a verb in this case. The argument `pgn` is a list that can take multiple strings of transliterated person, gender and/or number forms, for example P1sg suffixes:  `'NJ', 'J'`. Or a combination of suffixes: `'NJ', 'K'`, P1sg and P2Msg respectively. The actual searching and printing is done with the function `Pattern()`.
* `FindClassMention(cd, suffix_errors, what)` Most mentions do not initiate a coreference class or identify it. This function looks up a mention in a class and prints the entire class. The mention is specified in `what`. The arguments `cd` and `suffix_errors` are the same as in previous functions. The actual searching and printing is done with the helper function `SearchClassMention()`. 
* `FindMention(cd, suffix_errors, what)` Finds and prints all occurrences -- singletons and in classes -- of a specified mention in `what`. This function can serve as a first exploration of the available coreference data for the desired mention. Leaving the arguments `cd` and `suffix_errors` out of consideration, `what` is a mention string, for example `'>SP'`, Asaph. The actual searching and printing is done with the function `Search()`.

All described functions, except `FindMention()`, print the results in a similar way. On the first line the class identifier is printed, together with an identification, what mention initiates the coreference chain, the mention type of the first mention in chain and the number of the class in the entire annotated corpus. The functions then print 6 columns, per row one mention, with coreference and mention information that is relevant for participant analysis: the verse in which the mention occurs, mention type, pgn of the mention, the annotation of the mention itself, a gloss translation, and the mention note if it was stored on the mention. `FindMention()` prints an oversight in 7 columns, per row one class in which the mention occurs or the singleton: chapter:verse, class/singleton, mention type, pgn of the mention, the annotations itself, its gloss, and note. 

## 2. Load Analyse and Search

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from analyse import (ParseAnnotations, 
                     MakePandasTables, 
                     PrintThisTable, 
                     PrintSurvey,
                     PrintPatternsAndNotes, 
                     PrintCorefID,
                     PrintPossibleCorrections
                    )

   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API


In [3]:
from search import (GetAnnotations,
                    FindWho,
                    FindFirst,
                    FindMention,
                    FindClassMention
                    )

   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API


## 3. Parse and enrich coreference data

In [4]:
my_book_name = 'Psalms'
from_chapter = 82
to_chapter = 82

In [5]:
mentions1, corefs1, suffix_errors1, reconsider_rpt1 = ParseAnnotations(my_book_name, from_chapter, to_chapter)

## 4. Print Functions

In [6]:
PrintSurvey(corefs1)
#PrintPatternsAndNotes(corefs, suffix_errors)
#PrintCorefID(corefs) 

C1:2 Who/what: >JC / first: >JC, type: NP
----------------------------------------------------------------------
[>JC, HLK, <MD, JCB, W, JHGH, HJH, W, JTN, W, W, JBWL, J<FH, JYLJX] 

C1:4 Who/what: <YT RC<JM / first: <YT RC<JM, type: NP
----------------------------------------------------------------------
[<YT RC<JM, RC<JM, JQMW, RC<JM, DRK RC<JM, T>BD] 

C1:7 Who/what: DRK XV>JM / first: DRK XV>JM, type: NP
----------------------------------------------------------------------
[DRK XV>JM, XV>JM] 

C1:3 Who/what: TWRT JHWH / first: TWRT JHWH, type: NP
----------------------------------------------------------------------
[TWRT JHWH, TWRT, W] 

C1:9 Who/what: <Y / first: <Y, type: NP
----------------------------------------------------------------------
[<Y, CTWL] 

C1:6 Who/what: MY / first: MY, type: NP
----------------------------------------------------------------------
[MY, W] 

C1:5 Who/what: RWX / first: TDPN, type: VP
-----------------------------------------------------------

In [6]:
GetAnnotations(corefs1, suffix_errors1)

C82:4 Who/what: >LHJM / first: >LHJM, type: NP
----------------------------------------------------------------------
verse	id	type	pgn	ann		gloss	note

1	T3	NP		>LHJM     	god(s)	
1	T4	VP	umsg	NYB     	stand	
1	T7	VP	p3msg	JCPV     	judge	
6	T29	PPrP	p1usg	>NJ     	i	
6	T30	VP	p1usg	>MRTJ     	say	
8	T40	VP	p2msg	QWMH     	arise	
8	T41	NP		>LHJM     	god(s)	
8	T42	VP	p2msg	CPVH     	judge	
8	T44	PPrP	p2msg	>TH     	you	
8	T45	VP	p2msg	TNXL     	take possession	


C82:2 Who/what: QRB >LHJM / first: QRB >LHJM, type: NP
----------------------------------------------------------------------
verse	id	type	pgn	ann		gloss	note

1	T6	NP		QRB >LHJM     	interior	
2	T8	VP	p2mpl	TCPVW     	judge	Does 2Mpl refer to "the Gods" in v.1?
2	T11	VP	p2mpl	TF>W     	lift	
3	T12	VP	p2mpl	CPVW     	judge	
3	T17	VP	p2mpl	HYDJQW     	be just	
4	T18	VP	p2mpl	PLVW     	escape	
4	T22	VP	p2mpl	HYJLW     	deliver	
5	T23	VP	p3upl	JD<W     	know	Does 3Cpl refer to the 2Mpl in v.5? Is this the same entity?
5	T24	VP	

## 5. Print Possible Corrections and Errors

In [8]:
PrintPossibleCorrections(suffix_errors1, reconsider_rpt1)

There are 0 annotation errors for the specified corpus 

There are 2 possible erroneous reconstructed phrase types you may need to reconsider: 

The order is: text, start node, end node, lexeme, pdp, (phrase atom nodes), lexeme(s), type
('1:2', 310681, 310681, 'JWMM', 'advb', (310681, 310682, 310683), 'JWMM WLJLH00 ', 'AdvP') 

('1:2', 310683, 310683, 'LJLH', 'advb', (310681, 310682, 310683), 'JWMM WLJLH00 ', 'AdvP') 



## 6. Generate Statistics

In [9]:
overall_df1, pos_df1, \
pronoun_df1, pronoun_pos_class_df1, \
pronoun_pos_sing_df1 = MakePandasTables(corefs1, mentions1)

In [10]:
PrintThisTable(overall_df1)

Unnamed: 0,total
mentions,45
singletons,10
classes,9


In [11]:
PrintThisTable(pos_df1)

Unnamed: 0,NP,VP,Sffx,PrNP,advb,CP,total_type
in class,14,14,6,1,0,0,35
singleton,7,0,0,0,2,1,10
total,21,14,6,1,2,1,45
% total,47,31,13,2,4,2,100
first in chain,7,2,0,0,0,0,9
% chain,78,22,0,0,0,0,100


In [12]:
PrintThisTable(pronoun_df1)

Unnamed: 0,p3fsg,p3mpl,p3msg,umsg,total_pgn
in class,2,1,15,2,20
total,2,1,15,2,20
% total,10,5,75,10,100


In [13]:
PrintThisTable(pronoun_pos_class_df1)

Unnamed: 0,p3fsg,p3mpl,p3msg,umsg,total_pgn
VP,2,1,9,2,14
Sffx,0,0,6,0,6
total,2,1,15,2,20
% total,10,5,75,10,100


In [14]:
PrintThisTable(pronoun_pos_sing_df1)

Unnamed: 0,total_pgn
total,0
% total,0


## 7. Analyse other text(s)

In [15]:
mentions_nu, corefs_nu, errors_nu, reconsider_nu = ParseAnnotations('Numbers', 1, 36)

In [16]:
#PrintSurvey(corefs_nu)
#PrintPatternsAndNotes(corefs_nu, errors_nu)
#PrintCorefID(corefs_nu)

In [17]:
PrintPossibleCorrections(errors_nu, reconsider_nu)

There are 139 possible annotation errors you may need to reconsider: 

The order is: node, text, lexeme, brat id
(69635, '1:1', 'Y>', 'T9') 

(69649, '1:2', 'TM', 'T15') 

(69659, '1:2', 'TM', 'T21') 

(69674, '1:3', 'YB>', 'T29') 

(69804, '1:18', 'TM', 'T102') 

(69818, '1:18', 'TM', 'T109') 

(69836, '1:20', 'TWLD', 'T119') 

(69838, '1:20', 'TM', 'T122') 

(69846, '1:20', 'TM', 'T127') 

(69872, '1:22', 'TWLD', 'T139') 

(69874, '1:22', 'TM', 'T142') 

(69883, '1:22', 'TM', 'T149') 

(69909, '1:24', 'TWLD', 'T161') 

(69911, '1:24', 'TM', 'T164') 

(69943, '1:26', 'TWLD', 'T179') 

(69945, '1:26', 'TM', 'T182') 

(69975, '1:28', 'TWLD', 'T196') 

(69977, '1:28', 'TM', 'T199') 

(70007, '1:30', 'TWLD', 'T213') 

(70009, '1:30', 'TM', 'T216') 

(70042, '1:32', 'TWLD', 'T231') 

(70044, '1:32', 'TM', 'T234') 

(70072, '1:34', 'TWLD', 'T247') 

(70074, '1:34', 'TM', 'T250') 

(70103, '1:36', 'TWLD', 'T264') 

(70105, '1:36', 'TM', 'T267') 

(70135, '1:38', 'TWLD', 'T281') 

(70137, '1:

In [18]:
overall_df_nu, pos_df_nu, \
pronoun_df_nu, pronoun_pos_class_df_nu, \
pronoun_pos_sing_df_nu = MakePandasTables(corefs_nu, mentions_nu)

In [19]:
PrintThisTable(overall_df_nu)

Unnamed: 0,total
mentions,11647
singletons,3265
notes,2995
classes,1548


In [20]:
PrintThisTable(pos_df_nu)

Unnamed: 0,NP,VP,PrNP,DPrP,Sffx,PPrP,AdvP,IPrP,CP,prep,PP,conj,InrP,AdjP,art,PtcP,total_type
in class,2633,2751,973,93,1738,152,12,5,9,7,3,1,2,1,1,1,8382
singleton,2627,189,233,29,105,3,53,0,3,14,4,1,0,2,2,0,3265
total,5260,2940,1206,122,1843,155,65,5,12,21,7,2,2,3,3,1,11647
% total,45,25,10,1,16,1,1,0,0,0,0,0,0,0,0,0,100
first in chain,863,328,210,66,55,7,5,5,4,3,1,1,0,0,0,0,1548
% chain,56,21,14,4,4,0,0,0,0,0,0,0,0,0,0,0,100


In [21]:
PrintThisTable(pronoun_df_nu)

Unnamed: 0,p1upl,p1usg,p2fpl,p2fsg,p2mpl,p2msg,p3fpl,p3fsg,p3mpl,p3msg,p3upl,ufpl,ufsg,umpl,umsg,usg,uuu,total_pgn
in class,164,267,0,7,397,410,28,337,777,1537,182,2,44,59,82,0,245,4538
singleton,5,4,2,0,11,11,0,7,50,65,7,0,6,9,11,3,67,258
total,169,271,2,7,408,421,28,344,827,1602,189,2,50,68,93,3,312,4796
% total,4,6,0,0,9,9,1,7,17,33,4,0,1,1,2,0,7,100


In [22]:
PrintThisTable(pronoun_pos_class_df_nu)

Unnamed: 0,p1upl,p1usg,p2fsg,p2mpl,p2msg,p3fpl,p3fsg,p3mpl,p3msg,p3upl,ufpl,ufsg,umpl,umsg,uuu,total_pgn
VP,62,122,6,206,232,12,129,376,991,182,2,44,59,82,245,2750
Sffx,96,117,0,173,159,14,188,392,497,0,0,0,0,0,0,1636
PPrP,6,28,1,18,19,2,20,9,49,0,0,0,0,0,0,152
total,164,267,7,397,410,28,337,777,1537,182,2,44,59,82,245,4538
% total,4,6,0,9,9,1,7,17,34,4,0,1,1,2,5,100


In [23]:
PrintThisTable(pronoun_pos_sing_df_nu)

Unnamed: 0,p1upl,p1usg,p2fpl,p2mpl,p2msg,p3fsg,p3mpl,p3msg,p3upl,ufsg,umpl,umsg,usg,uuu,total_pgn
VP,1,2,0,10,6,2,10,53,7,6,9,11,3,67,187
Sffx,4,2,2,1,2,5,40,12,0,0,0,0,0,0,68
PPrP,0,0,0,0,3,0,0,0,0,0,0,0,0,0,3
total,5,4,2,11,11,7,50,65,7,6,9,11,3,67,258
% total,2,2,1,4,4,3,19,25,3,2,3,4,1,26,100
