<img align="right" src="images/tf-small.png" width="90"/>
<img align="right" src="images/etcbc.png" width="100"/>


# Creating a Coreference-annotated Corpus for Biblical Hebrew

#### An Analysis of Inter-annotator Agreement for Coreference Resolution Annotations in the Psalms and Beyond

## 1. Introduction

This notebook demonstrates and analyses the calculations of inter-annotator agreement, abbreviated to IAA, for the annotation of coreference information in the Hebrew Bible, specifically the Psalms. The Psalms, consisting of 150 poems in Ancient Hebrew, have been chosen as annotation corpus because they are the focus of my PhD research: "Who is who in the Psalms?" 

The Psalms and some comparison texts from Genesis, Numbers and Isaiah, have been annotated according to certain rules: an annotation scheme. A brief explanation of what data is annotated, the process and accompanying annotation tools and resources can be found in the [coreference annotation](https://github.com/cmerwich/participant-analysis/tree/master/annotation) notebooks. Why comparison texts have been annotated is explained below.

A vital part of the annotation process is checking the annotations for inter-annotator agreement. A website on [corpus linguistics](https://corpuslinguisticmethods.wordpress.com/2014/01/15/what-is-inter-annotator-agreement) puts it nicely: 

> "Inter-annotator agreement is a measure of how well two (or more) annotators can make the same annotation decision for a certain category."

From an IAA measure at least three things can be derived: 

* the *reliability* of the annotation scheme or guidelines that describe the category that is being annotated. Do the annotators understand the scheme and can they apply it *independently* and *consistently*? 
* the *reliability* of the annotation process, which is a necessary condition for 
* the *correctness* of the resulting annotations. 

The goal is create a coreference annotation method for the Hebrew Bible that can be used, adapted and corrected by others. In order to do that, we need to know to what extent the annotated information is correct. Producing reliable, consistent and correct coreference annotations are key to developing solid computer-assisted analyses of participants in the Psalms and other Hebrew Bible books. 


### 1.1 Setting up IAA

The scope and financial means of the PhD project do not allow for setting up a large scale IAA process with a team of annotators ranging in level of expertise. Two annotators were available for the current project. In the calculations I therefore function as annotator **A** and a fellow PhD candidate, Gyusang Jin, as annotator **B**. 

10 texts out of 150 Psalms were chosen randomly for annotator **B** with Python's `random` module since the whole Psalms corpus was already annotated by **A**:

```ruby
import random 
for i in range(1,11):
    print(random.randint(1, 151))
```

The output was 11, 88, 101, 17, 70, 138, 20, 32, 67, 129. After a training in the annotation guidelines, **B** started annotating. The 10 resulting annotation files of **B** together with **A**'s annotation files form the input for the IAA algorithm. The annotation files are found under [A](chris_A) and [B](gyus_B). 

The Psalms corpus consists of poetry that is written in Biblical Hebrew. The annotator therefore does not only have to deal with difficulties when trying to understand texts written in a different genre and language, they have also originated in a different space and time. A question that comes up is if it is harder to annotate poetic texts for coreference compared to for example narrative texts.

To enable some comparison between the annotation of different genres Numbers 8-10, which are narrative texts, were also annotated for coreference by **A** and **B**. The processing of the Numbers annotations however is in work in progress and will be inclused as soon as possible. 


### 1.2 Some Theory and the IAA Algorithm

There are all kinds of algorithms for IAA measures. [NLTK](https://www.nltk.org/api/nltk.metrics.html) has implemented some of these agreement metrics. [Artstein and Poesio](https://dl.acm.org/citation.cfm?id=1479206) wrote an informative article on the mathematics of IAA in 2008. I have decided to implement my own algorithm that uses SciPy's [lignear sum assignment module](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html). This gives me more control over the algorithm's output. 

Calculating IAA for coreference annotations in the Hebrew Bible is cast as an assignment problem. A classic assignment problem is e.g. to assign jobs *{p, q, r, s}* to workers *{a, b, c, d}* so as to minimize the total cost. [Here](http://csclab.murraystate.edu/~bob.pilgrim/445/munkres.html) you can find a nice explanation of this 'Kuhn-Munkres algorithm'. [Wikipedia](https://en.wikipedia.org/wiki/Hungarian_algorithm) is also helpful. 

The implementation of IAA for our case is as follows. Coreference resolution is the task of finding all expressions that refer to the same entity in a text. 

* A referring expression is called a **mention**. 
* An entity can be called a **class**, or **C**. A **class** is a set that contains two or more mentions that refer to the same entity. 
* A mention that refers to an entity that no other mention refers to is called a **singleton**, or **S**. A **singleton** is a set that contains one **mention**. A **singleton** set can also contain all singletons from one text. 

Comparing annotations of `annotator A` and `annotator B` then is done with set calculations. Given the Venn diagram (Source: [wikipedia](https://en.wikipedia.org/wiki/Set_theory)) below, for IAA we are interested in:

<img align="center" src="images/venn_a_intersect_b.png" width="220"/>

* $A \setminus B$ = Left, or blue
* $A \cap B$ = Middle, or purple
* $B \setminus A$ = Right, or pink
* $A \Delta B$ = Symmetric difference of A and B
* $\delta(A, B)$ = delta of A and B

To illustrate, in the following example

| $\setminus$ | A  | $\cap$ | B | $\setminus$ |
| ---- | ---- | ---- | ---- | ---- |
| 0 | C1 | 4 | C1 | 1 |  
| 1 | C2 | 7 | C2 | 1 |
| 8 | C3 | 0 | C3 | 0 |
| 3 | S  | 17 | S  | 2 |  


annotator **A** and annotator **B** have 4 mentions in common in *C1*, **B** has 1 mention more in *C1* than **A**. **A** and **B** have 7 mentions in common in *C2*. Both have annotated 1 mention that they not have in common. For *S*, **A** and **B** have an intersection of 17. For **A** the relative complement, or difference, is 3; for **B** it's 2. 

It is also possible that **A** or **B** has formed a class the other has not: the *C3* class is an example of that.  

The symmetric difference is the calculation of the annotations that belong to **A** or **B** but not to their intersection (Middle). Symmetric difference is defined for sets **A**, **B** as

$$A \Delta B = (A \setminus B) \cup (B \setminus A)$$

So the symmetric distance between **A** and **B** for *C1* is $0+1 = 1$. $A \Delta B$ for *S* is 5. 

The notation for the $\delta$ (delta) of a set is

$$ \delta = A \Delta B / A \cup B $$

The delta of *S* for example is: $(3+2)/(3+17+2) \approx 0.227 $.

To return to the assignment problem, the annotations of **A** and **B** can be divided as respectively $n$ and $k$ nodes into two disjoint and independent sets $U$ and $V$ in an acyclic bipartite graph (source: [wikipedia](https://en.wikipedia.org/wiki/Bipartite_graph)). In this bipartite graph every edge connects a node in $U$ to one in $V$:

<img align="center" src="images/bipartite-graph.png" width="220"/> 

In the [IAA algorithm](iaa.py) then, **A** and **B** (or $U$ and $V$) are matched with function `match()` using a distance function `distance()` that calculates the symmetric distance. The cost of the matching is calculated with the aforementioned `linear_sum_assignment` SciPy module. The results of the matching are stored in array $r$. The rest of the functions are helper functions that print the calculations to plain txt files with extension `.iaa`. These `.iaa` files are found [here](iaa-files). 

### 2. Executing the Code

To enable researchers to develop their own annotation process, with different annotators working in different file locations, I have decided to work with a [`Makefile`](Makefile). A `Makefile` does all of the file handling in the terminal, it prevents dependency hell when the annotators decide to change their files after the IAA calculations have been done. With the command `make` the calculations are easily done again. Some instructions:

* make an "iaa"-folder (or give it some other name) somewhere on your computer and place the `iaa.py`, `acc.py` and `Makefile` in that same folder;
* change the file locations for the `.ann` files under NU_A, PS_A etc. in the `Makefile`. Make sure there are **A** and **B** locations for the files of the two different annotators; 
* go to the "iaa"-folder in your terminal and give the command `make`;
* `iaa.py` will do its work and the IAA measures are printed per Hebrew Bible book in separate txt files and it prints one total IAA measure for all compared texts. All files are stored in the "iaa"-folder. 

TO DO: Fix Makefile --

Jupyter notebooks allow lots of cool magic, another possibility is to just run a shell command in this notebook with `!`, do: 

```ruby
! make
```
TO DO: give total_psalms etc. .iaa extension 

In [106]:
! make 
# give total_psalms etc. .iaa extension 

python3 iaa.py /Users/Christiaan/Sites/brat/data/coref/Numbers/annotate/Numbers_001.ann /Users/Christiaan/Sites/brat/data/gyusang/coref/Numbers/annotate/Numbers_001.ann > Numbers_001.iaa
python3 iaa.py /Users/Christiaan/Sites/brat/data/coref/Numbers/annotate/Numbers_008.ann /Users/Christiaan/Sites/brat/data/gyusang/coref/Numbers/annotate/Numbers_008.ann > Numbers_008.iaa
python3 iaa.py /Users/Christiaan/Sites/brat/data/coref/Numbers/annotate/Numbers_012.ann /Users/Christiaan/Sites/brat/data/gyusang/coref/Numbers/annotate/Numbers_012.ann > Numbers_012.iaa
python3 acc.py Numbers_001.iaa Numbers_008.iaa Numbers_012.iaa > total_numbers
python3 iaa.py /Users/Christiaan/Sites/brat/data/coref/Psalms/annotate/Psalms_011.ann /Users/Christiaan/Sites/brat/data/gyusang/coref/Psalms/annotate/Psalms_011.ann > Psalms_011.iaa
python3 iaa.py /Users/Christiaan/Sites/brat/data/coref/Psalms/annotate/Psalms_017.ann /Users/Christiaan/Sites/brat/data/gyusang/coref/Psalms/annotate/Psalms_017.ann > Psalms_017.

### 3. IAA analysis
Here follows the analysis of the calculated IAA measures. 


### 3.1 Totals
Let's pull in the IAA measures for `total_psalms.iaa` with a shell command, and sort the seventh column by ascending order.

In [7]:
! sort -k 7n,7 iaa-files/total_psalms.iaa

Psalms_138.iaa	-	9	62	10	19	0.2346
Psalms_088.iaa	-	25	121	25	50	0.2924
Psalms_011.iaa	-	12	49	12	24	0.3288
Psalms_129.iaa	-	9	36	9	18	0.3333
Psalms_070.iaa	-	11	34	10	21	0.3818
Psalms_032.iaa	-	21	71	26	47	0.3983
Psalms_020.iaa	-	18	55	19	37	0.4022
Psalms_017.iaa	-	38	107	42	80	0.4278
Psalms_101.iaa	-	19	45	20	39	0.4643
Psalms_067.iaa	-	20	42	21	41	0.494


Also accumulate the IAA measures of all texts `total_psalms.iaa` with `acc.py`. 

In [11]:
from acc import print_total

name, Lt, Mt, Rt, Dt, dt = print_total('iaa-files/total_psalms.iaa')

iaa-files/total_psalms.iaa	-	182	622	194	376	0.3768


The IAA $\delta$ of all Psalms annotated by **A** and **B** is $0.3768$. We'll get back to that value in a bit. 


Great, now we make it neat. Let's put all the values per text and the total in a pandas dataframe in which the '-' is dropped and sort the values in the seventh column ($\delta$) by ascending order:

In [86]:
import pandas as pd

tot_column_names=['-','L', 'M', 'R', 'D', 'd']
tot_data_types={'-': str, 'L': int, 'M': int, 'R': int, 'D': int, 'd': float}

ps_df = pd.read_table('iaa-files/total_psalms.iaa', 
                           delim_whitespace=True, 
                           names=tot_column_names,
                           dtype=tot_data_types
                          ).drop(columns='-').sort_values(by='d')

df = pd.DataFrame([[Lt, Mt, Rt, Dt, dt]],
                  index=['total_psalms'],
                  columns=['L', 'M', 'R', 'D', 'd']
                 )

tot_ps_df = ps_df.append(df)

tot_ps_df

Unnamed: 0,L,M,R,D,d
Psalms_138.iaa,9,62,10,19,0.2346
Psalms_088.iaa,25,121,25,50,0.2924
Psalms_011.iaa,12,49,12,24,0.3288
Psalms_129.iaa,9,36,9,18,0.3333
Psalms_070.iaa,11,34,10,21,0.3818
Psalms_032.iaa,21,71,26,47,0.3983
Psalms_020.iaa,18,55,19,37,0.4022
Psalms_017.iaa,38,107,42,80,0.4278
Psalms_101.iaa,19,45,20,39,0.4643
Psalms_067.iaa,20,42,21,41,0.494


The psalms dataframe (`ps_df`) is built up of the IAA calculations per text. The colums are named L(eft) and R(ight) for the relative complement, M(iddle) for intersection, the symmetric diference is D for $\Delta$,  and d for $\delta$.

The $\delta$ is a value $0 \leq \delta \leq 1 $ where $0$ denotes total inter-annotator agreement and $1$ total inter-annotator *dis*agreement. Considering the values of $\delta$ in the total_psalms dataframe the question arises what $\delta$ value threshold is maintained to be able to speak of IAA. Or in other words: here we encounter the problem of interpreting the meaning of the resulting values. To quote Artstein and Poesio:

> "Unfortunately, deciding what counts as an adequate level of agreement for a specific purpose is still little more than a black art: [a]s we will see, different levels of agreement may be appropriate for resource building and for more linguistic purposes." 

[Artstein and Poesio, 2008](https://dl.acm.org/citation.cfm?id=1479206), p.576

To put it in another away, though IAA is important, it is hard to analyse. This statement nevertheless does not withold them from concluding that in relation to [Krippendorff's](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha) $\alpha$ - an agreement coefficent for multiple coders where 1 signifies perfect agreement:

> "only values above 0.8 ensured an annotation of reasonable quality. We therefore feel that if a threshold needs to be set, 0.8 is a good value." 

[Artstein and Poesio, 2008](https://dl.acm.org/citation.cfm?id=1479206), p.591

Taking into account some of the difficulties that the corpus poses however - see the remarks under §1.1 'Setting up IAA' - I suggest we start doing some black art ourselves. I set a 'study' IAA threshold for individual texts at IAA $\leq 0.333$ with which the texts with a low IAA measure can be selected for annotation analysis. 

In [96]:
ps_df.loc[(ps_df['d'] >= 1/3)]

Unnamed: 0,L,M,R,D,d
Psalms_070.iaa,11,34,10,21,0.3818
Psalms_032.iaa,21,71,26,47,0.3983
Psalms_020.iaa,18,55,19,37,0.4022
Psalms_017.iaa,38,107,42,80,0.4278
Psalms_101.iaa,19,45,20,39,0.4643
Psalms_067.iaa,20,42,21,41,0.494


### 3.2 Psalms_067.iaa 

Starting with `Psalms_067.iaa` with $\delta = 0.4940$, the file can be loaded in a pandas dataframe for clarity. 

In [15]:
column_names=('ann_A','ann_B', 'L', 'M', 'R', 'D', 'd')
data_types={'ann_A': str, 'ann_B': str ,'L': int, 'M': int, 'R': int, 'D': int, 'd': float}

ps067_df = pandas.read_table('iaa-files/Psalms_067.iaa', 
                           delim_whitespace=True, 
                           names=column_names,
                           dtype=data_types
                          )
ps067_df

Unnamed: 0,ann_A,ann_B,L,M,R,D,d
0,C1,C5,0,6,0,0,0.0
1,C3,C3,2,3,0,2,0.4
2,C4,C2,1,12,7,8,0.4
3,C5,C4,9,12,0,9,0.4286
4,C6,C6,0,2,0,0,0.0
5,C7,C1,2,0,9,11,1.0
6,S,S,1,7,5,6,0.4615
7,C2,-,5,0,0,5,1.0


Recall that the IAA algorithm matches the *C* and *S* sets of annotations from **A** and **B** in the most optimal way. That is why different class numbers can be matched differently for **A** and **B**. The different matching of class numbers does not mean that they have been identified as different entities. For indices 0 and 4

In [93]:
ps067_df.loc[[0,4]]

Unnamed: 0,ann_A,ann_B,L,M,R,D,d
0,C1,C5,0,6,0,0,0.0
4,C6,C6,0,2,0,0,0.0


means that there is complete agreement between **A** and **B** on which mentions refer to a certain entity. Indices 5 and 7

In [94]:
ps067_df.loc[[5,7]]

Unnamed: 0,ann_A,ann_B,L,M,R,D,d
5,C7,C1,2,0,9,11,1.0
7,C2,-,5,0,0,5,1.0


indicate that there is complete disagreement between **A** and **B** on which mentions refer to a certain entity.

* *C7* and *C1* are matched, but **A** and **B** have found that different mentions refer to that entity. 
* **A** has found an extra class, which can be concluded from both the non-match (i.e. '-') and the extra class that **A** (i.e. *C7* or *C2*) has found compared to **B**. 

For the sets with $\delta \geq 1/4 $ 

In [101]:
ps067_df.loc[[1,2,3,6]]

Unnamed: 0,ann_A,ann_B,L,M,R,D,d
1,C3,C3,2,3,0,2,0.4
2,C4,C2,1,12,7,8,0.4
3,C5,C4,9,12,0,9,0.4286
6,S,S,1,7,5,6,0.4615


* on index 2, **B** connects 7 mentions to his *C2*
* on index 3, while **A** connects 9 mentions to his *C5*, **B** connects 0 mentions to his *C4*.
* in *S* **A** and **B** have 7 mentions in common, but **B** finds 5 more mentions as singletons.  

In [103]:
from iaa import compare_ann

compare_ann('chris_A/Psalms_067.ann', 'gyus_B/Psalms_067.ann')

C1	C5	0	6	0	0	0.0
C3	C3	2	3	0	2	0.4
C4	C2	1	12	7	8	0.4
C5	C4	9	12	0	9	0.4286
C6	C6	0	2	0	0	0.0
C7	C1	2	0	9	11	1.0
S	S	1	7	5	6	0.4615
C2	-	5	0	0	5	1.0
