<img align="right" src="images/tf-small.png" width="90"/>
<img align="right" src="images/etcbc.png" width="100"/>

# Validating Manual Annotations and MiMi's results

## 1. Introduction

It is possible to cross-validate MiMi's results with the authors' coreference annotations (available through [analyseParticipants](https://github.com/cmerwich/participant-analysis/blob/master/tf_conversion/analyseParticipants.ipynb), since they are based on a similar ontology, namely the [annotation model](https://github.com/cmerwich/participant-analysis/blob/master/annotation/annotation_model.ipynb)). This comparison gives an impression of both the quality of the annotations and the quality of the results that MiMi has produced. Though the `2017` version of the BHSA data have been manually annotated, and MiMi takes in the `C` version data, both data sets are stable and similar enough for a valid comparison since BH as language does not evolve. In the annotation process 18571 mentions and 2001 classes were annotated; MiMi has produced 18484 and 2252 classes. One could say: the better MiMi performs, the more mentions are resolved into a class resulting in less classes. The results of both the annotations and MiMi's results can be compared in two ways. Firstly, the results of both methods are compared for the mention detection stage by parsing the mentions from the manual annotations and MiMi `.ann` files into sets. This is done in §. Secondly, the results of both methods are compared with the inter-annotator agreement algorithm that was developed in [iaa](https://github.com/cmerwich/participant-analysis/tree/master/iaa). This is done in § 

## 2. Load Modules and Data

In [28]:
import os
import pandas as pd
from parse import Parse
from acc import print_total
from utils import ExportToLatex

In [23]:
#OUTPUT = os.path.expanduser('~/Documents/PhD/1-dissertation/DISSERTATIONlatex/Tables/')

In [None]:
path_manual = os.path.expanduser('~/Sites/brat/data/coref/Psalms/annotate')
# Path to corrected MiMi ann files. The mention indices in these files have been reindexed so that 
# they fit the mention indices for the manual annotations. 
path_mimi_trans = os.path.expanduser('~/github/cmerwich/participant-analysis/iaa-ann-vs-mimi')

## 3. Mention Detection: Manual vs MiMi

In table `df_mentions` below a comparison is presented between the mentions as annotated by the author and produced by MiMi. The way to compare mentions from both methods is to perform set calculations on the mention boundaries, i.e. the start and end index of the mentions. MiMi produces textual data that is easier to read -- MiMi inserts '-' between concatenated words and '+' to indicate suffixes -- compared to the textual data of the annotation method. This means that the mention boundaries produced by both methods are also different. To enable a comparison the algorithm `translate` aligns the text indices of the mentions of both methods. After alignment, some of the mention boundaries will be just slightly off since MiMi's mention detection grammar has been designed to include articles with the mentions. The resulting differences are minor, therefore the mention comparison is considered as representative. 

The columns *manual* and *mimi* indicate the mentions produced by the manual annotations and MiMi respectively. The columns $L$ and $R$ denote the set difference for the manual annotations and MiMi's results respectively. $M$ indicates the set intersection of the annotations; $D$ the symmetric difference; and $d_{j}$ is the Jaccard distance. Important to note is that this is not the combined metric $d_{c}$ from the IAA algorithm, since $d_{c}$ also calculates (dis)similarity for coreference classes. $d_{j}$ is a value $0 \leq $d_{j}$ \leq 1$ where $0$ denotes total similarity and $1$ total dissimilarity. Column *%common* indicates the percentage that the two sets have in common: the intersection of the manual annotations and MiMi results divided by MiMi's mentions: 

\begin{equation}
   \%common = \frac{|A \cap B|}{|B|} \cdot 100 = 
    \frac{8240}{8978} \cdot 100 
\end{equation}

Since the actual mentions that MiMi has detected are nearly flawless, those results have been taken as baseline for the calculation. 

Considering both the mistakes that inevitably occur in an annotation process, and MiMi's absolute systematic mention detection a distance $d_{j} \approx 0.1213$ -- or conversely -- a similarity of $j \approx 0.8787$ and an overlap percentage of 91.8% can be qualified as an achievement. The 0.1213 dissimilarity is probably due to the errors that have been made during annotation: wrong demarcation of suffixes, wrong annotation of mention types etc. 

Though not unexpected, an explanation for the relative high consistency between the annotations and MiMi can be sought in the application of the [annotation aid](https://github.com/cmerwich/participant-analysis/blob/master/annotation/2.annotation_aid.ipynb) that was developed for the annotation process. The annotation aid visualises potential mention data in a way that is similar to the structure of phrase atoms that MiMi takes as input. 

In [8]:
mentions_manual = Parse(path_manual)
tot_manual = mentions_manual
mentions_manual = set(mentions_manual)

18571


In [9]:
mentions_mimi = Parse(path_mimi_trans)
tot_mimi = mentions_mimi
mentions_mimi = set(mentions_mimi)

18484


In [10]:
len(set(mentions_manual)) / len(set(mentions_mimi)) * 100

96.23524170193807

In [11]:
percent_common = len(mentions_manual & mentions_mimi) / len(mentions_mimi) * 100

round_percent = round(percent_common, 1)
round_percent

91.8

In [12]:
precision = len(mentions_manual & mentions_mimi) / len(mentions_manual) * 100
recall = len(mentions_manual & mentions_mimi) / len(mentions_mimi) * 100

#round_precison = round(precision, 1)
round_precison = 0
#round_recall = round(recall, 1)
round_recall = 0

In [13]:
len(mentions_mimi)

8978

In [14]:
intersection = len(mentions_manual & mentions_mimi)
intersection

8240

In [15]:
union = len(mentions_manual | mentions_mimi)
union

9378

In [16]:
man_diff = len(mentions_mimi - mentions_manual)
man_diff

738

In [17]:
mimi_diff = len(mentions_manual - mentions_mimi)
mimi_diff

400

In [18]:
symm_diff = man_diff + mimi_diff
symm_diff

1138

In [19]:
D = len(mentions_manual | mentions_mimi) - len(mentions_manual & mentions_mimi)
D

1138

In [20]:
# Jaccard distance 
dj = (len(mentions_manual | mentions_mimi) - len(mentions_manual & mentions_mimi)) / \
len(mentions_manual | mentions_mimi)

round_dj = round(dj, 4)

round_dj

0.1213

In [21]:
# Jaccard index
j = len(mentions_manual & mentions_mimi) / ((len(mentions_manual) + len(mentions_mimi)) - len(mentions_manual & mentions_mimi))

round_j = round(j, 4)
round_j

0.8787

In [26]:
cols = ['manual', 'L', 'M', 'R', 
        'mimi', 'D', 'd', '%common', 'precision', 'recall']

df_mentions = pd.DataFrame([[len(tot_manual), man_diff, intersection, mimi_diff, 
        len(tot_mimi), D, round_dj, round_percent, round_precison, round_recall]],
                  index=['mentions'],
                  columns=cols
                 )
df_mentions

Unnamed: 0,manual,L,M,R,mimi,D,d,%common,precision,recall
mentions,18571,738,8240,400,18484,1138,0.1213,91.8,0,0


In [27]:
ExportToLatex(OUTPUT, 'manual_mimi_mentions', df_mentions, indx = True)

Run the cell below to inspect the count of differences per Psalm. The makefile `Translate` produces the `diff` files.

In [1]:
! wc -l *.diff | sort -n

       0 Psalms_047.diff
       0 Psalms_093.diff
       0 Psalms_117.diff
       0 Psalms_120.diff
       0 Psalms_128.diff
       2 Psalms_058.diff
       2 Psalms_067.diff
       2 Psalms_070.diff
       2 Psalms_126.diff
       4 Psalms_053.diff
       4 Psalms_075.diff
       4 Psalms_125.diff
       5 Psalms_108.diff
       5 Psalms_134.diff
       7 Psalms_110.diff
       7 Psalms_112.diff
       7 Psalms_149.diff
       8 Psalms_147.diff
      10 Psalms_029.diff
      10 Psalms_062.diff
      10 Psalms_127.diff
      10 Psalms_138.diff
      11 Psalms_046.diff
      11 Psalms_100.diff
      12 Psalms_020.diff
      13 Psalms_015.diff
      13 Psalms_048.diff
      14 Psalms_002.diff
      14 Psalms_021.diff
      14 Psalms_099.diff
      14 Psalms_113.diff
      14 Psalms_123.diff
      14 Psalms_130.diff
      14 Psalms_131.diff
      15 Psalms_004.diff
      15 Psalms_150.diff
      16 Psalms_122.diff
      16 Psalms_124.diff
      16 Psa

Run the cell below to run the IAA algorithm. Make sure you have the right files in the right place.

In [None]:
#! make

## 4. IAA for Coreference Resolution: Manual vs MiMi

In table `manual_mimi_tot` below the total IAA measures are presented for the manual coreference annotations and MiMi's coreference data. The IAA measure for each Psalm is presented in table `manual_mimi_df`. The column names are for both tables the same. The results presented in `manual_mimi_tot` are discussed here. 

The columns $L$ and $R$ denote the set difference for the manual coreference annotations and MiMi's coreference results respectively. $M$ indicates the set intersection of the annotations: both methods have 8702 coreference annotations in common. $D$ indicates the symmetric difference: there is a total difference of 19651 mentions between both methods. The total combined distance measure is $d_{c_{L, R}} \approx 0.6931$. 

Taking the total disagreement measure $1 - 0.6931$; the agreement measure for the annotations and MiMi's results is $\approx 0.3069$, or when expressed in percentages 30.7%. These measures are not unexpected. Though MiMi detects mentions near perfectly for the Psalms, it harvests only the most explicit coreference relations. That means that MiMi is more successful in detecting mentions than the manual annotations -- even if the loss of mentions by way of mention errors is considered -- but is clearly less successful in resolving coreference relations. The annotation method is therefore most successful in detecting coreference relations. The IAA algorithm matches the detected coreference classes most optimally. Since MiMi cannot detect all coreference relations there is missing data which results in a less optimal matching and thus a higher $d_{c}$ measure. 

MiMi has not fully made use of the potential for resolving coreference in the BHSA data. There are more features and syntactical relations that can be harvested for coreference. The $d_{c}$ measure is therefore only an indication of what is possible in future research.

The IAA calculations for the annotations and MiMi also allow for a qualitative, comparative analysis of the results with [`retrieve_ann`](https://github.com/cmerwich/participant-analysis/blob/master/iaa/retrieve_iaa.py). This method will be used in the notebook [`confrontation-ps75`](https://github.com/cmerwich/participant-analysis/blob/master/confrontation/confrontation-ps75.ipynb) where the case study of Psalm 75 will be revisited. 

In [5]:
tot_column_names=['-','L', 'M', 'R', 'D', 'd']
tot_data_types={'-': str, 'L': int, 'M': int, 'R': int, 'D': int, 'd': float}
cols=['L', 'M', 'R', 'D', 'd']

manual_mimi_df = pd.read_table('total_psalms', 
                           delim_whitespace=True, 
                           names=tot_column_names,
                           dtype=tot_data_types
                          ).drop(columns='-').sort_values(by='d')
manual_mimi_df

Unnamed: 0,L,M,R,D,d
Psalms_150.iaa,13,31,10,23,0.4259
Psalms_093.iaa,11,29,11,22,0.4314
Psalms_110.iaa,18,41,17,35,0.4605
Psalms_128.iaa,12,27,12,24,0.4706
Psalms_020.iaa,24,49,24,48,0.4948
Psalms_120.iaa,12,24,12,24,0.5000
Psalms_141.iaa,31,62,31,62,0.5000
Psalms_047.iaa,18,35,18,36,0.5070
Psalms_063.iaa,32,61,33,65,0.5159
Psalms_112.iaa,26,45,24,50,0.5263


In [32]:
manual_mimi_df_index = manual_mimi_df.reset_index()

In [33]:
ExportToLatex(OUTPUT, 'manual_mimi_all_index', manual_mimi_df_index, indx = True)

In [6]:
name_ps, Lt_ps, Mt_ps, Rt_ps, Dt_ps, dt_ps = print_total('total_psalms')

manual_mimi_tot = pd.DataFrame([[Lt_ps, Mt_ps, Rt_ps, Dt_ps, dt_ps]],
                  index=['iaa total'],
                  columns=cols
                 )
manual_mimi_tot

total_psalms	-	9869	8702	9782	19651	0.6931


Unnamed: 0,L,M,R,D,d
iaa total,9869,8702,9782,19651,0.6931


In [25]:
ExportToLatex(OUTPUT, 'manual_mimi_tot', manual_mimi_tot, indx = True)