<img align="right" src="images/tf-small.png" width="90"/>
<img align="right" src="images/etcbc.png" width="100"/>

# MiMi
## A Deterministic Coreference Resolver for Biblical Hebrew

MiMi is a deterministic -- meaning 'rule driven' -- coreference resolver for Biblical Hebrew. MiMi is the concatenation of Mi Mi in Biblical Hebrew, which means 'Who? Who?' MiMi is used as a tool to identify and analyse participants or entities in a text. MiMi has two phases: mention detection and coreference resolution. The aim was to build a near perfect mention detection phase and a modular coreference resolver that can be enhanced in future research. Most coreference resolvers, machine learning models or deterministic models, remove singletons in an additional third stage. We have chosen to retain singletons, since they are also of importance for participant analysis in Biblical Hebrew. MiMi, as demonstrated in this notebook, can be used for any Hebrew Bible book, regardless the genre. 

#### Phase One: Mention Detection
In the first phase the input text is tokenised and parsed for mentions. The API of [Text-Fabric](https://annotation.github.io/text-fabric/) is used to process the data of the BHSA [Hebrew Bible Database](https://etcbc.github.io/bhsa/) which contains the text of the Hebrew Bible augmented with linguistic annotations as compiled by the Eep Talstra Centre for Bible and Computer [ETCBC](http://etcbc.nl). The mention parser takes as input [phrase atoms](https://etcbc.github.io/bhsa/features/otype/), for which the ETCBC has already determined the boundaries. That makes the parsing of mentions much easier. For both the mention enrichment and the coreference resolution phase many BHSA features are used. The documentation can be found under the aforementioned phrase atom hyperlink. The mention parsing is done with a Python implemention of the lex and yac tools called [SLY](https://sly.readthedocs.io/en/latest/index.html). 

#### Phase Two: Coreference Resolution
In the second phase the mentions are stored as a coreference list of singleton sets. The coreference singleton sets are merged in a sequence of five sieves. MiMi resolves in order: predicates, pronouns, vocatives, appositions and fronted elements. MiMi resolves easy first, meaning that the most easy resolution choices are made first based on explicit information that is already available in the database. 
1. Predicates: MiMi's first sieve searches for explicit subject predicate relations. This is done with so-called mother - daughter clause and clause atom relations that the ETCBC has analysed. 
2. Pronouns: 1st and 2nd person mentions (verbs and pronouns) are resolved within the same paragraph, or domain. In the BHSA data one of the criteria for the determination of a domain is that the same subject is active. In that way it is possible to resolve 1st and 2nd person mentions relatively easy. 3rd person references are much harder and hence need more heuristic rules which are time consuming to program. 3rd person are therefore left out of consideration. 
3. Vocatives: vocative relations are characterised by 2nd person mentions. These are merged with the 2nd person predicate coreference classes. 
4. Appositions: apposition relations are code in the BHSA data. They are therefore easy to resolve.
5. Fronted elements: are also coded, but this data is most incomplete. The sieve is however added for two reasons. It does resolve some extra relations and the ETCBC aims at enriching this fronted element data in the near future. 

#### Statistics 
For both the mention detection phase and the coreference resolution phase statistics are generated per Hebrew Bible Book: 
* Mention detection statistics: give a view on what the success and failure of the mention parser is. The average success for the whole Hebrew Bible 99.5%. 
* Coreference resolution statistics: give a view on how many coreference singleton sets have been resolved. MiMi resolves about 29.6% sets on average. Important to note is that the unresolved sets contain both truly unresolved sets and singletons. The resolution percentage may thus in fact be higher. 
* Sieve statistics: give a view on how many singleton sets are resolved per sieve. 

#### Files
For each book a plain text `.out` file is generated in the same directory as in which the coreference command is given. The files are called `mention_errors_BIBLEBOOKNAME`. The `.out` file contains useful information about which phrase atoms could not be parsed by SLY's parser. The token is given, the text of the token, the start index of the word and word node. 

In [1]:
__author__ = 'erwich/sikkel'

In [2]:
import os
from pprint import pprint
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import matplotlib
import seaborn as sns

In [3]:
from utils import ExportToLatex
from mimi_hb import GoMiMi

Parser debugging for MyParser written to parser.out


In [None]:
OUTPUT_LOC = os.path.expanduser('~/Documents/PhD/1-dissertation/DISSERTATIONlatex/Tables/')

In [4]:
mention_stats_df, coref_stats_df, sieve_stats_df = GoMiMi()


 Mention Parse Statistics Genesis: 
 15633 phrase atoms INPUT 
 15552 phrase atoms SUCCESFULLY parsed 
 +151 extra mentions SUCCESFULLY parsed from phrase atoms 
 -108 phrase atoms without mentions 
 81 phrase atom parse ERRORS 
 +0 extra mentions SUCCESFULLY parsed from phrase atom errors 
 -80 phrase atoms without mentions from phrase atom ERRORS 
 15596 mention coreference input 
 99.5% parsing succes 
 0.5% parsing error

 Coreference Resolution Statistics Genesis: 
 15596 total input corefs 
 4629 corefs RESOLVED 
 10967 corefs UNRESOLVED 
 29.7% corefs RESOLVED 
 70.3% corefs UNRESOLVED 
 2523 classes

 Sieve Statistics Genesis: 
 Predicate Sieve: 3246 
 Pronoun Sieve: 1007 
 Vocative Sieve: 63 
 Apposition Sieve: 300 
 Fronted Element Sieve: 13
 Total Sieves: 4629
 Total Classes: 2523
Counter({1: 15583, 4: 15552, 19: 10703, 36: 7597, 35: 7596, 55: 7433, 6: 6271, 13: 4883, 8: 4883, 16: 4802, 30: 4352, 28: 4352, 27: 4352, 22: 4265, 9: 4264, 46: 3904, 34: 3904, 42: 3227, 41: 2364,


 Mention Parse Statistics 1_Samuel: 
 10018 phrase atoms INPUT 
 9958 phrase atoms SUCCESFULLY parsed 
 +91 extra mentions SUCCESFULLY parsed from phrase atoms 
 -87 phrase atoms without mentions 
 60 phrase atom parse ERRORS 
 +1 extra mentions SUCCESFULLY parsed from phrase atom errors 
 -58 phrase atoms without mentions from phrase atom ERRORS 
 9965 mention coreference input 
 99.4% parsing succes 
 0.6% parsing error

 Coreference Resolution Statistics 1_Samuel: 
 9965 total input corefs 
 2950 corefs RESOLVED 
 7015 corefs UNRESOLVED 
 29.6% corefs RESOLVED 
 70.4% corefs UNRESOLVED 
 1799 classes

 Sieve Statistics 1_Samuel: 
 Predicate Sieve: 2225 
 Pronoun Sieve: 536 
 Vocative Sieve: 33 
 Apposition Sieve: 155 
 Fronted Element Sieve: 1
 Total Sieves: 2950
 Total Classes: 1799
Counter({1: 9987, 4: 9958, 19: 6580, 36: 4913, 35: 4912, 55: 4780, 6: 3627, 13: 3390, 8: 3390, 16: 3349, 30: 2894, 28: 2894, 27: 2894, 22: 2836, 9: 2834, 46: 1939, 34: 1939, 41: 1789, 42: 1742, 43: 114


 Mention Parse Statistics Hosea: 
 2067 phrase atoms INPUT 
 2066 phrase atoms SUCCESFULLY parsed 
 +14 extra mentions SUCCESFULLY parsed from phrase atoms 
 -14 phrase atoms without mentions 
 1 phrase atom parse ERRORS 
 +0 extra mentions SUCCESFULLY parsed from phrase atom errors 
 -1 phrase atoms without mentions from phrase atom ERRORS 
 2066 mention coreference input 
 100.0% parsing succes 
 0.0% parsing error

 Coreference Resolution Statistics Hosea: 
 2066 total input corefs 
 545 corefs RESOLVED 
 1521 corefs UNRESOLVED 
 26.4% corefs RESOLVED 
 73.6% corefs UNRESOLVED 
 317 classes

 Sieve Statistics Hosea: 
 Predicate Sieve: 388 
 Pronoun Sieve: 122 
 Vocative Sieve: 18 
 Apposition Sieve: 16 
 Fronted Element Sieve: 1
 Total Sieves: 545
 Total Classes: 317
Counter({4: 2066, 1: 2053, 19: 1415, 6: 887, 36: 857, 35: 857, 55: 852, 13: 650, 8: 650, 16: 643, 46: 590, 34: 590, 30: 519, 28: 519, 27: 519, 22: 512, 9: 512, 42: 459, 50: 312, 49: 264, 41: 223, 43: 111, 45: 101, 37: 

 98.9% parsing succes 
 1.1% parsing error

 Coreference Resolution Statistics Zephaniah: 
 553 total input corefs 
 128 corefs RESOLVED 
 425 corefs UNRESOLVED 
 23.1% corefs RESOLVED 
 76.9% corefs UNRESOLVED 
 71 classes

 Sieve Statistics Zephaniah: 
 Predicate Sieve: 84 
 Pronoun Sieve: 28 
 Vocative Sieve: 10 
 Apposition Sieve: 6 
 Fronted Element Sieve: 0
 Total Sieves: 128
 Total Classes: 71
Counter({1: 550, 4: 547, 19: 386, 6: 236, 36: 233, 35: 233, 55: 219, 46: 169, 34: 169, 13: 167, 8: 167, 16: 164, 30: 146, 28: 146, 27: 146, 22: 140, 9: 140, 50: 94, 42: 94, 37: 76, 41: 72, 49: 70, 45: 60, 43: 60, 47: 38, 17: 12, 54: 9, 14: 8, 12: 8, 23: 6, 3: 6, 48: 5, 53: 5, 40: 5, 18: 4, 2: 3, 32: 3, 10: 3, 15: 3, 31: 2, 39: 2, 58: 1, 7: 1})

 Mention Parse Statistics Haggai: 
 382 phrase atoms INPUT 
 378 phrase atoms SUCCESFULLY parsed 
 +16 extra mentions SUCCESFULLY parsed from phrase atoms 
 -4 phrase atoms without mentions 
 4 phrase atom parse ERRORS 
 +0 extra mentions SUCCESFULL


 Mention Parse Statistics Song_of_songs: 
 1162 phrase atoms INPUT 
 1158 phrase atoms SUCCESFULLY parsed 
 +5 extra mentions SUCCESFULLY parsed from phrase atoms 
 -7 phrase atoms without mentions 
 4 phrase atom parse ERRORS 
 +0 extra mentions SUCCESFULLY parsed from phrase atom errors 
 -4 phrase atoms without mentions from phrase atom ERRORS 
 1156 mention coreference input 
 99.7% parsing succes 
 0.3% parsing error

 Coreference Resolution Statistics Song_of_songs: 
 1156 total input corefs 
 391 corefs RESOLVED 
 765 corefs UNRESOLVED 
 33.8% corefs RESOLVED 
 66.2% corefs UNRESOLVED 
 138 classes

 Sieve Statistics Song_of_songs: 
 Predicate Sieve: 141 
 Pronoun Sieve: 199 
 Vocative Sieve: 46 
 Apposition Sieve: 4 
 Fronted Element Sieve: 1
 Total Sieves: 391
 Total Classes: 138
Counter({4: 1158, 1: 1154, 19: 887, 6: 595, 36: 524, 55: 524, 35: 524, 46: 377, 34: 377, 42: 325, 30: 291, 28: 291, 27: 291, 22: 287, 9: 287, 16: 262, 13: 262, 8: 262, 49: 211, 50: 158, 43: 141, 45: 


 Mention Parse Statistics 1_Chronicles: 
 7353 phrase atoms INPUT 
 7326 phrase atoms SUCCESFULLY parsed 
 +191 extra mentions SUCCESFULLY parsed from phrase atoms 
 -21 phrase atoms without mentions 
 27 phrase atom parse ERRORS 
 +0 extra mentions SUCCESFULLY parsed from phrase atom errors 
 -26 phrase atoms without mentions from phrase atom ERRORS 
 7497 mention coreference input 
 99.6% parsing succes 
 0.4% parsing error

 Coreference Resolution Statistics 1_Chronicles: 
 7497 total input corefs 
 1544 corefs RESOLVED 
 5953 corefs UNRESOLVED 
 20.6% corefs RESOLVED 
 79.4% corefs UNRESOLVED 
 1013 classes

 Sieve Statistics 1_Chronicles: 
 Predicate Sieve: 894 
 Pronoun Sieve: 204 
 Vocative Sieve: 48 
 Apposition Sieve: 388 
 Fronted Element Sieve: 10
 Total Sieves: 1544
 Total Classes: 1013
Counter({4: 7326, 1: 7324, 19: 6102, 36: 5199, 35: 5199, 55: 5144, 6: 3927, 41: 2991, 30: 2164, 28: 2164, 27: 2164, 22: 1984, 9: 1983, 46: 1915, 34: 1915, 13: 1394, 8: 1394, 16: 1386, 37: 1

In [None]:
# average mention detection success HB
avg_mention_success = round((1 - (mention_stats_df['pa errors'].sum() / mention_stats_df['pa parsed'].sum())) * 100, 1)

# average mention detection failure HB
avg_mention_failure = round((mention_stats_df['pa errors'].sum() / mention_stats_df['pa parsed'].sum()) * 100, 1)

print('Mention detection for the Hebrew Bible: \n',\
    f'{avg_mention_success}% average success \n',\
    f'{avg_mention_failure}% average failure')

In [None]:
# average coreference resolution success HB
avg_coreference_resolved = round(((1- coref_stats_df['unresolved'].sum() / coref_stats_df['input corefs'].sum())) * 100, 1)

# average coreference resolution success HB
avg_coreference_unresolved = round((coref_stats_df['unresolved'].sum() / coref_stats_df['input corefs'].sum()) * 100, 1)

print('Coreference Resolution for the Hebrew Bible: \n',\
    f'{avg_coreference_resolved}% average resolved \n',\
    f'{avg_coreference_unresolved}% average unresolved')

In [None]:
def PlotMentionDf(mention_stats_df):

    mention_stats_sort = mention_stats_df.sort_values(by=['%error'], ascending=False)
    
    mention_stats_sort.plot(x='book', y='%error', kind='bar', 
                            color='xkcd:deep blue', 
                            figsize=(30,12),
                            fontsize=30
                           )
    
    plt.title(('MiMi Mention Detection Parsing Errors in %'), fontsize=30)
    plt.xlabel('Book', fontsize=30)
    plt.ylabel('Parsing Errors in %', fontsize=30)
    plt.box(False)
    plt.legend(frameon=False, fontsize=30)
    plt.show()
    
PlotMentionDf(mention_stats_df)

In [None]:
mention_stats_df

In [None]:
def PlotCoreferenceDf(coref_stats_df):

    coref_stats_sort = coref_stats_df.sort_values(by=['%resolved'], ascending=False)
    
    coref_stats_sort.plot(x='book', y='%resolved', kind='bar', color='xkcd:deep blue', 
                  figsize=(30,12),
                  fontsize=30, 
                 )

    plt.title(('MiMi Coreference Resolution Success in %'), fontsize=30)
    plt.xlabel('Book', fontsize=30)
    plt.ylabel('Resolved in %', fontsize=30)
    plt.box(False)
    plt.legend(frameon=False, fontsize=30)
    plt.ylim(ymin=15)
    plt.show()
    
PlotCoreferenceDf(coref_stats_df)

In [None]:
coref_stats_df

In [None]:
ExportToLatex(OUTPUT_LOC, 'coref_stats_hb', coref_stats_df, indx = False)

In [None]:
sieve_stats_df

In [None]:
#sns.choose_colorbrewer_palette('sequential')
sieves = ('predicate sieve', 'pronoun sieve', 'vocative sieve',
          'apposition sieve', 'fronted element sieve')
colours = sns.color_palette("Blues", n_colors=len(sieves)) #"RdGy"

#books = sieve_stats_df['book']

df = sieve_stats_df.drop(['total sieves', 'classes'], axis=1)
df = df.set_index('book')

res = df.div(df.sum(axis=1), axis=0)*100
res = res.reset_index()

matplotlib.rc('xtick', labelsize=18) 
matplotlib.rc('ytick', labelsize=18) 
res.plot(kind='bar', x='book', stacked=True, figsize=(10,9), color=colours)
plt.title(('Coreference Sieves'), fontsize=35)
plt.xlabel('Book', fontsize=30)
plt.ylabel('Resolved in %', fontsize=30)
plt.box(False)
plt.legend(frameon=False, fontsize=12, loc='best', bbox_to_anchor=(.9, .5, 0.5, 0.5))
plt.show()

In [None]:
res

In [None]:
ExportToLatex(OUTPUT_LOC, 'sieve_stats_hb', sieve_stats_df, indx = False)