<img align="right" src="images/tf-small.png" width="90"/>
<img align="right" src="images/etcbc.png" width="100"/>


# Statistics of a Coreference-annotated Corpus for Biblical Hebrew

## 1. Introduction

This notebook contains several functions that offer descriptive statistics of the corpus of texts that have been annotated for coreference:

* Genesis 1
* Numbers 
* Isaiah 42
* Psalms 

Genesis 1, Isaiah 42 and the whole book of Psalms have been annotated by me (Christiaan), and the whole book of Numbers has been annotated by Gyusang Jin. 

The statistics for the Psalms are part of my disseration and are generated with code in `parse_ann` and shown in `Pandas` data frames. The Pandas data frames can be exported as a LateX table with the function `ExportToLatex()`. The fuction takes as arguments: the output location on your pc `OUTPUT_LOC`, the name of the LaTeX table in string from, e.g. `overall_coref_ann`, the name of the data frame as generated in this NB, e.g. `overall_df`, and if an index is needed specify: `indx = True`, otherwise `false`. 



## 2. Load modules

In [13]:
%load_ext autoreload
%autoreload 2

import os
from parse_ann import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 3. Specify output location

In [14]:
OUTPUT_LOC = os.path.expanduser('~/Documents/PhD/1-dissertation/DISSERTATIONlatex/Tables/')

## 4. Specify corpus

In [15]:
my_book_name = 'Psalms'
from_chapter = 1
to_chapter = 150

## 5. Run code

In [16]:
mentions, corefs, suffix_errors = TexFabricParse(my_book_name, from_chapter, to_chapter)
EnrichMentions(mentions)
IdentifyEntities(corefs)

In [17]:
overall_df, pos_df, \
pronoun_df, pronoun_pos_class_df, \
pronoun_pos_sing_df = MakePandasTables(corefs, mentions)

## 6. Print tables 

In [18]:
PrintThisTable(overall_df)

Unnamed: 0,total
mentions,18571
singletons,4788
classes,2001
notes,715


In [None]:
#print(overall_df.to_latex(index=True))

ExportToLatex(OUTPUT_LOC, 'overall_coref_ann', overall_df, indx = True)

In [19]:
PrintThisTable(pos_df)

Unnamed: 0,NP,VP,PPrP,PrNP,DPrP,PtcP,AdjP,CP,AdvP,suffix,PP,prep,advb,art,total_type
in class,3090,4981,287,795,31,16,2,7,3,4569,2,0,0,0,13783
singleton,4403,98,2,164,15,1,24,3,18,40,0,14,5,1,4788
total,7493,5079,289,959,46,17,26,10,21,4609,2,14,5,1,18571
% total,40,27,2,5,0,0,0,0,0,25,0,0,0,0,100
first in chain,1001,702,155,122,13,4,2,1,1,0,0,0,0,0,2001
% chain,50,35,8,6,1,0,0,0,0,0,0,0,0,0,100


In [12]:
#print(pos_df.to_latex(index=True))

ExportToLatex(OUTPUT_LOC, 'pos_coref_ann', pos_df, indx = True)

In [9]:
PrintThisTable(pronoun_df)

Unnamed: 0,p1upl,p1usg,p2fsg,p2mpl,p2msg,p3fpl,p3fsg,p3mpl,p3msg,p3upl,ufpl,ufsg,umpl,umsg,uuu,total_pgn
in class,332,2415,29,281,2282,21,344,1173,2089,385,5,21,91,284,80,9832
singleton,11,13,0,10,16,0,3,8,34,7,0,1,9,8,13,133
total,343,2428,29,291,2298,21,347,1181,2123,392,5,22,100,292,93,9965
% total,3,24,0,3,23,0,3,12,21,4,0,0,1,3,1,100


In [None]:
#print(pronoun_df.to_latex(index=True))

ExportToLatex(OUTPUT_LOC, 'pronoun_coref_ann', pronoun_df, indx = True)

In [10]:
PrintThisTable(pronoun_pos_class_df)

Unnamed: 0,p1upl,p1usg,p2fsg,p2mpl,p2msg,p3fpl,p3fsg,p3mpl,p3msg,p3upl,ufpl,ufsg,umpl,umsg,uuu,total_pgn
VP,91,768,29,254,998,20,234,589,1131,385,5,21,91,284,80,4980
suffix,233,1565,0,27,1167,0,104,560,909,0,0,0,0,0,0,4565
PPrP,8,82,0,0,117,1,6,24,49,0,0,0,0,0,0,287
total,332,2415,29,281,2282,21,344,1173,2089,385,5,21,91,284,80,9832
% total,3,25,0,3,23,0,3,12,21,4,0,0,1,3,1,100


In [None]:
#print(pronoun_pos_class_df.to_latex(index=True))

ExportToLatex(OUTPUT_LOC, 'pronoun_pos_class_ann', pronoun_pos_class_df, indx = True)

In [11]:
PrintThisTable(pronoun_pos_sing_df)

Unnamed: 0,p1upl,p1usg,p2mpl,p2msg,p3fsg,p3mpl,p3msg,p3upl,ufsg,umpl,umsg,uuu,total_pgn
VP,0,5,9,12,1,4,28,7,1,9,8,13,97
suffix,11,7,1,4,2,3,6,0,0,0,0,0,34
PPrP,0,1,0,0,0,1,0,0,0,0,0,0,2
total,11,13,10,16,3,8,34,7,1,9,8,13,133
% total,8,10,8,12,2,6,26,5,1,7,6,10,100


In [None]:
#print(pronoun_pos_sing_df.to_latex(index=True))

ExportToLatex(OUTPUT_LOC, 'pronoun_pos_sing_ann', pronoun_pos_sing_df, indx = True)