# Counting Coreference Information

## 1. Introduction

As explained in NB `2. annotation_aid` the following information is important for coreference annotation:

* Noun Phrases (`NP`)
* Appositions (`Appo`) [not covered in this NB]
* Verbs (`verb`)
* Substantives/Nouns (`subs`)
* Articles (`art`)
* Proper Nouns (`nmpr`)
* Indepedent Personal Pronouns (`prps`)
* Demonstrative Pronouns (`prde`)
* Suffixes that are attached to Prepositional Phrases (`PP`)

This NB gives counts of some of this information to demonstrate how potentially big and complicated the task of coreference resolution in the **Psalms** is. In §5 however you can fill in any Hebrew Bible Book and Chapter to check out other books. 

In this NB the tables with the counts are most important, but the I've also included the accompanying code. The paragraphs indicated below are the table paragraphs: 

* §6.1 gives counts of the features/categories `pgn_prps`, `pgn_prde`, `pgn_verb`, `pgn_prs`. These features give the Person, Gender and Number (PGN) of respectively indepedent personal pronouns, demonstrative pronouns, verbs and suffixes in one 'package' of information instead of three separate features (`ps`, `gn`, `nu`).

* §6.2 gives counts of all possible PGN information within the categories `pgn_prps`, `pgn_prde`, `pgn_verb`, `pgn_prs`. 

* §7.1 gives counts of  phrase types (`typ`) that are determined (`det`), undetermined (`und`) or if not applicable (`NA`). If there are no values for phrase type at all, because they can not be determined, this is indicated by `NaN`. In NB `2. annotation_aid` I stated that for feature `typ` `PP` or `NP` the determination (`det`) is important; in this count NB I have included all phrase types that can contain coreference information (see the enumeration above). These phrase types are: 
    * `VP`: Verbal phrase
    * `NP`: Nominal phrase
    * `PrNP`: Proper-noun phrase
    * `AdvP`: Adverbial phrase
    * `PP`: Prepositional phrase
    * `PPrP`: Personal pronoun phrase
    * `DPrP`: Demonstrative pronoun phrase

* §7.2 gives counts of Phrase Dependent Part of Speech (`pdp`) within phrase types. The `subs` within `NP`'s and `PP`'s are relevant for annotation. The same goes for the verbs in `VP`'s. 

* §8.1 gives counts of the status (`st`) of `sub`'s within the phrase types that are relevant for this study. The status can be: `a` (absolute), `c` (construct) or `e` (emphatic). Not all words have `st` (e.g. adverbs), these are marked as `NA`.

## 2. Import modules and utils

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys, os, re, pickle, csv, collections
from IPython.display import display, HTML
from pprint import pprint

import pandas as pd

from utils import * 
from print_datetime import *

print_datetime()


Notebook last updated by Christiaan at 2019-04-05 09:18:40.280617


## 3. Import Text-Fabric data 

For the coreference data I used the frozen 2017 version, taken from the ETCBC 2017-10-06. The 2017 version is archived in Zenodo: DOI: [doi.org/10.5281/zenodo.1007624](https://zenodo.org/record/1302798#.W5ocuC1g3pI). 

I use 2017 because I don't want the data to change while I'm annotating. The annotated data will be made available through TF later on. 

In [3]:
from tf.app import use

In [4]:
A = use(
    'bhsa', version='2017',
    mod=(
        'etcbc/lingo/heads/tf,'
        'cmerwich/bh-reference-system/tf'
    ),
    hoist=globals(),
)

TF app is up-to-date.
Using annotation/app-bhsa commit d3cf8f0c2ab5d690a0fda14ea31c33da5c5c8483 (=latest)
  in /Users/Christiaan/text-fabric-data/__apps__/bhsa.
Using etcbc/bhsa/tf - 2017 r1.5 in /Users/Christiaan/text-fabric-data
Using etcbc/phono/tf - 2017 r1.2 in /Users/Christiaan/text-fabric-data
Using etcbc/parallels/tf - 2017 r1.2 in /Users/Christiaan/text-fabric-data
Using etcbc/lingo/heads/tf - 2017 r0.1 in /Users/Christiaan/text-fabric-data
Using cmerwich/bh-reference-system/tf - 2017 rv1.0 in /Users/Christiaan/text-fabric-data


## 4. Fill in Hebrew Bible Book and Chapter

In [4]:
# Set any Hebrew Bible Book
MY_BOOK = {'Psalms'}

# Indicate a range of chapters
MY_CHAPTERS = set(range(1,151))

## 5. Person, Number, Gender Categories in the Psalms

In [5]:
pgn_count_dict = {}
all_pgn_count = {}

for book in F.otype.s('book'):
    book_name = T.bookName(book)
    for chn in L.d(book, 'chapter'):
        chapter = F.chapter.v(chn)
        if book_name in MY_BOOK and chapter in MY_CHAPTERS:
            pgn_count_dict[book_name] = defaultdict(int)
            all_pgn_count[book_name] = defaultdict(int)
            for w in L.d(book, 'word'):
                pgn_prps = F.pgn_prps.v(w)
                pgn_prde = F.pgn_prde.v(w)
                pgn_verb = F.pgn_verb.v(w)
                pgn_prs = F.pgn_prs.v(w)
                pgn_verb_prs = F.pgn_verb_prs.v(w)
                
                # Count all categories of PGN
                if pgn_prps:
                    pgn_count_dict[book_name]['pgn_prps'] += 1
                if pgn_prde:
                    pgn_count_dict[book_name]['pgn_prde'] += 1
                if pgn_verb:
                    pgn_count_dict[book_name]['pgn_verb'] += 1
                if pgn_prs:
                    pgn_count_dict[book_name]['pgn_prs'] += 1
                
                # Count all sorts of PGN within all PGN categories
                for pgn in all_pgn_set:
                
                    if pgn == pgn_prps:
                        all_pgn_count[book_name][pgn] += 1
                    elif pgn not in all_pgn_count[book_name]:
                        all_pgn_count[book_name][pgn] = 0 

                    if pgn == pgn_prde:
                        all_pgn_count[book_name][pgn] += 1
                    elif pgn not in all_pgn_count[book_name]:
                        all_pgn_count[book_name][pgn] = 0 

                    if pgn == pgn_verb:
                        all_pgn_count[book_name][pgn] += 1
                    elif pgn not in all_pgn_count[book_name]:
                        all_pgn_count[book_name][pgn] = 0 

                    if pgn == pgn_prs:
                        all_pgn_count[book_name][pgn] += 1
                    elif pgn not in all_pgn_count[book_name]:
                        all_pgn_count[book_name][pgn] = 0

#pprint(pgn_count_dict)
#pprint(all_pgn_count)

## 5.1 Table: Person, Number, Gender Categories in the Psalms

In [6]:
pgn_count_dict_df = pd.DataFrame.from_dict(pgn_count_dict, orient='index')
pgn_count_dict_df

Unnamed: 0,pgn_verb,pgn_prs,pgn_prps,pgn_prde
Psalms,5294,4651,294,49


## 5.2 Table: sorts of PGN within all PGN categories in the Psalms

In [7]:
all_pgn_count_df = pd.DataFrame.from_dict(all_pgn_count)
all_pgn_count_df

Unnamed: 0,Psalms
Cpl-1,0
Cpl-2,8
Csg,0
Fsg,0
Fsg-1,21
Fsg-2,0
Fsg-3,1
Msg,0
Msg-1,19
P1Cpl,343


## 6. Phrase types - Determined and Undetermined

In [8]:
### Create dictionaries with Phrase Types of Choice

phr_typ_count = {}
phr_typ_pdp_count = {}
phr_typ_st_count = {}

for t in {'VP','NP','PrNP', 'AdvP', 'PP', 'PPrP', 'DPrP'}:
        phr_typ_count[t] = defaultdict(int)
        phr_typ_pdp_count[t] = defaultdict(int)
        phr_typ_st_count[t] = defaultdict(int)

In [9]:
for book in F.otype.s('book'):
    book_name = T.bookName(book)
    for chn in L.d(book, 'chapter'):
        chapter = F.chapter.v(chn)
        if book_name in MY_BOOK and chapter in MY_CHAPTERS:

            for phrase in L.d(chn, 'phrase'):
                typ = F.typ.v(phrase)
                det = F.det.v(phrase)
                
                if typ in {'VP','NP','PrNP', 'AdvP', 'PP', 'PPrP', 'DPrP'}:
                    if det:
                         phr_typ_count[typ][det] += 1
                    else:
                        phr_typ_count[typ][det] = 0
#pprint(phr_typ_count)

## 6.1 Table: Phrase types - Determined and Undetermined

In [10]:
phr_typ_count_df = pd.DataFrame.from_dict(phr_typ_count, orient='index')
phr_typ_count_df

Unnamed: 0,det,und,NA
AdvP,,,292.0
DPrP,36.0,,
NP,2109.0,2061.0,
PP,2560.0,1145.0,35.0
PPrP,291.0,,
PrNP,585.0,,
VP,,,5294.0


## 6.2 Phrase Dependent Part of Speech within Phrase Types

In [11]:
for book in F.otype.s('book'):
    book_name = T.bookName(book)
    for chn in L.d(book, 'chapter'):
        chapter = F.chapter.v(chn)
        if book_name in MY_BOOK and chapter in MY_CHAPTERS:

            for phrase in L.d(chn, 'phrase'):
                typ = F.typ.v(phrase)
                det = F.det.v(phrase)
                
                if typ in {'VP','NP','PrNP', 'AdvP', 'PP', 'PPrP', 'DPrP'}:
                    for w in L.d(phrase, 'word'):
                        pdp = F.pdp.v(w)
                        if pdp:
                            phr_typ_pdp_count[typ][pdp] += 1
#pprint(phr_typ_pdp_count)

## 6.3 Table: Phrase Dependent Part of Speech within Phrase Types

In [12]:
phr_typ_pdp_count_df = pd.DataFrame.from_dict(phr_typ_pdp_count, orient='index')
phr_typ_pdp_count_df

Unnamed: 0,subs,art,prep,nmpr,conj,adjv,advb,prps,nega,prde,prin,inrg,verb
AdvP,5,,2.0,,5.0,,298.0,,,,,,
DPrP,3,1.0,,,,,,,,36.0,,,
NP,5382,221.0,145.0,211.0,235.0,98.0,33.0,3.0,3.0,2.0,,,
PP,3608,493.0,3910.0,430.0,122.0,69.0,35.0,,2.0,10.0,12.0,13.0,
PPrP,16,,1.0,2.0,4.0,,5.0,291.0,,,,,
PrNP,76,4.0,3.0,602.0,12.0,2.0,2.0,,,,,,
VP,9,,268.0,,2.0,,,,,,,,5294.0


## 7. Status of Subs within Phrases of Choice

In [13]:
for book in F.otype.s('book'):
    book_name = T.bookName(book)
    for chn in L.d(book, 'chapter'):
        chapter = F.chapter.v(chn)
        if book_name in MY_BOOK and chapter in MY_CHAPTERS:

            for phrase in L.d(chn, 'phrase'):
                typ = F.typ.v(phrase)
                det = F.det.v(phrase)
                
                if typ in {'VP','NP','PrNP', 'AdvP', 'PP', 'PPrP', 'DPrP'}:
                    for w in L.d(phrase, 'word'):
                        pdp = F.pdp.v(w)
                        st = F.st.v(w)
                        if pdp == 'subs':
                            if st == 'a':
                                phr_typ_st_count[typ]['absolute'] += 1
                            if st == 'c':
                                phr_typ_st_count[typ]['construct'] += 1
                            if st == 'e':
                                phr_typ_st_count[typ]['emphatic'] += 1
#pprint(phr_typ_st_count)

## 7.1 Table: Status of Subs within Phrases of Choice

In [14]:
phr_typ_st_count_df = pd.DataFrame.from_dict(phr_typ_st_count, orient='index')
phr_typ_st_count_df

Unnamed: 0,construct,absolute
AdvP,,5
DPrP,1.0,2
NP,1020.0,4362
PP,793.0,2815
PPrP,4.0,12
PrNP,10.0,66
VP,3.0,6


## 8. Double checking with TF queries 

In [None]:
# Warning: Notebook finds the handling of the results this query produces heavy 

query = '''
book book=Psalmi
  verse
    phrase typ=NP
    
'''
res = B.search(query)
A.table(res, linked=3, end=10)

In [None]:
# Warning: Notebook finds the handling of the results this query produces heavy 

query = '''
book book=Psalmi
  verse
    phrase typ=PP
'''
res = B.search(query)
A.table(res, linked=3, end=10)

In [None]:
# Warning: Notebook finds the handling of the results this query produces heavy 

query = '''
book book=Psalmi
    word pdp=subs
'''
res = B.search(query)
A.table(res, linked=3, end=10)