<img align="right" src="images/tf-small.png" width="90"/>
<img align="right" src="images/etcbc.png" width="100"/>

# Annotating Coreference

## 1. Introduction 

This notebook contains a simple function `show_text()` that helps with annotating coreference information in the Psalms: NP's, named entities, verbs, personal pronouns (IPP's, suffixes), demonstratives and sometimes adverbs. Below I explain what information `show_text()` displays in which colours and what kind of additional ETCBC resources I use to annotate. 


## 1.1 Annotating with `show_text`
The function below `show_text()` in §2 retrieves all kinds of information from TF that is important for coreference annotation. Different coreference information is indicated with the [CSS colour codes](https://developer.mozilla.org/en-US/docs/Web/CSS/color_value): 

* Noun Phrases (`NP`): skyblue
* Appositions (`Appo`): yellow
* Verbs (`verb`): springgreen
* Substantives/Nouns (`subs`): skyblue
* Articles (`art`): skyblue
* Proper Nouns (`nmpr`): tomato
* Independent Personal Pronouns (`prps`): palegoldenrod
* Demonstrative Pronouns (`prde`): royalblue
* Suffixes that are attached to Prepositional Phrases (`PP`): DarkGoldenrod

Additional features are displayed by passing them to a string in `A.prettySetup()`: 

* The features `pgn_prps`, `pgn_prde`, `pgn_verb`, `pgn_prs` give the Person, Gender and Number (PGN) of respectively indepedent personal pronouns, demonstrative pronouns, verbs and suffixes in one 'package' of information instead of three separate features (`ps`, `gn`, `nu`). Especially for large texts these features save considerable space in `A.show()`.
* `pdp` indicates the Phrase Dependent Part of Speech;
* For feature `typ` PP or NP the determination (`det`) is important;
* For `sub`'s within PP's and NP's the status (`st`) is relevant;
* `ls` indicates the Lexical Set of words and/or lexemes. The feature contains some Named Entity information.
* `nametype` contains Named Entity information of the lexeme.
* `gn` and `nu` indicate the gender and number of the word other than the features as described under the first bullet above. 
* For ease of grammatical reference `rela` and `function` have been included. 

In the function `show_text()` in the cell below fill in the values for Hebrew Bible book and chapters that you want to have displayed with the coreference information which has been described above. 

## 1.2 Annotating with resources

Besides this function I have included in the folder **annotation_resources** resources from the ETCBC server that I use as reference works to check how certain Biblical Hebrew words are built up: 

* **alphabet** and **graphemes**: definition of the Latin transliteration alphabet of Biblical Hebrew 
* **anwb**: Analytical Dictionary for Biblical Hebrew 
* **lexicon**: contains all lexemes in Biblical Hebrew
* **named_entities**: contains categories of named entities that need to be annotated
* **word_grammar**: grammar at word level in Biblical Hebrew  

As I try to work in a [Open Science](https://en.wikipedia.org/wiki/Open_science) way, I have included the **annotation resources** here because only authorised ETCBC personnel is allowed on the ETCBC server. 

To illustrate the use of the **annotation resources**:

In Psalm 1:4 I come across the verb: 

> תִּדְּפֶ֥נּוּ (scatter), or in text-trans-plain: TDPNW (see the txt file of Psalm 1 in the NB: '1. file_preparation_for_annotation')

with PGN (Person, Gender and Number) of the verb P3Fsg and PGN of the attached suffix P3Msg. 

Say that I'm unsure how I should seperately annotate the verb and the suffix:

* Click on the word (or load the feature `lex` in `A.prettySetup()`), this leads to all occurences of that word in the website version of the BHSA: [SHEBANQ](https://shebanq.ancient-data.org/hebrew/word?version=2017&id=1NDPv). 
* Find the disambiguated verb form `NDP[` on SHEBANQ in the box on the left. The `[` indicates that the word is a verb. 
* Open the **anwb** resource in a .txt editor and look (ctrl-F, regex) for the pattern `# NDP[`. This search will bring you to all occurrences of the verb within the **anwb**. 
* The search returns: `T.ID.:PEN.W.			!T=!(NDP[~N+(HW			# NDP[`. 

    1. The first form `T.ID.:PEN.W.` is the written form in the text. It can be deciphered using the **alphabet** and **graphemes** resources. 
    2. The second form `!T=!(NDP[~N+(HW` is ETCBC's analysed form, it can be deciphered with the **word_grammar** resource. The analysed form is the most important for the annotation process. The `~N` indicates the paragogic nun; `(` a non-realised `H`. The `W` is therefore the suffix, and should be annotated. Consequently, `TDP` should be annotated as verb. 
    3. The third form `# NDP[` is the search pattern. 

## 2. Import modules and utils

In [81]:
%load_ext autoreload
%autoreload 2

import re, collections
from IPython.display import display, HTML
from pprint import pprint

from utils import * 
from print_datetime import *

from tf.app import use
from tf.fabric import Fabric
print_datetime()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Notebook last updated by Christiaan at 2020-01-30 12:33:35.683749


## 3. Import Text-Fabric Data 

For the coreference data I used the frozen 2017 version, taken from the ETCBC 2017-10-06. The 2017 version is archived in Zenodo: DOI: [doi.org/10.5281/zenodo.1007624](https://zenodo.org/record/1302798#.W5ocuC1g3pI). 

I use 2017 because I don't want the data to change while I'm annotating. The annotated data will be made available through TF later on. 

In [82]:
A = use(
    'bhsa', version='2017',
    mod=(
        'cmerwich/bh-reference-system/tf'
    ),
    hoist=globals(),
)

	connecting to online GitHub repo annotation/app-bhsa ... connected
Using TF-app in /Users/Christiaan/text-fabric-data/annotation/app-bhsa/code:
	rv1.2=#5fdf1778d51d938bfe80b37b415e36618e50190c (latest release)
	connecting to online GitHub repo etcbc/bhsa ... connected
Using data in /Users/Christiaan/text-fabric-data/etcbc/bhsa/tf/2017:
	rv1.6=#bac4a9f5a2bbdede96ba6caea45e762fe88f88c5 (latest release)
	connecting to online GitHub repo etcbc/phono ... connected
Using data in /Users/Christiaan/text-fabric-data/etcbc/phono/tf/2017:
	r1.2 (latest release)
	connecting to online GitHub repo etcbc/parallels ... connected


	cannot find commits


Using data in /Users/Christiaan/text-fabric-data/etcbc/parallels/tf/2017:
	r1.2 (latest release)
	connecting to online GitHub repo cmerwich/bh-reference-system ... failed
The offline data may not be the latest
Using data in /Users/Christiaan/text-fabric-data/cmerwich/bh-reference-system/tf/2017:
	rv1.0 (latest? release)


## 4. TF query for appositions

In coreference appositional relations like: "Barack Obama, the President of the United States, is coming to visit the Rijksmuseum." (He did actually.) In the example "the President of the United States" is an apposition to Barack Obama. 

The TF-query looks for such appositions in the Psalms. The results can be used for reference during the annotation process. The `show_text` function below highlights the apposition in phrase atoms in yellow.

With `A.show(res, start=1, end=30)` in the second cell the results can be visualised in context of the text with all kinds of features. Tweak `end=30` to indicate how many results you want to see.

### 4.1 Run TF-query

In [None]:
res = A.search('''
book book=Numeri
  verse
    phrase_atom rela=Appo
''')

A.table(res, linked=3) #Psalmi

### 4.2 Appositions in context  

In [None]:
A.show(res, start=1, end=30)

### 4.3 Fronted elements

In [None]:
res = A.search('''
book book=Psalmi
  verse
    phrase function=Frnt
''')

A.table(res, linked=3)

### 4.4 Resumption

In [14]:
res = A.search('''
book book=Genesis
  clause rela=Resu
    phrase rela=NA
''')

A.table(res, linked=3) 

  1.12s 234 results


n,p,book,clause,phrase
1,Genesis 2:7,426585,וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה 427700,וַ 651942
2,Genesis 2:7,426585,וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה 427700,יִּיצֶר֩ 651943
3,Genesis 2:7,426585,וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה 427700,יְהוָ֨ה אֱלֹהִ֜ים 651944
4,Genesis 2:7,426585,וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה 427700,אֶת־הָֽאָדָ֗ם 651945
5,Genesis 2:7,426585,וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה 427700,עָפָר֙ מִן־הָ֣אֲדָמָ֔ה 651946
6,Genesis 2:14,426585,ה֥וּא פְרָֽת׃ 427728,ה֥וּא 652024
7,Genesis 2:14,426585,ה֥וּא פְרָֽת׃ 427728,פְרָֽת׃ 652025
8,Genesis 2:17,426585,לֹ֥א תֹאכַ֖ל מִמֶּ֑נּוּ 427737,לֹ֥א 652046
9,Genesis 2:17,426585,לֹ֥א תֹאכַ֖ל מִמֶּ֑נּוּ 427737,תֹאכַ֖ל 652047
10,Genesis 2:17,426585,לֹ֥א תֹאכַ֖ל מִמֶּ֑נּוּ 427737,מִמֶּ֑נּוּ 652048


## 5 Annotation aid: `show_text()`  
- Run the cells below first - 

In [83]:
def compute_text(my_book_name, from_chapter, to_chapter):

    results = []
    highlights = {}
    
    my_chapters = set(range(from_chapter, to_chapter+1))
    
    for book in F.otype.s('book'):
        book_name = T.bookName(book)
       
        for chn in L.d(book, 'chapter'):
            chapter = F.chapter.v(chn)
            tup = (chn,)
            if (
                (my_book_name and book_name not in my_book_name)
                or 
                (my_chapters and chapter not in my_chapters)
            ):
                continue
            for phrase in L.d(chn, 'phrase'):
                typ = F.typ.v(phrase)
                if typ == 'NP':
                    tup = tup + (phrase,)
                    highlights[phrase] = 'skyblue'

            for phr_atom in L.d(chn, 'phrase_atom'):
                if F.rela.v(phr_atom) == 'Appo':
                    tup = tup + (phr_atom,)
                    highlights[phr_atom] = 'yellow'

            for w in L.d(chn, 'word'):
                pdp = F.pdp.v(w)
                pgn_prps = F.pgn_prps.v(w)
                pgn_prde = F.pgn_prde.v(w)
                pgn_verb = F.pgn_verb.v(w)
                pgn_prs = F.pgn_prs.v(w)

                if pdp == 'verb':
                    tup = tup + (w,)
                    highlights[w] = 'springgreen'

                if pdp == 'subs':
                    tup = tup + (w,)
                    highlights[w] = 'skyblue'

                if pdp == 'art':
                    tup = tup + (w,)
                    highlights[w] = 'skyblue'

                if pdp == 'nmpr':
                    tup = tup + (w,)
                    highlights[w] = 'tomato' 

                if pdp == 'prps':
                    tup = tup + (w,)
                    highlights[w] = 'palegoldenrod'

                if pdp == 'prde':
                    tup = tup + (w,)
                    highlights[w] = 'royalblue'

                if pdp == 'prep' and pgn_prs in prs_set:
                    tup = tup + (w,)
                    highlights[w] = 'DarkGoldenrod'

            results.append(tup)
    return (results, highlights)

In [84]:
def show_text(results, highlights):
    A.displaySetup(withNodes=True, extraFeatures='pgn_prps pgn_prde pgn_verb pgn_prs pdp typ rela ls function det st lex nametype gn nu') #ps gn nu
    A.show(results, condensed=False, highlights=highlights)

### 5.1 Choose book and chapter

In [92]:
# Set any Hebrew Bible Book
MY_BOOK = '1_Chronicles'

# Indicate a range of chapters
FROM_CHAPTER = 4
TO_CHAPTER = 4

# Run
(results, highlights) = compute_text(MY_BOOK, FROM_CHAPTER, TO_CHAPTER)

In [94]:
# Psalms
T.sectionFromNode(312001)
T.sectionFromNode(312498)
T.sectionFromNode(313865)
T.sectionFromNode(313998)
T.sectionFromNode(314043)
T.sectionFromNode(1107464)
T.sectionFromNode(329482)
T.sectionFromNode(335575)

#HB
T.sectionFromNode(391709) # ('1_Chronicles', 1, 43)
T.sectionFromNode(403893) # ('1_Chronicles', 25, 1)

T.sectionFromNode(60555) # ('Leviticus', 14, 14)

T.sectionFromNode(393372)
T.sectionFromNode(393069)

('1_Chronicles', 4, 18)

In [64]:
for w in L.d(1107464, 'word'):
    print(w)

322775


### `show_text()`

Run the cell below each time you choose a new book and/or chapter. 

In [93]:
show_text(results, highlights)