# File Preparation for Annotation

## Introduction 

This notebook outputs three files:

* [text-trans-plain](https://dans-labs.github.io/text-fabric/Api/General/#text-representation) .txt files for annotation of the Biblical Hebrew text
* [text-orig-full](https://dans-labs.github.io/text-fabric/Api/General/#text-representation) .txt files for easy reading of the Biblical Hebrew text
* stand-off .tsv files for future Text-Fabric administration

of the books and chapters that need to be annotated for coreference. The books and chapers can be specified in the `File Administration` cell below.  

The annotation is done in [brat](http://brat.nlplab.org). 

In the second notebook **2. annotation_aid** I will explain what information from TF is retrieved to annotate for coreference. 

## 1. Import modules and utils

In [1]:
import sys, os, re, pickle, csv, collections
from collections import *
from IPython.display import HTML
from pprint import pprint
from functools import reduce

from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa

from print_datetime import *
print_datetime()

Notebook last updated by Christiaan at 2018-09-13 17:06:17.376368


## 2. Import Text-Fabric data 

For the coreference data I used the frozen 2017 version, taken from the ETCBC 2017-10-06. The 2017 version is archived in Zenodo: DOI: [doi.org/10.5281/zenodo.1007624](https://zenodo.org/record/1302798#.W5ocuC1g3pI). 

I use 2017 because I don't want the data to change while I'm annotating. The annotated data will be made available through TF later on. 

In [2]:
VERSION = '2017'
DATABASE = '~/github/etcbc'
BHSA = f'bhsa/tf/{VERSION}'
REFERENCE = f'bh-reference-system/tf/{VERSION}' # Check my GitHub to download extra pgn features
TF = Fabric(locations=[DATABASE], modules=[BHSA, REFERENCE], silent=False )

This is Text-Fabric 5.5.19
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

115 features found and 0 ignored


## 3. Import Text-Fabric features 

In [3]:
api = TF.load('''
    otype
    lex book chapter verse
    nu ps gn prs ls lex gloss
    function sp typ pdp language
''', silent=True)

api.makeAvailableIn(globals())

B = Bhsa(api, 'file_preparation_for_annotation', version=VERSION)

**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/2017/0_home.html" title="{CORPUS.upper()} feature documentation">Feature docs</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/Bhsa/" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/" title="text-fabric-api">Text-Fabric API 5.5.19</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/#search-templates" title="Search Templates Introduction and Reference">Search Reference</a>

## 4. File administration

In [6]:
PATH_FILE_ANNOTATE = '/Users/Christiaan/Sites/brat/data/psalms_coref/psalms_annotate/' # Path for annotation files

PATH_FILE_STAND_OFF = '/Users/Christiaan/Sites/brat/data/psalms_coref/stand_off/' # Path for stand-off files 

PATH_FILE_FULL = '/Users/Christiaan/Sites/brat/data/psalms_coref/psalms_full/' # Path for full text files 

## 5. Which book and chapters

In [None]:
# Set any Hebrew Bible Book
MY_BOOK = {'Psalms'} 

# Set any range in chapters of specified HB book
MY_CHAPTERS = set(range(1,151)) 

In [7]:
info('Making .txt and .tsv files per book and chapter.')

chap_list = []

for chn in F.otype.s('chapter'):
    chapter = F.chapter.v(chn)

    for bn in L.u(chn, 'book'):
        book_name = T.bookName(bn)
        
        if book_name in MY_BOOK and chapter in MY_CHAPTERS:
            book_chap = book_name+'_'+'{0:0=3d}'.format(chapter)
            chap_list.append(book_chap)
            
            filename_txt = '{}.txt'.format(book_chap)
            filename_txt_full = '{}_full.txt'.format(book_chap)
            
            with open('{}{}'.format(PATH_FILE_ANNOTATE, filename_txt), 'w') as txt_f, \
            open('{}{}'.format(PATH_FILE_FULL, filename_txt_full), 'w') as txt_f_full:
                txt_f.write('{}\n'.format(''.join(book_chap)))
                
                txt_f_full.write('{}\n'.format(''.join(book_chap)))
                
                for vn in L.d(chn, 'verse'):
                    boo, chap, vers = T.sectionFromNode(vn)
                    verse_words = L.d(vn, 'word')
                    
                    txt_infor = [str(vers), T.text(verse_words, fmt='text-trans-plain')]
                    txt_infor_full = [str(vers), T.text(verse_words)]
                    
                    # make 'text-trans-plain' txt files for annotation
                    txt_f.write('{}\n'.format(' '.join(txt_infor)))
                    
                    # make 'text-orig-full' txt files for easy reading
                    txt_f_full.write('{}\n'.format(' '.join(txt_infor_full)))
                    
                    # now make the stand-off file
                    chapter_words = L.d(chn, 'word')
                    index = 0 

                    filename_tsv = '{}.tsv'.format(book_chap)
                    
                    with open('{}{}'.format(PATH_FILE_STAND_OFF, filename_tsv), 'w') as tsv_f:
                        header = ['start_index', 'end_index', 'word_node']
                        tsv_f.write('{}\n'.format('\t'.join(header)))
                        
                        for w in chapter_words:
                            start_index = index
                            length_word = len(T.text([w], fmt='text-trans-plain'))
                            end_index = index+length_word-1
                            index += length_word 
                            
                            tsv_infor = [str(start_index), str(end_index), str(w)]
                            tsv_f.write('{}\n'.format('\t'.join(tsv_infor)))
                            
                        if index != len(T.text(chapter_words, fmt='text-trans-plain')):
                            info('This is not good. Check your indices for the .tsv again!')

info('Done making {} text-trans-plain txt files of specified books and chapters for annotation.'.format(len(chap_list)))
info('Done making {} text-orig-full txt files of specified books and chapters for easy reading.'.format(len(chap_list)))
info('Done making {} tsv stand-off files of specified books and chapters for future Text-Fabric administration.'.format(len(chap_list)))  

#pprint(list(text.items())[0:4])

    32s Making .txt and .tsv files per book and chapter.
    40s Done making 150 text-trans-plain txt files of specified books and chapters for annotation.
    40s Done making 150 text-orig-full txt files of specified books and chapters for easy reading.
    40s Done making 150 tsv stand-off files of specified books and chapters for future Text-Fabric administration.


## Make stand-off file for administration

Ok, dan ga ik aan de slag met de annotatie van de Psalmen op de volgende manier:

1. schrijf een .conf file waarin ik specificeer op welke entiteiten (persoon, object, organisatie = familierelatie, etc.) en relaties (antecedent, anaphor) ik annoteer (bij succes kan deze ook gebruikt worden voor andere Bijbelboeken)
2. maak per psalm een txt bestand met daarin de volle gepunctueerde Hebreeuwse text, voor de zekerheid zet ik daar het boek, hoofdstuk, versnummer en de versnode bij
3. maak een begeleidende tsv stand-off file met kolommen: startpositie, eindpositie, tf wordnode
4. importeer #1 en #2 in Brat, en start met annoteren

De annotatie duurt wel even. Het is namelijk een flinke klus. Maar ik heb goede hoop.

Maar je kunt ook een stand-off file meegenereren: een tabel die van elk woord de character positie van het begin en van het eind van dat woord bevat, plus de woord node van dat woord.

Ja. Dus als je de file leest, moet je bijhouden op het hoeveelste karakter in die file elk woord begint en eindigt.

Ik zou de stand off file genereren op het zelfde moment dat je de .txt file genereert, want in dat generatie proces heb je alle benodigde info bij de hand, inclusief de start en eind posities van de woorden.

Indexeren: ja. De stand off file:

* 0 7 567 
* 9 13 568 
* 14 21 569

Tab separated, kolommen: startpos endpos tfnode

Ja, als je inplaats van op `laugh` op ieder woord matcht (\W+), dan krijg je van ieder woord de start positie, en de eind positie krijg je door er de lengte van het woord bij op te tellen. Zie ook de python doc voor re, ihb Match objects: https://docs.python.org/3/library/re.html#module-re
Ik moet nu een tijdje weg.