<img align="right" src="images/tf-small.png" width="90"/>
<img align="right" src="images/etcbc.png" width="100"/>

# File Preparation for Annotation


In [None]:
__author__ = 'erwich/roorda'

## Introduction 

This notebook outputs three files:

* [text-trans-plain](https://dans-labs.github.io/text-fabric/Api/General/#text-representation) .txt files for annotation of the Biblical Hebrew text
* [text-orig-full](https://dans-labs.github.io/text-fabric/Api/General/#text-representation) .txt files for easy reading of the Biblical Hebrew text
* stand-off .tsv files for future Text-Fabric administration

of the books and chapters that need to be annotated for coreference. The books and chapers can be specified in the `File Administration` cell below.  

The annotation is done in [brat](http://brat.nlplab.org). 

In the second notebook [2. annotation_aid](2.annotation_aid.ipynb) I will explain what information from TF is retrieved to annotate for coreference. 

## 1. Import modules and utils

In [1]:
import sys, os, re, pickle, csv, collections
from shutil import rmtree
from glob import glob
from collections import *
from IPython.display import HTML
from pprint import pprint
from functools import reduce

from tf.app import use

## 2. Import Text-Fabric data 

For the coreference data I used the frozen 2017 version, taken from the ETCBC 2017-10-06. The 2017 version is archived in Zenodo: DOI: [doi.org/10.5281/zenodo.1007624](https://zenodo.org/record/1302798#.W5ocuC1g3pI). 

I use 2017 because I don't want the data to change while I'm annotating. The annotated data will be made available through TF later on. 

In [2]:
VERSION = '2017'
REFERENCE = f'bh-reference-system/tf/{VERSION}' # Check my GitHub to download extra pgn features
OUTPUT_BASE = os.path.expanduser('~/Sites/brat/data/coref')
ANNOTATE = f'annotate'
FULL = f'full'
STANDOFF = f'standoff'

In [3]:
A = use('bhsa', version=VERSION, hoist=globals())

	connecting to online GitHub repo annotation/app-bhsa ... connected
Using TF-app in /Users/Christiaan/text-fabric-data/annotation/app-bhsa/code:
	rv1.0=#d3cf8f0c2ab5d690a0fda14ea31c33da5c5c8483 (latest release)
	connecting to online GitHub repo etcbc/bhsa ... connected
Using data in /Users/Christiaan/text-fabric-data/etcbc/bhsa/tf/2017:
	rv1.6=#bac4a9f5a2bbdede96ba6caea45e762fe88f88c5 (latest release)
	connecting to online GitHub repo etcbc/phono ... connected
Using data in /Users/Christiaan/text-fabric-data/etcbc/phono/tf/2017:
	r1.2 (latest release)
	connecting to online GitHub repo etcbc/parallels ... connected
Using data in /Users/Christiaan/text-fabric-data/etcbc/parallels/tf/2017:
	r1.2 (latest release)
   |      |     1.17s C __levels__           from otype, oslots, otext
   |      |       25s C __order__            from otype, oslots, __levels__
   |      |     1.17s C __rank__             from otype, __order__
   |      |       22s C __levUp__            from otype, oslots, __

In [4]:
TF.load('g_cons', add=True)

  0.00s loading features ...
   |     0.00s Not enough info for structure in otext, structure functionality will not work
  0.02s All additional features loaded - for details use loadLog()


## 3. Create files

In [11]:
def writeChapter(chapterNode, bookBase):
    (book, chapter, verse) = T.sectionFromNode(chapterNode, fillup=True)

    filename = f'{book}_{chapter:>03}'
    filename_txt = f'{filename}.txt'
    filename_tsv = f'{filename}.tsv'
    filename_txt_full = f'{filename}_full.txt'

    txtPos = 0
    
    def writeP(text, fh):
        nonlocal txtPos
        fh.write(text)
        txtPos += len(text)
        
    with \
        open(f'{bookBase}/{ANNOTATE}/{filename_txt}', 'w') as txt_f, \
        open(f'{bookBase}/{FULL}/{filename_txt_full}', 'w') as txt_f_full, \
        open(f'{bookBase}/{STANDOFF}/{filename_tsv}', 'w') as tsv_f \
    :
        writeP(f'{filename}\n', txt_f)

        txt_f_full.write(f'{filename}\n')

        header = ['start_index', 'end_index', 'word_node', 'word']
        tsv_f.write('{}\n'.format('\t'.join(header)))

        for vn in L.d(chapterNode, 'verse'):
            verse = T.sectionFromNode(vn)[2]
            verse_words = L.d(vn, 'word')

            # write transcription and .tsv
            writeP(f'{verse} ', txt_f)

            for w in verse_words:
                word = F.g_cons.v(w)
                trailer = F.trailer.v(w)
                tsv_f.write(f'{txtPos}\t{txtPos + len(word)}\t{w}\t{word}\n')

                writeP(f'{word}{trailer}', txt_f)
            
            writeP('\n', txt_f)

            # write full text
            txt_f_full.write(f'{verse} {T.text(vn)}\n')

In [12]:
def writeBook(bookName):
    bookBase = f'{OUTPUT_BASE}/{bookName}'
    for path in (ANNOTATE, FULL, STANDOFF):
        fullPath = f'{bookBase}/{path}'
        if os.path.exists(path):
            if fullPath.endswith(ANNOTATE):
                for file in glob(f'{fullPath}/*.txt'):
                    os.unlink(file)
            else:
                rmtree(fullPath)
        if not os.path.exists(fullPath):
            os.makedirs(fullPath, exist_ok=True)
            
    bookNode = T.bookNode(bookName)
    chapters = L.d(bookNode, otype='chapter')
    info(f'Write {bookName}: {len(chapters)} chapters')
    for chapter in chapters:
        writeChapter(chapter, bookBase)
        info('.', tm=False, nl=False)
        
    info('', tm=False)
    info('Done writing files (.txt .tsv _full.txt)')

## 4. Execute function 

Give the function `writeBook()` any Hebrew Bible book name in English in `string` form. 

In [14]:
writeBook('Genesis')

38m 30s Write Genesis: 50 chapters
..................................................
38m 30s Done writing files (.txt .tsv _full.txt)
