To fix: 

* When selecting only a few chapters for conversion with `selectChapters = {1, 2}`, all earlier done conversions to TF are overwritten.

* When the conversion of all annotations for different books have been done, and a new push is done for one specific book, all annotions are overwritten except for the updated data of one book. 

Work around for now:

* Correct annotations, and push al annotated books one by one again. It just costs a minute, but needs a fix anyway. 

<img align="right" src="images/tf-small.png" width="90"/>
<img align="right" src="images/etcbc.png" width="100"/>

# From Coreference Annotations to Text-Fabric Data

The code in this notebook converts the coreference resolution annotations done in brat to Text-Fabric data. Just run the cells, and follow the instructions written in the cells above the code or else in the code which are indicated by a `#`. 

In [None]:
__author__ = 'erwich/roorda'

### Load Modules and Utils

In [None]:
import os
from shutil import rmtree
from glob import glob

from tf.app import use
from tf.fabric import Fabric

### File Administration

The location on your PC where the annotated data is taken from for the TF conversion has the form: 
* `{OUTPUT_BASE}/{bookName}/{ANNOTATE}/*.ann`

For the standoff files the form of the location is: 
* `{OUTPUT_BASE}/{bookName}/{STANDOFF}/{fileName}.tsv`

The location on your PC where the converted TF data is stored has the form: 
* `{GITHUB_BASE}/{ORG}/{REPO}/{PATH}/{VERSION}`

All the CONSTANTS can be specified in the cell below. It is possible to convert coreference annotations, if existent of course, for any Hebrew Bible book. The book is specified with `bookName` in the function `getFeatures()` below, e.g. `getFeatures('Psalms')`

The textual data of the BHSA that form the basis of the coreference annotations has been generated in the fixed 2017 version, so do not change this.

In [None]:
# I have used the fixed 2017 data, do not change this
VERSION = '2017'

# Enter your GitHub repo 
ORG = 'cmerwich' 

# The locations where you want to store the converted data 
REPO = 'participant-analysis' 
PATH = 'coreference/tf'

# The output base indicates where the programs below can find the annotated data
OUTPUT_BASE = os.path.expanduser('~/Sites/brat/data/coref')

# GitHub location on your computer
GITHUB_BASE = os.path.expanduser('~/github')

# This is the annotation folder 
ANNOTATE = f'annotate'

# The standoff folder is important for the TF conversion 
STANDOFF = f'standoff'

### Load the TF App

In [None]:
A = use('bhsa', version=VERSION, hoist=globals())

### Load consonantal feature

`g_cons` is the consonantal representation of a word occurrence in BHSA transliteration. It operates on a word object. The feature is needed for the conversion of the annotation of suffixes. 

In [None]:
TF.load('g_cons', add=True)

### Select Chapters for TF Conversion

When a certain part of the desired annotations have been done, but not everything yet, it is possible to select a specific number of chapters for the conversion. Just uncomment line 1 in the cell below, and comment line 3. If you want to convert all annotations, just run the cell as it is. 

In [None]:
#selectChapters = {1, 2}

selectChapters = None

### Node Features for TF

Three features are made:

1. `mention`: contains all referring expressions which consist of NP's, named entities, suffixes, person/gender/number of verbs, personal pronouns and demonstrative pronouns, . 
2. `mentionNote`: annotator notes that have been stored on the mentions. 
3. `coref`: the coreference relations between mentions. 

In [None]:
nodeFeatures = dict(
    mention={},
    mentionNote={},
    coref={},
)

In [None]:
def lookupNode(aStart, standoffInfo, standoffInfoNonFirst):
    node = None
    isPart = False
    if aStart in standoffInfo:
        (node, sEnd, sWord) = standoffInfo[aStart]
    else:
        if aStart in standoffInfoNonFirst:
            (node, sEnd, sWord) = standoffInfo[standoffInfoNonFirst[aStart]]
            isPart = aStart != sEnd
    return (node, isPart)

### Get All Features 

Specify the Hebrew Bible book that has been annotated in `getFeatures()` below. `getFeatures` assembles all the features that are necessary for a sound TF conversion. 

In [None]:
def getFeatures(bookName):
    bookBase = f'{OUTPUT_BASE}/{bookName}'
    files = glob(f'{bookBase}/{ANNOTATE}/*.ann')
    
    nChapters = 0
    
    for annFile in sorted(files):
        (directory, fileNameFull) = os.path.split(annFile)
        (fileName, ext) = os.path.splitext(fileNameFull)
        standoffFile = f'{bookBase}/{STANDOFF}/{fileName}.tsv'
        chapter = int(fileName[len(bookName) + 1:].lstrip('0'))
        if selectChapters is not None and chapter not in selectChapters:
            continue
        
        standoffInfo = {}
        standoffInfoNonFirst = {}
        
        with open(standoffFile) as fh:
            first = True
            errors = 0
            minPos = None
            maxPos = None
            for (i, line) in enumerate(fh):
                epos = f'{fileName}.tsv{i + 1} - '
                if first:
                    first = False
                    continue  # header
                (start, end, node, word) = line.rstrip('\n').split('\t')
                start = int(start)
                end = int(end)
                node = int(node)
                if start == end:  # empty word
                    continue
                if maxPos is None or end > maxPos:
                    maxPos = end
                if minPos is None or start < minPos:
                    minPos = start
                if start in standoffInfo:
                    error(f'{epos}{start} for multiple items: {standoffInfo[start]}')
                    errors += 1
                standoffInfo[start] = (node, end, word)
        
        currentStart = None
        
        for p in range(minPos, maxPos + 1):
            if p in standoffInfo:
                currentStart = p
            else:
                standoffInfoNonFirst[p] = currentStart
        
        if errors:
            error(f'{book} {chapter}: {errors} errors in standoff file')
        
        errors = 0
        
        mention = {}
        mentionNote = {}
        coref = {}
        
        firstChars = {'T', '#', '*'}
        cClass = 0
        
        with open(annFile) as fh:
            for (i, line) in enumerate(fh):
                epos = f'{fileName}.tsv:{i + 1} - '
                line = line.rstrip('\n')
                firstChar = line[0]
                
                if firstChar not in firstChars:
                    error(f'{epos}Unrecognized line "{line}"')
                    errors +=1
                    continue
                    
                numFields = 2 if firstChar =='*' else 3
                parts = line.split('\t')
                
                if len(parts) != numFields:
                    error(f'{epos}line does not have exactly {numFields} parts: "{line}"')
                    errors += 1
                    continue
                        
                if firstChar == 'T':
                    (tPart, mentionStr, aWord) = parts
                    mParts = mentionStr.split()
                    if len(mParts) != 3:
                        error(f'{epos}T-line mention does not have exactly 3 parts: "{line}"')
                        errors += 1
                        continue
                    (mm, aStart, aEnd) = mParts
                    aStart = int(aStart)
                    aEnd = int(aEnd)
                    
                    (nodeStart, isPartStart) = lookupNode(aStart, standoffInfo, standoffInfoNonFirst)
                    if nodeStart is None:
                        error(f'{epos}Mention start position not found in standoff file "{line}"')
                        errors += 1
                        continue

                    (nodeEnd, isPartEnd) = lookupNode(aEnd, standoffInfo, standoffInfoNonFirst)
                    if nodeEnd is None:
                        error(f'{epos}Mention end position not found in standoff file "{line}"')
                        errors += 1
                        continue
                    wordSize = nodeEnd - nodeStart + 1
                    wordPart = aWord if isPartStart or isPartEnd else None
                            
                    mention.setdefault(nodeStart, []).append((tPart, wordSize, wordPart))
                        
                elif firstChar == '#':
                    (code, spec, note) = parts
                    sParts = spec.split()
                    if len(sParts) != 2:
                        error(f'{epos}#-line spec does not have exactly 2 parts: "{line}"')
                        errors += 1
                        continue
                    tPart = sParts[1]
                    mentionNote.setdefault(tPart, set()).add(note)
                elif firstChar == '*':
                    (char, data) = parts
                    dataParts = data.split()
                    if len(dataParts) <= 1 or dataParts[0] != 'Coreference':
                        error(f'{epos}*-line spec does not have the right parts: "{line}"')
                        errors += 1
                        continue
                    cClass += 1
                    for tPart in dataParts[1:]:
                        if tPart in coref:
                            error(f'{epos}*-"{tPart} occurs in multiple classes "{coref[tPart]}" in "{line}"')
                            errors += 1
                            continue
                        coref[tPart] = f'C{cClass}'
                    
        if errors:
            error(f'{book} {chapter}: {errors} errors in annotation file')
        else:
            info('.', tm=False, nl=False)
        
        for (node, parts) in mention.items():
            parts = sorted(
                (x for x in parts),
                key=lambda x: ('' if x[2] is None else x[2], x[1], x[0])
            )
            valuesM = []
            valuesC = []
            notes = set()

            for (tPart, wordSize, wordPart) in parts:
                cPart = coref.get(tPart, tPart)

                wordSize = str(wordSize) if wordSize > 1 else ''
                wordPart = wordPart or ''

                isSuffix = 's' if wordPart and F.g_cons.v(node).endswith(wordPart) else ''
                
                valueM = (tPart[0], tPart[1:], wordSize, isSuffix, wordPart)
                valueC = (cPart[0], cPart[1:], wordSize, isSuffix, wordPart)
                
                valuesM.append(','.join(valueM))
                valuesC.append(','.join(valueC))
                
                if tPart in mentionNote:
                    notes |= mentionNote[tPart]
                         
            nodeFeatures['mention'][node] = '|'.join(valuesM)            
            nodeFeatures['coref'][node] = '|'.join(valuesC)
            if notes:
                nodeFeatures['mentionNote'][node] = '|'.join(sorted(notes))
                
        nChapters += 1
                
    info('', tm=False)
    info(f'Done assembling features of {nChapters} chapters')

In [None]:
getFeatures('Numbers')

### Specify Metadeta

Before the actual conversion takes place first the metadata need to be specified below. Since I have annotated the Psalms data myself, I am the only author. Of course it is possible to specify multiple authors or annotators. 

In [None]:
metaData = {
    '': dict(
            title='Participant analysis',
            author='Christiaan Erwich',
    ),
    'mention': dict(
        valueType='str',
        description='mention annotations made through Brat',
        explanation='the analysis is per chapter',
    ),
    'mentionNote': dict(
        valueType='str',
        description='comments on mention annotations made through Brat',
        explanation='the analysis is per chapter',
    ),
    'coref': dict(
        valueType='str',
        description='coreference equivalence class of mention annotations made through Brat',
        explanation='the analysis is per chapter',
    ),
}

edgeFeatures = {}

### Convert 

The new features are stored in my github repository as can be seen below. 

In [None]:
TFW = Fabric(locations=f'{GITHUB_BASE}/{ORG}/{REPO}/{PATH}/{VERSION}')
TFW.save(
    nodeFeatures=nodeFeatures,
    edgeFeatures=edgeFeatures,
    metaData=metaData,
)