<img align="right" src="tf-small.png"/>

# From LAF-Fabric to Text-Fabric

This notebook turns the older ETCBC4 and ETCBC4B datasets from LAF-Fabric into Text-Fabric.

In [21]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [22]:
import collections,sys

In [23]:
from laf.fabric import LafFabric
from etcbc.lib import Transcription
from etcbc.blang import booklangs, booknames
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.8.3
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



# ETCBC4 and ETCBC4B

The data of the ETCBC4 and ETCBC4B versions are complete with extra data and multilingual book names.

The features differ a little bit.

In [24]:
source = 'etcbc'
version = '4b'
ETCBC = 'hebrew/{}{}'.format(source, version)

In [25]:
nodeFeaturesList = {
    '4': '''
        otype
        monads
        book
        chapter
        dist
        dist_unit
        domain
        entry_heb
        entryid
        entry
        freq_lex
        freq_occ
        function
        g_cons
        g_cons_utf8
        g_entry_heb
        g_entry
        g_lex
        g_lex_utf8
        g_nme
        g_nme_utf8
        g_pfm
        g_pfm_utf8
        g_prs
        g_prs_utf8
        g_qere_utf8
        g_uvf
        g_uvf_utf8
        g_vbe
        g_vbe_utf8
        g_vbs
        g_vbs_utf8
        g_word
        g_word_utf8
        gloss
        gn
        id
        instruction
        is_root
        clause_kind
        label
        lan
        language
        lex
        lex_utf8
        ls
        nametype
        mother_object_type
        nme
        nu
        number
        number_in_ch
        pargr
        pdp
        pfm
        phono_sep
        phono
        pos
        prs
        ps
        qtrailer_utf8
        rank_lex
        rank_occ
        rela
        root
        sp
        st
        subpos
        tab
        trailer_utf8
        txt
        typ
        uvf
        vbe
        vbs
        verse
        vs
        vt    
    ''',
    '4b': '''
        otype
        monads
        book
        chapter
        dist
        dist_unit
        domain
        entry_heb
        entryid
        entry
        freq_lex
        freq_occ
        function
        g_cons
        g_cons_utf8
        g_entry_heb
        g_entry
        g_lex
        g_lex_utf8
        g_nme
        g_nme_utf8
        g_pfm
        g_pfm_utf8
        g_prs
        g_prs_utf8
        g_qere_utf8
        g_uvf
        g_uvf_utf8
        g_vbe
        g_vbe_utf8
        g_vbs
        g_vbs_utf8
        g_word
        g_word_utf8
        gloss
        gn
        id
        instruction
        is_root
        kind
        label
        lan
        language
        lex
        lex_utf8
        ls
        nametype
        mother_object_type
        nme
        nu
        number
        number_in_ch
        pargr
        pdp
        pfm
        phono_sep
        phono
        pos
        prs
        ps
        qtrailer_utf8
        rank_lex
        rank_occ
        rela
        root
        sp
        st
        subpos
        tab
        trailer_utf8
        txt
        typ
        uvf
        vbe
        vbs
        verse
        vs
        vt    
    '''
}
edgeFeaturesList = {
    '4': '''
        mother functional_parent distributional_parent
    ''',
    '4b': '''
        mother functional_parent distributional_parent
    '''
}

intFeatures = set('''
    dist
    number
    number_in_ch
    tab
    chapter
    verse
    freq_lex
    freq_occ
    rank_lex
    rank_occ
'''.strip().split())

In [26]:
API = fabric.load(source+version, 'lexicon,para', 'TF'+version, {
    "xmlids": {"node": False, "edge": False},
    "features": (nodeFeaturesList[version], edgeFeaturesList[version]),
    "primary": False,
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56
  0.01s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54
  0.02s USING annox: para DATA COMPILED AT: 2016-07-08T14-38-37
    17s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/TF4b/__log__TF4b.txt
    17s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon, para FOR TASK TF4b AT 2017-01-11T13-27-32


# Preparations

In [27]:
nodeFeatures = {}
edgeFeatures = {}
metaData = {}

## Map slot numbers
In TF we make sure that the slots go from 1-maxSlot consecutively.
Maybe we have to map the original LAF-Fabric monad numbers to the node numbers of the words in TF.
Maybe there are holes in the original monad sequence.

In TF the slots start at 1, and there are no holes.

We create the following mappings:

* `lfFromMonad`: original monad number => LAF-Fabric node number
* `tfFromMonad`: original monad number => Text-Fabric slot number
* `tfFromLf   `: LAF-Fabric node number => Text-Fabric node number (only for monads/slots) 

In [28]:
lfFromMonad = {}
tfFromMonad = {}
tfFromLf = {}

for w in F.otype.s('word'):
    m = int(F.monads.v(w))
    lfFromMonad[m] = w

for (i, (m, w)) in enumerate(sorted(lfFromMonad.items())):
    tfFromLf[w] = i+1
    tfFromMonad[m] = i+1

In [29]:
maxLfNode = max(lfn for lfn in NN())
maxMonadNode = max(lfFromMonad.values())
maxMonad = max(lfn for lfn in tfFromMonad.keys())
maxSlot = max(tfn for tfn in tfFromMonad.values())

## Map all nodes

After the last slot, the other nodes start. We map the remaining LAF-Fabric nodes to Text-Fabric nodes.

In [30]:
tfn = maxSlot
for lfn in range(maxMonadNode+1, maxLfNode+1):
    tfn += 1
    tfFromLf[lfn] = tfn
maxNode = tfn

I suspect that the LF monads correspond 1-1-identical to the TF slots,
and the TF nodes are all 1 higher than the LF nodes.
Let us check that.

In [31]:
good = True
for ln in tfFromLf:
    if ln+1 != tfFromLf[ln]:
        good = False
for m in tfFromMonad:
    if m != tfFromMonad[m]:
        good = False
print(good)

True


## Final correspondence between LF and TF
With this out of the way, we can make simple mappings between monads, LF nodes, slots and TF nodes.
We free up some memory and define some simple mapping functions.
These mapping functions must be used to translate LF nodes/monads into TF nodes/slots.

Remember that the following are already defined:

* `maxMonad` : maximum monad in LF
* `maxSlot`  : maximum slot in TF
* `maxLfNode`: maximum node in LF
* `maxNode`  : maximum node in TF

In [32]:
lfFromMonad = None
tfFromMonad = None
tfFromLf = None

tFm = lambda x: x
tFn = lambda x: x+1

def tFms(monadList): return set(monadList)
def tFns(lfNodeList): return {n+1 for n in lfNodeList}

print('''
LAF-Fabric:
  {:<12} = {:>7}
  {:<12} = {:>7}
  {:<12} = {:>7}
Text-Fabric
  {:<12} = {:>7}
  {:<12} = {:>7}
'''.format(
    'maxMonad', maxMonad,
    'maxMonadNode', maxMonadNode,
    'maxLfNode', maxLfNode, 
    'maxSlot', maxSlot, 
    'maxNode', maxNode,
))


LAF-Fabric:
  maxMonad     =  426568
  maxMonadNode =  426567
  maxLfNode    = 1436857
Text-Fabric
  maxSlot      =  426568
  maxNode      = 1436858



## oslots

Here is code to write the oslots edge information in a compact text file.

In [33]:
from tf.helpers import *

In [34]:
oslotsData = {}

for n in range(maxMonadNode+1, maxLfNode+1):
    oslotsData[tFn(n)] = tFms(setFromSpec(F.monads.v(n)))

 2m 01s END


# Additional data

## Book names international

For each language in which book names have been translated, we add a feature `book_ll` where 
`ll` is the two letter language code.
The feature gives for each book node the name of that book in that language.

In [35]:
langs = booklangs['Hebrew']
names = booknames['Hebrew']
books = [b for b in F.otype.s('book')]
bookIndex = dict(((b,i) for (i,b) in enumerate(books)))

textConfig = {
    'sectionTypes':              'book,chapter,verse',
    'sectionFeatures':           'book,chapter,verse',
    'fmt:text-orig-full':        '{g_qere_utf8/g_word_utf8}{qtrailer_utf8/trailer_utf8}',
    'fmt:text-orig-full-ketiv':  '{g_word_utf8}{trailer_utf8}',
    'fmt:text-orig-plain':       '{g_cons_utf8}{trailer_utf8}',
    'fmt:text-trans-full':       '{g_word} ',
    'fmt:text-trans-full-ketiv': '{g_word} ',
    'fmt:text-trans-plain':      '{g_cons} ',
    'fmt:lex-orig-full':         '{g_lex_utf8} ',
    'fmt:lex-orig-plain':        '{lex_utf8} ',
    'fmt:lex-trans-full':        '{g_lex} ',
    'fmt:lex-trans-plain':       '{lex0} ',
}

metaData.update(dict(otext=textConfig))

for (code, (langEng, langOwn)) in sorted(booklangs['Hebrew'].items()):
    fName = 'book@{}'.format(code)
    metaData[fName] = dict(
        valueType = 'str',
        source='blang.py in LAF-Fabric',
        languageCode=code,
        languageEnglish=langEng,
        language=langOwn
    )
    nodeFeatures[fName] = dict(((tFn(b), names[code][bookIndex[b]]) for b in books))
print('{} book name languages'.format(len(nodeFeatures)))

26 book name languages


# All node and edge features

In [36]:
metaData.update({'': dict(source='ETCBC{} via LAF-Fabric'.format(version))})
for nf in nodeFeaturesList[version].strip().split():
    metaData.setdefault(nf, {})['valueType'] = 'int' if nf in intFeatures else 'str'

metaData.setdefault('oslots', {})['valueType'] = 'str'
for ef in edgeFeaturesList[version].strip().split():
    metaData.setdefault(ef, {})['valueType'] = 'str'
    
for nf in nodeFeaturesList[version].strip().split():
    nodeFeatures[nf] = dict(((tFn(n), v) for (n,v) in F.item[nf].lookup.items()))

edgeFeatures['oslots'] = oslotsData
for ef in edgeFeaturesList[version].strip().split():    
    edgeFeatures[ef] = dict(((tFn(n), tFns(set(nDict.keys()))) for (n, nDict) in C.item[ef].lookup.items()))

## Last minute changes
Some features that come from the ETCBC core have obvious defects, or we need a feature next to it that is just a little bit different. This is what we do

1. `lex` contains the lexeme (in transcription) with disambiguation marks (`[/=`) appended.
   For text transformations we prefer the bare lexeme
1. `language` has values `Hebrew` and `Aramaic`. We prefer ISO language codes: `hbo` and `arc` instead.
   By adding `language` for lexeme nodes we already have switched to ISO codes. Here we do the rest.

In [37]:
lang_map = {
    'Hebrew': 'hbo',
    'Aramaic': 'arc',
}

nodeFeatures['lex0'] = {}

nfl = nodeFeatures['lex0']
lnf = nodeFeatures['language']

for (n, v) in nodeFeatures['lex'].items():
    vv = v.rstrip('[/=')
    nfl[n] = vv

for (n, v) in F.language.lookup.items():
    lnf[tFn(n)] = lang_map[v]

metaData.setdefault('lex0', {})['valueType'] = 'str'
metaData.setdefault('language', {})['valueType'] = 'str'

# Save everything

In [38]:
from tf.fabric import Fabric

In [39]:
TF = Fabric(locations='~/github/text-fabric-data-legacy', modules=ETCBC)

This is Text-Fabric 2.2.1
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
110 features found and 0 ignored


In [40]:
TF.save(
    nodeFeatures=nodeFeatures,
    edgeFeatures=edgeFeatures,
    metaData=metaData,
)

  0.00s Exporting 105 node and 4 edge and 1 config features to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b:
   |     0.05s T book                 to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b
   |     0.00s T book@am              to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b
   |     0.00s T book@ar              to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b
   |     0.00s T book@bn              to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b
   |     0.00s T book@da              to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b
   |     0.00s T book@de              to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b
   |     0.00s T book@el              to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b
   |     0.00s T book@en              to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b
   |     0.00s T book@es              to /Users/dirk/github/text-fabric-data-legacy/he