<img align="right" src="tf-small.png"/>

# From LAF-Fabric to Text-Fabric

This notebook turns the ETCBC4C dataset from LAF-Fabric into Text-Fabric.
It also includes additional data

* from the ETCBC itself:
  * ketiv qere data
  * lexicon data
  * paragraph data
* from me
  * book names in multiple languages
  * phonetic transcription: will be done in a separate notebook, completely inside Text-Fabric

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections,sys

In [3]:
from laf.fabric import LafFabric
from etcbc.preprocess import prep
from etcbc.lib import Transcription
from etcbc.blang import booklangs, booknames
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.8.3
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



# Core etcbc data

This is what came out of the live Emdros database running on the jakob server at the ETCBC, on 2016-11-04.

In [4]:
nodeFeaturesStr = '''
otype
code
det
dist
dist_unit
domain
function
g_cons
g_cons_utf8
g_lex
g_lex_utf8
g_nme
g_nme_utf8
g_pfm
g_pfm_utf8
g_prs
g_prs_utf8
g_uvf
g_uvf_utf8
g_vbe
g_vbe_utf8
g_vbs
g_vbs_utf8
g_word
g_word_utf8
gn
is_root
kind
lex
ls
mother_object_type
nme
nu
number
pdp
pfm
prs
prs_gn
prs_nu
prs_ps
ps
rela
sp
st
tab
trailer
trailer_utf8
txt
typ
uvf
vbe
vbs
vs
vt
book
chapter
label
verse
'''

edgeFeaturesStr = '''
oslots
mother
functional_parent
distributional_parent
'''

intFeatures = set('''
dist
number
tab
chapter
verse
'''.strip().split())

nodeFeatureList = nodeFeaturesStr.strip().split()
edgeFeatureList = edgeFeaturesStr.strip().split()

lfNodeFeatures = ' '.join(nodeFeatureList+('monads g_voc_lex g_voc_lex_utf8 language').strip().split())
lfEdgeFeatures = ' '.join(set(edgeFeatureList) - {'oslots'})

In [5]:
source = 'etcbc'
version = '4c'

In [6]:
API = fabric.load(source+version, '--', 'TF', {
    "xmlids": {"node": False, "edge": False},
    "features": (lfNodeFeatures, lfEdgeFeatures),
    "primary": False,
    "prepare": prep(select='L')
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: etcbc4c: UP TO DATE
  0.00s USING main: etcbc4c DATA COMPILED AT: 2016-11-09T19-16-37
  0.01s DETAIL: load main: G.node_anchor_min
  0.08s DETAIL: load main: G.node_anchor_max
  0.18s DETAIL: load main: G.node_sort
  0.24s DETAIL: load main: G.node_sort_inv
  0.63s DETAIL: load main: G.edges_from
  0.68s DETAIL: load main: G.edges_to
  0.74s DETAIL: load main: F.etcbc4_db_monads [node] 
  1.40s DETAIL: load main: F.etcbc4_db_otype [node] 
  2.00s DETAIL: load main: F.etcbc4_ft_code [node] 
  2.05s DETAIL: load main: F.etcbc4_ft_det [node] 
  2.25s DETAIL: load main: F.etcbc4_ft_dist [node] 
  2.42s DETAIL: load main: F.etcbc4_ft_dist_unit [node] 
  2.81s DETAIL: load main: F.etcbc4_ft_domain [node] 
  2.85s DETAIL: load main: F.etcbc4_ft_function [node] 
  3.07s DETAIL: load main: F.etcbc4_ft_g_cons [node] 
  3.38s DETAIL: load main: F.etcbc4_ft_g_cons_utf8 [node] 
  3.65s DETAIL: load main: F.etcbc4_ft_g_lex [node] 
  

In [7]:
source = 'etcbc'
version = '4c'
ETCBC = 'hebrew/{}{}'.format(source, version)

# Preparations

In [36]:
nodeFeatures = {}
edgeFeatures = {}
metaData = {}

## Map slot numbers
In TF we make sure that the slots go from 1-maxSlot consecutively.
Maybe we have to map the original LAF-Fabric monad numbers to the node numbers of the words in TF.
Maybe there are holes in the original monad sequence.

In TF the slots start at 1, and there are no holes.

We create the following mappings:

* `lfFromMonad`: original monad number => LAF-Fabric node number
* `tfFromMonad`: original monad number => Text-Fabric slot number
* `tfFromLf   `: LAF-Fabric node number => Text-Fabric node number (only for monads/slots) 

In [37]:
lfFromMonad = {}
tfFromMonad = {}
tfFromLf = {}

for w in F.otype.s('word'):
    m = int(F.monads.v(w))
    lfFromMonad[m] = w

for (i, (m, w)) in enumerate(sorted(lfFromMonad.items())):
    tfFromLf[w] = i+1
    tfFromMonad[m] = i+1

In [38]:
maxLfNode = max(lfn for lfn in NN())
maxMonadNode = max(lfFromMonad.values())
maxMonad = max(lfn for lfn in tfFromMonad.keys())
maxSlot = max(tfn for tfn in tfFromMonad.values())

## Map all nodes

After the last slot, the other nodes start. We map the remaining LAF-Fabric nodes to Text-Fabric nodes.

In [39]:
tfn = maxSlot
for lfn in range(maxMonadNode+1, maxLfNode+1):
    tfn += 1
    tfFromLf[lfn] = tfn
maxNode = tfn

I suspect that the LF monads correspond 1-1-identical to the TF slots,
and the TF nodes are all 1 higher than the LF nodes.
Let us check that.

In [40]:
good = True
for ln in tfFromLf:
    if ln+1 != tfFromLf[ln]:
        good = False
for m in tfFromMonad:
    if m != tfFromMonad[m]:
        good = False
print(good)

True


## Final correspondence between LF and TF
With this out of the way, we can make simple mappings between monads, LF nodes, slots and TF nodes.
We free up some memory and define some simple mapping functions.
These mapping functions must be used to translate LF nodes/monads into TF nodes/slots.

Remember that the following are already defined:

* `maxMonad` : maximum monad in LF
* `maxSlot`  : maximum slot in TF
* `maxLfNode`: maximum node in LF
* `maxNode`  : maximum node in TF

In [41]:
lfFromMonad = None
tfFromMonad = None
tfFromLf = None

tFm = lambda x: x
tFn = lambda x: x+1

def tFms(monadList): return set(monadList)
def tFns(lfNodeList): return {n+1 for n in lfNodeList}

print('''
LAF-Fabric:
  {:<12} = {:>7}
  {:<12} = {:>7}
  {:<12} = {:>7}
Text-Fabric
  {:<12} = {:>7}
  {:<12} = {:>7}
'''.format(
    'maxMonad', maxMonad,
    'maxMonadNode', maxMonadNode,
    'maxLfNode', maxLfNode, 
    'maxSlot', maxSlot, 
    'maxNode', maxNode,
))


LAF-Fabric:
  maxMonad     =  426581
  maxMonadNode =  426580
  maxLfNode    = 1436893
Text-Fabric
  maxSlot      =  426581
  maxNode      = 1436894



## oslots

Here is code to write the oslots edge information in a compact text file.

In [42]:
from tf.helpers import *

In [43]:
oslotsData = {}

for n in range(maxMonadNode+1, maxLfNode+1):
    oslotsData[tFn(n)] = tFms(setFromSpec(F.monads.v(n)))

# Additional data

## Book names international

For each language in which book names have been translated, we add a feature `book_ll` where 
`ll` is the two letter language code.
The feature gives for each book node the name of that book in that language.

In [44]:
langs = booklangs['Hebrew']
names = booknames['Hebrew']
books = [b for b in F.otype.s('book')]
bookIndex = dict(((b,i) for (i,b) in enumerate(books)))

textConfig = {
    'sectionTypes':              'book,chapter,verse',
    'sectionFeatures':           'book,chapter,verse',
    'fmt:text-orig-full':        '{qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}',
    'fmt:text-orig-full-ketiv':  '{g_word_utf8}{trailer_utf8}',
    'fmt:text-orig-plain':       '{g_cons_utf8}{trailer_utf8}',
    'fmt:text-trans-full':       '{qere/g_word}{qere_trailer/trailer}',
    'fmt:text-trans-full-ketiv': '{g_word}{trailer}',
    'fmt:text-trans-plain':      '{g_cons}{trailer}',
    'fmt:lex-orig-full':         '{g_lex_utf8} ',
    'fmt:lex-orig-plain':        '{lex_utf8} ',
    'fmt:lex-trans-full':        '{g_lex} ',
    'fmt:lex-trans-plain':       '{lex0} ',
}

metaData.update(dict(otext=textConfig))

for (code, (langEng, langOwn)) in sorted(booklangs['Hebrew'].items()):
    fName = 'book@{}'.format(code)
    metaData[fName] = dict(
        valueType = 'str',
        source='blang.py in LAF-Fabric',
        languageCode=code,
        languageEnglish=langEng,
        language=langOwn
    )
    nodeFeatures[fName] = dict(((tFn(b), names[code][bookIndex[b]]) for b in books))
print('{} book name languages'.format(len(nodeFeatures)))

26 book name languages


## Ketiv/qere

In [45]:
inf("Making mappings between verse labels in KQ and verse nodes in LAF")
vlab2vnode = {}
for vs in F.otype.s('verse'):
    lab = F.label.v(vs)
    vlab2vnode[lab] = vs
inf("{} verses".format(len(vlab2vnode)))

20m 41s Making mappings between verse labels in KQ and verse nodes in LAF
20m 42s 23213 verses


In [46]:
def get_kq(kq_file):
    inf("Reading Ketiv-Qere data")

    info = collections.defaultdict(lambda: [])
    not_found = set()
    missing = collections.defaultdict(lambda: [])
    missed = collections.defaultdict(lambda: [])

    error_limit = 10

    kq_handle = open(kq_file)

    ln = 0
    can = 0
    cur_label = None
    for line in kq_handle:
        ln += 1
        can += 1
        vlab = line[0:10]
        fields = line.rstrip('\n')[10:].split()
        (ketiv, qere) = fields[0:2]
        (qtrim, qtrailer) = Transcription.suffix_and_finales(qere)
        vnode = vlab2vnode.get(vlab, None)
        if vnode == None:
            not_found.add(vlab)
            continue
        info[vnode].append((ketiv, qtrim, qtrailer))        
    kq_handle.close()
    inf("Read {} ketiv-qere annotations".format(ln))

    data = []
    for vnode in info:
        wlookup = collections.defaultdict(lambda: [])
        wvisited = collections.defaultdict(lambda: -1)
        wnodes = L.d('word', vnode)
        for w in wnodes:
            gw = F.g_word.v(w)
            if '*' in gw:
                gw = F.g_cons.v(w)
                if gw == '': gw = '.'
                if F.trailer_utf8.v(w) == '': gw += '-'
                wlookup[gw].append(w)
        for (ketiv, qere, qtrailer) in info[vnode]:
            wvisited[ketiv] += 1
            windex = wvisited[ketiv]
            ws = wlookup.get(ketiv, None)
            if ws == None or windex > len(ws) - 1:
                missing[vnode].append((windex, ketiv, qere))
                continue
            w = ws[windex]
            qere_u = Transcription.to_hebrew(qere)
            qtrailer_u = Transcription.to_hebrew(qtrailer)
            data.append((
                w,
                ketiv,
                qere,
                qtrailer.replace('\n', ''),
                qere_u,
                qtrailer_u.replace('\n', ''),
            ))
        for ketiv in wlookup:
            if ketiv not in wvisited or len(wlookup[ketiv]) - 1 > wvisited[ketiv]:
                missed[vnode].append((len(wlookup[ketiv]) - (wvisited.get(ketiv, -1) + 1), ketiv))
    inf("Parsed {} ketiv-qere annotations".format(len(data)))

    if not_found:
        msg("Could not find {} verses: {}".format(len(not_found), sorted(not_found)))
    else:
        inf("All verses entries found in index")
    if missing:
        msg("Could not locate ketivs in the text: {} verses".format(len(missing)))
        e = 0
        for vnode in sorted(missing):
            if e > error_limit: break
            vlab = F.label.v(vnode)
            for (windex, ketiv, qere) in missing[vnode]:
                e += 1
                if e > error_limit: break
                print('NOT IN TEXT: {:<10} {:<20} #{} {}'.format(vlab, ketiv, windex, qere))
    else:
        inf("All ketivs found in the text")
    if missed:
        msg("Could not lookup qeres in the data: {} verses".format(len(missed)))
        e = 0
        for vnode in sorted(missed):
            if e > error_limit: break
            vlab = F.label.v(vnode)
            for (windex, ketiv) in missed[vnode]:
                e += 1
                if e > error_limit: break
                print('NOT IN DATA: {:<10} {:<20} #{}'.format(vlab, ketiv, windex))
    else:
        inf("All ketivs found in the data")
    return [(tFn(x[0]), x[2], x[3], x[4], x[5]) for x in data]

In [47]:
base_dir = '/Users/dirk/laf/laf-fabric-data/'
kq_base = '{}/{}/{}.{}{}'.format(base_dir, 'kq', 'kq', source, version)
kq = get_kq(kq_base)

20m 45s Reading Ketiv-Qere data
20m 45s Read 1892 ketiv-qere annotations
20m 45s Parsed 1892 ketiv-qere annotations
20m 45s All verses entries found in index
20m 45s All ketivs found in the text
20m 45s All ketivs found in the data


In [48]:
nodeFeatures.update(dict(
    qere=dict(((x[0], x[1]) for x in kq)),
    qere_trailer=dict(((x[0], x[2]) for x in kq)),
    qere_utf8=dict(((x[0], x[3]) for x in kq)),
    qere_trailer_utf8=dict(((x[0], x[4]) for x in kq))
))
for ft in 'qere qere_trailer qere_utf8 qere_trailer_utf8'.strip().split():
    metaData.setdefault(ft, {})['valueType'] = 'str'

In [49]:
w1 = 105846
print('"{}" "{}" "{}"'.format(F.g_word.v(w1), F.g_cons.v(w1), nodeFeatures['qere'][w1]))

"*N<R" "N<R" "HA45"


## Lexicon
We add lexical data.
The lexical data will not be added as features of words, but as features of lexemes.
The lexemes will be added as fresh nodes, of a new type `lex`.

In [50]:
lang_map = {
    'Hebrew': 'hbo',
    'Aramaic': 'arc',
}

lexNode = maxNode
lexOccs = {}
nodeFromLex = {}
lexFromNode = {}
otypeData = {}
for n in F.otype.s('word'):
    lex = F.lex.v(n)
    lan = lang_map[F.language.v(n)]
    lex_id = (lan, lex)
    lexOccs.setdefault(lex_id, []).append(tFn(n))
    if lex_id not in nodeFromLex:
        lexNode += 1
        nodeFromLex[lex_id] = lexNode
        lexFromNode[lexNode] = lex_id
print('added {} lexemes\nmaxNode is now {}'.format(len(nodeFromLex), lexNode)) 
for n in range(maxNode+1, lexNode+1):
    oslotsData[n] = lexOccs[lexFromNode[n]]
    otypeData[n] = 'lex'

added 9236 lexemes
maxNode is now 1446130


### Lexical features
We add extra features, based on the lexicon.
We will add an extra otype: `lexeme`, the nodes of which will hold the lexemes.
They will be linked to all the word nodes that contain occurrences of them.

In [51]:
langs = {'hbo', 'arc'}
lex_base = dict((lan, '{}/{}/{}.{}{}'.format(base_dir, 'lexicon', lan, source, version)) for lan in langs)

def read_lex(lan):
    lex_infile = open(lex_base[lan], encoding='utf-8')
    lex_outfile = outfile('{}.txt'.format(lan))
    lex_errfile = outfile('{}.err.txt'.format(lan))

    lex_items = {}
    ln = 0
    e = 0
    for line in lex_infile:
        ln += 1
        line = line.rstrip()
        line = line.split('#')[0]
        if line == '': continue
        (entry, featurestr) = line.split(sep=None, maxsplit=1)
        entry = entry.strip('"')
        if entry in lex_items:
            lex_errfile.write('duplicate lexical entry {} in line {}.\n'.format(entry, ln))
            e += 1
            continue
        featurestr = featurestr.strip(':')
        featurestr = featurestr.replace('\\:', chr(254))
        featurelst = featurestr.split(':')
        features = {}
        for feature in featurelst:
            comps = feature.split('=', maxsplit=1)
            if len(comps) == 1:
                if feature.strip().isnumeric():
                    comps = ('_n', feature.strip())
                else:
                    lex_errfile.write('feature without value for lexical entry {} in line {}: {}\n'.format(
                            entry, ln, feature,
                    ))
                    e += 1
                    continue
            (key, value) = comps
            value = value.replace(chr(254), ':')
            if key in features:
                lex_errfile.write('duplicate feature for lexical entry {} in line {}: {}={}\n'.format(
                        entry, ln, key, value,
                ))
                e += 1
                continue
            features[key] = value.replace('\\', '/')
        if 'sp' in features and features['sp'] == 'verb':
            if 'gl' in features:
                gloss = features['gl']
                if gloss.startswith('to '):
                    features['gl'] = gloss[3:]
        lex_items[entry] = features
        lex_outfile.write('{}\t{}\n'.format(entry, features))
        
    lex_infile.close()
    lex_outfile.close()
    lex_errfile.close()
    msgstr = "Lexicon {}: there w".format(lan) + ('ere {} errors'.format(e) if e != 1 else 'as 1 error')
    print(msgstr)
    return lex_items

inf("Reading lexicon ...")
lex_entries = dict((lan, read_lex(lan)) for lan in sorted(langs))
for lan in sorted(lex_entries):
    print('Lexicon {} has {:>5} entries'.format(lan, len(lex_entries[lan])))
inf("Done")

20m 58s Reading lexicon ...
Lexicon arc: there were 0 errors
Lexicon hbo: there were 0 errors
Lexicon arc has   708 entries
Lexicon hbo has  8528 entries
20m 59s Done


We inspect all word occurrences of the etcbc4 database, inspect their language and lexeme values, and construct sets of lexemes that belong to each of the two languages, ``hbo`` and ``arc``.

In [52]:
lex_text = collections.defaultdict(lambda: collections.defaultdict(lambda: collections.defaultdict(lambda: set())))
do_value_compare = {'sp', 'ls', 'gn', 'ps', 'nu', 'st'}
text_value_set = collections.defaultdict(lambda: set())
node_lex = {}

inf("Reading ETCBC database {}{} ...".format(source, version))
text_langs = set()
for n in F.otype.s('word'):
    lan = lang_map[F.language.v(n)]
    text_langs.add(lan)
    lex = F.lex.v(n)
    node_lex[n] = (lan,lex)
    lex_text[lan][lex]['sp'].add(F.sp.v(n))
    lex_text[lan][lex]['ls'].add(F.ls.v(n))
    lex_text[lan][lex]['gn'].add(F.gn.v(n))
    lex_text[lan][lex]['nu'].add(F.nu.v(n))
    lex_text[lan][lex]['ps'].add(F.ps.v(n))
    lex_text[lan][lex]['vc'].add(F.g_voc_lex.v(n))
    for p in do_value_compare:
        text_value_set[p].add(F.item[p].v(n))        

tf = outfile('text_lexemes.txt')
for lan in sorted(lex_text):
    for lex in sorted(lex_text[lan]):
        tf.write('{} "{}"\n'.format(lan, lex))
tf.close()
inf("Done")
for lan in sorted(lex_text):
    print('Language {} has {:>5} lexemes in the {}{} text'.format(lan, len(lex_text[lan]), source, version))

21m 02s Reading ETCBC database etcbc4c ...
21m 09s Done
Language arc has   708 lexemes in the etcbc4c text
Language hbo has  8528 lexemes in the etcbc4c text


Let us now check whether all lexemes in the text occur in the lexicon and vice versa.

In [53]:
mql_lan = dict(hbo='Hebrew', arc='Aramaic')

arc_lex = set(lex_entries['arc'])
hbo_lex = set(lex_entries['hbo'])

arc_text = set(lex_text['arc'])
hbo_text = set(lex_text['hbo'])

hbo_and_arc_text = arc_text & hbo_text
hbo_and_arc_lex = arc_lex & hbo_lex

lex_min_text = hbo_and_arc_lex - hbo_and_arc_text
text_min_lex = hbo_and_arc_text - hbo_and_arc_lex


print('The intersection of hbo and arc in the etcbc4 text contains {} lexemes'.format(len(hbo_and_arc_text)))
print('The intersection of hbo and arc in the lexicon     contains {} lexemes'.format(len(hbo_and_arc_lex)))
print("Lexemes in the lexical intersection of hbo and arc but not in the textual intersection: {}x: {}".format(
    len(lex_min_text), lex_min_text)
)
print("Lexemes in the textual intersection of hbo and arc but not in the lexical intersection: {}x: {}".format(
    len(text_min_lex), text_min_lex)
)

The intersection of hbo and arc in the etcbc4 text contains 460 lexemes
The intersection of hbo and arc in the lexicon     contains 460 lexemes
Lexemes in the lexical intersection of hbo and arc but not in the textual intersection: 0x: set()
Lexemes in the textual intersection of hbo and arc but not in the lexical intersection: 0x: set()


In [54]:
arc_text_min_lex = arc_text - arc_lex
arc_lex_min_text = arc_lex - arc_text

hbo_text_min_lex = hbo_text - hbo_lex
hbo_lex_min_text = hbo_lex - hbo_text

for (myset, mymsg) in (
    (arc_text_min_lex, 'arc: lexemes in text but not in lexicon'),
    (arc_lex_min_text, 'arc: lexemes in lexicon but not in text'),
    (hbo_text_min_lex, 'hbo: lexemes in text but not in lexicon'),
    (hbo_lex_min_text, 'hbo: lexemes in lexicon but not in text'),
):
    print('{}: {}x{}'.format(mymsg, len(myset), '' if not myset else '\n\t{}'.format(', '.join(sorted(myset)))))

arc: lexemes in text but not in lexicon: 0x
arc: lexemes in lexicon but not in text: 0x
hbo: lexemes in text but not in lexicon: 0x
hbo: lexemes in lexicon but not in text: 0x


### Vocalized lexeme

The lexicon file provides an attribute `vc` for each lexeme, which is the vocalized lexeme.
The ETCBC core data also has features `g_voc_lex` and `g_voc_lex_utf8` for each occurrence.

We investigate whether the latter features are *consistent*, i.e. a property of the lexeme and lexeme only.
If they are somehow dependent on the word occurrence, they are not consistent.

In [55]:
consistent_props = {'vc'}

exceptions = collections.defaultdict(lambda: collections.defaultdict(lambda: set()))

incons = outfile('inconsistent.csv')
for lan in sorted(lex_text):
    lexemes = lex_text[lan]
    for lexeme in sorted(lexemes):
        properties = lexemes[lexeme]
        for prop in consistent_props:
            if prop in properties:
                values = properties[prop]
                if len(values) > 1:
                    exceptions[lan][prop].add(lexeme)
                    incons.write('"{}";"{}";"{}";{};"{}"\n'.format(lan, prop, lexeme, len(values), '";"'.join(values)))
incons.close()
for lan in sorted(text_langs):
    print(lan)
    for prop in sorted(consistent_props):
        extra = ''
        print("{}{:<8}: {:>4} inconsistent lexemes{}".format(
            ' ' * 4, prop, len(exceptions.get(lan, {}).get(prop, set())), extra,
        ))        

arc
    vc      :    0 inconsistent lexemes
hbo
    vc      :    0 inconsistent lexemes


So we can omit the occurrence based features `g_voc_lex` and `g_voc_lex_utf8`
and replace them by lexeme-based features.
We will the lexeme-based features `voc_lex` and `voc_lex_utf8`.

In [56]:
g_voc_lex = {}
g_voc_lex_utf8 = {}

for w in F.otype.s('word'):
    lan = lang_map[F.language.v(w)]
    lex = F.lex.v(w)
    voc = F.g_voc_lex.v(w)
    voc_utf8 = F.g_voc_lex_utf8.v(w)
    g_voc_lex[(lan, lex)] = voc
    g_voc_lex_utf8[(lan, lex)] = voc_utf8

We are going to compute the features
``freq_lex``, ``rank_lex``, ``freq_occ``, ``rank_occ``.

In [57]:
inf('Computing statistics')
wstats = {
    'freqs': {
        'lex': collections.defaultdict(lambda: collections.Counter()),
        'occ': collections.defaultdict(lambda: collections.Counter()),
    },
    'ranks': {
        'lex': collections.defaultdict(lambda: {}),
        'occ': collections.defaultdict(lambda: {}),
    },
}
langs = set()

for w in F.otype.s('word'):
    lan = lang_map[F.language.v(w)]
    occ = F.g_cons.v(w)
    lex = F.lex.v(w)
    wstats['freqs']['lex'][lan][lex] += 1
    wstats['freqs']['occ'][lan][occ] += 1
    langs.add(lan)
for lan in langs:
    for tp in ['lex', 'occ']:
        rank = -1
        prev_n = -1
        amount = 1
        for (x, n) in sorted(wstats['freqs'][tp][lan].items(), key=lambda y: (-y[1], y[0])):
            if n == prev_n:
                amount += 1
            else:
                rank += amount
                amount = 1
            prev_n = n
            wstats['ranks'][tp][lan][x] = rank
inf('Done')

21m 29s Computing statistics
21m 32s Done


# Composing a lexical data file

The specification in ``lex_fields`` below specifies the lexicon fields in the intended order.
It contains instructions how to construct the field values from the lexical information obtained from the lexicon files.

    (source, method, name, transformation table, data type, data size, data options)

## source 
May contain one of the following:

* the name of a lexical feature as shown in the lexicon files, such as ``sp``, ``vc``.
* None. 
  In this case, **method** is a code that triggers special actions, such as getting an id or something that is available to the   program that fills the lexicon table
* the name of an other field as shown in the **name** part of the specification. 
  In this case, **method** must be a function, defined else where, that takes the value of that other field as argument. 
  The function is typically a transliteration, or a stripping action.

## method
May contain one of the following:

* a code (string), indicating:
    * ``lex``: take the value of a feature (indicated in **source**) for this entry from the lexicon file
    * ``entry``: take the value of the entry itself as found in the lexicon file
    * ``id``: take the id for this entry as generated by the program
    * ``lan``: take the language of this entry
* a function taking one argument
    * *get_voc*: get the earlier compiled `g_voc_lex` value
    * *get_voc_utf8*: get the earlier compiled `g_voc_lex_utf8` value

## name
The name of the field in the to be constructed annotation file.

In [58]:
def get_voc(lan, lex): return g_voc_lex[(lan, lex)]
def get_voc_utf8(lan, lex): return g_voc_lex_utf8[(lan, lex)]

lex_fields = (
    (None, 'lan', 'language', None),
    (None, 'entry', 'lex', None),
    (None, get_voc, 'voc', None),
    (None, get_voc_utf8, 'voc_utf8', None),
    ('rt', 'lex', 'root', None),
    ('sp', 'lex', 'sp', None),
    ('sm', 'lex', 'nametype', None),
    ('ls', 'lex', 'ls', None),
    ('gl', 'lex', 'gloss', None),
)

cur_lex_values = {}

def compute_fields(lan, entry, lexfeats):
    cur_lex_values.clear()
    return tuple(compute_field(lan, entry, lexfeats, f) for f in lex_fields)

def compute_field(lan, entry, lexfeats, f):
    (source, method, name, transform) = f
    val = None
    if method == 'lan': val = lan
    elif method == 'entry': val = entry
    elif method =='lex':
        val = lexfeats.get(f[0], '')
    else: val = method(lan, entry)
    cur_lex_values[f[2]] = val
    return val

lex_index = {}
for lan in sorted(lex_entries):
    for entry in sorted(lex_entries[lan]):
        entry_info = compute_fields(lan, entry, lex_entries[lan][entry])
        lex_index[(lan, entry)] = entry_info

In [59]:
lexFeatures = dict(((f[2], {}) for f in lex_fields))
for ft in lexFeatures:
    metaData.setdefault(ft, {})['valueType'] = 'str'
for ft in ('freq_lex', 'rank_lex'):
    lexFeatures[ft] = {}
    metaData.setdefault(ft, {})['valueType'] = 'int'

for (lan, lex) in lex_index:
    tfln = nodeFromLex[(lan, lex)]
    lexInfo = lex_index[(lan, lex)]
    for (i, f) in enumerate(lex_fields):
        lexFeatures[f[2]][tfln] = lexInfo[i]
    lexFeatures['freq_lex'][tfln] = str(wstats['freqs']['lex'][lan][lex])
    lexFeatures['rank_lex'][tfln] = str(wstats['ranks']['lex'][lan][lex])

occFeatures = {}
for ft in ('freq_occ', 'rank_occ'):
    occFeatures[ft] = {}
    metaData.setdefault(ft, {})['valueType'] = 'int'

for w in F.otype.s('word'):
    lan = lang_map[F.language.v(w)]
    occ = F.g_cons.v(w)
    occFeatures['freq_occ'][tFn(w)] = str(wstats['freqs']['occ'][lan][occ])
    occFeatures['rank_occ'][tFn(w)] = str(wstats['ranks']['occ'][lan][occ])

nodeFeatures.update(lexFeatures)
nodeFeatures.update(occFeatures)

# Core node and edge features

In [60]:
metaData.update({'': dict(source='ETCBC4c via LAF-Fabric')})
for nf in nodeFeatureList:
    metaData.setdefault(nf, {})['valueType'] = 'int' if nf in intFeatures else 'str'
for ef in edgeFeatureList:
    metaData.setdefault(ef, {})['valueType'] = 'str'
    
print('maxNode={}'.format(lexNode))

for nf in nodeFeatureList:
    if nf == 'otype':
        data = dict(((
            tFn(n),
            F.otype.lookup[n],
        ) for n in range(maxLfNode+1)))
        data.update(dict(((
            n,
            otypeData[n],
        ) for n in range(maxNode+1, lexNode+1))))
    else:
        data = dict(((tFn(n), v) for (n,v) in F.item[nf].lookup.items()))
    nodeFeatures.setdefault(nf, {}).update(data)

maxNode=1446130


## Last minute changes
Some features that come from the ETCBC core have obvious defects, or we need a feature next to it that is just a little bit different. This is what we do

1. `lex` contains the lexeme (in transcription) with disambiguation marks (`[/=`) appended.
   For text transformations we prefer the bare lexeme
1. `lex_utf` has frills at the end of many values. Probably they have arisen by transforming the lexeme plus
   disambiguation marks into unicode. We overwrite this feature with the transform of the bare lexeme.
1. `language` has values `Hebrew` and `Aramaic`. We prefer ISO language codes: `hbo` and `arc` instead.
   By adding `language` for lexeme nodes we already have switched to ISO codes. Here we do the rest.

In [61]:
nodeFeatures['lex0'] = {}
nodeFeatures['lex_utf8'] = {}
# node feature language alread exists: for lexemes

nfl = nodeFeatures['lex0']
nfu = nodeFeatures['lex_utf8']
lnf = nodeFeatures['language']

for (n, v) in nodeFeatures['lex'].items():
    vv = v.rstrip('[/=')
    nfl[n] = vv
    nfu[n] = Transcription.to_hebrew(vv)

for (n, v) in F.language.lookup.items():
    lnf[tFn(n)] = lang_map[v]

metaData.setdefault('lex0', {})['valueType'] = 'str'
metaData.setdefault('lex_utf8', {})['valueType'] = 'str'
metaData.setdefault('language', {})['valueType'] = 'str'

for ef in edgeFeatureList:    
    if ef == 'oslots':
        data = oslotsData
    else:
        data = dict(((tFn(n), tFns(set(nDict.keys()))) for (n, nDict) in C.item[ef].lookup.items()))
    edgeFeatures.setdefault(ef, {}).update(data)

# Save everything

In [62]:
from tf.fabric import Fabric

In [63]:
TF = Fabric(modules=ETCBC)

This is Text-Fabric 1.2.4
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data/
Shebanq docs  : https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html
Questions? Ask shebanq@ancient-data.org for an invite to Slack
106 features found and 0 ignored


In [64]:
TF.save(
    nodeFeatures=nodeFeatures,
    edgeFeatures=edgeFeatures,
    metaData=metaData,
)

  0.00s Exporting 100 node and 4 edge and 1 config features to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c:
   |     0.08s T book                 to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s T book@am              to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s T book@ar              to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s T book@bn              to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s T book@da              to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s T book@de              to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s T book@el              to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s T book@en              to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s T book@es              to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s T book@fa              to /Users/dirk/gith