<img align="right" src="tf-small.png"/>

# ETCBC nodes

In this notebook we try to map the **non-slot** nodes between the versions 4, 4b and 4c of the ETCBC dataset.
An other notebook has produced a mapping between the **slots** of those versions, and we want to extend that
mapping to nodes in general.

If we succeed, then text-fabric notebooks that are based on an older version of the data, can also be used unmodified on newer versions of the data.

In general, node mappings between versions can not be perfect. We try and see how far we get.

# Basic idea

We start out with a very simple idea: nodes are linked to slots. In order to map a node in version x, we look at its slots in version x, map those to slots in version y, and see which nodes in version y are linked to those slots.
They are good candidates for the mapping.

In [1]:
import os,collections
from functools import reduce
from tf.fabric import Fabric

In [2]:
locations = {
    '4': '~/github/text-fabric-data-legacy',
    '4b': '~/github/text-fabric-data-legacy',
    '4c': '~/github/text-fabric-data', 
}
versions = ['4', '4b', '4c']
TF = {}
api = {}
for v in versions:
    TF[v] = Fabric(locations=locations[v], modules=f'hebrew/etcbc{v}')
    api[v] = TF[v].load('''
        g_word lex
    ''')
A4 = api['4']
A4b = api['4b']
A4c = api['4c']

This is Text-Fabric 2.3.1
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
110 features found and 0 ignored
  0.00s loading features ...
   |     0.15s B g_word               from /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4
   |     0.13s B lex                  from /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4
   |     0.00s Feature overview: 105 nodes; 4 edges; 1 configs; 7 computeds
  4.99s All features loaded/computed - for details use loadLog()
This is Text-Fabric 2.3.1
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/bl

# Exploration

Let us see what has happened to the phrases between 4 and 4b.

We load the slot mapping 4 => 4b, by the A4b API.

In [20]:
TF['4b'].load('omap@4-4b', add=True)
TF['4c'].load('omap@4b-4c', add=True)

  0.00s loading features ...
  0.01s All additional features loaded - for details use loadLog()
  0.00s loading features ...
   |     1.00s B omap@4b-4c           from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
  1.78s All additional features loaded - for details use loadLog()


We compute a 4-node type to 4b-node type mapping.

A 4-node $p$ is mapped to all 4b-nodes of the same type $q$ that have a slot in the 4=>4b-image of the slots of $p$.

We compute the mappings and store them,
after which we can explore them by listing to how many target nodes each source node maps.

# Subphrases
Subphrases are different, because they overlap. That means that we cannot expect that
the mapping of them between versions will be largely 1-1.
We'll see that in a moment.

In [49]:
nodeMapping = {}

In [52]:
def exploreMapping(ntype, sourceV, targetV, force=False):
    slotMap = f'omap@{sourceV}-{targetV}'

    if not force and slotMap in nodeMapping and ntype in nodeMapping[slotMap]:
        mapping = nodeMapping[slotMap][ntype]
    else:
        mapping = {}
        api[sourceV].indent(reset=True)
        api[sourceV].info(f'Computing {slotMap} for {ntype} nodes')
        for p in api[sourceV].F.otype.s(ntype):
            slots = api[sourceV].E.oslots.s(p)
            mappedSlots = reduce(
                lambda x,y: x+y,
                [api[targetV].Es(slotMap).f(s) for s in slots],
                (),
            )
            mappedPhrases = reduce(
                set.union,
                [set(api[targetV].L.u(s, ntype)) for s in mappedSlots],
                set(),
            )
            mapping[p] = mappedPhrases
        nodeMapping.setdefault(slotMap, {})[ntype] = mapping
        api[sourceV].info('Done')
    
    spread = collections.Counter()
    for (p, qs) in mapping.items(): spread[len(qs)] += 1
    lmap = len(mapping)
    straight = sum(c[1] for c in spread.items() if c[0] == 1)
    issues = lmap - straight
    straightF = 100 * straight / lmap
    issuesF = 100 * issues / lmap
    print(f'{slotMap} {ntype}\n\t{lmap:>6} total          {straightF:5.2f}% straight, {issuesF:5.2f}% issues')
    for (ln, n) in sorted(spread.items(), key=lambda x: (-x[1], x[0])):
        print(f'\t{n:>6} => {ln:>3} each')

In [53]:
ntypes = A4.F.otype.all
for (sourceV, targetV) in (('4', '4b'), ('4b', '4c')):
    for ntype in ntypes[0:-1]: exploreMapping(ntype, sourceV, targetV)

omap@4-4b book
	    39 total          100.00% straight,  0.00% issues
	    39 =>   1 each
omap@4-4b chapter
	   929 total          100.00% straight,  0.00% issues
	   929 =>   1 each
omap@4-4b verse
	 23213 total          100.00% straight,  0.00% issues
	 23213 =>   1 each
omap@4-4b half_verse
	 44682 total          98.89% straight,  1.11% issues
	 44185 =>   1 each
	   496 =>   2 each
	     1 =>   3 each
omap@4-4b sentence
	 66045 total          99.51% straight,  0.49% issues
	 65720 =>   1 each
	   279 =>   2 each
	    34 =>   3 each
	     7 =>   4 each
	     3 =>   5 each
	     1 =>   6 each
	     1 =>   8 each
omap@4-4b sentence_atom
	 66701 total          99.53% straight,  0.47% issues
	 66390 =>   1 each
	   257 =>   2 each
	    37 =>   3 each
	    11 =>   4 each
	     3 =>   5 each
	     1 =>   6 each
	     1 =>   7 each
	     1 =>   8 each
omap@4-4b clause
	 87978 total          99.01% straight,  0.99% issues
	 87111 =>   1 each
	   825 =>   2 each
	    31 =>   3 each
	    10 =

# Next steps

1. We have to find a way to deal with subphrases.
2. In those cases where there issues, we probably will see that one of the mapped nodes shares has the bulk
   of the mapped slots, and the others just a marginal fraction. In those cases we can also confidently map one
   node to a single other node.