<img align="right" src="tf-small.png"/>

# ETCBC versions

In this notebook we try to map the nodes between the versions 4, 4b and 4c of the ETCBC dataset.

If we succeed, then text-fabric notebooks that are based on an older version of the data, can also be used unmodified on newer versions of the data.

In general, node mappings between versions can not be perfect. We try and see how far we get.

Let us start with *slot* mappings.
We map the slots of a version to the slots of the next version.
Mappings go from old to new, and they are between successive versions.

We have data in text-fabric format for the ETCBC Hebrew Bible Database, versions 4, 4b, and 4c.

Stephen Ku has prepared a Strong number mapping for version 4, based on 
[OpenScriptures Bible Lexicon](https://github.com/openscriptures/HebrewLexicon).

This provides us with a nice use case: can we apply the Strong number mapping for version 4 to versions 4b and 4c
as well?
See notebook
[strong](https://github.com/ETCBC/text-fabric/blob/master/Versions/strong.ipynb)
for how we add Strong numbers to the ETCBC dataset.

Below we will get a pretty good view on the differences between the versions.
We use the
[ETCBC transcription](https://shebanq.ancient-data.org/shebanq/static/docs/ETCBC4-transcription.pdf)
to write down the diffs.

In [1]:
import os,collections
from tf.fabric import Fabric

Load all versions in one go!

In [2]:
locations = {
    '4': '~/github/text-fabric-data-legacy',
    '4b': '~/github/text-fabric-data-legacy',
    '4c': '~/github/text-fabric-data', 
}
versions = ['4', '4b', '4c']
TF = {}
api = {}
for v in versions:
    TF[v] = Fabric(locations=locations[v], modules='hebrew/etcbc{}'.format(v))
    api[v] = TF[v].load('''
        g_word lex
    ''')
A4 = api['4']
A4b = api['4b']
A4c = api['4c']

This is Text-Fabric 2.2.1
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
110 features found and 0 ignored
  0.00s loading features ...
   |     0.16s B g_word               from /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4
   |     0.13s B lex                  from /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4
   |     0.00s Feature overview: 105 nodes; 4 edges; 1 configs; 7 computeds
  5.19s All features loaded/computed - for details use loadLog()
This is Text-Fabric 2.2.1
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/bl

Inspect the amount of slots in all versions.

In [3]:
nSlots = {}
for v in versions:
    nSlots[v] = api[v].F.otype.maxSlot
nSlots

{'4': 426555, '4b': 426568, '4c': 426581}

# Method

When we compare two versions, we inspect the lexemes found at corresponding positions in the versions.
We start at the beginning, and when the lexemes do not match, we have a closer look.

However, in order not to be disturbed by minor discrepancies in the lexemes, we mask the lexemes: we
apply a few transformations to it, such as removing alefs and waws, and finally even turning them into
ordered sets of letters, thereby loosing the order and multiplicity of letter.
We also strip the disambiguation marks.

We maintain a current mapping between the slots of the two versions, and we update it if we encounter
disturbances. 
Initially, this map is the identity map.

What we encounter as remaining differences boils down to the following:

* a lexeme is split into two lexemes with the same total material, typically involving `H`, `MN`, or `B`
* the lexeme is part of a special case, listed in the `cases` table (which has been found by repeatedly
  chasing for the first remaining difference.
* the both lexemes differ, but that's it, no map updates have to be done.
  
The first two types of cases can be solved by splitting a lexeme into `k` parts or combining `k` lexemes into one.
After that the mapping has to be shifted to the right or to the left from a certain point onwards.

The loop then is as follows:

* find the first slot with a lexeme in the first version that is different from the lexeme at the mapped slot
  in the second version
* analyse what is the case:
  * if the disturbance is recognized on the basis of existing patterns and cases, update the map and
    consider this case solved
  * if the disturbance is not recognized, the case is unsolved, and we break out of the loop.
    More analysis is needed, and the outcome of that has to be coded as an extra pattern or case.
* if the status is solved, go back to the first step

We end up with a mapping from the slots of the first version to those of the other version that links
slots with approximately equal lexemes together.

We start by defining our masking function, and compile lists of all lexemes and masked lexemes for all versions.

In [4]:
masks = [
    (lambda lex: lex.rstrip('[/='),                         'strip disambiguation'),
    (lambda lex: lex[0:-2] if lex.endswith('JM') else lex,  'remove JM'),
    (lambda lex: lex[0:-2] if lex.endswith('WT') else lex,  'remove WT'),
    (lambda lex: lex.rstrip('HT'),                          'strip HT'),
    (lambda lex: lex.replace('J', ''),                      'remove J'),
    (lambda lex: lex.replace('>', ''),                      'remove Alef'),
    (lambda lex: lex.replace('W', ''),                      'remove W'),
    (lambda lex: lex.replace('Z', 'N'),                     'identify Z and N'),
    (lambda lex: (''.join(sorted(set(set(lex)))))+'_'*lex.count('_'), 'ignore order and multiplicity'),
]

def mask(lex, trans=None):
    if trans != None:
        return masks[trans][0](lex)
    for (fun, desc) in masks:
        lex = fun(lex)
    return lex

lexemes = {}

A4.indent(level=0, reset=True)
A4.info('Masking lexemes')
for v in versions:
    A4.indent(level=1, reset=True)
    lexemes[v] = collections.OrderedDict()
    for n in api[v].F.otype.s('word'):
        lex = api[v].F.lex.v(n)
        lexemes[v][n] = (lex, mask(lex, trans=0), mask(lex))
    A4.info('version {} done'.format(v))
A4.indent(level=0)
A4.info('Done')

  0.00s Masking lexemes
   |     2.72s version 4 done
   |     2.67s version 4b done
   |     2.73s version 4c done
  8.12s Done


In `cases` we store special cases that we stumbled upon.
Every time we encountered a disturbance which did not follow a recognized pattern,
we turned it into a case.
The number is the slot number in the first version where the case will be applied.
Cases will only be applied at these exact slot number and nowhere else.

In [5]:
cases = {}
mappings = {}

# Algorithm

Here is the code that directly implements the method.
Every pair of distinct versions can be mapped.
We store the mappings in a dictionary, keyed by tuples like `(4, 4b)`, 
for the mapping from version `4` to `4b`, for instance.

The loop is in `doDiffs` below.

In [None]:
def inspect(v1, v2, start, end):
    mapKey = (v1, v2)
    mp = mappings[mapKey]
    for n in range(start, end):
        print('{:>6}: {:<8} {:<8}'.format(
            n, 
            api[v1].F.lex.v(n),
            api[v2].F.lex.v(mp[n]),

        ))

def firstDiff(v1, v2, start):
    mapKey = (v1, v2)
    mp = mappings[mapKey]

    fDiff = None
    for (n, (lx1, sxl, mx1)) in lexemes[v1].items():
        if n < start: continue
        if mx1 != lexemes[v2][mp[n]][2]:
            fDiff = n
            break
    return fDiff

def printDiff(v1, v2, n):
    mapKey = (v1, v2)
    mp = mappings[mapKey]

    (lx1, sx1, mx1) = lexemes[v1][n]
    (lx2, sx2, mx2) = lexemes[v2][mp[n]]
    if n < api[v1].F.otype.maxSlot:
        (lx1n, sx1n, mx1n) = lexemes[v1][n+1]
    else:
        (lx1n, sx1n, mx1n) = ('max', 'max', 'max')
    if mp[n] < api[v2].F.otype.maxSlot:
        (lx2n, sx2n, mx2n) = lexemes[v2][mp[n+1]]
    else:
        (lx2n, sx2n, mx2n) = ('max', 'max', 'max')
    if n > 1:
        (lx1p, sx1p, mx1p) = lexemes[v1][n-1]
    else:
        (lx1p, sx1p, mx1p) = ('min', 'min', 'min')
    if mp[n] > 1:
        (lx2p, sx2p, mx2p) = lexemes[v2][mp[n-1]]
    else:
        (lx2p, sx2p, mx2p) = ('min', 'min', 'min')

    #print('''{} {}:{} ==> slot {} ==> {}
    #{:<2}: {:<6} ~ |{:<6}| ~ {:<6}   {:<6} ~ |{:<6}| ~ {:<6}   {:<6} ~ |{:<6}| {:<6}
    #{:<2}: {:<6} ~ |{:<6}| ~ {:<6}   {:<6} ~ |{:<6}| ~ {:<6}   {:<6} ~ |{:<6}| {:<6}'''.format(
    #    *api[v1].T.sectionFromNode(n),
    #    n, mp[n],
    #    v1, lx1p, lx1, lx1n, sx1p, sx1, sx1n, mx1p, mx1, mx1n,
    #    v2, lx2p, lx2, lx2n, sx2p, sx2, sx2n, mx2p, mx2, mx2n,
    #)) 
    print('''{} {}:{} ==> slot {} ==> {}
    {:<2}: {:<8} ~ |{:<8}| ~ {:<8}
    {:<2}: {:<8} ~ |{:<8}| ~ {:<8}'''.format(
        *api[v1].T.sectionFromNode(n),
        n, mp[n],
        v1, lx1p, lx1, lx1n, 
        v2, lx2p, lx2, lx2n,
    )) 


MAX_ITER = 100

def doDiffs(v1, v2):
    mapKey = (v1, v2)
    mappings[mapKey] = dict(((n, n) for n in api[v1].F.otype.s('word')))
    mp = mappings[mapKey]
    cs = cases.get(mapKey, {})
    it = 0
    start = 1
    while True:
        n = firstDiff(v1, v2, start)

        if n == None:
            print('No more differences.\nFound {} points of disturbance'.format(it))
            break

        if it > MAX_ITER: 
            print('There might be more disturbances: increase MAX_ITER')
            break
            
        it += 1

        printDiff(v1, v2, n)

        (lx1, sx1, mx1) = lexemes[v1][n]
        (lx2, sx2, mx2) = lexemes[v2][mp[n]]
        (lx1n, sx1n, mx1n) = lexemes[v1][n+1]
        (lx2n, sx2n, mx2n) = lexemes[v2][mp[n+1]]

        solved = None
        skip = 0
        if n in cs:
            (action, param) = cs[n]
            if action == 'collapse':
                solved = '{} {} slots'.format(action, param)
                skip = param
                for m in range(api[v1].F.otype.maxSlot, n + param -1, -1):
                    mp[m] = mp[m-param+1]
                for m in range(n+1, n+param):
                    mp[m] = mp[n]
            elif action == 'split':
                solved = '{} into {} slots'.format(action, param)
                for m in range(n+1, api[v1].F.otype.maxSlot+1):
                    mp[m] = mp[m] + param -1
            elif action == 'ok':
                solved = 'innocent variation in lexeme'
        elif lx1 == lx2 + lx2n:
            if lx2 == 'H':
                solved = 'split article off'
                for m in range(n+1, api[v1].F.otype.maxSlot+1):
                    mp[m] = mp[m] + 1
        elif set(mx1) == set(mx2) | set(mx2n):
            if lx2 == 'B' or lx2 == 'MN':
                solved = 'split preposition off'
                for m in range(n+1, api[v1].F.otype.maxSlot+1):
                    mp[m] = mp[m] + 1
        print('Action: {}\n'.format(solved if solved else 'BLOCKED'))

        if not solved: break
        
        start = n + 1 + skip

    if not solved:
        print('Blocking difference in {} iterations'.format(it))

The mappings itself are needed elsewhere in Text-Fabric, let us write them to file.
We write them into the dataset corresponding to the target version.
So the map `4-4b` ends up in dataset `etcbc4b`.

In [7]:
def writeMaps():
    for ((v1, v2), mp) in mappings.items():
        module = 'hebrew/etcbc{}'.format(v2)
        fName = 'omap@{}-{}'.format(v1, v2)
        edgeFeatures = {
            fName: dict(((n, (mp[n],)) for n in range(1, api[v1].F.otype.maxSlot + 1)))
        }
        metaData = {
            fName: {
                'about': 'Mapping from the slots of ETCBC version {} to version {}'.format(v1, v2),
                'see': 'https://github.com/ETCBC/text-fabric/blob/master/Versions/etcbc-versions.ipynb',
                'valueType': 'str',
            }
        }
        TF[v2].save(
            nodeFeatures={},
            edgeFeatures=edgeFeatures,
            metaData=metaData,
        )

# Running

Here we run the mapping between `4` and `4b`.
The points of disturbance will be written into the output cell.

## 4 => 4b

Here are the special cases for this conversion.

In [8]:
cases.update({
    ('4', '4b'): {
        214730: ('collapse', 4),
        260028: ('split', 2),
        289948: ('ok', None),
        307578: ('split', 2),
        323067: ('ok', None),
        407543: ('split', 2),
        408429: ('split', 2),
    },
})

In [9]:
doDiffs('4', '4b')

Genesis 24:65 ==> slot 12369 ==> 12369
    4 : >JC/     ~ |HLZH    | ~ H       
    4b: >JC/     ~ |H       | ~ LZH     
Action: split article off

Genesis 37:19 ==> slot 20514 ==> 20515
    4 : XLWM/    ~ |HLZH    | ~ BW>[    
    4b: XLWM/    ~ |H       | ~ LZH     
Action: split article off

Judges 6:20 ==> slot 130846 ==> 130848
    4 : SL</     ~ |HLZ     | ~ W       
    4b: SL</     ~ |H       | ~ LZ      
Action: split article off

1_Samuel 14:1 ==> slot 148319 ==> 148322
    4 : <BR/     ~ |HLZ     | ~ W       
    4b: <BR/     ~ |H       | ~ LZ      
Action: split article off

1_Samuel 17:26 ==> slot 151331 ==> 151335
    4 : PLCTJ/   ~ |HLZ     | ~ W       
    4b: PLCTJ/   ~ |H       | ~ LZ      
Action: split article off

1_Samuel 20:19 ==> slot 153816 ==> 153821
    4 : >BN/     ~ |H>ZL/   | ~ W       
    4b: >BN/     ~ |H       | ~ >ZL/    
Action: split article off

2_Kings 4:25 ==> slot 196975 ==> 196981
    4 : CWNMJ/   ~ |HLZ     | ~ <TH     
    4b: CWNMJ/   ~ |H  

Just have a look at the first point of disturbance:

In [10]:
(v1, v2) = ('4', '4b')
(n, m) = [x for x in mappings[(v1, v2)].items() if x[0] != x[1]][0]
print('{} {}:{} node {}: {} versus {} becomes {}'.format(
    *api[v1].T.sectionFromNode(n),
    n,
    api[v1].F.lex.v(n),
    api[v2].F.lex.v(n),
    api[v2].F.lex.v(m),
))

Genesis 24:65 node 12370: H versus LZH becomes H


## 4b => 4c

We need other cases.

In [11]:
cases.update({
    ('4b', '4c'): {
         28423: ('split', 3),
         28455: ('split', 3),
         91193: ('split', 2),
         91197: ('split', 2),
        122218: ('split', 2),
        122247: ('split', 2),
        123160: ('split', 2),
        184086: ('split', 2),
        394186: ('collapse', 2),
        395150: ('ok', None),
        395190: ('ok', None),
        401036: ('split', 3),
        404503: ('ok', None),
        419138: ('split', 3),
    },    
})

In [12]:
doDiffs('4b', '4c')

Genesis 50:10 ==> slot 28423 ==> 28423
    4b: <D       ~ |GRN_>VD/| ~ >CR     
    4c: <D       ~ |GRN/    | ~ H       
Action: split into 3 slots

Genesis 50:11 ==> slot 28455 ==> 28457
    4b: B        ~ |GRN_>VD/| ~ W       
    4c: B        ~ |GRN/    | ~ H       
Action: split into 3 slots

Numbers 33:45 ==> slot 91193 ==> 91197
    4b: B        ~ |DJBWN_GD/| ~ W       
    4c: B        ~ |DJBN/   | ~ GD==/   
Action: split into 2 slots

Numbers 33:46 ==> slot 91197 ==> 91202
    4b: MN       ~ |DJBWN_GD/| ~ W       
    4c: MN       ~ |DJBN/   | ~ GD==/   
Action: split into 2 slots

Joshua 16:3 ==> slot 122218 ==> 122224
    4b: GBWL/    ~ |BJT_XRWN_TXTWN/| ~ W       
    4c: GBWL/    ~ |BJT_XWRWN/| ~ TXTWN/  
Action: split into 2 slots

Joshua 16:5 ==> slot 122247 ==> 122254
    4b: <D       ~ |BJT_XRWN_<LJWN/| ~ W       
    4c: <D       ~ |BJT_XWRWN/| ~ <LJWN/  
Action: split into 2 slots

Joshua 18:13 ==> slot 123160 ==> 123168
    4b: L        ~ |BJT_XRWN_TXTWN/| ~ W      

The bit below is very handy if you need a closer look to what is the case in some range of slots.

In [13]:
inspect('4b', '4c', 419135, 419145)

419135: <M/      <M/     
419136: W        W       
419137: HLK[     HLK[    
419138: GJ>_MLX/ GJ>/    
419139: W        W       
419140: NKH[     NKH[    
419141: >T       >T      
419142: BN/      BN/     
419143: F<JR====/ F<JR====/
419144: <FRH=/   <FRH=/  


Just have a look at the first point of disturbance:

In [14]:
(v1, v2) = ('4b', '4c')
(n, m) = [x for x in mappings[(v1, v2)].items() if x[0] != x[1]][0]
print('{} {}:{} node {}: {} versus {} becomes {}'.format(
    *api[v1].T.sectionFromNode(n),
    n,
    api[v1].F.lex.v(n),
    api[v2].F.lex.v(n),
    api[v2].F.lex.v(m),
))

Genesis 50:10 node 28424: >CR versus H becomes >CR


In [15]:
writeMaps()

  0.00s Exporting 0 node and 1 edge and 0 config features to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b:
   |     1.33s T omap@4-4b            to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b
  1.33s Exported 0 node features and 1 edge features and 0 config features to /Users/dirk/github/text-fabric-data-legacy/hebrew/etcbc4b
  0.00s Exporting 0 node and 1 edge and 0 config features to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c:
   |     1.31s T omap@4b-4c           to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
  1.31s Exported 0 node features and 1 edge features and 0 config features to /Users/dirk/github/text-fabric-data/hebrew/etcbc4c


In [28]:
sorted(mappings.keys())

[]