<img align="right" src="images/tf-small.png" width="128"/>
<img align="right" src="images/etcbc.png"/>
<img align="right" src="images/dans-small.png"/>

You might want to consider the [start](search.ipynb) of this tutorial.

Short introductions to other TF datasets:

* [Dead Sea Scrolls](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/dss.ipynb),
* [Old Babylonian Letters](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/oldbabylonian.ipynb),
or the
* [Q'uran](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/lorentz2020/quran.ipynb)


# Upgrade features along a node mapping

Consider the semantic actor features in 
[ch-jensen/participants/actor/tf](https://github.com/ch-jensen/participants/tree/master/actor/tf).

We see only features for version `c` of the BHSA, but we prefer to work with version `2021` of the BHSA.

When we try to load the features by simply saying

```
A = use("bhsa", mod="ch-jensen/participants/actor/tf")
```

we have no luck, because there is no `ch-jensen/participants/actor/tf/2021` on Github.

But, one of the features in the BHSA is `omap@c-2021.tf` and this contains the information to map
all nodes in version `c` to the nodes of version `2021`, as faithfully as is reasonably possible.

My homework as Text-Fabric developer is to make it so that the statement above works, by steering Text-Fabric
to download version `c` and using the mapping feature to produce upgraded data in the right place.
But I have not get round to that yet.

So, here is what *you* can do about it 😎.

1. File an [issue](https://github.com/ch-jensen/participants/issues) and ask Christian whether he is inclined to
   use his software to build the features against BHSA version 2021.
   *But he might be too busy to do that right now.*
2. Fork [ch-jensen/participants](https://github.com/ch-jensen/participants) and try to run his software yourself.
   *That might not be easy. It seems that the code to run is in another repository.
   Is all the input data publicly available? Are special settings needed for version 2021?
   Is the software still executable?*
3. Do fork the repo by all means, and then use a tool of text-fabric to *upgrade* the features of the older version
   to the features of the newer version.
   
We take you through the last option and evaluate how well the upgrade process fares.

In [1]:
%load_ext autoreload
%autoreload 2

# Incantation

The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are
explained in the [start tutorial](start.ipynb).

In [2]:
import collections

from tf.app import use
from tf.fabric import Fabric
from tf.dataset.nodemaps import Versions

## Load the current version of the BHSA

We need the current version (`2021`) of the BHSA anyway, so we are going to load it.

In [3]:
A = use("bhsa", hoist=globals())

This is Text-Fabric 9.1.10
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

122 features found and 0 ignored


## Load the available version of the participant features

We have forked Christian's repo to `etcbc/participants`, so make sure to clone it to your computer:

```
cd ~/github/etcbc
git clone https://github.com/ETCBC/participants
```

Now we can load the actor features for version `c`.
We do this in a very low-level-way.

First we set up TF to look into a specific directory of features.

In [4]:
LOCATION = "~/github/etcbc/participants/actor/tf"

TF = Fabric(locations=f"{LOCATION}/c")

This is Text-Fabric 9.1.10
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

3 features found and 0 ignored
  0.00s Not all of the warp features otype and oslots are present in
~/github/etcbc/participants/actor/tf/c/
  0.00s Only the Feature and Edge APIs will be enabled
  0.00s Warp feature "otext" not found. Working without Text-API



Then we load all available features in that directory.

Note that this is not a complete TF dataset, because the standard features `oslots` and `otype` are missing.
These are called *warp* features, in the lingo of the fabric metaphor of text.

In [5]:
B = TF.loadAll()

  0.00s All features loaded/computed - for details use TF.isLoaded()
   |     0.00s Feature overview: 2 for nodes; 1 for edges; 1 configs; 0 computed
  0.05s All additional features loaded - for details use TF.isLoaded()


We just ask for an overview of loaded features.

In [6]:
B.isLoaded()

actor                node (str) Participant references for words, subphrases and phrases. The references are
                                adapted from Eep Talstra's work on participant
                                tracking. http://doi.org/10.5281/zenodo.1479491
coref                edge       Edges to co-referring actors on chapter-level. The references are adapted from
                                Eep Talstra's work on participant tracking.
                                http://doi.org/10.5281/zenodo.1479491
otext                config    
prs_actor            node (str) Participant references for pronominal suffixes. The references are adapted from
                                Eep Talstra's work on participant tracking.
                                http://doi.org/10.5281/zenodo.1479491


We can get a bit more information:

In [7]:
B.isLoaded(meta=True)

actor                node (str)
	coreData             = BHSA
	coreVersion          = c
	dateWritten          = 2020-05-11T13:34:09Z
	description          = Participant references for words, subphrases and phrases. The references are
	                       adapted from Eep Talstra's work on participant tracking.
	                       http://doi.org/10.5281/zenodo.1479491
	writtenBy            = Text-Fabric
coref                edge      
	coreData             = BHSA
	coreVersion          = c
	dateWritten          = 2020-05-11T13:34:16Z
	description          = Edges to co-referring actors on chapter-level. The references are adapted from
	                       Eep Talstra's work on participant tracking.
	                       http://doi.org/10.5281/zenodo.1479491
	writtenBy            = Text-Fabric
otext                config    
	sectionFeatures      = 
	sectionTypes         = 
prs_actor            node (str)
	coreData             = BHSA
	coreVersion          = c
	dateWritten      

## Upgrade the participant features

We are going to upgrade the participant features from version `c` to version `2021`.

For that, we use [tf.dataset.nodemaps.Versions](https://annotation.github.io/text-fabric/tf/dataset/nodemaps.html#tf.dataset.nodemaps.Versions).

We initialize the Versions object with two text-fabric api objects:

* the one holding the old version of the features: `B`. This is an TF-api object, since it is the result of `TF.load()`.
* the one holding the new version of the features: `A.api`.
  If we load a dataset by means of `use()`, the result is a richer object, of which the `.api` member is the classical TF-API.


In [8]:
apis = {"2021": A.api, "c": B}

V = Versions(apis, "c", "2021")

Finally we migrate the features from "c" to "2021" and save them in the correct loccation.

We skip the `otext` feature, since it is a special config feature, not a data feature made by Christian.

In [9]:
V.migrateFeatures(("actor", "coref", "prs_actor"), location=LOCATION)

  0.00s Exporting 2 node and 1 edge and 0 config features to ~/github/etcbc/participants/actor/tf/2021:
   |     0.00s T actor                to ~/github/etcbc/participants/actor/tf/2021
   |     0.00s T prs_actor            to ~/github/etcbc/participants/actor/tf/2021
   |     0.06s T coref                to ~/github/etcbc/participants/actor/tf/2021
  0.06s Exported 2 node features and 1 edge features and 0 config features to ~/github/etcbc/participants/actor/tf/2021


## Load the upgraded module

Now we are in a position that we can load the BHSA together with the migrated module of participant features.
Note that we we point Text-Fabric to the forked repo (`etcbc` instead of `ch-jensen`) and then to
our local clone (`:clone`).

In [10]:
A = use("bhsa", mod="etcbc/participants/actor/tf:clone")

This is Text-Fabric 9.1.10
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

125 features found and 0 ignored
   |     0.01s T actor                from ~/github/etcbc/participants/actor/tf/2021
   |     0.10s T coref                from ~/github/etcbc/participants/actor/tf/2021
   |     0.00s T prs_actor            from ~/github/etcbc/participants/actor/tf/2021


If you click the triangles and navigate to the full metadata of the participants features,
you see a line

```
upgraded: from version c to 2021
```

## Checks

Let's do a few checks to see how well the upgrade process has worked.

First we load the `c` version of the BHSA and Christian's original features.

In [11]:
B = use("bhsa", mod="ch-jensen/participants/actor/tf", version="c")

This is Text-Fabric 9.1.10
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

123 features found and 0 ignored


Below we are going to peek into the corpus by means of pretty dispays.
Here we what is displayed and in what style.

In this case we use the phonological transcription, instead of fully pointed Hebrew,
so that non-Hebraists can see what is happening here.

In [12]:
hiddenTypes="half_verse,sentence_atom,clause,clause_atom"
A.displaySetup(hiddenTypes=hiddenTypes, condenseType="sentence", withNodes=True, fmt="text-phono-full")
B.displaySetup(hiddenTypes=hiddenTypes, condenseType="sentence", withNodes=True, fmt="text-phono-full")

### Node feature "actor"

What are the node types that have an actor value?

In [13]:
{B.api.F.otype.v(n) for n in B.api.N.walk() if B.api.F.actor.v(n) is not None}

{'phrase_atom', 'subphrase'}

In [14]:
{A.api.F.otype.v(n) for n in A.api.N.walk() if A.api.F.actor.v(n) is not None}

{'phrase_atom', 'subphrase'}

Let's inspect the frequency lists of actor, per node type.

In [15]:
for otype in ("phrase_atom", "subphrase"):
    frequenciesA = A.api.F.actor.freqList(nodeTypes={otype})
    frequenciesB = B.api.F.actor.freqList(nodeTypes={otype})
    freqDictA = {v: f for (v, f) in frequenciesA}
    freqDictB = {v: f for (v, f) in frequenciesB}
    goodOnes = []
    badOnes = []
    for v in sorted(set(freqDictA) | set(freqDictB)):
        fA = freqDictA.get(v, 0)
        fB = freqDictB.get(v, 0)
        if fA == fB:
            goodOnes.append(v)
        else:
            badOnes.append((v, fA, fB))
            
    print(f"\nComparing frequencies on {otype}s: {len(goodOnes)} OK; {len(badOnes)} discrepancies")
    for (v, fA, fB) in badOnes[0:100]:
        print(f"{fA:>3} {fB:>3} {v}")


Comparing frequencies on phrase_atoms: 361 OK; 2 discrepancies
 91  94 >JC
  7   9 CNH

Comparing frequencies on subphrases: 135 OK; 0 discrepancies


### Closer inspection

Most actors on phrase atoms carry over well. But e.g. `CNH` has discrepancies.
Let's get a feel of why we get the discrepancies.

In [16]:
actorCNH = """
phrase_atom
  actor=CNH
"""

resultsA = A.search(actorCNH)
resultsB = B.search(actorCNH)

  0.18s 7 results
  0.16s 9 results


In [17]:
A.table(resultsA)
B.table(resultsB)

n,p,phrase_atom
1,Leviticus 25:11,945873tihyˈeh
2,Leviticus 25:12,945886yôvˈēl
3,Leviticus 25:12,945887hˈiw
4,Leviticus 25:12,945888qˌōḏeš
5,Leviticus 25:12,945889tihyˈeh
6,Leviticus 25:51,946353baššānˈîm
7,Leviticus 25:52,946362baššānˈîm


n,p,phrase_atom
1,Leviticus 25:10,945830šānˈā
2,Leviticus 25:11,945851šānˌā
3,Leviticus 25:11,945852tihyˈeh
4,Leviticus 25:12,945865yôvˈēl
5,Leviticus 25:12,945866hˈiw
6,Leviticus 25:12,945867qˌōḏeš
7,Leviticus 25:12,945868tihyˈeh
8,Leviticus 25:51,946332baššānˈîm
9,Leviticus 25:52,946341baššānˈîm


Clearly, there is something interesting in Leviticus 25 verses 10 and 11.

We want to compare verse 10 in both versions.
Here are the original actors in version `c`:

In [18]:
B.show(resultsB, start=1, end=1, condensed=True)

Let's find the same sentence in version `2021`

In [19]:
sB = 1181939
mappedSb = A.api.Es("omap@c-2021").f(sB)
mappedSb

((1181957, None),)

In [20]:
A.pretty(mappedSb[0][0])

Aha: in version 2021 there is no counterpart of the phrase atom 945830, the one which carried `actor=CNH`.

This phrase atom has morphed into a subphrase, and hence we loose the connection and this particular annotation.

### Edge feature "coref"

We also have an edge feature in the module. Let's test that as well.

First we explore th edge feature a little bit.
From which node type to which node type do they go?

We constrain our displays to phrases from now on.

In [21]:
A.displaySetup(condenseType="phrase")
B.displaySetup(condenseType="phrase")

In [22]:
nodeTypes = collections.Counter()

for (f, ts) in B.api.E.coref.items():
    fromType = B.api.F.otype.v(f)
    for t in ts:
        toType = B.api.F.otype.v(t)
        nodeTypes[(fromType, toType)] += 1

In [23]:
nodeTypes

Counter({('word', 'subphrase'): 471,
         ('word', 'phrase_atom'): 20254,
         ('word', 'word'): 19884,
         ('phrase_atom', 'phrase_atom'): 34404,
         ('phrase_atom', 'subphrase'): 1621,
         ('phrase_atom', 'word'): 20254,
         ('subphrase', 'word'): 471,
         ('subphrase', 'subphrase'): 1086,
         ('subphrase', 'phrase_atom'): 1621})

The *coref* relation seems to be symmetrical, so when we check cases, we can skip a number
of pairs.

In [24]:
done = set()

for (fromType, toType) in nodeTypes:
    if (fromType, toType) in done:
        continue
    done.add((fromType, toType))
    done.add((toType, fromType))
    print(f"{fromType:<15} - {toType:<15}")
    template = f"""
{fromType}
-coref> {toType}
"""
    resultsA = A.search(template)
    resultsB = B.search(template)
    
    goodOnes = []
    badOnes = []

    phonoA = lambda n: A.api.T.text(n, fmt="text-phono-full")
    phonoB = lambda n: B.api.T.text(n, fmt="text-phono-full")

    for ((fA, tA), (fB, tB)) in zip(resultsA, resultsB):
        fAp = phonoA(fA)
        fBp = phonoB(fB)
        tAp = phonoA(tA)
        tBp = phonoB(tB)
        if fAp == fBp and tAp == tBp:
            goodOnes.append(f"{fAp} => {tAp}")
        else:
            fDif = fAp if fAp == fBp else f"{fAp} != {fBp}"
            tDif = tAp if tAp == tBp else f"{tAp} != {tBp}"
            badOnes.append((f"{fDif} => {tDif}", fA, fB, tA, tB))
    print(f"good: {len(goodOnes):>5}\nbad : {len(badOnes):>5}")
    if len(goodOnes):
        print("Good:")
        for rep in goodOnes[0:3]:
            print(f"\t{rep}")
    if len(badOnes):
        print("Bad:")
        for (rep, fA, fB, tA, tB) in badOnes[0:3]:
            print(f"\t{rep} {fA} {fB} => {tA} {tB}")
    print("-" * 40)
    print("")

word            - subphrase      
  0.19s 471 results
  0.16s 471 results
good:   471
bad :     0
Good:
	bānˈāʸw  => ʔˈel-ʔahᵃrˈōn 
	zivḥêhem  => bᵊnˈê yiśrāʔˈēl 
	zivḥêhem  => mibbᵊnˈê yiśrāʔˈēl 
----------------------------------------

word            - phrase_atom    
  0.35s 20188 results
  0.24s 20254 results
good:  3785
bad : 16403
Good:
	ʔᵃlêhˈem  => ʔˈel-ʔahᵃrˈōn wᵊʔel-bānˈāʸw wᵊʔˌel kol-bᵊnˈê yiśrāʔˈēl 
	hᵉvîʔˌô  => šˌôr ʔô-ḵˈeśev ʔô-ʕˌēz 
	ʕammˈô .  => ʔˌîš ʔîš 
Bad:
	zzarʕˈô  => ʔˈîš ʔîš  != ʔˈîš  64423 64422 => 944121 944096
	zzarʕˈô  => yittˈēn  != ʔîš  64423 64422 => 944127 944097
	zzarʕˈô  => yûmˈāṯ  != yittˈēn  64423 64422 => 944131 944103
----------------------------------------

word            - word           
  0.56s 19884 results
  0.48s 19884 results
good: 19884
bad :     0
Good:
	zivḥêhem  => zivḥêhˈem 
	zivḥêhem  => lāhˌem 
	zivḥêhem  => ḏōrōṯˈām . 
----------------------------------------

phrase_atom     - phrase_atom    
  0.29s 34215 results
  0.38s 34404 

Observations:

All coref links between words and subphrases match perfectly.

But where phrase atoms are involved, we get bad ones, sometimes more bad ones than good ones.

We inspect a few bad cases.

##### between word and phrase atom:

```
zzarʕˈô  => ʔˈîš ʔîš  != ʔˈîš  64423 64422 => 944121 944096
```

In [25]:
fB = 64422
tB = 944096
pfB = B.api.L.u(fB, otype="phrase")[0]
ptB = B.api.L.u(tB, otype="phrase")[0]
highlightsB = {fB: "orange", tB: "cyan"}

In [26]:
fA = 64423
tA = 944121
pfA = A.api.L.u(fA, otype="phrase")[0]
ptA = A.api.L.u(tA, otype="phrase")[0]
highlightsA = {fA: "orange", tA: "cyan"}

In [27]:
# original coref link
B.pretty(pfB, highlights=highlightsB)
if pfB != ptB:
    B.pretty(ptB, highlights=highlightsB)

In [28]:
# mapped coref link
A.pretty(pfA, highlights=highlightsA)
if pfA != pfB:
    A.pretty(ptA, highlights=highlightsA)

Force majeure! the phrase atom in the original has changed. In the new version it is combined with its neighbour,
and the two constituting parts are now subphrases.

##### between phrase atoms:

```
ʔˌîš ʔˈîš  != ʔˌîš  => ʔˌîš ʔˈîš  != ʔˈîš  943311 943285 => 943311 943286
```

In [29]:
fB = 943285
tB = 943286
pfB = B.api.L.u(fB, otype="phrase")[0]
ptB = B.api.L.u(tB, otype="phrase")[0]
highlightsB = {fB: "orange", tB: "cyan"}

In [30]:
fA = 943311
tA = 943311
pfA = A.api.L.u(tA, otype="phrase")[0]
ptA = A.api.L.u(tA, otype="phrase")[0]
highlightsA = {fA: "orange", tA: "cyan"}

In [31]:
B.pretty(pfB, highlights=highlightsB)
if pfB != ptB:
    B.pretty(ptB, highlights=highlightsB)

In [32]:
A.pretty(pfA, highlights=highlightsA)
if pfA != ptA:
    A.pretty(ptA, highlights=highlightsA)

The same kind of force majeure. 
In this case the link was between the two original phrase atoms.
In the new version these have merged into one phrase atom, and now there is 
a *coref* self-link!

##### between phrase atom and subphrase:

```
ʔˌîš ʔˈîš  != ʔˌîš  => ʔîš  943311 943285 => 1317262 1317261
```

In [33]:
fB = 943285
tB = 1317261
pfB = B.api.L.u(fB, otype="phrase")[0]
ptB = B.api.L.u(tB, otype="phrase")[0]
highlightsB = {fB: "orange", tB: "cyan"}

In [34]:
fA = 943311
tA = 1317262
pfA = A.api.L.u(fA, otype="phrase")[0]
ptA = A.api.L.u(tA, otype="phrase")[0]
highlightsA = {fA: "orange", tA: "cyan"}

In [35]:
# original coref link
B.pretty(pfB, highlights=highlightsB)
if pfB != ptB:
    B.pretty(ptB, highlights=highlightsB)

In [36]:
# mapped coref link
A.pretty(pfA, highlights=highlightsA)
if pfA != ptA:
    A.pretty(ptA, highlights=highlightsA)

The same kind of force majeure. 

Clearly, there is a massive reorganization of phrase atoms in version `2021` as compared to version `c`.

## Conclusion

It is great to be able to upgrade features from a version against which they have been created to a newer
version.
But the corpus may have been changed in unforeseen ways, and not every node in the old corpus can be necessarily
matched with a unique node in the new corpus.
If there are annotations on such nodes, then they either do not carry over to the new version, or they may carry
over to unintended extra nodes in the new version.

We saw a lot of "bad" cases. But yet, all these discrepancies are really not that bad.
The mapping has always picked the closest node in the new version that corresponds with the original node in the old version.

There are ways to detect such discrepancies, and the node mapping already has relevant information about the quality of the mapping.
In fact, the `migrateFeatures` of Text-Fabric already uses the quality information
when it assigns feature values to nodes.

But nothing beats generating the features against the new version by the same code that generated them against
the old version.
If there are issues due to important version differences, the author of the generated feature knows best
how to handle that.

# All steps

* **[start](start.ipynb)** your first step in mastering the bible computationally
* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **[export](export.ipynb)** export your dataset as an Emdros database
* **[annotate](annotate.ipynb)** annotate plain text by means of other tools and import the annotations as TF features
* **map** map somebody else's annotations to a new version of the corpus
* **[volumes](volumes.ipynb)** work with selected books only
* **[trees](trees.ipynb)** work with the BHSA data as syntax trees

CC-BY Dirk Roorda