In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import json
from tf.app import use
from watm import WATM

In [3]:
ORG = "annotation"
REPO = "mondriaan"

# WATM

We make an export of the Mondriaan letters to WATM, the suite of systems developed by Team Text for
serving text plus annotations over the web.

WATM = Web Annotation Model for Text and consists (currently) of TextRepo, AnnoRepo, Broccoli, and TextAnnoViz.

The input data is given by a Text-Fabric dataset, which is in turn derived from TEI files managed by the
Huygens institute.

# Generate plain text

We generate a plain text of the whole corpus.
This will produce a text and annotation file:

* `text.json`: with the text segments in an array;
* `anno.json`: all generated annotations

## Format of the text file

A json file with the following structure:

```
{
  "_ordered_segments": [
    "word1 ",
    "word2 ",
    ...
  ]
}
```
* each item in `_ordered_segments` corresponds to one word,
* the item contains the text of the word plus the subsequent interword material;
* we skip all material inside the TEI-header;

## Format of the annotation file

A json file with the following structure:

```
{
 "a000nnn": [
  "kind",
  "namespace",
  "body",
  "bbb-eee"
 ],{
 ...
}
```
* it is a big dict, keyed by annotation ids
* the values consist of the annotation data, divided in the following fields:

* `kind`: the kind of annotation
  * `element`: targets the text location where an element occurs, the body is the element name;
  * `pi`: targets the text location where a processing instruction occurs, the body is the pi's target;
  * `attribute`: targets an attribute (of an element or pi), the body has the shape *name*`=`*value*,
    the name and value of the attribute in question;
  * `node`: targets an individual word or element or pi, the body is the TF node of that word/element/pi;
  * `edge`: targets two node annotations, the body has the shape
    `*name* or `*name*`=`*value*,
    where *name* is the name of the edge and *value* is the label of the edge if the edge has a label;
  * `format`: targets an individual word, the body is a formatting property for that word,
    all words in notes get a `format` annotation with body `note`;
  * `anno`: targets an arbitrary annotation or text range, body has an arbitrary value;
    can be used for extra annotations, e.g. the url to an artwork derived from an `<rs>` element.
    
* `namespace`: the namespace of the annotation; an indicator where the information comes from. Possible values:
  * `tei`: attribute comes
    [literally](https://annotation.github.io/text-fabric/tf/convert/helpers.html#tf.convert.helpers.CM_LIT)
    from the TEI, or is
    [processed](https://annotation.github.io/text-fabric/tf/convert/helpers.html#tf.convert.helpers.CM_LITP)
    straightforwardly from it;
  * `tf`: attribute is
    [composed](https://annotation.github.io/text-fabric/tf/convert/helpers.html#tf.convert.helpers.CM_LITC)
    in a more intricate way from the TEI or even
    [added](https://annotation.github.io/text-fabric/tf/convert/helpers.html#tf.convert.helpers.CM_PROV)
    to it;
  * `nlp`: attribute is generated as a result of
    [NLP processing](https://annotation.github.io/text-fabric/tf/convert/helpers.html#tf.convert.helpers.CM_NLP);
  * `tt`: attribute is derived from other material in the TEI source for the benefit
    of the Team Text infrastructure. Defined in the `watm.yaml` file next to this program.
    Currently used for the annotations that derived from the specs of the Mondriaan project.
      
* `body`: the body of an annotation (probably the *kind* and *body* fields together will make up the body
  of the resulting web annotation);
  
* `target`: a string, of the following kinds:

  * **single** this is a target pointing to a single thing, either:
  
    * `bbb-eee` a range of text segments in the `_ordered_segments`;
    * an annotation id
    
  * **double** this is a target pointing to two things:
    * `fff->ttt` where `fff` is a "from" target and `ttt` is a "to" target;
      both targets can vary independently between a range and an annotation id.

# Remarks

## Tokens

The base type is `t`, the *atomic* token. Atomic tokens are tokens as they come from the NLP, except when the token contains
an element boundary. In those cases tokens are split in fragments between the element boundaries.

It is guaranteed that a text segment that corresponds to a `t` does not contain element boundaries.

The original, unsplit tokens are also present in the annotations, they have type `token`.

# Generate

In [4]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,13417.0,100
letter,14,958.36,100
body,14,500.36,52
text,14,500.36,52
chunk,100,134.04,100
div,30,232.57,52
standOff,14,314.0,33
teiHeader,14,132.57,14
page,51,125.96,48
listAnnotation,46,95.57,33


We want to create special annotations for urls to artwork on the RKD site.

We collect a mapping from TF nodes to such urls in order to pass that to the function that generates
annotations.

In [5]:
skipMeta = False
WA = WATM(A, skipMeta=skipMeta)
WA.makeText()
WA.makeAnno()
WA.writeAll()

  0.00s 12 results
Text file:   13417 segments to ~/github/annotation/mondriaan/watm/0.8.10/text.json
Anno file:   60165 annotations to ~/github/annotation/mondriaan/watm/0.8.10/anno.json


# Test

Let's do a cross-check on our number manipulation.

First we define a few compare functions.

In [6]:
def compare(nTF, nWA):
    print(f"TF: {nTF:>6}\nWA: {nWA:>6}")
    return nTF == nWA


def strEqual(wa=None, tf=None):
    different = False
    for (i, cTF) in enumerate(tf):
        if i >= len(wa):
            contextI = max((0, i - 10))
            print(f"WA {i}: {wa[contextI:i]} <END>")
            print(f"TF {i}: {tf[contextI:i]} <> {tf[i:i + 10]}")
            different = True
            break
        elif tf[i] != wa[i]:
            contextI = max((0, i - 10))
            print(f"WA {i}: {wa[contextI:i]} <{wa[i]}> {wa[i+1:i + 11]}")
            print(f"TF {i}: {tf[contextI:i]} <{tf[i]}> {tf[i+1:i + 11]}")
            different = True
            break
    if not different and len(wa) > len(tf):
        i = len(tf)
        contextI = max((0, i - 10))
        print(f"WA {i}: {wa[contextI:i]} <> {wa[i:i + 10]}")
        print(f"TF {i}: {tf[contextI:i]} <END>")
        different = True
    return not different

We read the text and annotation files.

In [7]:
with open(WA.textFile) as fh:
    text = json.load(fh)
    tokens = text["_ordered_segments"]

with open(WA.annoFile) as fh:
    annotationById = {}
    annotations = []
    annos = json.load(fh)
    
    for (aId, (kind, ns, body, target)) in annos.items():
        if "->" in target:
            parts = target.split("->", 1)
        else:
            parts = [target]
        newParts = []
        for part in parts:
            if "-" in part:
                (start, end) = part.split("-", 1)
                part = (int(start), int(end))
            newParts.append(part)
            
        target = newParts[0] if len(newParts) == 1 else tuple(newParts)
            
        annotationById[aId] = (kind, body, target)
        annotations.append((aId, kind, body, target))
        
    annotations = sorted(annotations)
    
if False:
    with open(f"{WA.annoFile}.txt", "w") as fh:
        for anno in annotations:
            fh.write(f"{anno=}\n")

## Does the number of tokens match?

We compare the number of tokens in the textFile with the non-meta slots in the TF dataset.

In [8]:
nTokensTF = sum(0 if skipMeta and F.is_meta.v(s) else 1 for s in range(1, F.otype.maxSlot + 1))
nTokensWA = len(tokens)
compare(nTokensTF, nTokensWA)

TF:  13417
WA:  13417


True

## Do both representations have the same text?

We extract the from the text file and from TF.

In [9]:
textWA = "".join(tokens)
print(textWA[0:100])

Brief aan Aletta de Iongh. Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.



In [10]:
slotType = F.otype.slotType

textTF = "".join(
    f"{F.str.v(s)}{F.after.v(s) or ''}"
    for s in F.otype.s(slotType)
    if not ((skipMeta and F.is_meta.v(s)) or F.empty.v(s))
)
print(textTF[0:100])

Brief aan Aletta de Iongh. Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.



In [11]:
strEqual(wa=textWA, tf=textTF)

True

Good!

## Do the numbers of element nodes and processing instructions match between TF and WATM?

We compare the number of element annotations in the file with the non-meta elements in the TF dataset.

We compare the number of processing instructions in the file with the processing instructions in the TF dataset.

In [12]:
nElementsTF = 0
nPisTF = 0

for n in range(F.otype.maxSlot + 1, F.otype.maxNode + 1):
    nType = F.otype.v(n)
    isPi = nType.startswith("?")
    
    if isPi:
        nPisTF += 1
        
    slots = E.oslots.s(n)
    b = slots[0]
    e = slots[-1]
    
    if skipMeta and (F.is_meta.v(b) or F.is_meta.v(e)):
        continue
    else:
        if not isPi:
            nElementsTF +=1
    
nElementsWA = sum(1 if kind == "element" else 0 for (aId, kind, body, target) in annotations)
nPisWA = sum(1 if kind == "pi" else 0 for (aId, kind, body, target) in annotations)

print("Elements:")
compare(nElementsTF, nElementsWA)
print("Processing instructions:")
compare(nPisTF, nPisWA)

Elements:
TF:  15927
WA:  15927
Processing instructions:
TF:     14
WA:     14


True

## Can we map the element annotations back to TF?

We make a mapping from annotation ids of element/pi annotations to TF nodes.

Then for each element/pi annotation, we check whether the body of the annotation contains the 
type of the corresponding TF node.

In [13]:
tfFromAid = {}

for (aId, kind, body, target) in annotations:
    if kind != "node":
        continue
    tfFromAid[target] = body
        
print(f"Annotations mapped: {len(tfFromAid)}")

Annotations mapped: 29358


In [14]:
element = 0
pi = 0
other = 0
good = 0
wrong = 0
unmapped = 0

for (aId, kind, body, target) in annotations:
    isElem = kind == "element"
    isPi = kind == "pi"
    
    if not isElem and not isPi:
        other += 1
        continue
        
    if isElem:
        element += 1
    else:
        pi += 1
        
    tag = body
    node = tfFromAid.get(aId, None)
    if node is None:
        unmapped += 1
        continue
        
    otype = F.otype.v(node)
    if isPi and tag == otype[1:] or not isPi and tag == otype:
        good +=1
    else:
        wrong += 1
        
print(f"Element : {element:>5} x")
print(f"Pi      : {pi:>5} x")
print(f"Other   : {other:>5} x")
print(f"Good    : {good:>5} x")
print(f"Wrong   : {wrong:>5} x")
print(f"Unmapped: {unmapped:>5} x")

Element : 15927 x
Pi      :    14 x
Other   : 44224 x
Good    : 15941 x
Wrong   :     0 x
Unmapped:     0 x


No errors or irregularities in the correspondence.

## Are the attributes preserved?

Using the correspondence between element annotations and TF nodes, we test whether the
annotations encode the same attributes and values as the TF does with its features.

Note that in the generation of TF we may have added extra features, based on the elements and attributes.
We also check, in one go, that these features have been transferred faithfully to
corresponding annotations.

We make a list of entries for all attribute values encoded in WATM.
Each entry consists of:

* node ID
* attribute name
* attribute value

First the WATM side:

In [15]:
attWA = []

for (aId, kind, body, target) in annotations:
    if kind != "attribute":
        continue
    node = tfFromAid[target]
    (att, value) = body.split("=", 1)
    attWA.append((node, att, value))
    
attWA = sorted(attWA)

print(f"{len(attWA)} attribute values")

6098 attribute values


Let's check whether this data is consistent with TF.

Later we must also check that this data is complete, i.e. that it covers all feature data of TF.

### Consistency

In [16]:
good = 0
wrong = []

for (node, att, valWA) in attWA:
    valTF = str(Fs(att).v(node))
    if valWA == valTF:
        good += 1
    else:
        wrong.append((node, att, valWA, valTF))
        
print(f"Good:     {good:>5} x")
print(f"Wrong:    {len(wrong):>5} x")

Good:      6098 x
Wrong:        0 x


### Completeness

Are there features in TF not covered by WATM?

We ignore the `rend_`, and `is_` features.

We collect the TF features in a list like `attWA`.

In [17]:
attTF = []

for feat in Fall():
    if feat in {"otype", "str", "after"}:
        continue
        
    if skipMeta and feat == "is_meta":
        continue
        
    if (feat != "is_meta" and feat.startswith("is_")) or feat.startswith("rend_"):
        continue
    
    for (node, valTF) in Fs(feat).items():
        slots = E.oslots.s(node)
        b = slots[0]
        e = slots[-1]
        if skipMeta and (F.is_meta.v(b) or F.is_meta.v(e)):
            continue
        attTF.append((node, feat, str(valTF)))

attTF = sorted(attTF)

Now the big question is: are `attWA` and `attTF` the same data?

In [18]:
print(f"TL attributes: {len(attWA)}")
print(f"TF attributes: {len(attTF)}")

TL attributes: 6098
TF attributes: 6098


In [19]:
attWA == attTF

True

## Are the formatting attributes preserved?

This is about the `rend` attributes in the TEI and the `format` annotations in the anno file.

In [20]:
fmtWA = []

for (aId, kind, body, target) in annotations:
    if kind != "format":
        continue
    if body == "note":
        continue
    node = tfFromAid[target]
    fmtWA.append((node, body))
    
fmtWA = sorted(fmtWA)

print(f"{len(fmtWA)} format values")
{f[1] for f in fmtWA}

153 format values


{'above',
 'blockletter',
 'center',
 'italics',
 'overwritten',
 'right',
 'spaced',
 'super',
 'underline',
 'underline2',
 'upsidedown'}

Let's check whether this data is consistent with TF.

Later we must also check that this data is complete, i.e. that it covers all feature data of TF.

### Consistency

In [21]:
good = 0
wrong = []

for (node, valWA) in fmtWA:
    feat = f"rend_{valWA}"
    valTF = valWA if str(Fs(feat).v(node)) else None
    if valWA == valTF:
        good += 1
    else:
        wrong.append((node, feat, valWA, valTF))
        
print(f"Good:     {good:>5} x")
print(f"Wrong:    {len(wrong):>5} x")

Good:       153 x
Wrong:        0 x


### Completeness

Are there formats in TF not covered by WA?

We collect the TF *rend* features in a list like `fmtWA`.

In [22]:
fmtTF = []

for feat in Fall():
    if not feat.startswith("rend_"):
        continue
    
    value = feat.split("_", 2)[1]
    if value == "note":
        continue
        
    for (node, valTF) in Fs(feat).items():
        slots = E.oslots.s(node)
        b = slots[0]
        e = slots[-1]
        if skipMeta and (F.is_meta.v(b) or F.is_meta.v(e)):
            continue
        fmtTF.append((node, value))

fmtTF = sorted(fmtTF)

Now the big question is: are `attWA` and `attTF` the same data?

In [23]:
print(f"TL attributes: {len(fmtWA)}")
print(f"TF attributes: {len(fmtTF)}")

TL attributes: 153
TF attributes: 153


In [24]:
fmtWA == fmtTF

True

## Are the edges preserved?

In [25]:
for (aId, kind, body, target) in annotations:
    if kind != "node":
        continue

    if type(target) is not str:
        print(f"{aId=} {kind=} {body=} {target=}")
        break

aId='a031882' kind='node' body=1 target=(0, 1)


In [26]:
tfFromAidNodes = {}
tfFromAidEdges = {}

for (aId, kind, body, target) in annotations:
    if kind != "node":
        continue
    if type(target) is tuple:
        (start, end) = target
        if start + 1 != end:
            print(target)
            break
        target = end
    tfFromAidNodes[target] = body
        
for (aId, kind, body, target) in annotations:
    if kind != "edge":
        continue
        
    (fro, to) = target
    fromNode = tfFromAidNodes[fro]
    toNode = tfFromAidNodes[to]
    parts = body.split("=", 1)
    (name, val) = (body, None) if len(parts) == 1 else parts
    tfFromAidEdges.setdefault(name, {}).setdefault(fromNode, {})[toNode] = val
        
print(f"Found: {len(tfFromAidNodes)} nodes")

for (edge, edgeData) in sorted(tfFromAidEdges.items()):
    print(f"Found edge {edge} with {len(edgeData)} starting nodes")

Found: 29358 nodes
Found edge parent with 1641 starting nodes
Found edge sibling with 949 starting nodes


We check whether the edge data found in the annotations is equivalent with the edge data in TF

In [27]:
allGood = True

for edge in set(Eall()) | set(tfFromAidEdges):
    if edge == "oslots":
        continue

    print(f"Checking edge {edge}")

    good = True

    if edge not in set(Eall()):
        print("\tmissing in TF data")
        good = False

    if edge not in tfFromAidEdges:
        print("\tmissing in annotation data")
        good = False

    if not good:
        continue

    dataTF = dict(Es(edge).items())
    dataAid = tfFromAidEdges[edge]

    fromNodesTF = set(dataTF)
    fromNodesAid = set(dataAid)

    nFromTF = len(fromNodesTF)
    nFromAid = len(fromNodesAid)

    if fromNodesTF == fromNodesAid:
        print(f"\tsame {nFromTF} fromNodes")
    else:
        print(
            f"\tfrom nodes differ: {len(fromNodesTF)} in TF, {len(fromNodesAid)} in Aid"
        )
        good = False

    diffs = []

    nToChecked = 0

    for (f, toNodeInfoTF) in dataTF.items():
        toNodeInfoAid = dataAid[f]
        if type(toNodeInfoTF) is dict:
            toNodeInfoTF = {k: str(v) for (k, v) in toNodeInfoTF.items()}
        else:
            toNodeInfoTF = {x: None for x in toNodeInfoTF}

        if toNodeInfoTF != toNodeInfoAid:
            diffs.append((f, toNodeInfoTF, toNodeInfoAid))

        nToChecked += len(toNodeInfoTF)

    if len(diffs):
        good = False
        print(f"\tdifferences in toNodes for {len(diffs)} fromNodes")

        for (f, toNodeInfoTF, toNodeInfoAid) in sorted(diffs)[0:10]:
            print(f"\t\tfromNode {f}")

            toNodesTF = set(toNodeInfoTF)
            toNodesAid = set(toNodeInfoAid)

            nToTF = len(toNodesTF)
            nToAid = len(toNodesAid)

            if toNodesTF == toNodesAid:
                print(f"\t\t\tsame {nToTF} toNodes")
            else:
                print(
                    f"\t\t\ttoNodes differ: {len(toNodesTF)} in TF, {len(toNodesAid)} in Aid"
                )
            for t in toNodesTF | toNodesAid:
                doCompare = True
                if t not in toNodesTF:
                    print(f"\t\t\t\ttoNode {t} not in TF")
                    doCompare = False
                else:
                    valTF = toNodeInfoTF[t]

                if t not in toNodesAid:
                    print(f"\t\t\t\ttoNode {t} not in Aid")
                    doCompare = False
                else:
                    valAid = toNodeInfoAid[t]

                if doCompare:
                    if valTF == valAid:
                        print(f"\t\t\t\ttoNode{t} values agree: {repr(valTF)}")
                    else:
                        print(
                            f"\t\t\t\ttoNode{t} values differ: TF: {repr(valTF)} Aid: {repr(valAid)}"
                        )

    print(f"\t{nToChecked} toNodes checked")
    print("\tOK" if good else "\tWRONG")
    if not good:
        allGood = False
        
print(f"{'All OK' if allGood else 'some WRONG'}")

Checking edge sibling
	same 949 fromNodes
	2580 toNodes checked
	OK
Checking edge parent
	same 1641 fromNodes
	1641 toNodes checked
	OK
All OK


# Conclusion and caveat

The WATM representation of the corpus is a faithful and complete representation of the TF source and
hence of the TEI source.

Well, don't take this too literally, probably there are aspects where the different representations
differ.

I am aware of the following:

* The TEI to TF conversion has lost the exact embedding of elements in the following case:
  Suppose element A contains the same words as element B. Then the TF data does not know whether A is
  a child of B or the other way round.
  
  This is repairable by adding parenthood edges between nodes when constructing the TF data.
  We should then also convert these TF edges to WATM annotations, for which we need
  structured targets: if `n` is the parent of `m`, we must make an annotation with
  body `"parent"` and target `[n, m]`.

* The TF to WATM conversion forgets the types of feature values: it does not make a distinction
  between the integer `1` and the string `"1"`.
  
  This is repairable by creating annotations with structured bodies like `{"att": value}` instead
  of strings like `att=value` as we do now.