# WATM

We make an export of the Mondriaan letters to WATM, the suite of systems developed by Team Text for
serving text plus annotations over the web.

WATM = Web Annotation Model for Text and consists (currently) of TextRepo, AnnoRepo, Broccoli, and TextAnnoViz.

The input data is given by a Text-Fabric dataset, which is in turn derived from TEI files managed by the
Huygens institute.

# Generate plain text

We generate a plain text of the whole corpus.
This will produce a text and annotation file:

* `text.json`: with the text segments in an array;
* `anno.json`: all generated annotations

## Format of the text file

A json file with the following structure:

```
{
  "_ordered_segments": [
    "word1 ",
    "word2 ",
    ...
  ]
}
```
* each item in `_ordered_segments` corresponds to one word,
* the item contains the text of the word plus the subsequent interword material;
* we skip all material inside the TEI-header;

## Format of the annotation file

A json file with the following structure:

```
{
 "a000nnn": [
  "kind",
  "body",
  "bbb-eee"
 ],{
 ...
}
```
* it is a big dict, keyed by annotation ids
* the values consist of the annotation data, divided in the following fields:

* `kind`: the kind of annotation
  * `element`: targets the text location where an element occurs, the body is the element name;
  * `attribute`: targets an element attribute, the body has the shape *name*`=`*value*,
    the name and value of the attribute in question;
  * `node`: targets an individual word or element, the body is the TF node of that word, or element
  * `format`: targets an individual word, the body is a formatting property for that word; all words in notes
    get a `format` annotation with body `note`;
  * `anno`: targets an arbitrary annotation or text range, body has an arbitrary value;
    can be used for extra annotations, e.g. the url to an artwork derived from an `<rs>` element.
    
* `body`: the body of an annotation (probably the *kind* and *body* fields together will make up the body
  of the resulting web annotation);
  
* `target`: a string
  If the string has the form of a range `bbb-eee`, then it points to that range of text segments
  in the `_ordered_segments`.
  
  Otherwise, it is an annotation id.

# Generate

In [1]:
%load_ext autoreload
%autoreload 2

In [1]:
import json
from tf.app import use
from watm import WATM

In [2]:
A = use("mondriaan/letters:clone", checkout="clone", backend="git.diginfra.net", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,11235.0,100
letter,14,802.5,100
body,14,695.21,87
text,14,695.21,87
div,93,180.35,149
chunk,86,130.63,100
p,95,62.63,53
revisionDesc,14,59.36,7
postscript,6,53.0,3
note,86,39.1,30


We want to create special annotations for urls to artwork on the RKD site.

We collect a mapping from TF nodes to such urls in order to pass that to the function that generates
annotations.

In [3]:
RKD_URL_BASE = "https://rkd.nl/explore/images/"

artworksWithRef = A.search("""
rs type=artwork-m key~[1-9]
""")

artworkUrls = {r[0]: f"{RKD_URL_BASE}{F.key.v(r[0])}" for r in artworksWithRef}

  0.00s 8 results


In [4]:
WA = WATM(A)
WA.makeText()
WA.makeAnno(artworkUrls)
WA.writeAll()

Text file:    9785 segments to ~/git.diginfra.net/mondriaan/letters/watm/text.json
Anno file:   16936 annotations to ~/git.diginfra.net/mondriaan/letters/watm/anno.json


# Test

Let's do a cross-check on our number manipulation.

First we define a few compare functions.

In [5]:
def compare(nTF, nWA):
    print(f"TF: {nTF:>6}\nWA: {nWA:>6}")
    return nTF == nWA


def strEqual(wa=None, tf=None):
    different = False
    for (i, cTF) in enumerate(tf):
        if i >= len(wa):
            contextI = max((0, i - 10))
            print(f"WA {i}: {wa[contextI:i]} <END>")
            print(f"TF {i}: {tf[contextI:i]} <> {tf[i:i + 10]}")
            different = True
            break
        elif tf[i] != wa[i]:
            contextI = max((0, i - 10))
            print(f"WA {i}: {wa[contextI:i]} <{wa[i]}> {wa[i+1:i + 11]}")
            print(f"TF {i}: {tf[contextI:i]} <{tf[i]}> {tf[i+1:i + 11]}")
            different = True
            break
    if not different and len(wa) > len(tf):
        i = len(tf)
        contextI = max((0, i - 10))
        print(f"WA {i}: {wa[contextI:i]} <> {wa[i:i + 10]}")
        print(f"TF {i}: {tf[contextI:i]} <END>")
        different = True
    return not different

We read the text and annotation files.

In [6]:
with open(WA.textFile) as fh:
    text = json.load(fh)
    words = text["_ordered_segments"]

with open(WA.annoFile) as fh:
    annotationById = {}
    annotations = []
    annos = json.load(fh)
    for (aId, (kind, body, target)) in annos.items():
        if "-" in target:
            (start, end) = target.split("-", 1)
            target = (int(start), int(end))
        annotationById[aId] = (kind, body, target)
        annotations.append((aId, kind, body, target))
        
    annotations = sorted(annotations)

## Does the number of words match?

We compare the number of words in the textFile with the non-meta slots in the TF dataset.

In [7]:
nWordsTF = sum(0 if F.is_meta.v(s) else 1 for s in range(1, F.otype.maxSlot + 1))
nWordsWA = len(words)
compare(nWordsTF, nWordsWA)

TF:   9785
WA:   9785


True

## Do both representations have the same text?

We extract the from the text file and from TF.

In [8]:
textWA = "".join(words)
print(textWA[0:100])

Beste Zus,
kom je morgenavond (Woensdag) om kwart voor acht ingang kleine zaal ConcertgebouwConcert-


In [9]:
textTF = "".join(
    f"{F.str.v(s)}{F.after.v(s) or ''}"
    for s in F.otype.s("word")
    if not (F.is_meta.v(s) or F.empty.v(s))
)
print(textTF[0:100])

Beste Zus,
kom je morgenavond (Woensdag) om kwart voor acht ingang kleine zaal ConcertgebouwConcert-


In [10]:
strEqual(wa=textWA, tf=textTF)

True

Good!

## Does the number of element nodes match?

We compare the number of element annotations in the file with the non-meta elements in the TF dataset.

In [11]:
nElementsTF = 0

for n in range(F.otype.maxSlot + 1, F.otype.maxNode + 1):
    slots = E.oslots.s(n)
    b = slots[0]
    e = slots[-1]
    if F.is_meta.v(b) or F.is_meta.v(e):
        continue
    nElementsTF +=1
    
nElementsWA = sum(1 if kind == "element" else 0 for (aId, kind, body, target) in annotations)
compare(nElementsTF, nElementsWA)

TF:   1114
WA:   1114


True

## Can we map the element annotations back to TF?

We make a mapping from annotation ids of element annotations to TF nodes.

Then for each element annotation, we check whether the body of the annotation contains the 
type of the corresponding TF node.

In [12]:
tfFromAid = {}

for (aId, kind, body, target) in annotations:
    if kind != "node":
        continue
    tfFromAid[target] = body
        
print(f"Annotations mapped: {len(tfFromAid)}")

Annotations mapped: 10899


In [13]:
element = 0
other = 0
good = 0
wrong = 0
unmapped = 0

for (aId, kind, body, target) in annotations:
    if kind != "element":
        other += 1
        continue
    element += 1
    tag = body
    node = tfFromAid.get(aId, None)
    if node is None:
        unmapped += 1
        continue
    otype = F.otype.v(node)
    if tag == otype:
        good +=1
    else:
        wrong += 1
        
print(f"Element:  {element:>5} x")
print(f"Other  :  {other:>5} x")
print(f"Good:     {good:>5} x")
print(f"Wrong:    {wrong:>5} x")
print(f"Unmapped: {unmapped:>5} x")

Element:   1114 x
Other  :  15822 x
Good:      1114 x
Wrong:        0 x
Unmapped:     0 x


No errors or irregularities in the correspondence.

## Are the attributes preserved?

Using the correspondence between element annotations and TF nodes, we test whether the
annotations encode the same attributes and values as the TF does with its features.

We make a list of entries for all attribute values encoded in WATM.
Each entry consists of:

* node ID
* attribute name
* attribute value

First the WATM side:

In [14]:
attWA = []

for (aId, kind, body, target) in annotations:
    if kind != "attribute":
        continue
    node = tfFromAid[target]
    (att, value) = body.split("=", 1)
    attWA.append((node, att, value))
    
attWA = sorted(attWA)

print(f"{len(attWA)} attribute values")

1382 attribute values


Let's check whether this data is consistent with TF.

Later we must also check that this data is complete, i.e. that it covers all feature data of TF.

### Consistency

In [15]:
good = 0
wrong = []

for (node, att, valWA) in attWA:
    valTF = str(Fs(att).v(node))
    if valWA == valTF:
        good += 1
    else:
        wrong.append((node, att, valWA, valTF))
        
print(f"Good:     {good:>5} x")
print(f"Wrong:    {len(wrong):>5} x")

Good:      1382 x
Wrong:        0 x


### Completeness

Are there features in TF not covered by WATM?

We ignore the `rend_`, and `is_` features.

We collect the TF features in a list like `attWA`.

In [16]:
attTF = []

for feat in Fall():
    if feat in {"otype", "str", "after"}:
        continue
        
    if feat.startswith("is_") or feat.startswith("rend_"):
        continue
    
    for (node, valTF) in Fs(feat).items():
        slots = E.oslots.s(node)
        b = slots[0]
        e = slots[-1]
        if F.is_meta.v(b) or F.is_meta.v(e):
            continue
        attTF.append((node, feat, str(valTF)))

attTF = sorted(attTF)

Now the big question is: are `attWA` and `attTF` the same data?

In [17]:
print(f"TL attributes: {len(attWA)}")
print(f"TF attributes: {len(attTF)}")

TL attributes: 1382
TF attributes: 1382


In [18]:
attWA == attTF

True

## Are the formatting attributes preserved?

This is about the `rend` attributes in the TEI and the `format` annotations in the anno file.

In [19]:
fmtWA = []

for (aId, kind, body, target) in annotations:
    if kind != "format":
        continue
    if body == "note":
        continue
    node = tfFromAid[target]
    fmtWA.append((node, body))
    
fmtWA = sorted(fmtWA)

print(f"{len(fmtWA)} format values")
{f[1] for f in fmtWA}

170 format values


{'above',
 'below',
 'blockletter',
 'center',
 'inline',
 'italics',
 'norend',
 'right',
 'spaced',
 'super',
 'underline',
 'underline2',
 'upsidedown'}

Let's check whether this data is consistent with TF.

Later we must also check that this data is complete, i.e. that it covers all feature data of TF.

### Consistency

In [20]:
good = 0
wrong = []

for (node, valWA) in fmtWA:
    feat = f"rend_{valWA}"
    valTF = valWA if str(Fs(feat).v(node)) else None
    if valWA == valTF:
        good += 1
    else:
        wrong.append((node, feat, valWA, valTF))
        
print(f"Good:     {good:>5} x")
print(f"Wrong:    {len(wrong):>5} x")

Good:       170 x
Wrong:        0 x


### Completeness

Are there formats in TF not covered by WA?

We collect the TF *rend* features in a list like `fmtWA`.

In [21]:
fmtTF = []

for feat in Fall():
    if not feat.startswith("rend_"):
        continue
    
    value = feat.split("_", 2)[1]
    if value == "note":
        continue
        
    for (node, valTF) in Fs(feat).items():
        slots = E.oslots.s(node)
        b = slots[0]
        e = slots[-1]
        if F.is_meta.v(b) or F.is_meta.v(e):
            continue
        fmtTF.append((node, value))

fmtTF = sorted(fmtTF)

Now the big question is: are `attWA` and `attTF` the same data?

In [22]:
print(f"TL attributes: {len(fmtWA)}")
print(f"TF attributes: {len(fmtTF)}")

TL attributes: 170
TF attributes: 170


In [23]:
fmtWA == fmtTF

True

# Conclusion and caveat

The WATM representation of the corpus is a faithful and complete representation of the TF source and
hence of the TEI source.

Well, don't take this too literally, probably there are aspects where the different representations
differ.

I am aware of the following:

* The TEI to TF conversion has lost the exact embedding of elements in the following case:
  Suppose element A contains the same words as element B. Then the TF data does not know whether A is
  a child of B or the other way round.
  
  This is repairable by adding parenthood edges between nodes when constructing the TF data.
  We should then also convert these TF edges to WATM annotations, for which we need
  structured targets: if `n` is the parent of `m`, we must make an annotation with
  body `"parent"` and target `[n, m]`.

* The TF to WATM conversion forgets the types of feature values: it does not make a distinction
  between the integer `1` and the string `"1"`.
  
  This is repairable by creating annotations with structured bodies like `{"att": value}` instead
  of strings like `att=value` as we do now.