In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import json
from tf.app import use
from tf.convert.watm import WATM

In [3]:
ORG = "annotation"
REPO = "mondriaan"

# WATM

We make an export of the Mondriaan letters to WATM, the suite of systems developed by Team Text for
serving text plus annotations over the web.

WATM = Web Annotation Model for Text and consists (currently) of TextRepo, AnnoRepo, Broccoli, and TextAnnoViz.

The input data is given by a Text-Fabric dataset, which is in turn derived from TEI files managed by the
Huygens institute.

# Generate plain text

We generate a plain text of the whole corpus.
This will produce a text and annotation file:

* `text.json`: with the text segments in an array;
* `anno.json`: all generated annotations

## Format of the text file

A JSON file with the following structure:

```
{
  "_ordered_segments": [
    "word1 ",
    "word2 ",
    ...
  ]
}
```
* each item in `_ordered_segments` corresponds to one word,
* the item contains the text of the word plus the subsequent interword material;
* we skip all material inside the TEI-header;

## Format of the annotation file

A JSON file with the following structure:

```
{
 "a000nnn": [
  "kind",
  "namespace",
  "body",
  "bbb-eee"
 ],{
 ...
}
```
* it is a big dict, keyed by annotation ids
* the values consist of the annotation data, divided in the following fields:

* `kind`: the kind of annotation
  * `element`: targets the text location where an element occurs, the body is the element name;
  * `pi`: targets the text location where a processing instruction occurs, the body is the `pi`'s target;
  * `attribute`: targets an attribute (of an element or `pi`), the body has the shape *name*`=`*value*,
    the name and value of the attribute in question;
  * `node`: targets an individual word or element or `pi`, the body is the TF node of that word/element/pi;
  * `edge`: targets two node annotations, the body has the shape
    `*name* or `*name*`=`*value*,
    where *name* is the name of the edge and *value* is the label of the edge if the edge has a label;
  * `format`: targets an individual word, the body is a formatting property for that word,
    all words in notes get a `format` annotation with body `note`;
  * `anno`: targets an arbitrary annotation or text range, body has an arbitrary value;
    can be used for extra annotations, e.g. the URL to an artwork derived from an `<rs>` element.
    
* `namespace`: the namespace of the annotation; an indicator where the information comes from. Possible values:
  * `tei`: attribute comes
    [literally](https://annotation.github.io/text-fabric/tf/convert/helpers.html#tf.convert.helpers.CM_LIT)
    from the TEI, or is
    [processed](https://annotation.github.io/text-fabric/tf/convert/helpers.html#tf.convert.helpers.CM_LITP)
    straightforwardly from it;
  * `tf`: attribute is
    [composed](https://annotation.github.io/text-fabric/tf/convert/helpers.html#tf.convert.helpers.CM_LITC)
    in a more intricate way from the TEI or even
    [added](https://annotation.github.io/text-fabric/tf/convert/helpers.html#tf.convert.helpers.CM_PROV)
    to it;
  * `nlp`: attribute is generated as a result of
    [NLP processing](https://annotation.github.io/text-fabric/tf/convert/helpers.html#tf.convert.helpers.CM_NLP);
  * `tt`: attribute is derived from other material in the TEI source for the benefit
    of the Team Text infrastructure. Defined in the `watm.yaml` file next to this program.
    Currently used for the annotations that derived from the specs of the Mondriaan project.
      
* `body`: the body of an annotation (probably the *kind* and *body* fields together will make up the body
  of the resulting web annotation);
  
* `target`: a string, of the following kinds:

  * **single** this is a target pointing to a single thing, either:
  
    * `bbb-eee` a range of text segments in the `_ordered_segments`;
    * an annotation id
    
  * **double** this is a target pointing to two things:
    * `fff->ttt` where `fff` is a "from" target and `ttt` is a "to" target;
      both targets can vary independently between a range and an annotation id.

# Remarks

## Tokens

The base type is `t`, the *atomic* token. Atomic tokens are tokens as they come from the NLP, except when the token contains
an element boundary. In those cases tokens are split in fragments between the element boundaries.

It is guaranteed that a text segment that corresponds to a `t` does not contain element boundaries.

The original, unsplit tokens are also present in the annotations, they have type `token`.

# Generate

In [4]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
folder,2,8873.5,100
bibliolist,1,3144.0,18
listBibl,2,1546.5,17
file,16,1109.19,100
letter,14,1004.07,79
body,16,666.94,60
text,16,666.94,60
artworklist,1,546.0,3
listObject,1,501.0,3
standOff,14,358.71,28


We want to create special annotations for urls to artwork on the RKD site.

We collect a mapping from TF nodes to such urls in order to pass that to the function that generates
annotations.

In [5]:
RKD_URL_BASE = "https://rkd.nl/explore/images/"


def getArtWorksUrl():
    query = """rs type=artwork-m key~[1-9]"""
    artworksWithRef = A.search(query)
    return {r[0]: f"{RKD_URL_BASE}{F.key.v(r[0])}" for r in artworksWithRef}

In [50]:
skipMeta = False
WA = WATM(A, "tei", skipMeta=skipMeta, extra=getArtWorksUrl())
WA.makeText()
WA.makeAnno()
WA.writeAll()

  0.00s 0 results
Text file:   17747 segments to ~/github/annotation/mondriaan/watm/0.8.13/text.json
 85372 annotations written to ~/github/annotation/mondriaan/watm/0.8.13/anno-1.json
Anno files:   85372 annotations to 1 files


In [54]:
WA.testAll()

Testing the text ...
	TF:  17747
	WA:  17747
OK - Same number of tokens
	TF: Brief aan Aletta de  ... 0-1940. Baarn 1980. 
	WA: Brief aan Aletta de  ... 0-1940. Baarn 1980. 
OK - Same text
Testing the elements ...
	TF:  21684
	WA:  21684
OK - Same number of elements as nodes
	TF:      0
	WA:      0
OK - Same number of processing instructions
	39431 element annotations
	Element : 21684 x
	Pi      :     0 x
	Other   : 63688 x
	Good    : 21684 x
	Wrong   :     0 x
	Unmapped:     0 x
OK - All element annotations OK
Testing the attributes ...
	7890 attribute values
	Good:      7890 x
	Wrong:        0 x
OK - annotations consistent with features
	WA attributes: 7890
	TF attributes: 7890
OK - annotations complete w.r.t. features
Testing the format attributes ...
	252 format values
	formatting attributes: 
		   106 x italics
		    59 x underline
		    36 x indent
		    18 x upsidedown
		     8 x super
		     6 x blockletter
		     4 x right
		     4 x spaced
		     3 x center
		     2 x above
	

True

# Conclusion and caveat

The WATM representation of the corpus is a faithful and complete representation of the TF source and
hence of the TEI source.

Well, don't take this too literally, probably there are aspects where the different representations
differ.

I am aware of the following:

* The TEI to TF conversion has lost the exact embedding of elements in the following case:
  Suppose element A contains the same words as element B. Then the TF data does not know whether A is
  a child of B or the other way round.
  
  This is repairable by adding parenthood edges between nodes when constructing the TF data.
  We should then also convert these TF edges to WATM annotations, for which we need
  structured targets: if `n` is the parent of `m`, we must make an annotation with
  body `"parent"` and target `[n, m]`.

* The TF to WATM conversion forgets the types of feature values: it does not make a distinction
  between the integer `1` and the string `"1"`.
  
  This is repairable by creating annotations with structured bodies like `{"att": value}` instead
  of strings like `att=value` as we do now.