# Convert from TEI to TF to WATM

First we convert Mondriaan TEI to TF and then the TF to WATM.

This notebook is bare, no explanations, no illustrations, no checks.
For more documentation, try any of the following variants:

* `convertExpress` : as few commands/feedback/interaction as possible, 
* [convertSteps](convertSteps.ipynb): broken down in a few command line commands, more feedback
* [convertDetails](convertDetails.ipynb): run from Python with full control

# One shot

Here is the express, mindless way to convert the corpus.

If something goes wrong, you can follow the rest of the notebook.

In [2]:
from make import run

In [7]:
%%time

run("all 0.9.0 --silent")

Using TF version: 0.8.15
Checking TEI ...
	folder proeftuin:
	folder backmatter:
Converting TEI to TF ...
	folder proeftuin:
	folder backmatter:
Loading TF ...
App updated
Add tokens and sentences ...


  0.21s Using NLP pipeline Spacy (it) ...
NLP with language model it True
This language supports tagging
This language supports morphologizing
This language supports lemmatizing
  4.95s NLP done
  0.00s Feature overview: 65 for nodes; 5 for edges; 1 configs; 9 computed
App updated with NLP output 
Producing WATM


	16 x of type sentence
	4 x of type ent
	6 x of type token
OK - whether all tests passed
CPU times: user 12.1 s, sys: 1.16 s, total: 13.3 s
Wall time: 14.1 s


# Only WATM:

In [6]:
%%time

run("watm 0.9.0 --silent")

Using TF version: 0.8.14
Producing WATM


	16 x of type sentence
	4 x of type ent
	6 x of type token
OK - whether all tests passed
CPU times: user 755 ms, sys: 32 ms, total: 787 ms
Wall time: 787 ms


# Step by step

Below you can inspect all the steps of the conversion:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.addnlp import NLPipeline
from tf.convert.watm import WATM
from tf.advanced.helpers import dm

In [3]:
ORG = "annotation"
REPO = "mondriaan"

# Step 1: Check

In [73]:
Tei = TEI(verbose=-1, tei=0, tf="0.9.0")
# Tei = TEI(verbose=-1, tei="2000-01-01", tf="0.9.0")



In [74]:
Tei.task(check=True, verbose=1, validate=True)

TEI to TF checking: ~/github/annotation/mondriaan/tei/2023-06-06 => ~/github/annotation/mondriaan/report/2023-06-06
Processing instructions are treated
XML validation will be performed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
INFO: Needs af.xsd (exists)
INFO: Needs ns2.xsd (exists)
INFO: Needs ns1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/MD.xsd
	round   1:  49 changes
180 identical override(s)
  6 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	sepLine complex pure (added)
	transpose pure ==> mixed
	wbh complex pure (added)
INFO: Needs artwork1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/artwork.xsd
	round   1:  15 changes
	round   2:   2 changes
 24 identical ove

True

# Step 2: Convert

In [75]:
Tei.good = True
Tei.task(convert=True, verbose=0)

Processing instructions are treated
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/MD.xsd
	round   1:  49 changes
180 identical override(s)
  6 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	sepLine complex pure (added)
	transpose pure ==> mixed
	wbh complex pure (added)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/artwork.xsd
	round   1:  15 changes
	round   2:   2 changes
 24 identical override(s)
  3 changing override(s)
	artwork complex pure (added)
	bibl mixed ==> pure
	catRef pure ==> mixed
  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
Start fo

True

# Step 3: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.
Most of it will be generated now, but there are ways to keep custom additions intact.

In [76]:
Tei.task(app=True)

App updated


True

# Step 4: Use the new dataset

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [77]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.4.0
67 features found and 0 ignored
   |     0.00s T otype                from ~/github/annotation/mondriaan/tf/0.9.0
   |     0.05s T oslots               from ~/github/annotation/mondriaan/tf/0.9.0
  0.06s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.03s T after                from ~/github/annotation/mondriaan/tf/0.9.0
   |     0.00s T chunk                from ~/github/annotation/mondriaan/tf/0.9.0
   |     0.00s T file                 from ~/github/annotation/mondriaan/tf/0.9.0
   |     0.04s T str                  from ~/github/annotation/mondriaan/tf/0.9.0
   |     0.00s T folder               from ~/github/annotation/mondriaan/tf/0.9.0
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.14s C __order__            from otype, oslots, __levels__
   |      |     0.00s C __rank__             from otype, __order__
   |      |     0.11s C __levUp__            from otype, oslots, 

Name,# of nodes,# slots / node,% coverage
folder,2,8849.5,100
bibliolist,1,3409.0,19
listBibl,2,1679.0,19
file,16,1106.19,100
letter,14,984.79,78
body,16,694.5,63
text,16,694.5,63
artworklist,1,503.0,3
listObject,1,464.0,3
div,30,354.93,60


In [78]:
T.sectionTypes

['folder', 'file', 'chunk']

# Step 5 Convert to WATM

We add some annotations on the fly: links from nodes which refer to artworks to their
representation on the RKD website.

In [79]:
RKD_URL_BASE = "https://rkd.nl/explore/images/"


def getArtWorksUrl():
    query = """rs type=artwork-m ref~[1-9]"""
    artworksWithRef = A.search(query)
    return {r[0]: f"{RKD_URL_BASE}{F.ref.v(r[0])[13:]}" for r in artworksWithRef}

artworks = getArtWorksUrl()
artworks

  0.00s 14 results


{19721: 'https://rkd.nl/explore/images/277201',
 19722: 'https://rkd.nl/explore/images/68554',
 19728: 'https://rkd.nl/explore/images/277201',
 19729: 'https://rkd.nl/explore/images/68554',
 19785: 'https://rkd.nl/explore/images/62319',
 19786: 'https://rkd.nl/explore/images/68733',
 19792: 'https://rkd.nl/explore/images/277201',
 19818: 'https://rkd.nl/explore/images/194515',
 19819: 'https://rkd.nl/explore/images/268821',
 19820: 'https://rkd.nl/explore/images/268881',
 19824: 'https://rkd.nl/explore/images/62324',
 19871: 'https://rkd.nl/explore/images/68728',
 19884: 'https://rkd.nl/explore/images/268864',
 19891: 'https://rkd.nl/explore/images/268864'}

N.B. For docs click the WATM link in the output cell.

In [80]:
WA = WATM(A, "tei", skipMeta=False, extra=dict(artworkref=artworks))
WA.makeText()

textRepoLevel is section level 'file'


[WATM exporter documentation](https://annotation.github.io/text-fabric/tf/convert/watm.html)

In [81]:
WA.makeAnno()

              folder [   0:     0] - [  13:  2090]
              folder [  14:     0] - [  15:  3407]


In [82]:
WA.writeAll()

Text file    0:      436 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-0.json
Text file    1:      745 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-1.json
Text file    2:      990 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-2.json
Text file    3:      544 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-3.json
Text file    4:      613 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-4.json
Text file    5:      478 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-5.json
Text file    6:      521 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-6.json
Text file    7:      432 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-7.json
Text file    8:     1099 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-8.json
Text file    9:     3374 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-9.json
Text file   10:      560 segments to ~/github/annotation/mondriaan/watm/0.9.0/text-10.json
Text file

# Step 8 Test the WATM against the TF

In [83]:
WA.testAll()

Testing the text ...
	TF:  17699
	WA:  17699
OK - whether the amounts of tokens agree
	TF: Brief aan Aletta de  ... -1940. Baarn 1980.  
	WA: Brief aan Aletta de  ... -1940. Baarn 1980.  
OK - whether the text is the same
Testing the elements ...
	TF:   2621
	WA:   2621
OK - whether the amounts of elements and nodes agree
Testing the processing instructions ...
	TF:      0
	WA:      0
OK - whether the amounts of processing instructions agree
Testing the element/pi annotations ...
	20320 element/pi annotations
	Element      :   2621 x
	Pi           :      0 x
	Other        :  21975 x
	Good name    :   2621 x
	Wrong name   :      0 x
	Good target  :   2621 x
	Wrong target :      0 x
	Unmapped     :      0 x
OK - whether all element/pi annotations have good bodies
OK - whether all element/pi annotations have good targets
Testing the attributes ...
	5678 attribute values
	Good:      5678 x
	Wrong:        0 x
OK - whether annotations are consistent with features
	WA attributes: 5678
	TF att