# Convert from TEI to TF to WATM

First we convert Mondriaan TEI to TF and then the TF to WATM.

This notebook is bare, no explanations, no illustrations, no checks.
For more documentation, try any of the following variants:

* `convertExpress` : as few commands/feedback/interaction as possible, 
* [convertSteps](convertSteps.ipynb): broken down in a few command line commands, more feedback
* [convertDetails](convertDetails.ipynb): run from Python with full control

In [1]:
%load_ext autoreload
%autoreload 2

In [18]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.addnlp import NLPipeline
from tf.convert.watm import WATM
from tf.advanced.helpers import dm

In [3]:
ORG = "annotation"
REPO = "mondriaan"

# Step 1: Check

In [4]:
Tei = TEI(verbose=-1, tei=0, tf="0.8.13pre")



In [5]:
Tei.task(check=True, verbose=1, validate=True)

TEI to TF checking: ~/github/annotation/mondriaan/tei/2023-06-06 => ~/github/annotation/mondriaan/report/2023-06-06
Processing instructions are treated
XML validation will be performed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
INFO: Needs af.xsd (exists)
INFO: Needs ns2.xsd (exists)
INFO: Needs ns1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/MD.xsd
	round   1:  49 changes
180 identical override(s)
  6 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	sepLine complex pure (added)
	transpose pure ==> mixed
	wbh complex pure (added)
INFO: Needs artwork1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/artwork.xsd
	round   1:  15 changes
	round   2:   2 changes
 24 identical ove

True

# Step 2: Convert

In [6]:
Tei.good = True
Tei.task(convert=True, verbose=0)

Processing instructions are treated
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/MD.xsd
	round   1:  49 changes
180 identical override(s)
  6 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	sepLine complex pure (added)
	transpose pure ==> mixed
	wbh complex pure (added)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/artwork.xsd
	round   1:  15 changes
	round   2:   2 changes
 24 identical override(s)
  3 changing override(s)
	artwork complex pure (added)
	bibl mixed ==> pure
	catRef pure ==> mixed
  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
Start fo

True

# Step 3: Load the TF data

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [7]:
Tei.task(load=True)

True

# Step 4: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.
Most of it will be generated now, but there are ways to keep custom additions intact.

In [8]:
Tei.task(app=True)

App updated


True

# Step 5: Add tokens and sentences

In [9]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
folder,2,40736.5,100
bibliolist,1,13650.0,17
listBibl,2,6722.0,17
file,16,5092.06,100
letter,14,4684.14,80
body,16,2878.19,57
text,16,2878.19,57
artworklist,1,2245.0,3
listObject,1,2035.0,2
standOff,14,1798.36,31


In [10]:
NLP = NLPipeline(lang="en", verbose=0, write=True)
NLP.loadApp(Apre)

Input data has version 0.8.13pre
Compute element boundaries
  1644 start postions
  1907 end postions


In [11]:
(text, positions) = NLP.task(plaintext=True)

Input data has version 0.8.13pre
Compute element boundaries
  1644 start postions
  1907 end postions
  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 296 empty slots
   |   recorded flow main       with 168863 items
   |   recorded flow del        with    153 items
   |   recorded flow note       with  77857 items
   |   recorded flow orig       with     95 items
  0.22s Done. Generated text and positions written to ~/github/annotation/mondriaan/_temp/txt/plain.txt


In [12]:
(tokens, sentences) = NLP.task(lingo=True, text=text)

Input data has version 0.8.13pre
Compute element boundaries
  1644 start postions
  1907 end postions
  0.00s Using NLP pipeline Spacy (en) ...
NLP with language model en False
  4.19s Tokens written to ~/github/annotation/mondriaan/_temp/txt/tokens.tsv
  4.19s Sentences written to ~/github/annotation/mondriaan/_temp/txt/sentences.tsv
  4.19s NLP done


In [13]:
newVersion = NLP.task(
    ingest=True,
    positions=positions,
    tokens=tokens,
    sentences=sentences,
)

Input data has version 0.8.13pre
Compute element boundaries
  1644 start postions
  1907 end postions
  0.00s Ingesting NLP output into the dataset ...
   |       12s Mapping NLP data to nodes and features ...
   |      |     0.00s generating t-nodes with features str, after, empty
   |      |      |    -0.00s 17747 t nodes have values assigned for str, after, pos, morph, lemma
   |      |      |     0.00s 0 empty slots are properly contained in a token
   |      |      |     0.00s 101 space slots have split into chars
   |      |      |     0.00s 2121 slots have split around an element boundary
   |      |      |     0.00s 11479x Items contained in extra generated text
   |      |     0.11s 17518 tokens
   |      |     0.11s 17747 ts
   |      |     0.11s generating sentence-nodes with features nsent
   |      |      |     0.03s 1521 sentence nodes have values assigned for nsent
   |      |      |     0.03s  1789x Items contained in extra generated text
   |      |      |     0.03s   

In [14]:
Tei.task(apptoken=True)

App updated with NLP output 


True

# Step 6: Use the new dataset

In [15]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.2.10
71 features found and 0 ignored
   |     0.02s T otype                from ~/github/annotation/mondriaan/tf/0.8.13
   |     0.23s T oslots               from ~/github/annotation/mondriaan/tf/0.8.13
  0.25s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.00s T file                 from ~/github/annotation/mondriaan/tf/0.8.13
   |     0.00s T folder               from ~/github/annotation/mondriaan/tf/0.8.13
   |     0.08s T str                  from ~/github/annotation/mondriaan/tf/0.8.13
   |     0.00s T chunk                from ~/github/annotation/mondriaan/tf/0.8.13
   |     0.07s T after                from ~/github/annotation/mondriaan/tf/0.8.13
   |      |     0.01s C __levels__           from otype, oslots, otext
   |      |     0.22s C __order__            from otype, oslots, __levels__
   |      |     0.01s C __rank__             from otype, __order__
   |      |     0.31s C __levUp__            from otype, 

Name,# of nodes,# slots / node,% coverage
folder,2,8873.5,100
bibliolist,1,3144.0,18
listBibl,2,1546.5,17
file,16,1109.19,100
letter,14,1004.07,79
body,16,666.94,60
text,16,666.94,60
artworklist,1,546.0,3
listObject,1,501.0,3
standOff,14,358.71,28


# Step 7 Convert to WATM

We add some annotations on the fly: links from nodes which refer to artworks to their
representation on the RKD website.

In [16]:
RKD_URL_BASE = "https://rkd.nl/explore/images/"


def getArtWorksUrl():
    query = """rs type=artwork-m ref~[1-9]"""
    artworksWithRef = A.search(query)
    return {r[0]: f"{RKD_URL_BASE}{F.ref.v(r[0])[13:]}" for r in artworksWithRef}

artworks = getArtWorksUrl()
artworks

  0.00s 14 results


{19790: 'https://rkd.nl/explore/images/277201',
 19791: 'https://rkd.nl/explore/images/68554',
 19797: 'https://rkd.nl/explore/images/277201',
 19798: 'https://rkd.nl/explore/images/68554',
 19854: 'https://rkd.nl/explore/images/62319',
 19855: 'https://rkd.nl/explore/images/68733',
 19861: 'https://rkd.nl/explore/images/277201',
 19887: 'https://rkd.nl/explore/images/194515',
 19888: 'https://rkd.nl/explore/images/268821',
 19889: 'https://rkd.nl/explore/images/268881',
 19893: 'https://rkd.nl/explore/images/62324',
 19940: 'https://rkd.nl/explore/images/68728',
 19953: 'https://rkd.nl/explore/images/268864',
 19960: 'https://rkd.nl/explore/images/268864'}

N.B. For docs click the WATM link in the output cell.

In [19]:
WA = WATM(A, "tei", skipMeta=False, extra=dict(artworkref=artworks))
WA.makeText()
WA.makeAnno()
WA.writeAll()

[WATM exporter docs](https://annotation.github.io/text-fabric/tf/convert/watm.html)

Text file:   17747 segments to ~/github/annotation/mondriaan/watm/0.8.13/text.json
 85386 annotations written to ~/github/annotation/mondriaan/watm/0.8.13/anno-1.json
Anno files:   85386 annotations to 1 files


# Step 8 Test the WATM against the TF

In [20]:
WA.testAll()

Testing the text ...
	TF:  17747
	WA:  17747
OK - whether the amounts of tokens agree
	TF: Brief aan Aletta de  ... 0-1940. Baarn 1980. 
	WA: Brief aan Aletta de  ... 0-1940. Baarn 1980. 
OK - whether the text is the same
Testing the elements ...
	TF:  21684
	WA:  21684
OK - whether the amounts of elements and nodes agree
	TF:      0
	WA:      0
OK - whether the amounts of processing instructions agree
	39431 element annotations
	Element : 21684 x
	Pi      :     0 x
	Other   : 63702 x
	Good    : 21684 x
	Wrong   :     0 x
	Unmapped:     0 x
OK - whether all element annotations are ok
Testing the attributes ...
	7890 attribute values
	Good:      7890 x
	Wrong:        0 x
OK - whether annotations are consistent with features
	WA attributes: 7890
	TF attributes: 7890
OK - whether annotations are complete w.r.t. features
Testing the format attributes ...
	252 format values
	formatting attributes: 
		   106 x italics
		    59 x underline
		    36 x indent
		    18 x upsidedown
		     8 x su

True