# Convert from TEI to TF

We convert Mondriaan TEI to TF.

This notebook is bare, no explanations, nu illustrations, no checks.
For more documentation, try any of the following variants:

* *convertExpress* : as few commands/feedback/interaction as possible, 
* [convertSteps](convertSteps.ipynb): broken down in a few command line commands, more feedback
* [convertDetails](convertDetails.ipynb): run from Python with full control

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.addnlp import NLPipeline
from tf.advanced.helpers import dm

In [3]:
ORG = "annotation"
REPO = "mondriaan"

### Step 1: Check

In [36]:
Tei = TEI(verbose=-1, tei=0, tf="0.8.12pre")

In [37]:
Tei.task(check=True, verbose=1, validate=True)

TEI to TF checking: ~/github/annotation/mondriaan/tei/2023-05-24 => ~/github/annotation/mondriaan/report/2023-05-24
Processing instructions are treated
XML validation will be performed
Section model I
Start folder proeftuin:
~/github/annotation/mondriaan/tei/2023-05-24/proeftuin/19090216y_IONG_1303.xml: adaptation triggerRe=re.compile('<\\?editem\\b[^>]*?adaptation=[\'"]([^\'"]+)[\'"]')
~/github/annotation/mondriaan/tei/2023-05-24/proeftuin/19090216y_IONG_1303.xml: template triggerRe=re.compile('<\\?editem\\b[^>]*?template=[\'"]([^\'"]+)[\'"]')
   1 MD           letter       19090216y_IONG_1303.xml                           
~/github/annotation/mondriaan/tei/2023-05-24/proeftuin/19090407y_IONG_1739.xml: adaptation triggerRe=re.compile('<\\?editem\\b[^>]*?adaptation=[\'"]([^\'"]+)[\'"]')
~/github/annotation/mondriaan/tei/2023-05-24/proeftuin/19090407y_IONG_1739.xml: template triggerRe=re.compile('<\\?editem\\b[^>]*?template=[\'"]([^\'"]+)[\'"]')
   2 MD           letter       19090407y_

False

### Step 2: Convert

In [32]:
Tei.good = True
Tei.task(convert=True)

TEI to TF converting: ~/github/annotation/mondriaan/tei/2023-05-24 => ~/github/annotation/mondriaan/tf/0.8.12pre
Page model II with pb elements at the top of the page
Processing instructions are treated
  0.00s Not all of the warp features otype and oslots are present in
~/github/annotation/mondriaan/tf/0.8.12pre
  0.00s Only the Feature and Edge APIs will be enabled
  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s No structure nodes will be set up
   |   SECTION   TYPES:    folder, file, chunk
   |   SECTION   FEATURES: folder, file, chunk
   |   STRUCTURE TYPES:    
   |   STRUCTURE FEATURES: 
   |   TEXT      FEATURES:
   |      |   text-orig-full       ch
   |     0.00s OK
   |     0.00s Following director... 
Start folder proeftuin:
   1 MD           letter       19090216y_IONG_1303.xml                           
   2 MD           letter       19090407y_

True

### Step 3: Load the TF data

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and precomputations are performed.
Next time the loading will be much quicker.

In [33]:
Tei.task(load=True)

True

### Step 4: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [34]:
Tei.task(app=True)

App updated


True

### Step 5: Add tokens and sentences

In [35]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())
NLP = NLPipeline(lang="en", verbose=0, write=True)
NLP.loadApp(Apre)

(text, positions) = NLP.task(plaintext=True)
(tokens, sentences) = NLP.task(lingo=True, text=text)
newVersion = NLP.task(
    ingest=True,
    positions=positions,
    tokens=tokens,
    sentences=sentences,
)
Tei.task(apptoken=True)

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,2,39134.0,100
biblio,1,13652.0,17
listBibl,2,6722.5,17
file,16,4891.75,100
letter,14,4455.07,80
body,16,2876.0,59
text,16,2876.0,59
artworks,1,2245.0,3
listObject,1,2035.0,3
standOff,14,1570.86,28


Input data has version 0.8.12pre
Compute element boundaries
  1591 start postions
  1850 end postions
Input data has version 0.8.12pre
Compute element boundaries
  1591 start postions
  1850 end postions
  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 277 empty slots
   |   recorded flow main       with 168540 items
   |   recorded flow del        with    153 items
   |   recorded flow note       with  68245 items
   |   recorded flow orig       with     95 items
  0.16s Done. Generated text and positions written to ~/github/annotation/mondriaan/_temp/txt/plain.txt
Input data has version 0.8.12pre
Compute element boundaries
  1591 start postions
  1850 end postions
  0.00s Using NLP pipeline Spacy (en) ...
  3.89s Atomic tokens written to ~/github/annotation/mondriaan/_temp/txt/tokens.tsv
  3.90s Sentences written to ~/github/annotation/mondriaan/_temp/txt/sentences.tsv
  3.90s NLP done
Input data has ve

True

### Step 6: Use the new dataset

In [36]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 11.4.16
70 features found and 0 ignored
   |     0.02s T otype                from ~/github/annotation/mondriaan/tf/0.8.12
   |     0.19s T oslots               from ~/github/annotation/mondriaan/tf/0.8.12
  0.21s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.07s T after                from ~/github/annotation/mondriaan/tf/0.8.12
   |     0.08s T str                  from ~/github/annotation/mondriaan/tf/0.8.12
   |     0.00s T file                 from ~/github/annotation/mondriaan/tf/0.8.12
   |     0.00s T chunk                from ~/github/annotation/mondriaan/tf/0.8.12
   |     0.00s T folder               from ~/github/annotation/mondriaan/tf/0.8.12
   |      |     0.01s C __levels__           from otype, oslots, otext
   |      |     0.25s C __order__            from otype, oslots, __levels__
   |      |     0.01s C __rank__             from otype, __order__
   |      |     0.32s C __levUp__            from otype, 

Name,# of nodes,# slots/node,% coverage
folder,2,8550.5,100
biblio,1,3146.0,18
listBibl,2,1547.0,18
file,16,1068.81,100
letter,14,957.79,78
body,16,664.75,62
text,16,664.75,62
artworks,1,546.0,3
listObject,1,501.0,3
div,31,326.03,59
