# Convert from TEI to TF

We convert Mondriaan TEI to TF.

This notebook is bare, no explanations, nu illustrations, no checks.
For more documentation, try any of the following variants:

* *convertExpress* : as few commands/feedback/interaction as possible, 
* [convertSteps](convertSteps.ipynb): broken down in a few command line commands, more feedback
* [convertDetails](convertDetails.ipynb): run from Python with full control

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.addnlp import NLPipeline
from tf.advanced.helpers import dm

In [3]:
ORG = "annotation"
REPO = "mondriaan"

### Step 1: Check

In [4]:
Tei = TEI(verbose=-1, tei=0, tf="0.8.13pre")

In [5]:
Tei.task(check=True, verbose=0, validate=False)

Processing instructions are treated
XML validation will be skipped
Start folder proeftuin:
   1 MD           letter       md           19090216y_IONG_1303.xml                           
   2 MD           letter       md           19090407y_IONG_1739.xml                           
   3 MD           letter       md           19090421y_IONG_1304.xml                           
   4 MD           letter       md           19090426y_IONG_1738.xml                           
   5 MD           letter       md           19090513y_IONG_1293.xml                           
   6 MD           letter       md           19090624_IONG_1294.xml                            
   7 MD           letter       md           19090807y_IONG_1296.xml                           
   8 MD           letter       md           19090824y_KNAP_1747.xml                           
   9 MD           letter       md           19090905y_IONG_1295.xml                           
  10 MD           letter       md           190909XX_Q

True

### Step 2: Convert

In [6]:
Tei.good = True
Tei.task(convert=True, verbose=0)

Processing instructions are treated
  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
Start folder proeftuin:
   1 MD           md           letter       19090216y_IONG_1303.xml                           
   2 MD           md           letter       19090407y_IONG_1739.xml                           
   3 MD           md           letter       19090421y_IONG_1304.xml                           
   4 MD           md           letter       19090426y_IONG_1738.xml                           
   5 MD           md           letter       19090513y_IONG_1293.xml                           
   6 MD           md           letter       19090624_IONG_1294.xml                            
   7 MD           md           letter       19090807y_IONG_1296.xml                           
   8 MD           md           letter       19090824y_KNAP_1747.xml                           
   9 MD           md       

True

### Step 3: Load the TF data

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and precomputations are performed.
Next time the loading will be much quicker.

In [7]:
Tei.task(load=True)

True

### Step 4: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [8]:
Tei.task(app=True)

App updated


True

### Step 5: Add tokens and sentences

In [9]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())
NLP = NLPipeline(lang="en", verbose=0, write=True)
NLP.loadApp(Apre)

(text, positions) = NLP.task(plaintext=True)
(tokens, sentences) = NLP.task(lingo=True, text=text)
newVersion = NLP.task(
    ingest=True,
    positions=positions,
    tokens=tokens,
    sentences=sentences,
)
Tei.task(apptoken=True)

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,2,40718.5,100
bibliolist,1,13650.0,17
listBibl,2,6722.0,17
file,16,5089.81,100
letter,14,4681.57,80
body,16,2875.94,57
text,16,2875.94,57
artworklist,1,2245.0,3
listObject,1,2035.0,2
standOff,14,1798.36,31


Input data has version 0.8.13pre
Compute element boundaries
  1611 start postions
  1871 end postions
Input data has version 0.8.13pre
Compute element boundaries
  1611 start postions
  1871 end postions
  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 262 empty slots
   |   recorded flow main       with 168529 items
   |   recorded flow del        with    153 items
   |   recorded flow note       with  77857 items
   |   recorded flow orig       with     95 items
  0.20s Done. Generated text and positions written to ~/github/annotation/mondriaan/_temp/txt/plain.txt
Input data has version 0.8.13pre
Compute element boundaries
  1611 start postions
  1871 end postions
  0.00s Using NLP pipeline Spacy (en) ...
  3.89s Atomic tokens written to ~/github/annotation/mondriaan/_temp/txt/tokens.tsv
  3.89s Sentences written to ~/github/annotation/mondriaan/_temp/txt/sentences.tsv
  3.89s NLP done
Input data has ve

True

### Step 6: Use the new dataset

In [10]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 11.4.16
70 features found and 0 ignored
   |     0.02s T otype                from ~/github/annotation/mondriaan/tf/0.8.13
   |     0.24s T oslots               from ~/github/annotation/mondriaan/tf/0.8.13
  0.26s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.00s T folder               from ~/github/annotation/mondriaan/tf/0.8.13
   |     0.00s T file                 from ~/github/annotation/mondriaan/tf/0.8.13
   |     0.09s T str                  from ~/github/annotation/mondriaan/tf/0.8.13
   |     0.00s T chunk                from ~/github/annotation/mondriaan/tf/0.8.13
   |     0.07s T after                from ~/github/annotation/mondriaan/tf/0.8.13
   |      |     0.01s C __levels__           from otype, oslots, otext
   |      |     0.19s C __order__            from otype, oslots, __levels__
   |      |     0.01s C __rank__             from otype, __order__
   |      |     0.32s C __levUp__            from otype, 

Name,# of nodes,# slots/node,% coverage
folder,2,8855.5,100
bibliolist,1,3144.0,18
listBibl,2,1546.5,17
file,16,1106.94,100
letter,14,1001.5,79
body,16,664.69,60
text,16,664.69,60
artworklist,1,546.0,3
listObject,1,501.0,3
standOff,14,358.71,28
