# Convert from TEI to TF

We show how to convert a TEI data source into TF.

This has two stages:

1. make an preliminary TF dataset with the character as slot type
1. feed the plain text to a tokenizer, and add tokens and sentences to the datset,
   while removing its character and word nodes;
   the new slot type is token.
   
A dataset based on characters is precise, but rather inefficient.
The second step makes the dataset much more efficient.

**More ways to do it!**

* [convertExpress](convertExpress.ipynb) : as few commands/feedback/interaction as possible, 
* *convertSteps*: broken down in a few command line commands, more feedback
* [convertDetails](convertDetails.ipynb): run from Python with full control

## Preliminary conversion

Same as in [convertExpress](convertExpress.ipynb) but now step by step and with more feedback.

In [5]:
!python tfFromTei.py check +verbose

INFO: Needs af.xsd (exists)
INFO: Needs ns2.xsd (exists)
INFO: Needs ns1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
Analysing ~/github/annotation/mondriaan/schema/MD.xsd
  6 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	sepLine complex pure (added)
	transpose pure ==> mixed
	wbh complex pure (added)
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

217 info line(s) written to ~/github/annotation/mondriaan/report/elements.txt
0 error(s) in 0 file(s) written to ~/github/annotation/mondriaan/report/errors.txt
59 tags of which 0 with multiple namespaces written to ~/github/annotation/mondriaan/report/namespaces.txt


In [6]:
!python tfFromTei.py convert +verbose

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

   |     0.19s "edge" actions: 0
   |     0.19s "feature" actions: 123262
   |     0.19s "node" actions: 12921
   |     0.19s "resume" actions: 0
   |     0.19s "slot" actions: 63491
   |     0.19s "terminate" actions: 12921
   |      76412 nodes of all types
   |     0.19s OK
   |     0.00s checking for nodes and edges ... 
   |     0.00s OK
   |     0.00s checking (section) features ... 
   |     0.00s OK
   |     0.00s reordering nodes ...
   |     0.04s Max node = 76412
   |     0.04s OK
   |     0.00s reassigning feature values ...
   |     0.01s OK


In [7]:
!python tfFromTei.py load +verbose

max node = 76412


In [8]:
!python tfFromTei.py app +verbose

App updating  ...
	~/github/annotation/mondriaan/docs/about.md (generated with custom info)
	~/github/annotation/mondriaan/docs/transcription.md (no custom info, older orginal exists)
	~/github/annotation/mondriaan/app/static/logo.png (already exists, not overwritten)
	~/github/annotation/mondriaan/app/static/display.css (generated with custom info)
	~/github/annotation/mondriaan/app/config.yaml (generated with custom info)
	~/github/annotation/mondriaan/app/app.py (generated with custom info)
Done


We can regulate the verbosity:

* `-verbose`: minimal feedback,
* `+verbose`: moderate amount of feedback,
* `++verbose` maximal feedback.

## Add tokens and sentences

We add tokens and sentences to the TF dataset.

We do this by the following steps

1. Generate a plain text plus mapping between character positions and nodes
2. Use Spacy to tokenize the text and to determine sentence boundaries
3. Translate the Spacy results back to extra nodes and features for the TF set
4. Adjust the app to the modified dataset

### Step 0: Revert back to the dataset without the tokens and sentences

In [9]:
!python tfFromtei.py app +verbose

App updating  ...
	~/github/annotation/mondriaan/docs/about.md (generated with custom info)
	~/github/annotation/mondriaan/docs/transcription.md (no custom info, older orginal exists)
	~/github/annotation/mondriaan/app/static/logo.png (already exists, not overwritten)
	~/github/annotation/mondriaan/app/static/display.css (generated with custom info)
	~/github/annotation/mondriaan/app/config.yaml (generated with custom info)
	~/github/annotation/mondriaan/app/app.py (generated with custom info)
Done


### Step 1: Generate a plain text of the whole corpus

We add `+write` to write out the plain text and the node positions,
so that it can be used in the next step.

In [10]:
!addnlp plaintext +write +verbose

  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 202 empty slots
   |   recorded flow main       with 127818 items
   |   recorded flow del        with    159 items
   |   recorded flow note       with  64164 items
   |   recorded flow orig       with    227 items
  0.13s Done. Generated text and positions written to ~/github/annotation/mondriaan/_temp/txt/plain.txt


### Step 2: Run Spacy to get tokens and sentences

In [11]:
!addnlp lingo +write +verbose

  0.00s Using NLP pipeline Spacy (en) ...
  3.02s NLP done


### Step 3: Ingest the results in the data set

We omit the `+write` because the new data set will be written anyway.

In [12]:
!addnlp ingest +verbose

  0.09s Ingesting tokens and sentences into the dataset ...
   |     0.09s Mapping NLP data to nodes and features ...
   |      |     0.00s generating token-nodes with features str, after, empty
   |      |      |    -0.00s 13761 token nodes have values assigned for str, after
   |      |      |     0.00s 202 empty slots have split surrounding tokens
   |      |      |     0.00s 228 space slots have split into chars
   |      |      |     0.00s  6837x Items contained in extra generated text
   |      |     0.03s 13761 tokens
   |      |     0.03s generating sentence-nodes with features nsent
   |      |      |     0.02s 1063 sentence nodes have values assigned for nsent
   |      |      |     0.02s  1034x Items contained in extra generated text
   |      |      |     0.02s    77x Items with empty final text
   |      |     0.05s 1063 sentences
   |     0.15s Make a modified dataset ...
allTokenFeatures=['str', 'after']
allSentenceFeatures=['nsent']
  0.00s Feature overview: 45 for node

### Step 4: Adjust the app to the modified dataset

Various things in the `config.yaml` and `app.py` of the TF app should be updated, as well
as the documentation file that gives the ins and outs of the resulting features.

In [13]:
!python tfFromTei.py apptoken

App updated with tokens and sentences 


# Use the data

In [14]:
from tf.app import use
from tf.convert.addnlp import NLPipeline

In [15]:
ORG = "annotation"
REPO = "mondriaan"

In [16]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

   |     0.01s T otype                from ~/github/annotation/mondriaan/tf/0.8.2
   |     0.09s T oslots               from ~/github/annotation/mondriaan/tf/0.8.2
   |     0.00s T letter               from ~/github/annotation/mondriaan/tf/0.8.2
   |     0.00s T chunk                from ~/github/annotation/mondriaan/tf/0.8.2
   |     0.03s T after                from ~/github/annotation/mondriaan/tf/0.8.2
   |     0.03s T str                  from ~/github/annotation/mondriaan/tf/0.8.2
   |     0.00s T folder               from ~/github/annotation/mondriaan/tf/0.8.2
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.09s C __order__            from otype, oslots, __levels__
   |      |     0.00s C __rank__             from otype, __order__
   |      |     0.09s C __levUp__            from otype, oslots, __rank__
   |      |     0.02s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.00s C __characters__       from otext
   | 

Name,# of nodes,# slots/node,% coverage
folder,1,13761.0,100
letter,14,982.93,100
body,14,849.93,86
text,14,849.93,86
chunk,86,160.0,100
div,93,219.99,149
teiHeader,14,124.57,13
p,95,73.39,51
postscript,6,62.83,3
revisionDesc,14,61.0,6


Display the a nice chunk:

In [17]:
chunk = A.nodeFromSectionStr("proeftuin@19090421y_IONG_1304:6")
A.plain(chunk)