# Convert from TEI to TF

We show how to convert a TEI data source into TF.

This has two stages:

1. make an preliminary TF dataset with the character as slot type
1. feed the plain text to a tokeniser, and add tokens and sentences to the data set,
   while removing its character and word nodes;
   the new slot type is token.
   
A dataset based on characters is precise, but rather inefficient.
The second step makes the dataset much more efficient.

**More ways to do it!**

* `convertExpress` : as few commands/feedback/interaction as possible, 
* [convertSteps](convertSteps.ipynb): broken down in a few command line commands, more feedback
* [convertDetails](convertDetails.ipynb): run from Python with full control

## Preliminary conversion

We start with a case where the input does not validate.

In [3]:
!tf-fromtei all

Start folder proeftuin:
   1 MD           letter       md           19090216y_IONG_1303.xml                           
   2 MD           letter       md           19090407y_IONG_1739.xml                           
   3 MD           letter       md           19090421y_IONG_1304.xml                           
   4 MD           letter       md           19090426y_IONG_1738.xml                           
   5 MD           letter       md           19090513y_IONG_1293.xml                           
   6 MD           letter       md           19090624_IONG_1294.xml                            
   7 MD           letter       md           19090807y_IONG_1296.xml                           
   8 MD           letter       md           19090824y_KNAP_1747.xml                           
   9 MD           letter       md           19090905y_IONG_1295.xml                           
  10 MD           letter       md           190909XX_QUER_1654.xml                            
  11 MD           letter  

However, the previous version is correct, so we revert to it. That is what the `tei=-1` does.

In [2]:
!tf-fromtei all tei=-1

Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

Validation OK
Namespaces OK
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

App updated


## Add tokens and sentences

Now we have a preliminary TF dataset to work with.
The next step is no longer involved with the source TEI.

In [3]:
!addnlp all
!tf-fromtei apptoken

  0.14s Using NLP pipeline Spacy (en) ...
  4.03s NLP done
  0.00s Feature overview: 45 for nodes; 1 for edges; 1 configs; 9 computed
App updated with tokens and sentences 


# Zip the data

This is for producing a zip file to attach to the latest release, so that TF can download the data smoothly.

In [4]:
!tf-zipall

loading tf app ...
Data to be zipped:
	OK       app                      (v0.8.2 5843b9)     : ~/github/annotation/mondriaan/app
	OK       main data                (v0.8.2 5843b9)     : ~/github/annotation/mondriaan/tf/0.8.2
	OK       graphics                 (v0.8.2 5843b9)     : ~/github/annotation/mondriaan/illustrations
Writing zip file ...
Result: ~/Downloads/github/annotation/mondriaan/complete.zip


# Inspect

We view the result in the TF browser.

To stop the browser, interrupt the kernel (Press `i` twice).

In [None]:
!tf-fromtei browse

This is Text-Fabric 11.4.3
Starting new kernel listening on 10990
Loading data for annotation/mondriaan. Please wait ...
Setting up TF kernel for annotation/mondriaan  
**Locating corpus resources ...**
Using app in ~/github/annotation/mondriaan/app:
	repo clone offline under ~/github (local github)
Using data in ~/github/annotation/mondriaan/tf/0.8.2:
	repo clone offline under ~/github (local github)
Using data in ~/github/annotation/mondriaan/illustrations:
	repo clone offline under ~/github (local github)
<IPython.core.display.HTML object>
TF setup done.
Starting new webserver listening on 20990
 * Running on http://localhost:20990
[33mPress CTRL+C to quit[0m
Opening annotation/mondriaan in browser
Press <Ctrl+C> to stop the TF browser
Kernel listening at port 10990
127.0.0.1 - - [02/May/2023 09:19:04] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [02/May/2023 09:19:04] "GET /server/static/display.css HTTP/1.1" 200 -
127.0.0.1 - - [02/May/2023 09:19:04] "GET /server/static/highlight.css HT