In [1]:
%load_ext autoreload
%autoreload 2

# Convert from TEI to TF

We show how to convert a TEI data source into TF.

This has two stages:

1. make an preliminary TF dataset with the character as slot type
1. feed the plain text to a tokenizer, and add tokens and sentences to the datset,
   while removing its character and word nodes;
   the new slot type is token.
   
A dataset based on characters is precise, but rather inefficient.
The second step makes the dataset much more efficient.

**More ways to do it!**

* [convertExpress](convertExpress.ipynb) : as few commands/feedback/interaction as possible, 
* [convertSteps](convertSteps.ipynb): broken down in a few command line commands, more feedback
* *convertDetails*: run from Python with full control

## Preliminary conversion

Same as in [convertSteps](convertSteps.ipynb) but now with even more feedback.

### Step 1: Check

Check the input: validity of the TEI-XML.

Make a report of the elements and attributes used.

Use the declared schemas in the XML source to determine which elements have
pure content and which ones mixed content.

In [2]:
from tf.convert.tei import TEI

In [3]:
T = TEI(verbose=1)

Working in repository annotation/mondriaan in backend github
TEI data version is 2023-04-25 (most recent)
TF data version is 0.8.2pre (latest)


In [4]:
T.task(check=True)

TEI to TF checking: ~/github/annotation/mondriaan/tei/2023-04-25 => ~/github/annotation/mondriaan/report/2023-04-25
INFO: Needs af.xsd (exists)
INFO: Needs ns2.xsd (exists)
INFO: Needs ns1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/MD.xsd
	round   1:  49 changes
180 identical override(s)
  6 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	sepLine complex pure (added)
	transpose pure ==> mixed
	wbh complex pure (added)
Section model I
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       ERROR

End   folder proeftuin

220 info line(s) written to ~/github/annotation/mondriaan/report/2023-04-25/elements.txt
7 error(s) in 1 file(s) written to ~/github/annotation/mondriaan/report/2023-04-25/errors.txt
60 tags of which 0 with multiple namespaces written to ~/github/annotation/mondriaan/report/2023-04-

False

We revert to the previous version:

In [5]:
T = TEI(verbose=1, tei=-1)
T.task(check=True)

Working in repository annotation/mondriaan in backend github
TEI data version is 2023-03-20 (previous)
TF data version is 0.8.2pre (latest)
TEI to TF checking: ~/github/annotation/mondriaan/tei/2023-03-20 => ~/github/annotation/mondriaan/report/2023-03-20
INFO: Needs af.xsd (exists)
INFO: Needs ns2.xsd (exists)
INFO: Needs ns1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/MD.xsd
	round   1:  49 changes
180 identical override(s)
  6 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	sepLine complex pure (added)
	transpose pure ==> mixed
	wbh complex pure (added)
Section model I
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

217 info line(s) written to ~/github/annotation/mondriaan/report/2023-03-20/elements.txt
0 error(s) in 0 file(s) written to ~/github/annotation/mondr

True

### Step 2: Convert

Run the actual conversin and produce TF output.

In [15]:
T.task(convert=True)

TEI to TF converting: ~/github/annotation/mondriaan/tei/2023-03-20 => ~/github/annotation/mondriaan/tf/0.8.2pre
  0.00s Not all of the warp features otype and oslots are present in
~/github/annotation/mondriaan/tf/0.8.2pre
  0.00s Only the Feature and Edge APIs will be enabled
  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s No structure nodes will be set up
   |   SECTION   TYPES:    folder, letter, chunk
   |   SECTION   FEATURES: folder, letter, chunk
   |   STRUCTURE TYPES:    
   |   STRUCTURE FEATURES: 
   |   TEXT      FEATURES:
   |      |   text-orig-full       ch
   |     0.00s OK
   |     0.00s Following director... 
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

source reading done
   |     0.23s "edge" actions: 0
   |     0.23s "feature" actions: 123262
   |     0.23s "node" actions: 12921


True

### Step 3: Load the TF data

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and precomputations are performed.
Next time the loading will be much quicker.

In [16]:
T.task(load=True, verbose=1)

   |     0.02s T otype                from ~/github/annotation/mondriaan/tf/0.8.2pre
   |     0.22s T oslots               from ~/github/annotation/mondriaan/tf/0.8.2pre
   |     0.13s T ch                   from ~/github/annotation/mondriaan/tf/0.8.2pre
   |     0.00s T letter               from ~/github/annotation/mondriaan/tf/0.8.2pre
   |     0.00s T folder               from ~/github/annotation/mondriaan/tf/0.8.2pre
   |     0.00s T chunk                from ~/github/annotation/mondriaan/tf/0.8.2pre
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.43s C __order__            from otype, oslots, __levels__
   |      |     0.01s C __rank__             from otype, __order__
   |      |     0.43s C __levUp__            from otype, oslots, __rank__
   |      |     0.10s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.01s C __characters__       from otext
   |      |     0.15s C __boundary__         from otype, oslots, __ra

True

### Step 4: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [17]:
T.task(app=True)

App updating  ...
	~/github/annotation/mondriaan/docs/about.md (generated with custom info)
	~/github/annotation/mondriaan/docs/transcription.md (no custom info, older orginal exists)
	~/github/annotation/mondriaan/app/static/logo.png (already exists, not overwritten)
	~/github/annotation/mondriaan/app/static/display.css (generated with custom info)
	~/github/annotation/mondriaan/app/config.yaml (generated with custom info)
	~/github/annotation/mondriaan/app/app.py (generated with custom info)
Done


True

## View the preliminary result

As a final proof, we load the app:

In [2]:
from tf.app import use

In [3]:
ORG = "annotation"
REPO = "mondriaan"

In [20]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone")

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,63491.0,100
letter,14,4535.07,100
body,14,3880.0,86
text,14,3880.0,86
chunk,86,738.26,100
div,93,1013.75,148
teiHeader,14,646.64,14
revisionDesc,14,368.07,8
p,95,324.6,49
postscript,6,250.83,2


### Show a fragment

In [21]:
chunk = Apre.api.F.otype.s("chunk")[4]
Apre.plain(chunk)

## Add tokens and sentences

We add tokens and sentences to the TF dataset.

We do this by the following steps

1. Generate a plain text plus mapping between character positions and nodes
2. Use Spacy to tokenize the text and to determine sentence boundaries
3. translate the Spacy results back to extra nodes and features for the TF set
4. replace the character slots in the TF set by tokens

### Step by step from Python

We carry out the steps from within Python.

In that way we get access to all intermediate results, and we can play and explore between the steps.

We load the data we have so far, and pass it on to an `NLPipeline` object, defined by Text-Fabric.

In [3]:
from tf.app import use
from tf.convert.addnlp import NLPipeline

In [4]:
ORG = "annotation"
REPO = "mondriaan"

### Back to the previous state

When we have added the data to the dataset, we will tweak the TF app.

But if we want to redo the pipeline, we have to restore the app to the situation before
the tokens and sentences were added.

That's the reason we have the next cell.

In [24]:
T.task(app=True)

App updating  ...
	~/github/annotation/mondriaan/docs/about.md (generated with custom info)
	~/github/annotation/mondriaan/docs/transcription.md (no custom info, older orginal exists)
	~/github/annotation/mondriaan/app/static/logo.png (already exists, not overwritten)
	~/github/annotation/mondriaan/app/static/display.css (generated with custom info)
	~/github/annotation/mondriaan/app/config.yaml (generated with custom info)
	~/github/annotation/mondriaan/app/app.py (generated with custom info)
Done


True

In [25]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone")
NLP = NLPipeline(lang="en", verbose=0, write=True)
NLP.loadApp(Apre)

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,63491.0,100
letter,14,4535.07,100
body,14,3880.0,86
text,14,3880.0,86
chunk,86,738.26,100
div,93,1013.75,148
teiHeader,14,646.64,14
revisionDesc,14,368.07,8
p,95,324.6,49
postscript,6,250.83,2


### Before the steps

We can set the verbosity as we like.

Generate plain text (add `verbose=-1` or `0` or `1` and/or `write=True` if you like).

* `verbose=-1` is the same as `-verbose`
* `verbose=0` is the same as `+verbose`
* `verbose=1` is the same as `++verbose`

### Step 1: Generate a plain text of the whole corpus

The function delivers the text in a variable, and it has recorded which character positions correspond
to which slots in the TF dataset.

We receive both items of data.

In [26]:
(text, positions) = NLP.task(plaintext=True)

  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 202 empty slots
   |   recorded flow main       with 127818 items
   |   recorded flow del        with    159 items
   |   recorded flow note       with  64164 items
   |   recorded flow orig       with    227 items
  0.18s Done. Generated text and positions written to ~/github/annotation/mondriaan/_temp/txt/plain.txt


### Step 2: Run Spacy to get tokens and sentences

Now we feed the text from step 1 into the NLP pipeline, which is Spacy.

We get a list of tokens and a list of sentences back.

In [27]:
(tokens, sentences) = NLP.task(lingo=True, text=text)

  0.00s Using NLP pipeline Spacy (en) ...
  3.04s NLP done


Let's examine a few tokens and sentences:

In [28]:
for token in tokens[400:410]:
    print(token)

(1568, 1573, 'kwart', ' ')
(1574, 1578, 'voor', ' ')
(1579, 1583, 'acht', ' ')
(1584, 1590, 'ingang', ' ')
(1591, 1597, 'kleine', ' ')
(1598, 1602, 'zaal', ' ')
(1603, 1610, 'Concert', '')
(1610, 1611, '-', '')
(1611, 1617, 'gebouw', '')
(1617, 1618, ',', ' ')


Each token entry specifies the start and end position in the plain text file,
the string value of the token, and the whitespace after the token, if any.

In [29]:
for sentence in sentences[390:410]:
    print(sentence)

(11144, 11152, 'Aa bb. ')
(11152, 11172, 'End div.   xxx. ')
(11172, 11180, 'Aa bb. ')
(11180, 11201, 'End div.   xxx. ')
(11201, 11209, 'Aa bb. ')
(11209, 11231, 'End chunk.   xxx. ')
(11231, 11239, 'Aa bb. ')
(11239, 11255, 'End letter.  ')
(11255, 11263, 'Aa bb. ')
(11263, 11281, 'Begin letter.  ')
(11281, 11289, 'Aa bb. ')
(11289, 11305, 'Begin meta.  ')
(11305, 11313, 'Aa bb. ')
(11313, 11348, 'Begin chunk. fileDesc. titleStmt.')
(11349, 11355, 'title.')
(11356, 11382, 'Brief aan Aletta de Iongh.')
(11383, 11424, 'Amsterdam, donderdag 13 mei 1909.editor.')
(11425, 11447, 'Wietse Coppes.editor.')
(11448, 11468, 'Leo Jansen.sponsor.')
(11469, 11494, 'Mondriaan Editieproject.')


Sentence entries have the same fields, except for the last whitespace field.

Actually, the program will not use the texts of tokens and sentences for display, only for determining
where the boundaries are.

With those boundaries in hand, the texts of tokens and sentences are read off from the original corpus.

### Step 3: Ingest the results in the data set

A lot of critical things happen when we ingest the token and sentence streams into our dataset.

We calculate slot positions, retrieve text, split some tokens, and last but not least,
we replace the character-by-character basis of the preliminary dataset by a token-by-token basis.

In [30]:
newVersion = NLP.task(
    ingest=True,
    positions=positions,
    tokens=tokens,
    sentences=sentences,
)

  0.00s Ingesting tokens and sentences into the dataset ...
   |       11s Mapping NLP data to nodes and features ...
   |      |    -0.00s generating token-nodes with features str, after, empty
   |      |      |    -0.00s 13761 token nodes have values assigned for str, after
   |      |      |     0.00s 202 empty slots have split surrounding tokens
   |      |      |     0.00s 228 space slots have split into chars
   |      |      |     0.00s  6837x Items contained in extra generated text
   |      |     0.04s 13761 tokens
   |      |     0.04s generating sentence-nodes with features 
   |      |      |     0.02s 1063 sentence nodes have values assigned for 
   |      |      |     0.02s  1034x Items contained in extra generated text
   |      |      |     0.02s    77x Items with empty final text
   |      |     0.07s 1063 sentences
   |       11s Make a modified dataset ...
  0.00s Feature overview: 45 for nodes; 1 for edges; 1 configs; 9 computed
   |       11s Done
  0.32s Enriched

### Step 4: Adjust the app to the modified dataset

Various things in the `config.yaml` and `app.py` of the TF app should be updated, as well
as the documentation file that gives the ins and outs of the resulting features.

In [31]:
T.task(apptoken=True)

App updating  with tokens and sentences  ...
	~/github/annotation/mondriaan/docs/about.md (generated with custom info)
	~/github/annotation/mondriaan/docs/transcription.md (no custom info, older orginal exists)
	~/github/annotation/mondriaan/app/static/logo.png (already exists, not overwritten)
	~/github/annotation/mondriaan/app/static/display.css (generated with custom info)
	~/github/annotation/mondriaan/app/config.yaml (generated with custom info)
	~/github/annotation/mondriaan/app/app.py (generated with custom info)
Done


True

# Use the new dataset

We can now use the resulting dataset in the usual way.
Because we have adapted the TF app, the version without the `pre` will now be loaded.

In [4]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,13761.0,100
letter,14,982.93,100
body,14,849.93,86
text,14,849.93,86
chunk,86,160.0,100
div,93,219.99,149
teiHeader,14,124.57,13
p,95,73.39,51
postscript,6,62.83,3
revisionDesc,14,61.0,6


# Zip the data

This is for producing a zip file to attach to the latest release, so that TF can download the data smoothly.

In [33]:
A.zipAll()

Data to be zipped:
	OK       app                      (v0.8.2 29a586)     : ~/github/annotation/mondriaan/app
	OK       main data                (v0.8.2 29a586)     : ~/github/annotation/mondriaan/tf/0.8.2
	OK       graphics                 (v0.8.2 29a586)     : ~/github/annotation/mondriaan/illustrations
Writing zip file ...
Result: ~/Downloads/github/annotation/mondriaan/complete.zip


# Exploration

We walk around a bit more in the corpus.

## All titles:

In [34]:
for t in F.otype.s("titleStmt"):
    print(t, T.text(t))

15545 Brief aan Aletta de Iongh. Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15546 Brief aan Aletta de Iongh. Amsterdam, woensdag 7 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15547 Brief aan Aletta de Iongh. Amsterdam, tussen maandag 19 en vrijdag 23 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15548 Brief aan Aletta de Iongh. Amsterdam, maandag 26 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15549 Brief aan Aletta de Iongh. Amsterdam, donderdag 13 mei 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15550 Brief aan Aletta de Iongh. Amsterdam, donderdag 24 juni 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15551 Brief aan Aletta de Iongh. Amsterdam, eerste helft augustus 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15552  Briefkaart aan Gerrit Willem Knap. Zoutelande, c. dinsdag 24 augustus 1909.
Wietse Coppes
Leo Jansen
Mondria

## Sentences

In [35]:
for s in F.otype.s("sentence")[2:4]:
    print(T.text(s))

Wietse Coppes
Leo Jansen


In [36]:
for s in F.otype.s("sentence")[2:4]:
    A.pretty(s, withNodes=True)

In [37]:
for (i, s) in enumerate(F.otype.s("sentence")[0:100]):
    print(f"SENTENCE {i + 1}: {T.text(s)}")

SENTENCE 1: Brief aan Aletta de Iongh. 
SENTENCE 2: Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
SENTENCE 3: Wietse Coppes
SENTENCE 4: Leo Jansen
SENTENCE 5: Mondriaan Editieproject
SENTENCE 6: Nederland
SENTENCE 7: Otterlo
SENTENCE 8: Kröller Müller Museum
SENTENCE 9: KM 123.397
SENTENCE 10: 19090216y_IONG_1303
SENTENCE 11: ​
SENTENCE 12: ​
SENTENCE 13: ​
SENTENCE 14: Piet Mondriaan
SENTENCE 15: dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909
SENTENCE 16: Amsterdam
SENTENCE 17: Aletta de Iongh
SENTENCE 18: transcriptie: voltooid 20.7.15
SENTENCE 19: collatie bron: 6.6.16
SENTENCE 20: tweede collatie aan het origineel: voltooid 26.11.19
SENTENCE 21: invoer tweede collatie: voltooid 5.8.16
SENTENCE 22: bespreking eindversie: gb
SENTENCE 23: markeren annotaties: in bewerking / voltooid
SENTENCE 24: gereed 17.4.2019
SENTENCE 25: titel gecontroleerd 21.09.2020
SENTENCE 26: personen getagd 12.10.2020
SENTENCE 27: vertaling ingevoerd 16.2.2021
SENTENC

# Illustrations

In [5]:
results = A.search("""
rs type=artwork-m key~[0-9]
""")

  0.00s 11 results


In [7]:
A.show(results, withNodes=True,end=1)

## The first letter

In [38]:
A.pretty(F.otype.s("letter")[0], full=True, withNodes=False)

## Overlapping divs

There are overlapping divs!
Let's find them all.

First the total amount of divs:

In [39]:
len(F.otype.s("div"))

93

In [40]:
query = """
d1:div
&& d2:div

d1 < d2
"""

results = A.search(query)

  0.01s 69 results


In [41]:
A.table(results, end=2)

n,p,div,div.1
1,proeftuin@19090216y_IONG_1303:6,"Manuscript. De brief is geschreven op een dinsdag voorafgaand aan een van de drie woensdagen waarop Richard Buhlig on 1909 in de kleine zaal van het Concertgebouw een concert zou geven: 17 februari, 3 maart en 10 maart. Zie tevens noot 1. ​Aletta werd binnen haar familie aangesproken met de koosnaam 'Zus' (Heteren 2018, p. 25) Mondriaans gebruik van deze naam geeft aan dat hij op intieme voet stond met De Iongh. De Amerikaanse pianist Richard Moritz Buhlig (1880-1952) gaf begin 1909 drie concerten in de kleine zaal van het Concertgebouw, op woensdag 17 februari, woensdag 3 maart en woensdag 10 maart. Het is niet bekend voor welke van deze concerten Mondriaan kaarten had. Dear Zus,​ If you come to the entrance to the small auditorium in the Concertgebouw at a quarter to eight tomorrow (Wednesday) evening, I have a ticket for van Buhlig for you.[2] And then we can arrange a time other than Thursday afternoon because I can’t manage that. With my very best wishes, your Piet.","Manuscript. De brief is geschreven op een dinsdag voorafgaand aan een van de drie woensdagen waarop Richard Buhlig on 1909 in de kleine zaal van het Concertgebouw een concert zou geven: 17 februari, 3 maart en 10 maart. Zie tevens noot 1."
2,proeftuin@19090216y_IONG_1303:6,"Manuscript. De brief is geschreven op een dinsdag voorafgaand aan een van de drie woensdagen waarop Richard Buhlig on 1909 in de kleine zaal van het Concertgebouw een concert zou geven: 17 februari, 3 maart en 10 maart. Zie tevens noot 1. ​Aletta werd binnen haar familie aangesproken met de koosnaam 'Zus' (Heteren 2018, p. 25) Mondriaans gebruik van deze naam geeft aan dat hij op intieme voet stond met De Iongh. De Amerikaanse pianist Richard Moritz Buhlig (1880-1952) gaf begin 1909 drie concerten in de kleine zaal van het Concertgebouw, op woensdag 17 februari, woensdag 3 maart en woensdag 10 maart. Het is niet bekend voor welke van deze concerten Mondriaan kaarten had. Dear Zus,​ If you come to the entrance to the small auditorium in the Concertgebouw at a quarter to eight tomorrow (Wednesday) evening, I have a ticket for van Buhlig for you.[2] And then we can arrange a time other than Thursday afternoon because I can’t manage that. With my very best wishes, your Piet.",​


### Notes

In [42]:
for (i, nn) in enumerate(F.otype.s("note")[4:5]):
    Apre.dm(f"### Note {i + 1}\n\n")
    tokens = L.d(nn, otype="token")
    s = L.u(L.d(nn, otype="token")[0], otype="chunk")[0]
    A.pretty(nn, withNodes=True, full=True)
    A.pretty(s, withNodes=True, full=True)

### Note 1

