In [1]:
%load_ext autoreload
%autoreload 2

# Convert from TEI to TF

We show how to convert a TEI data source into TF.

This has two stages:

1. make an preliminary TF dataset with the character as slot type
1. feed the plain text to a tokenizer, and add tokens and sentences to the datset,
   while removing its character and word nodes;
   the new slot type is token.
   
A dataset based on characters is precise, but rather inefficient.
The second step makes the dataset much more efficient.

## Preliminary conversion

For this we have a program in this directory that directly invokes TF machinery that does the hard work.

We do it step by step.

First we check the input and make an inventory of all elements and attributes in it.

In [2]:
!python tfFromTei.py check +verbose

INFO: Needs af.xsd (exists)
INFO: Needs ns2.xsd (exists)
INFO: Needs ns1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
Analysing ~/git.diginfra.net/mondriaan/letters/schema/MD.xsd
  6 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	sepLine complex pure (added)
	transpose pure ==> mixed
	wbh complex pure (added)
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

217 info line(s) written to ~/git.diginfra.net/mondriaan/letters/report/elements.txt
0 error(s) in 0 file(s) written to ~/git.diginfra.net/mondriaan/letters/report/errors.txt
59 tags of which 0 with multiple namespaces written to ~/git.diginfra.net/mondriaan/letters/report/namespaces.txt


We can regulate the verbosity:

* `-verbose`: minimal feedback,
* `+verbose`: moderate amount of feedback,
* `++verbose` maximal feedback.

The actual conversion:

In [3]:
!python tfFromTei.py convert +verbose

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

   |     0.20s "edge" actions: 0
   |     0.20s "feature" actions: 123262
   |     0.20s "node" actions: 12921
   |     0.20s "resume" actions: 0
   |     0.20s "slot" actions: 63491
   |     0.20s "terminate" actions: 12921
   |      76412 nodes of all types
   |     0.20s OK
   |     0.00s checking for nodes and edges ... 
   |     0.00s OK
   |     0.00s checking (section) features ... 
   |     0.00s OK
   |     0.00s reordering nodes ...
   |     0.04s Max node = 76412
   |     0.04s OK
   |     0.00s reassigning feature values ...
   |     0.01s OK


We load the generated TF for the first time.
This is:

1. a check that we have generated valid TF
1. a precomputation step that makes loading the dataset faster next time.

In [4]:
!python tfFromTei.py load +verbose

max node = 76412


We generate a TF app for the dataset:

In [4]:
!python tfFromTei.py app +force

App updated


We view the result in the TF browser right from here:

To stop the browser, interrupt the kernel (Press `i` twice).

In [8]:
!python tfFromTei.py browse

This is Text-Fabric 11.3.1
Starting new kernel listening on 14430
Loading data for mondriaan/letters. Please wait ...
Setting up TF kernel for mondriaan/letters   version 0.8.1pre
**Locating corpus resources ...**
Using app in ~/git.diginfra.net/mondriaan/letters/app:
	repo clone offline under ~/git.diginfra.net (local github)
Using data in ~/git.diginfra.net/mondriaan/letters/tf/0.8.1pre:
	repo clone offline under ~/git.diginfra.net (local github)
Using data in ~/git.diginfra.net/mondriaan/letters/illustrations:
	repo clone offline under ~/git.diginfra.net (local github)
<IPython.core.display.HTML object>
TF setup done.
Starting new webserver listening on 24430
 * Running on http://localhost:24430
[33mPress CTRL+C to quit[0m
Opening mondriaan/letters in browser
Press <Ctrl+C> to stop the TF browser
Kernel listening at port 14430
127.0.0.1 - - [21/Apr/2023 12:29:17] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [21/Apr/2023 12:29:17] "[36mGET /server/static/index.css HTTP/1.1[0m" 304 -
127.

## View the preliminary result

As a final proof, we load the app:

In [6]:
from tf.app import use

In [7]:
ORG = "mondriaan"
REPO = "letters"
BACKEND = "git.diginfra.net"

In [8]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone", backend=BACKEND, hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,63491.0,100
letter,14,4535.07,100
body,14,3880.0,86
text,14,3880.0,86
chunk,86,738.26,100
div,93,1013.75,148
teiHeader,14,646.64,14
revisionDesc,14,368.07,8
p,95,324.6,49
postscript,6,250.83,2


In [16]:
c = F.otype.s("chunk")[4]

In [17]:
print(T.text(c))


Beste Zus,
​kom je morgenavond (Woensdag) om kwart voor acht ingang kleine zaal Concertgebouw Concert-gebouw,  dan heb ik een plaats voor v v.  Bulhlig voor je.​ En dan kunnen we een andere dan Donderdagmiddag afspreken want dan kan ik niet goed.
Met vele beste groeten je Piet.




In [18]:
Apre.plain(c)

## Add tokens and sentences

We add tokens and sentences to the TF dataset.

We do this by the following steps

1. Generate a plain text plus mapping between character positions and nodes
2. Use Spacy to tokenize the text and to determine sentence boundaries
3. translate the Spacy results back to extra nodes and features for the TF set
4. replace the character slots in the TF set by tokens

### All at once on the commandline

The whole process goes smoother and quicker if you do all steps in a single run:

In [9]:
!python tfFromtei.py app +force
!addnlp all
!python tfFromTei.py apptoken +force
!python tfFromTei.py load

App updated
  0.12s Using NLP pipeline Spacy (may take a while)...
  2.69s NLP done
  0.00s Feature overview: 45 for nodes; 1 for edges; 1 configs; 9 computed
App updated adapted to tokens and sentences


The disadvantage is that every time this cell is executed, the costly
Spacy pipeline has to run.

It is possible to do it more efficiently, by directly running the tasks from
Python.

### Step-by-step on the commandline

In [15]:
!python tfFromtei.py app +force # needed when we repeat the NLP addition

App updated


First step: generate a plain text of the whole corpus.
We generate a plain text, while remembering node positions.

Take care to take out the material that is out-of-flow, for example the notes.

All notes are collected, separated by a string with a `.` in it, and put at the end of the plain text.

Then Spacy will not confound sentences withon notes with the sentences in the context of the notes.

Later on, once all sentences have been detected, the notes will move to their original places.

In [16]:
!addnlp plaintext +write +verbose

  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 202 empty slots
   |   recorded flow MAIN       with 127616 items
   |   recorded flow del        with    159 items
   |   recorded flow note       with  64164 items
   |   recorded flow orig       with    227 items
  0.13s Done. Generated text and positions written to ~/git.diginfra.net/mondriaan/letters/_temp/txt/plain.txt


We added `+write` to write out the plain text and the node positions,
so that it can be used in the next step.

If you call these functions in a Python script, you can capture the result of these functions
and pass them to the next step, without writing material to disk.

**NB: you can add `+verbose` if you want more messages.**

Second step: run Spacy on the plain text to get tokens and sentences.
This is costly.

When you are still in iterative/exploring mode, you want to avoid running this step all the time.

That's why a look at the body of that function, and run its steps one by one, here in this notebook.

In [17]:
!addnlp lingo +write +verbose

  0.00s Using NLP pipeline Spacy (may take a while)...
  2.53s NLP done


Final step: ingest the results in the data set and replace the character slots
by token slots.

We omit the `+write` because the new data set will be written anyway.

In [18]:
!addnlp ingest +verbose

  0.09s Ingesting tokens and sentences into the dataset ...
   |     0.09s Mapping NLP data to nodes and features ...
   |      |     0.00s generating token-nodes with features str, after, empty
   |      |      |    -0.00s 13761 token nodes have values assigned for str, after
   |      |      |     0.00s 202 empty slots have split surrounding tokens
   |      |      |     0.00s 336 space slots have split into chars
   |      |      |     0.00s  3383x Items contained in extra generated text
   |      |     0.03s 13761 tokens
   |      |     0.03s generating sentence-nodes with features nsent
   |      |      |     0.01s 756 sentence nodes have values assigned for nsent
   |      |      |     0.01s   455x Items contained in extra generated text
   |      |      |     0.01s   402x Items with empty final text
   |      |     0.05s 756 sentences
   |     0.13s Make a modified dataset ...
  0.00s Feature overview: 45 for nodes; 1 for edges; 1 configs; 9 computed
   |     0.39s Done
  0.39s 

The new dataset is different.
It is no longer character based, but the slot type has become `token`.
Various things in the `config.yaml` and `app.py` of the TF app should be updated, as well
as the documentation file that gives the ins and outs of the resulting features.

We have created this app by running

```
python tfFtromTei.py app
```

We can now update this app by running

In [19]:
!python tfFromTei.py apptoken +force

App updated adapted to tokens and sentences


### Step by step from Python

In [19]:
from tf.app import use
from tf.convert.addnlp import NLPipeline

In [20]:
ORG = "mondriaan"
REPO = "letters"
BACKEND = "git.diginfra.net"

In [21]:
!python tfFromtei.py app +force

App updated


In [22]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone", backend=BACKEND)
NLP = NLPipeline()
NLP.loadApp(Apre)

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,63491.0,100
letter,14,4535.07,100
body,14,3880.0,86
text,14,3880.0,86
chunk,86,738.26,100
div,93,1013.75,148
teiHeader,14,646.64,14
revisionDesc,14,368.07,8
p,95,324.6,49
postscript,6,250.83,2


Generate plain text (add `verbose=-1` or `0` or `1` and/or `write=True` if you like).

* `verbose=-1` is the same as `-verbose`
* `verbose=0` is the same as `+verbose`
* `verbose=1` is the same as `++verbose`

In [23]:
verbose = 0
write=True

In [24]:
(text, positions) = NLP.task(plaintext=True, verbose=verbose, write=write)

  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 202 empty slots
   |   recorded flow MAIN       with 127616 items
   |   recorded flow del        with    159 items
   |   recorded flow note       with  64164 items
   |   recorded flow orig       with    227 items
  0.19s Done. Generated text and positions written to ~/git.diginfra.net/mondriaan/letters/_temp/txt/plain.txt


Run the NLP pipeline (Spacy):

In [25]:
(tokens, sentences) = NLP.task(lingo=True, text=text, verbose=verbose, write=write)

  0.00s Using NLP pipeline Spacy (may take a while)...
  2.62s NLP done


This was the culprit.
Once we are here, we have our data and can play along with it with little hassle.

Include the NLP results:

In [26]:
newVersion = NLP.task(
    ingest=True,
    positions=positions,
    tokens=tokens,
    sentences=sentences,
    verbose=0,
    write=write,
)

  0.00s Ingesting tokens and sentences into the dataset ...
   |     0.00s Mapping NLP data to nodes and features ...
   |      |     0.00s generating token-nodes with features str, after, empty
   |      |      |     0.00s 13761 token nodes have values assigned for str, after
   |      |      |     0.00s 202 empty slots have split surrounding tokens
   |      |      |     0.00s 336 space slots have split into chars
   |      |      |     0.00s  3383x Items contained in extra generated text
   |      |     0.04s 13761 tokens
   |      |     0.04s generating sentence-nodes with features nsent
   |      |      |     0.01s 756 sentence nodes have values assigned for nsent
   |      |      |     0.01s   455x Items contained in extra generated text
   |      |      |     0.02s   402x Items with empty final text
   |      |     0.05s 756 sentences
   |     0.06s Make a modified dataset ...
  0.00s Feature overview: 45 for nodes; 1 for edges; 1 configs; 9 computed
   |     0.32s Done
  0.31s 

Adapt the app to the tokens as slot type:

In [27]:
!python tfFromTei.py apptoken +force

App updated adapted to tokens and sentences


# Use the new dataset

We can now use the resulting dataset in the usual way.
Because we have adapted the TF app, the version without the `pre` will now be loaded.

In [28]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", backend=BACKEND, hoist=globals())

**Locating corpus resources ...**

   |     0.00s T otype                from ~/git.diginfra.net/mondriaan/letters/tf/0.8.1
   |     0.04s T oslots               from ~/git.diginfra.net/mondriaan/letters/tf/0.8.1
   |     0.00s T folder               from ~/git.diginfra.net/mondriaan/letters/tf/0.8.1
   |     0.03s T after                from ~/git.diginfra.net/mondriaan/letters/tf/0.8.1
   |     0.00s T letter               from ~/git.diginfra.net/mondriaan/letters/tf/0.8.1
   |     0.00s T chunk                from ~/git.diginfra.net/mondriaan/letters/tf/0.8.1
   |     0.04s T str                  from ~/git.diginfra.net/mondriaan/letters/tf/0.8.1
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.09s C __order__            from otype, oslots, __levels__
   |      |     0.00s C __rank__             from otype, __order__
   |      |     0.09s C __levUp__            from otype, oslots, __rank__
   |      |     0.02s C __levDown__          from otype, __levUp__, __rank__
   |      | 

Name,# of nodes,# slots/node,% coverage
folder,1,13761.0,100
letter,14,982.93,100
body,14,849.93,86
text,14,849.93,86
chunk,86,160.0,100
div,93,219.99,149
teiHeader,14,124.57,13
p,95,73.39,51
postscript,6,62.83,3
revisionDesc,14,61.0,6


In [29]:
for t in F.otype.s("titleStmt"):
    print(t, T.text(t))

15545 Brief aan Aletta de Iongh. Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15546 Brief aan Aletta de Iongh. Amsterdam, woensdag 7 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15547 Brief aan Aletta de Iongh. Amsterdam, tussen maandag 19 en vrijdag 23 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15548 Brief aan Aletta de Iongh. Amsterdam, maandag 26 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15549 Brief aan Aletta de Iongh. Amsterdam, donderdag 13 mei 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15550 Brief aan Aletta de Iongh. Amsterdam, donderdag 24 juni 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15551 Brief aan Aletta de Iongh. Amsterdam, eerste helft augustus 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15552  Briefkaart aan Gerrit Willem Knap. Zoutelande, c. dinsdag 24 augustus 1909.
Wietse Coppes
Leo Jansen
Mondria

In [30]:
T.text(F.otype.s("sentence")[2])

'Wietse Coppes'

In [31]:
A.pretty(F.otype.s("letter")[0], full=True, withNodes=False)

There are overlapping divs!
Let's find them all.

First the total amount of divs:

In [32]:
len(F.otype.s("div"))

93

In [33]:
query = """
d1:div
&& d2:div

d1 < d2
"""

results = A.search(query)

  0.01s 69 results


In [34]:
A.table(results, end=2)

n,p,div,div.1
1,proeftuin@19090216y_IONG_1303:6,"Manuscript. De brief is geschreven op een dinsdag voorafgaand aan een van de drie woensdagen waarop Richard Buhlig on 1909 in de kleine zaal van het Concertgebouw een concert zou geven: 17 februari, 3 maart en 10 maart. Zie tevens noot 1. ​Aletta werd binnen haar familie aangesproken met de koosnaam 'Zus' (Heteren 2018, p. 25) Mondriaans gebruik van deze naam geeft aan dat hij op intieme voet stond met De Iongh. De Amerikaanse pianist Richard Moritz Buhlig (1880-1952) gaf begin 1909 drie concerten in de kleine zaal van het Concertgebouw, op woensdag 17 februari, woensdag 3 maart en woensdag 10 maart. Het is niet bekend voor welke van deze concerten Mondriaan kaarten had. Dear Zus,​ If you come to the entrance to the small auditorium in the Concertgebouw at a quarter to eight tomorrow (Wednesday) evening, I have a ticket for van Buhlig for you.[2] And then we can arrange a time other than Thursday afternoon because I can’t manage that. With my very best wishes, your Piet.","Manuscript. De brief is geschreven op een dinsdag voorafgaand aan een van de drie woensdagen waarop Richard Buhlig on 1909 in de kleine zaal van het Concertgebouw een concert zou geven: 17 februari, 3 maart en 10 maart. Zie tevens noot 1."
2,proeftuin@19090216y_IONG_1303:6,"Manuscript. De brief is geschreven op een dinsdag voorafgaand aan een van de drie woensdagen waarop Richard Buhlig on 1909 in de kleine zaal van het Concertgebouw een concert zou geven: 17 februari, 3 maart en 10 maart. Zie tevens noot 1. ​Aletta werd binnen haar familie aangesproken met de koosnaam 'Zus' (Heteren 2018, p. 25) Mondriaans gebruik van deze naam geeft aan dat hij op intieme voet stond met De Iongh. De Amerikaanse pianist Richard Moritz Buhlig (1880-1952) gaf begin 1909 drie concerten in de kleine zaal van het Concertgebouw, op woensdag 17 februari, woensdag 3 maart en woensdag 10 maart. Het is niet bekend voor welke van deze concerten Mondriaan kaarten had. Dear Zus,​ If you come to the entrance to the small auditorium in the Concertgebouw at a quarter to eight tomorrow (Wednesday) evening, I have a ticket for van Buhlig for you.[2] And then we can arrange a time other than Thursday afternoon because I can’t manage that. With my very best wishes, your Piet.",​


In [35]:
T.text(F.otype.s("sentence")[2])

'Wietse Coppes'

In [37]:
T.text(F.otype.s("sentence")[3])

'Leo Jansen'

In [39]:
s = F.otype.s("sentence")[2]

A.pretty(s, withNodes=True)

In [40]:
for (i, s) in enumerate(F.otype.s("sentence")[0:100]):
    print(f"SENTENCE {i + 1}: {T.text(s)}")

SENTENCE 1: Brief aan Aletta de Iongh. 
SENTENCE 2: Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
SENTENCE 3: Wietse Coppes
SENTENCE 4: Leo Jansen
SENTENCE 5: Mondriaan Editieproject
SENTENCE 6: Nederland
SENTENCE 7: Otterlo
SENTENCE 8: Kröller Müller Museum
SENTENCE 9: KM 123.397
SENTENCE 10: 19090216y_IONG_1303
SENTENCE 11: ​
SENTENCE 12: ​
SENTENCE 13: ​
SENTENCE 14: Piet Mondriaan
SENTENCE 15: dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909
SENTENCE 16: Amsterdam
SENTENCE 17: Aletta de Iongh
SENTENCE 18: transcriptie: voltooid 20.7.15
SENTENCE 19: collatie bron: 6.6.16
SENTENCE 20: tweede collatie aan het origineel: voltooid 26.11.19
SENTENCE 21: invoer tweede collatie: voltooid 5.8.16
SENTENCE 22: bespreking eindversie: gb
SENTENCE 23: markeren annotaties: in bewerking / voltooid
SENTENCE 24: gereed 17.4.2019
SENTENCE 25: titel gecontroleerd 21.09.2020
SENTENCE 26: personen getagd 12.10.2020
SENTENCE 27: vertaling ingevoerd 16.2.2021
SENTENC

In [41]:
sent = F.otype.s("sentence")[110]

In [42]:
A.pretty(sent, baseTypes=set(), full=True)

In [43]:
query = """
s1:sentence
&& s2:sentence

s1 # s2
"""

results = A.search(query)

  0.00s 0 results


In [44]:
query = """
sentence
  =: t1:token
  := t2:token
  
sentence
  =: t3:token
  
t1 < t3
t3 < t2
"""

results = A.search(query)

  0.11s 0 results


In [45]:
len(list(F.otype.s("note")))

86

In [46]:
for (i, nn) in enumerate(F.otype.s("note")[4:5]):
    Apre.dm(f"### Note {i + 1}\n\n")
    s1 = L.u(L.d(nn, otype="token")[0], otype="chunk")[0]
    s2 = s1 + 1
    A.pretty(nn, withNodes=True, full=True)
    A.pretty(s1, withNodes=True, full=True)
    A.pretty(s2, withNodes=True, full=True)

### Note 1



In [47]:
slots = F.otype.s("token")
len(slots)

13761

In [48]:
sortedSlots = N.sortNodes(slots)

In [49]:
t0 = slots[0]

In [50]:
def checkTokens(data):
    equal = True
    for (i, t) in enumerate(data):
        if t0 + i != t:
            print(f"mismatch at {i}th member: {t=} {t0 + i=}")
            equal = False
            break
    if equal:
        print("continuous sequence")
    return i

In [51]:
m = checkTokens(slots)
m

continuous sequence


13760

In [52]:
m = checkTokens(sortedSlots)

continuous sequence


In [53]:
A.pretty(slots[m])
A.pretty(sortedSlots[m])