In [1]:
%load_ext autoreload
%autoreload 2

# Convert from TEI to TF

We show how to convert a TEI data source into TF.

This has two stages:

1. make an preliminary TF dataset with the character as slot type
1. feed the plain text to a tokenizer, and add tokens and sentences to the datset,
   while removing its character and word nodes;
   the new slot type is token.
   
A dataset based on characters is precise, but rather inefficient.
The second step makes the dataset much more efficient.

**More ways to do it!**

* [convertExpress](convertExpress.ipynb) : as few commands/feedback/interaction as possible, 
* [convertSteps](convertSteps.ipynb): broken down in a few command line commands, more feedback
* *convertDetails*: run from Python with full control

## Preliminary conversion

Same as in [convertSteps](convertSteps.ipynb) but now with even more feedback.

### Step 1: Check

Check the input: validity of the TEI-XML.

Make a report of the elements and attributes used.

Use the declared schemas in the XML source to determine which elements have
pure content and which ones mixed content.

In [2]:
!python tfFromTei.py check ++verbose

TEI to TF checking: ~/github/annotation/mondriaan/tei/2023-03-20 => ~/github/annotation/mondriaan/report
INFO: Needs af.xsd (exists)
INFO: Needs ns2.xsd (exists)
INFO: Needs ns1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/MD.xsd
	round   1:  49 changes
180 identical override(s)
  6 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	sepLine complex pure (added)
	transpose pure ==> mixed
	wbh complex pure (added)
Section model I
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

217 info line(s) written to ~/github/annotation/mondriaan/report/elements.txt
0 error(s) in 0 file(s) written to ~/github/annotation/mondriaan/report/errors.txt
59 tags of which 0 with multiple namespaces written to ~/github/annotation/mondriaan/report/namespaces.txt


### Step 2: Convert

Run the actual conversin and produce TF output.

In [3]:
!python tfFromTei.py convert ++verbose

TEI to TF converting: ~/github/annotation/mondriaan/tei/2023-03-20 => ~/github/annotation/mondriaan/tf/0.8.1pre
  0.00s Not all of the warp features otype and oslots are present in
~/github/annotation/mondriaan/tf/0.8.1pre
  0.00s Only the Feature and Edge APIs will be enabled
  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s No structure nodes will be set up
   |   SECTION   TYPES:    folder, letter, chunk
   |   SECTION   FEATURES: folder, letter, chunk
   |   STRUCTURE TYPES:    
   |   STRUCTURE FEATURES: 
   |   TEXT      FEATURES:
   |      |   text-orig-full       ch
   |     0.00s OK
   |     0.00s Following director... 
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

source reading done
   |     0.19s "edge" actions: 0
   |     0.19s "feature" actions: 123262
   |     0.19s "node" actions: 12921


### Step 3: Load the TF data

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and precomputations are performed.
Next time the loading will be much quicker.

In [4]:
!python tfFromTei.py load ++verbose

   |     0.02s T otype                from ~/github/annotation/mondriaan/tf/0.8.1pre
   |     0.20s T oslots               from ~/github/annotation/mondriaan/tf/0.8.1pre
   |     0.00s T folder               from ~/github/annotation/mondriaan/tf/0.8.1pre
   |     0.13s T ch                   from ~/github/annotation/mondriaan/tf/0.8.1pre
   |     0.00s T chunk                from ~/github/annotation/mondriaan/tf/0.8.1pre
   |     0.00s T letter               from ~/github/annotation/mondriaan/tf/0.8.1pre
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.40s C __order__            from otype, oslots, __levels__
   |      |     0.01s C __rank__             from otype, __order__
   |      |     0.43s C __levUp__            from otype, oslots, __rank__
   |      |     0.11s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.01s C __characters__       from otext
   |      |     0.14s C __boundary__         from otype, oslots, __ra

### Step 4: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [5]:
!python tfFromTei.py app +force ++verbose

App updating  ...
	about  : exists , generated ~/github/annotation/mondriaan/docs/about.md
	trans  : exists , generated ~/github/annotation/mondriaan/docs/transcription.md
	display: exists , generated ~/github/annotation/mondriaan/app/static/display.css
	config : exists , generated ~/github/annotation/mondriaan/app/config.yaml
	app    : exists , generated ~/github/annotation/mondriaan/app/app.py
Done


## View the preliminary result

As a final proof, we load the app:

In [6]:
from tf.app import use

In [7]:
ORG = "annotation"
REPO = "mondriaan"

In [9]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone")

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,63491.0,100
letter,14,4535.07,100
body,14,3880.0,86
text,14,3880.0,86
chunk,86,738.26,100
div,93,1013.75,148
teiHeader,14,646.64,14
revisionDesc,14,368.07,8
p,95,324.6,49
postscript,6,250.83,2


### Show a fragment

In [10]:
chunk = Apre.api.F.otype.s("chunk")[4]
Apre.plain(chunk)

## Add tokens and sentences

We add tokens and sentences to the TF dataset.

We do this by the following steps

1. Generate a plain text plus mapping between character positions and nodes
2. Use Spacy to tokenize the text and to determine sentence boundaries
3. translate the Spacy results back to extra nodes and features for the TF set
4. replace the character slots in the TF set by tokens

### Step by step from Python

We carry out the steps from within Python.

In that way we get access to all intermediate results, and we can play and explore between the steps.

We load the data we have so far, and pass it on to an `NLPipeline` object, defined by Text-Fabric.

In [12]:
from tf.app import use
from tf.convert.addnlp import NLPipeline

In [13]:
ORG = "annotation"
REPO = "mondriaan"

### Back to the previous state

When we have added the data to the dataset, we will tweak the TF app.

But if we want to redo the pipeline, we have to restore the app to the situation before
the tokens and sentences were added.

That's the reason we have the next cell.

In [14]:
!python tfFromtei.py app +force ++verbose

App updating  ...
	about  : exists , generated ~/github/annotation/mondriaan/docs/about.md
	trans  : exists , generated ~/github/annotation/mondriaan/docs/transcription.md
	display: exists , generated ~/github/annotation/mondriaan/app/static/display.css
	config : exists , generated ~/github/annotation/mondriaan/app/config.yaml
	app    : exists , generated ~/github/annotation/mondriaan/app/app.py
Done


In [15]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone")
NLP = NLPipeline()
NLP.loadApp(Apre)

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,63491.0,100
letter,14,4535.07,100
body,14,3880.0,86
text,14,3880.0,86
chunk,86,738.26,100
div,93,1013.75,148
teiHeader,14,646.64,14
revisionDesc,14,368.07,8
p,95,324.6,49
postscript,6,250.83,2


### Before the steps

We can set the verbosity as we like.

We store the settings in two variables that we will pass to the following steps.

Return to the next cell, change the value for `verbose` and or `write`, run it,
and from then on all the next steps will react to the new settings.

Generate plain text (add `verbose=-1` or `0` or `1` and/or `write=True` if you like).

* `verbose=-1` is the same as `-verbose`
* `verbose=0` is the same as `+verbose`
* `verbose=1` is the same as `++verbose`

In [16]:
verbose = 0
write=True

### Step 1: Generate a plain text of the whole corpus

The function delivers the text in a variable, and it has recorded which character positions correspond
to which slots in the TF dataset.

We receive both items of data.

In [17]:
(text, positions) = NLP.task(plaintext=True, verbose=verbose, write=write)

  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 202 empty slots
   |   recorded flow MAIN       with 127616 items
   |   recorded flow del        with    159 items
   |   recorded flow note       with  64164 items
   |   recorded flow orig       with    227 items
  0.19s Done. Generated text and positions written to ~/github/annotation/mondriaan/_temp/txt/plain.txt


### Step 2: Run Spacy to get tokens and sentences

Now we feed the text from step 1 into the NLP pipeline, which is Spacy.

We get a list of tokens and a list of sentences back.

In [18]:
(tokens, sentences) = NLP.task(lingo=True, text=text, verbose=verbose, write=write)

  0.00s Using NLP pipeline Spacy (may take a while)...
  2.63s NLP done


Let's examine a few tokens and sentences:

In [19]:
for token in tokens[400:410]:
    print(token)

(1644, 1646, 'to', ' ')
(1647, 1650, 'the', ' ')
(1651, 1659, 'entrance', ' ')
(1660, 1662, 'to', ' ')
(1663, 1666, 'the', ' ')
(1667, 1672, 'small', ' ')
(1673, 1683, 'auditorium', ' ')
(1684, 1686, 'in', ' ')
(1687, 1690, 'the', ' ')
(1691, 1704, 'Concertgebouw', ' ')


Each token entry specifies the start and end position in the plain text file,
the string value of the token, and the whitespace after the token, if any.

In [20]:
for sentence in sentences[390:410]:
    print(sentence)

(8493, 8583, 'Als je om kwart voor 8 niet bij me ben, denk ik, dat je verhinderd was, en ga de deur uit.')
(8583, 8588, 'Dag!')
(8589, 8620, 'hartelijke groeten van je Piet.')
(8620, 8627, ' xxx.')
(8627, 8638, 'END chunk.')
(8638, 8652, 'BEGIN chunk.')
(8652, 8759, 'Dear Zus,The polyp didn’t come yesterday morning, but I found his card in the letterbox in the evening.')
(8760, 8781, 'I shall write to him.')
(8782, 8821, 'It’s fine for you to come on Wednesday.')
(8822, 8952, 'The theosophical lectures are open to everyone on Tuesdays, apart from the first Tuesday of every month: then only to the members.')
(8952, 9023, '￮ So if you want to come and pick me up around seven thirty, we can go.')
(9024, 9110, 'If you aren’t with me by a quarter to eight I’ll assume you can’t come and I’ll leave.')
(9110, 9115, 'Bye!')
(9116, 9143, 'Warm wishes from your Piet.')
(9143, 9150, ' xxx.')
(9150, 9161, 'END chunk.')
(9161, 9174, 'BEGIN META.')
(9174, 9188, 'BEGIN chunk.')
(9188, 9199, 'fileDesc.

Sentence entries have the same fields, except for the last whitespace field.

Actually, the program will not use the texts of tokens and sentences for display, only for determining
where the boundaries are.

With those boundaries in hand, the texts of tokens and sentences are read off from the original corpus.

### Step 3: Ingest the results in the data set

A lot of critical things happen when we ingest the token and sentence streams into our dataset.

We calculate slot positions, retrieve text, split some tokens, and last but not least,
we replace the character-by-character basis of the preliminary dataset by a token-by-token basis.

In [21]:
newVersion = NLP.task(
    ingest=True,
    positions=positions,
    tokens=tokens,
    sentences=sentences,
    verbose=0,
    write=write,
)

  0.00s Ingesting tokens and sentences into the dataset ...
   |     0.02s Mapping NLP data to nodes and features ...
   |      |     0.00s generating token-nodes with features str, after, empty
   |      |      |     0.00s 13761 token nodes have values assigned for str, after
   |      |      |     0.00s 202 empty slots have split surrounding tokens
   |      |      |     0.00s 336 space slots have split into chars
   |      |      |     0.00s  3383x Items contained in extra generated text
   |      |     0.04s 13761 tokens
   |      |     0.04s generating sentence-nodes with features nsent
   |      |      |     0.01s 756 sentence nodes have values assigned for nsent
   |      |      |     0.02s   455x Items contained in extra generated text
   |      |      |     0.02s   402x Items with empty final text
   |      |     0.05s 756 sentences
   |     0.07s Make a modified dataset ...
  0.00s Feature overview: 45 for nodes; 1 for edges; 1 configs; 9 computed
   |     0.32s Done
  0.31s 

### Step 4: Adjust the app to the modified dataset

Various things in the `config.yaml` and `app.py` of the TF app should be updated, as well
as the documentation file that gives the ins and outs of the resulting features.

In [22]:
!python tfFromTei.py apptoken +force ++verbose

App updating  adapted to tokens and sentences ...
	about  : exists , generated ~/github/annotation/mondriaan/docs/about.md
	trans  : exists , generated ~/github/annotation/mondriaan/docs/transcription.md
	display: exists , generated ~/github/annotation/mondriaan/app/static/display.css
	config : exists , generated ~/github/annotation/mondriaan/app/config.yaml
	app    : exists , generated ~/github/annotation/mondriaan/app/app.py
Done


# Use the new dataset

We can now use the resulting dataset in the usual way.
Because we have adapted the TF app, the version without the `pre` will now be loaded.

In [23]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

   |     0.01s T otype                from ~/github/annotation/mondriaan/tf/0.8.1
   |     0.04s T oslots               from ~/github/annotation/mondriaan/tf/0.8.1
   |     0.00s T letter               from ~/github/annotation/mondriaan/tf/0.8.1
   |     0.03s T str                  from ~/github/annotation/mondriaan/tf/0.8.1
   |     0.00s T chunk                from ~/github/annotation/mondriaan/tf/0.8.1
   |     0.00s T folder               from ~/github/annotation/mondriaan/tf/0.8.1
   |     0.03s T after                from ~/github/annotation/mondriaan/tf/0.8.1
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.09s C __order__            from otype, oslots, __levels__
   |      |     0.00s C __rank__             from otype, __order__
   |      |     0.09s C __levUp__            from otype, oslots, __rank__
   |      |     0.02s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.00s C __characters__       from otext
   | 

Name,# of nodes,# slots/node,% coverage
folder,1,13761.0,100
letter,14,982.93,100
body,14,849.93,86
text,14,849.93,86
chunk,86,160.0,100
div,93,219.99,149
teiHeader,14,124.57,13
p,95,73.39,51
postscript,6,62.83,3
revisionDesc,14,61.0,6


# Exploration

We walk around a bit more in the corpus.

## All titles:

In [24]:
for t in F.otype.s("titleStmt"):
    print(t, T.text(t))

15545 Brief aan Aletta de Iongh. Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15546 Brief aan Aletta de Iongh. Amsterdam, woensdag 7 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15547 Brief aan Aletta de Iongh. Amsterdam, tussen maandag 19 en vrijdag 23 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15548 Brief aan Aletta de Iongh. Amsterdam, maandag 26 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15549 Brief aan Aletta de Iongh. Amsterdam, donderdag 13 mei 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15550 Brief aan Aletta de Iongh. Amsterdam, donderdag 24 juni 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15551 Brief aan Aletta de Iongh. Amsterdam, eerste helft augustus 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15552  Briefkaart aan Gerrit Willem Knap. Zoutelande, c. dinsdag 24 augustus 1909.
Wietse Coppes
Leo Jansen
Mondria

## Sentences

In [25]:
for s in F.otype.s("sentence")[2:4]:
    print(T.text(s))

Wietse Coppes
Leo Jansen


In [26]:
for s in F.otype.s("sentence")[2:4]:
    A.pretty(s, withNodes=True)

In [27]:
for (i, s) in enumerate(F.otype.s("sentence")[0:100]):
    print(f"SENTENCE {i + 1}: {T.text(s)}")

SENTENCE 1: Brief aan Aletta de Iongh. 
SENTENCE 2: Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
SENTENCE 3: Wietse Coppes
SENTENCE 4: Leo Jansen
SENTENCE 5: Mondriaan Editieproject
SENTENCE 6: Nederland
SENTENCE 7: Otterlo
SENTENCE 8: Kröller Müller Museum
SENTENCE 9: KM 123.397
SENTENCE 10: 19090216y_IONG_1303
SENTENCE 11: ​
SENTENCE 12: ​
SENTENCE 13: ​
SENTENCE 14: Piet Mondriaan
SENTENCE 15: dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909
SENTENCE 16: Amsterdam
SENTENCE 17: Aletta de Iongh
SENTENCE 18: transcriptie: voltooid 20.7.15
SENTENCE 19: collatie bron: 6.6.16
SENTENCE 20: tweede collatie aan het origineel: voltooid 26.11.19
SENTENCE 21: invoer tweede collatie: voltooid 5.8.16
SENTENCE 22: bespreking eindversie: gb
SENTENCE 23: markeren annotaties: in bewerking / voltooid
SENTENCE 24: gereed 17.4.2019
SENTENCE 25: titel gecontroleerd 21.09.2020
SENTENCE 26: personen getagd 12.10.2020
SENTENCE 27: vertaling ingevoerd 16.2.2021
SENTENC

## The first letter

In [28]:
A.pretty(F.otype.s("letter")[0], full=True, withNodes=False)

## Overlapping divs

There are overlapping divs!
Let's find them all.

First the total amount of divs:

In [29]:
len(F.otype.s("div"))

93

In [30]:
query = """
d1:div
&& d2:div

d1 < d2
"""

results = A.search(query)

  0.01s 69 results


In [31]:
A.table(results, end=2)

n,p,div,div.1
1,proeftuin@19090216y_IONG_1303:6,"Manuscript. De brief is geschreven op een dinsdag voorafgaand aan een van de drie woensdagen waarop Richard Buhlig on 1909 in de kleine zaal van het Concertgebouw een concert zou geven: 17 februari, 3 maart en 10 maart. Zie tevens noot 1. ​Aletta werd binnen haar familie aangesproken met de koosnaam 'Zus' (Heteren 2018, p. 25) Mondriaans gebruik van deze naam geeft aan dat hij op intieme voet stond met De Iongh. De Amerikaanse pianist Richard Moritz Buhlig (1880-1952) gaf begin 1909 drie concerten in de kleine zaal van het Concertgebouw, op woensdag 17 februari, woensdag 3 maart en woensdag 10 maart. Het is niet bekend voor welke van deze concerten Mondriaan kaarten had. Dear Zus,​ If you come to the entrance to the small auditorium in the Concertgebouw at a quarter to eight tomorrow (Wednesday) evening, I have a ticket for van Buhlig for you.[2] And then we can arrange a time other than Thursday afternoon because I can’t manage that. With my very best wishes, your Piet.","Manuscript. De brief is geschreven op een dinsdag voorafgaand aan een van de drie woensdagen waarop Richard Buhlig on 1909 in de kleine zaal van het Concertgebouw een concert zou geven: 17 februari, 3 maart en 10 maart. Zie tevens noot 1."
2,proeftuin@19090216y_IONG_1303:6,"Manuscript. De brief is geschreven op een dinsdag voorafgaand aan een van de drie woensdagen waarop Richard Buhlig on 1909 in de kleine zaal van het Concertgebouw een concert zou geven: 17 februari, 3 maart en 10 maart. Zie tevens noot 1. ​Aletta werd binnen haar familie aangesproken met de koosnaam 'Zus' (Heteren 2018, p. 25) Mondriaans gebruik van deze naam geeft aan dat hij op intieme voet stond met De Iongh. De Amerikaanse pianist Richard Moritz Buhlig (1880-1952) gaf begin 1909 drie concerten in de kleine zaal van het Concertgebouw, op woensdag 17 februari, woensdag 3 maart en woensdag 10 maart. Het is niet bekend voor welke van deze concerten Mondriaan kaarten had. Dear Zus,​ If you come to the entrance to the small auditorium in the Concertgebouw at a quarter to eight tomorrow (Wednesday) evening, I have a ticket for van Buhlig for you.[2] And then we can arrange a time other than Thursday afternoon because I can’t manage that. With my very best wishes, your Piet.",​


In [32]:
for (i, s) in enumerate(F.otype.s("sentence")[0:100]):
    print(f"SENTENCE {i + 1}: {T.text(s)}")

SENTENCE 1: Brief aan Aletta de Iongh. 
SENTENCE 2: Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
SENTENCE 3: Wietse Coppes
SENTENCE 4: Leo Jansen
SENTENCE 5: Mondriaan Editieproject
SENTENCE 6: Nederland
SENTENCE 7: Otterlo
SENTENCE 8: Kröller Müller Museum
SENTENCE 9: KM 123.397
SENTENCE 10: 19090216y_IONG_1303
SENTENCE 11: ​
SENTENCE 12: ​
SENTENCE 13: ​
SENTENCE 14: Piet Mondriaan
SENTENCE 15: dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909
SENTENCE 16: Amsterdam
SENTENCE 17: Aletta de Iongh
SENTENCE 18: transcriptie: voltooid 20.7.15
SENTENCE 19: collatie bron: 6.6.16
SENTENCE 20: tweede collatie aan het origineel: voltooid 26.11.19
SENTENCE 21: invoer tweede collatie: voltooid 5.8.16
SENTENCE 22: bespreking eindversie: gb
SENTENCE 23: markeren annotaties: in bewerking / voltooid
SENTENCE 24: gereed 17.4.2019
SENTENCE 25: titel gecontroleerd 21.09.2020
SENTENCE 26: personen getagd 12.10.2020
SENTENCE 27: vertaling ingevoerd 16.2.2021
SENTENC

### Notes

In [34]:
for (i, nn) in enumerate(F.otype.s("note")[4:5]):
    Apre.dm(f"### Note {i + 1}\n\n")
    s1 = L.u(L.d(nn, otype="token")[0], otype="chunk")[0]
    s2 = s1 + 1
    A.pretty(nn, withNodes=True, full=True)
    A.pretty(s1, withNodes=True, full=True)
    A.pretty(s2, withNodes=True, full=True)

### Note 1

