In [1]:
%load_ext autoreload
%autoreload 2

# Convert from TEI to TF

We show how to convert a TEI data source into TF.

This has two stages:

1. make an preliminary TF dataset with the character as slot type
1. feed the plain text to a tokenizer, and add tokens and sentences to the datset,
   while removing its character and word nodes;
   the new slot type is token.
   
A dataset based on characters is precise, but rather inefficient.
The second step makes the dataset much more efficient.

**More ways to do it!**

* [convertExpress](convertExpress.ipynb) : as few commands/feedback/interaction as possible, 
* [convertSteps](convertSteps.ipynb): broken down in a few command line commands, more feedback
* *convertDetails*: run from Python with full control

## Preliminary conversion

Same as in [convertSteps](convertSteps.ipynb) but now with even more feedback.

### Step 1: Check

Check the input: validity of the TEI-XML.

Make a report of the elements and attributes used.

Use the declared schemas in the XML source to determine which elements have
pure content and which ones mixed content.

In [6]:
from tf.convert.tei import TEI

In [7]:
Tei = TEI(verbose=1, tei=0, tf="0.8.6pre")

Working in repository annotation/mondriaan in backend github
With custom behaviour hooked in at:
	transformCustom = ~/github/annotation/mondriaan/programs/tei.py.transform
TEI data version is 2023-05-11 (most recent)
TF data version is 0.8.6pre (explicit existing)
Processing instructions are treated


In [8]:
Tei.task(check=True)

TEI to TF checking: ~/github/annotation/mondriaan/tei/2023-05-11 => ~/github/annotation/mondriaan/report/2023-05-11
Processing instructions are treated
INFO: Needs af.xsd (exists)
INFO: Needs ns2.xsd (exists)
INFO: Needs ns1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/mondriaan/schema/MD.xsd
	round   1:  49 changes
180 identical override(s)
  6 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	sepLine complex pure (added)
	transpose pure ==> mixed
	wbh complex pure (added)
Section model I
Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin

238 info line(s) written to ~/github/annotation/mondriaan/report/2023-05-11/elements.txt
0 error(s) in 0 file(s) written to ~/github/annotation/mondriaan/report/2023-05-11/errors.txt
62 tags of which 0 with multiple namespaces written to ~/github/annota

True

### Step 2: Convert

Run the actual conversin and produce TF output.

In [23]:
Tei.task(convert=True)

Start folder proeftuin:
  14 19100131_SAAL_ARNO_0018.xml                       
End   folder proeftuin



True

### Step 3: Load the TF data

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and precomputations are performed.
Next time the loading will be much quicker.

In [24]:
Tei.task(load=True, verbose=1)

   |     0.02s T otype                from ~/github/annotation/mondriaan/tf/0.8.6pre
   |     0.24s T oslots               from ~/github/annotation/mondriaan/tf/0.8.6pre
   |     0.00s T folder               from ~/github/annotation/mondriaan/tf/0.8.6pre
   |     0.13s T ch                   from ~/github/annotation/mondriaan/tf/0.8.6pre
   |     0.00s T letter               from ~/github/annotation/mondriaan/tf/0.8.6pre
   |     0.00s T chunk                from ~/github/annotation/mondriaan/tf/0.8.6pre
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.45s C __order__            from otype, oslots, __levels__
   |      |     0.01s C __rank__             from otype, __order__
   |      |     0.49s C __levUp__            from otype, oslots, __rank__
   |      |     0.10s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.01s C __characters__       from otext
   |      |     0.13s C __boundary__         from otype, oslots, __ra

True

### Step 4: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [25]:
Tei.task(app=True)

App updating  ...
	~/github/annotation/mondriaan/docs/about.md (generated with custom info)
	~/github/annotation/mondriaan/docs/transcription.md (no custom info, older orginal exists)
	~/github/annotation/mondriaan/app/static/logo.png (already exists, not overwritten)
	~/github/annotation/mondriaan/app/static/display.css (generated with custom info)
	~/github/annotation/mondriaan/app/config.yaml (generated with custom info)
	~/github/annotation/mondriaan/app/app.py (generated with custom info)
Done


True

## View the preliminary result

In [26]:
from tf.app import use

In [27]:
ORG = "annotation"
REPO = "mondriaan"

In [28]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone")

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,62349.0,100
letter,14,4453.5,100
body,14,3739.64,84
text,14,3739.64,84
chunk,86,724.99,100
div,92,982.87,145
teiHeader,14,702.64,16
page,51,544.35,45
revisionDesc,14,424.07,10
p,90,313.77,45


### Show a fragment

In [29]:
chunk = Apre.api.F.otype.s("chunk")[4]
Apre.plain(chunk)

## Memory resources

Check the memory usage per feature.

Keep an eye on the footprint of the `sibling` feature, because it might become too large
in a bigger corpus.

In [30]:
Apre.footprint()

                                                


# 59 features

feature | members | size in bytes
--- | --- | ---
__levUp__ | 75,023 | 5,356,216
ch | 62,349 | 4,372,832
oslots | 3 | 3,620,100
__boundary__ | 2 | 2,816,328
__order__ | 75,023 | 2,700,868
is_note | 25,393 | 2,021,854
__levDown__ | 12,674 | 1,569,600
is_meta | 11,356 | 907,930
str | 10,833 | 739,650
after | 10,833 | 604,737
sibling | 965 | 371,900
__rank__ | 75,023 | 318,936
parent | 1,647 | 286,332
otype | 4 | 105,225
extraspace | 780 | 58,842
type | 402 | 31,536
empty | 438 | 30,804
id | 267 | 20,803
rend_italics | 308 | 17,956
n | 231 | 16,478
rend_underline | 212 | 15,268
__characters__ | 1 | 14,559
target | 139 | 14,281
__levels__ | 66 | 14,241
key | 146 | 11,276
__sections__ | 2 | 10,365
who | 127 | 8,550
chunk | 86 | 7,292
url | 63 | 7,279
rend_upsidedown | 90 | 7,236
rend | 72 | 5,152
lang | 76 | 4,494
facs | 48 | 4,036
f | 46 | 3,654
rend_spaced | 38 | 2,260
rend_blockletter | 34 | 2,148
letter | 14 | 1,977
letterid | 14 | 1,977
manid | 14 | 1,972
when | 14 | 1,720
institution | 14 | 1,397
form | 14 | 1,206
country | 14 | 1,189
rend_super | 12 | 996
unit | 9 | 712
quantity | 9 | 704
dim | 9 | 661
rend_above | 8 | 604
rend_center | 6 | 548
rend_right | 6 | 548
rend_right_underline | 6 | 548
place | 3 | 471
rend_underline2 | 5 | 392
rend_norend | 4 | 364
reason | 2 | 339
folder | 1 | 310
rend_overwritten | 2 | 308
rend_super_underline2 | 2 | 308
rend_super_underline | 1 | 280
TOTAL | 364,983 | 26,120,549

## Add tokens and sentences

We add tokens and sentences to the TF dataset.

We do this by the following steps

1. Generate a plain text plus mapping between character positions and nodes
2. Use Spacy to tokenize the text and to determine sentence boundaries
3. translate the Spacy results back to extra nodes and features for the TF set
4. replace the character slots in the TF set by tokens

### Step by step from Python

We carry out the steps from within Python.

In that way we get access to all intermediate results, and we can play and explore between the steps.

We load the data we have so far, and pass it on to an `NLPipeline` object, defined by Text-Fabric.

In [31]:
from tf.app import use
from tf.convert.addnlp import NLPipeline

In [32]:
ORG = "annotation"
REPO = "mondriaan"

### Back to the previous state

When we have added the data to the dataset, we will tweak the TF app.

But if we want to redo the pipeline, we have to restore the app to the situation before
the tokens and sentences were added.

That's the reason we have the next cell.

In [33]:
Tei.task(app=True)

App updating  ...
	~/github/annotation/mondriaan/docs/about.md (generated with custom info)
	~/github/annotation/mondriaan/docs/transcription.md (no custom info, older orginal exists)
	~/github/annotation/mondriaan/app/static/logo.png (already exists, not overwritten)
	~/github/annotation/mondriaan/app/static/display.css (generated with custom info)
	~/github/annotation/mondriaan/app/config.yaml (generated with custom info)
	~/github/annotation/mondriaan/app/app.py (generated with custom info)
Done


True

In [34]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone")
NLP = NLPipeline(lang="en", verbose=0, write=True)
NLP.loadApp(Apre)

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,62349.0,100
letter,14,4453.5,100
body,14,3739.64,84
text,14,3739.64,84
chunk,86,724.99,100
div,92,982.87,145
teiHeader,14,702.64,16
page,51,544.35,45
revisionDesc,14,424.07,10
p,90,313.77,45


Input data has version 0.8.6pre


### Before the steps

We can set the verbosity as we like.

Generate plain text (add `verbose=-1` or `0` or `1` and/or `write=True` if you like).

* `verbose=-1` is the same as `-verbose`
* `verbose=0` is the same as `+verbose`
* `verbose=1` is the same as `++verbose`

### Step 1: Generate a plain text of the whole corpus

The function delivers the text in a variable, and it has recorded which character positions correspond
to which slots in the TF dataset.

We receive both items of data.

In [35]:
(text, positions) = NLP.task(plaintext=True)

Input data has version 0.8.6pre
  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 219 empty slots
   |   recorded flow main       with 122392 items
   |   recorded flow del        with    159 items
   |   recorded flow note       with  66355 items
   |   recorded flow orig       with    104 items
  0.20s Done. Generated text and positions written to ~/github/annotation/mondriaan/_temp/txt/plain.txt


### Step 2: Run Spacy to get tokens and sentences

Now we feed the text from step 1 into the NLP pipeline, which is Spacy.

We get a list of tokens and a list of sentences back.

In [36]:
(tokens, sentences) = NLP.task(lingo=True, text=text)

Input data has version 0.8.6pre
  0.00s Using NLP pipeline Spacy (en) ...
  2.97s NLP done


Let's examine a few tokens and sentences:

In [37]:
for token in tokens[400:410]:
    print(token)

(1590, 1595, 'Beste', ' ')
(1596, 1599, 'Zus', '')
(1599, 1600, ',', '')
(1600, 1601, '\n', '')
(1601, 1602, '￮', '')
(1602, 1605, 'kom', ' ')
(1606, 1608, 'je', ' ')
(1609, 1620, 'morgenavond', ' ')
(1621, 1622, '(', '')
(1622, 1630, 'Woensdag', '')


Each token entry specifies the start and end position in the plain text file,
the string value of the token, and the whitespace after the token, if any.

In [38]:
for sentence in sentences[390:410]:
    print(sentence)

(11368, 11405, 'Warm wishes from your Piet.  xxx. ')
(11405, 11413, 'Aa bb. ')
(11413, 11433, 'End div.   xxx. ')
(11433, 11441, 'Aa bb. ')
(11441, 11462, 'End div.   xxx. ')
(11462, 11470, 'Aa bb. ')
(11470, 11492, 'End chunk.   xxx. ')
(11492, 11500, 'Aa bb. ')
(11500, 11516, 'End letter.  ')
(11516, 11524, 'Aa bb. ')
(11524, 11542, 'Begin letter.  ')
(11542, 11550, 'Aa bb. ')
(11550, 11566, 'Begin meta.  ')
(11566, 11574, 'Aa bb. ')
(11574, 11609, 'Begin chunk. fileDesc. titleStmt.')
(11610, 11616, 'title.')
(11617, 11643, 'Brief aan Aletta de Iongh.')
(11644, 11685, 'Amsterdam, donderdag 13 mei 1909.editor.')
(11686, 11708, 'Wietse Coppes.editor.')
(11709, 11729, 'Leo Jansen.sponsor.')


Sentence entries have the same fields, except for the last whitespace field.

Actually, the program will not use the texts of tokens and sentences for display, only for determining
where the boundaries are.

With those boundaries in hand, the texts of tokens and sentences are read off from the original corpus.

### Step 3: Ingest the results in the data set

A lot of critical things happen when we ingest the token and sentence streams into our dataset.

We calculate slot positions, retrieve text, split some tokens, and last but not least,
we replace the character-by-character basis of the preliminary dataset by a token-by-token basis.

In [39]:
newVersion = NLP.task(
    ingest=True,
    positions=positions,
    tokens=tokens,
    sentences=sentences,
)

Input data has version 0.8.6pre
  0.00s Ingesting tokens and sentences into the dataset ...
   |     9.87s Mapping NLP data to nodes and features ...
   |      |     0.00s generating token-nodes with features str, after, empty
   |      |      |     0.00s 13353 token nodes have values assigned for str, after
   |      |      |     0.00s 219 empty slots have split surrounding tokens
   |      |      |     0.00s 177 space slots have split into chars
   |      |      |     0.00s  6870x Items contained in extra generated text
   |      |     0.04s 13353 tokens
   |      |     0.04s generating sentence-nodes with features nsent
   |      |      |     0.02s 1052 sentence nodes have values assigned for nsent
   |      |      |     0.02s  1038x Items contained in extra generated text
   |      |      |     0.02s    50x Items with empty final text
   |      |     0.06s 1052 sentences
   |     9.94s Make a modified dataset ...
  0.00s Feature overview: 48 for nodes; 3 for edges; 1 configs; 9 com

### Step 4: Adjust the app to the modified dataset

Various things in the `config.yaml` and `app.py` of the TF app should be updated, as well
as the documentation file that gives the ins and outs of the resulting features.

In [40]:
Tei.task(apptoken=True)

App updating  with tokens and sentences  ...
	~/github/annotation/mondriaan/docs/about.md (generated with custom info)
	~/github/annotation/mondriaan/docs/transcription.md (no custom info, older orginal exists)
	~/github/annotation/mondriaan/app/static/logo.png (already exists, not overwritten)
	~/github/annotation/mondriaan/app/static/display.css (generated with custom info)
	~/github/annotation/mondriaan/app/config.yaml (generated with custom info)
	~/github/annotation/mondriaan/app/app.py (generated with custom info)
Done


True

# Use the new dataset

We can now use the resulting dataset in the usual way.
Because we have adapted the TF app, the version without the `pre` will now be loaded.

In [41]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

   |     0.00s T otype                from ~/github/annotation/mondriaan/tf/0.8.6
   |     0.05s T oslots               from ~/github/annotation/mondriaan/tf/0.8.6
   |     0.00s T letter               from ~/github/annotation/mondriaan/tf/0.8.6
   |     0.03s T str                  from ~/github/annotation/mondriaan/tf/0.8.6
   |     0.00s T folder               from ~/github/annotation/mondriaan/tf/0.8.6
   |     0.00s T chunk                from ~/github/annotation/mondriaan/tf/0.8.6
   |     0.03s T after                from ~/github/annotation/mondriaan/tf/0.8.6
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.10s C __order__            from otype, oslots, __levels__
   |      |     0.00s C __rank__             from otype, __order__
   |      |     0.10s C __levUp__            from otype, oslots, __rank__
   |      |     0.03s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.00s C __characters__       from otext
   | 

Name,# of nodes,# slots/node,% coverage
folder,1,13353.0,100
letter,14,953.79,100
body,14,811.79,85
text,14,811.79,85
chunk,86,155.27,100
div,92,211.23,146
teiHeader,14,132.57,14
page,51,125.18,48
p,90,70.99,48
revisionDesc,14,69.0,7


## Memory resources (revisited)

We now have a leaner dataset, because the granularity has become coarser: from character
to token.

In [42]:
A.footprint()

                                                


# 58 features

feature | members | size in bytes
--- | --- | ---
str | 13,353 | 1,108,995
__levUp__ | 16,246 | 1,061,848
after | 13,033 | 954,935
oslots | 3 | 741,632
__order__ | 16,246 | 584,896
__boundary__ | 2 | 492,316
sibling | 965 | 371,900
__levDown__ | 2,893 | 339,852
parent | 1,647 | 286,332
is_note | 4,371 | 269,982
is_meta | 1,856 | 125,826
nsent | 1,052 | 95,864
__rank__ | 16,246 | 65,904
extraspace | 759 | 58,254
type | 402 | 31,536
otype | 4 | 26,982
id | 267 | 20,803
n | 231 | 16,478
empty | 219 | 15,464
__characters__ | 1 | 14,655
target | 139 | 14,281
__levels__ | 66 | 14,190
key | 146 | 11,276
__sections__ | 2 | 10,337
who | 127 | 8,550
chunk | 86 | 7,292
url | 63 | 7,279
rend | 72 | 5,152
lang | 76 | 4,494
facs | 48 | 4,036
rend_underline | 59 | 3,944
f | 46 | 3,654
rend_italics | 47 | 3,608
manid | 14 | 2,001
letter | 14 | 1,977
letterid | 14 | 1,977
when | 14 | 1,720
institution | 14 | 1,473
form | 14 | 1,206
country | 14 | 1,189
rend_upsidedown | 17 | 1,136
unit | 9 | 712
quantity | 9 | 704
dim | 9 | 661
rend_blockletter | 6 | 548
place | 3 | 471
rend_super | 5 | 392
rend_spaced | 4 | 364
reason | 2 | 339
folder | 1 | 310
rend_above | 2 | 308
rend_center | 2 | 308
rend_overwritten | 2 | 308
rend_right | 2 | 308
rend_right_underline | 2 | 308
rend_super_underline2 | 2 | 308
rend_super_underline | 1 | 280
rend_underline2 | 1 | 280
TOTAL | 90,950 | 6,802,135

# Zip the data

This is for producing a zip file to attach to the latest release, so that TF can download the data smoothly.

In [43]:
A.zipAll()

Data to be zipped:
	OK       app                      (v0.8.5 1ca5d1)     : ~/github/annotation/mondriaan/app
	OK       main data                (v0.8.5 1ca5d1)     : ~/github/annotation/mondriaan/tf/0.8.6
	OK       graphics                 (v0.8.5 1ca5d1)     : ~/github/annotation/mondriaan/illustrations
Writing zip file ...
Result: ~/Downloads/github/annotation/mondriaan/complete.zip


# Exploration

We walk around a bit more in the corpus.

## All titles:

In [44]:
for t in F.otype.s("titleStmt"):
    print(t, T.text(t))

15163 Brief aan Aletta de Iongh. Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15164 Brief aan Aletta de Iongh. Amsterdam, woensdag 7 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15165 Brief aan Aletta de Iongh. Amsterdam, tussen maandag 19 en vrijdag 23 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15166 Brief aan Aletta de Iongh. Amsterdam, maandag 26 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15167 Brief aan Aletta de Iongh. Amsterdam, donderdag 13 mei 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15168 Brief aan Aletta de Iongh. Amsterdam, donderdag 24 juni 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15169 Brief aan Aletta de Iongh. Amsterdam, eerste helft augustus 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

15170  Briefkaart aan Gerrit Willem Knap. Zoutelande, c. dinsdag 24 augustus 1909.
Wietse Coppes
Leo Jansen
Mondria

## Sentences

In [45]:
for s in F.otype.s("sentence")[2:4]:
    print(T.text(s))

Wietse Coppes
Leo Jansen


In [46]:
for s in F.otype.s("sentence")[2:4]:
    A.pretty(s, withNodes=True)

In [47]:
for (i, s) in enumerate(F.otype.s("sentence")[0:100]):
    print(f"SENTENCE {i + 1}: {T.text(s)}")

SENTENCE 1: Brief aan Aletta de Iongh. 
SENTENCE 2: Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
SENTENCE 3: Wietse Coppes
SENTENCE 4: Leo Jansen
SENTENCE 5: Mondriaan Editieproject
SENTENCE 6: Nederland
SENTENCE 7: Otterlo
SENTENCE 8: Kröller Müller Museum
SENTENCE 9: KM 123.397
SENTENCE 10: 19090216y_IONG_1303
SENTENCE 11: ​
SENTENCE 12: ​
SENTENCE 13: ​
SENTENCE 14: Piet Mondriaan
SENTENCE 15: dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909
SENTENCE 16: Amsterdam
SENTENCE 17: Aletta de Iongh
SENTENCE 18: transcriptie: voltooid 20.7.15
SENTENCE 19: collatie bron: 6.6.16
SENTENCE 20: tweede collatie aan het origineel: voltooid 26.11.19
SENTENCE 21: invoer tweede collatie: voltooid 5.8.16
SENTENCE 22: bespreking eindversie: gb
SENTENCE 23: markeren annotaties: in bewerking / voltooid
SENTENCE 24: gereed 17.4.2019
SENTENCE 25: titel gecontroleerd 21.09.2020
SENTENCE 26: personen getagd 12.10.2020
SENTENCE 27: vertaling ingevoerd 16.2.2021
SENTENC

# Illustrations

In [48]:
results = A.search("""
rs type=artwork-m key~[0-9]
""")

  0.00s 12 results


In [49]:
A.show(results, withNodes=True,end=1)

## The first letter

In [50]:
A.pretty(F.otype.s("letter")[0], full=True, withNodes=False)

## Pages

In [51]:
pages = A.search("""
page
""")
A.table(pages, end=2)

  0.00s 51 results


n,p,page
1,proeftuin@19090216y_IONG_1303:5,"Beste Zus, ​kom je morgenavond (Woensdag) om kwart voor acht ingang kleine zaal Concertgebouw​, dan heb ik een plaats voor v. ​ Buhlig lvoor je. ​En dan kunnen we een andere dan Donderdagmiddag afspreken want dan kan ik niet goed. Met vele beste groeten je Piet."
2,proeftuin@19090216y_IONG_1303:6,"Dear Zus,​ If you come to the entrance to the small auditorium in the Concertgebouw at a quarter to eight tomorrow (Wednesday) evening, I have a ticket for van Buhlig for you. ​And then we can arrange a time other than Thursday afternoon because I can’t manage that. With my very best wishes, your Piet."


## Overlapping divs

There are divs in divs
Let's find them all.

First the total amount of divs:

In [52]:
len(F.otype.s("div"))

92

In [53]:
query = """
d1:div
&& d2:div

d1 < d2
"""

resultsA = A.search(query)

  0.02s 68 results


We can also find the divs that are directly under another div by means of the `parent` edges:

In [54]:
query = """
div
<parent- div
"""

resultsD = A.search(query)

  0.00s 62 results


So some divs are nested, but not directly below each other.

Let's see which they are.

In [55]:
arbitrarily = set(resultsA)
directly = set(resultsD)

It is to be expected that the arbitrarily nested divs are a superset of the directly nested divs.

In [56]:
directly - arbitrarily

set()

Now the other way round:

In [57]:
results = arbitrarily - directly
results

{(13843, 13849),
 (13843, 13850),
 (13843, 13851),
 (13913, 13919),
 (13913, 13920),
 (13913, 13921)}

In [58]:
A.table(sorted(results), end=2)

n,p,div,div.1
1,proeftuin@19090421y_IONG_1304:6,"Manuscript. De brief dateert van vóór brief 19090513y_IONG_1293 van circa 13 mei 1909, waarin Mondriaan schrijft aan De Iongh dat hij nog geen portretten ontving van Waldenburg. De in deze brief voorgestelde fotosessie heeft dan dus reeds plaatsgevonden. We dateren de onderhavige brief daarom enkele weken eerder dan brief 19090513y_IONG_1293, in de week van maandag 19 tot en met vrijdag 23 april. Brief 19090426y_IONG_1738 zit nog tussen de onderhavige brief en brief 19090513y_IONG_1293 van circa 13 mei in. In die brief wordt meer concreet gestreefd naar een moment om de voorgestelde fotosessie te doen plaatsvinden. ​Met 'de schedelmeter' bedoelt Mondriaan Alfred Waldenburg. De fotosessie met Mondriaan en De Iongh heeft vermoedelijk kort na dit schrijven plaatsgevonden, hoewel de afdrukken pas in augustus gereed waren (zie brief 19090905y_IONG_1295). Tijdens de sessie maakte Waldenburg tevens een portretfoto van Mondriaan. Hoewel er enkele portretfoto's van De Iongh zijn overgeleverd, is geen daarvan met zekerheid toe te schrijven aan Waldenburg. Correspondentie tussen Mondriaan en Waldenburg is niet overgeleverd. De enige bewaard gebleven tekening van Mondriaan waarvan met zekerheid kan worden gesteld dat De Iongh er model voor stond is de krijttekening Piet Mondriaan, Leo Gestel, Meisjeskop, 1910 Alkmaar, Stedelijk Museum Alkmaar, inv./cat.nr. [..]. conté on papier. RKD 277201►Meisjeskop (1910, UA38). Mogelijk stond zij ook model voor de tekening RKD htpps://rkd.nl/images/68554►Female nude: bust portrait (c. 1909-1911, A645). ​ Dear Zus, ​The “skull measurer” came round this morning; he saw the sketches of you and thought your forehead and the expression in your eyes so beautiful that he wanted to take a photograph of you. ​He’s coming here again on Sunday. If you have the time and the inclination, come to me at 10 o’clock, otherwise I’ll just write and tell him you can’t come, in the event that you’d rather decline. It might well take a while before you get a print from him and perhaps he’ll put you in a book later: I’m just warning you, then you can do as you wish. Let’s agree that if you don’t have the inclination or the time, drop me a line and otherwise come to me on Sunday morning at 10. Bye, my dear Zus, your Piet. Manuscript. ​  ​​",Manuscript. ​
2,proeftuin@19090421y_IONG_1304:6,"Manuscript. De brief dateert van vóór brief 19090513y_IONG_1293 van circa 13 mei 1909, waarin Mondriaan schrijft aan De Iongh dat hij nog geen portretten ontving van Waldenburg. De in deze brief voorgestelde fotosessie heeft dan dus reeds plaatsgevonden. We dateren de onderhavige brief daarom enkele weken eerder dan brief 19090513y_IONG_1293, in de week van maandag 19 tot en met vrijdag 23 april. Brief 19090426y_IONG_1738 zit nog tussen de onderhavige brief en brief 19090513y_IONG_1293 van circa 13 mei in. In die brief wordt meer concreet gestreefd naar een moment om de voorgestelde fotosessie te doen plaatsvinden. ​Met 'de schedelmeter' bedoelt Mondriaan Alfred Waldenburg. De fotosessie met Mondriaan en De Iongh heeft vermoedelijk kort na dit schrijven plaatsgevonden, hoewel de afdrukken pas in augustus gereed waren (zie brief 19090905y_IONG_1295). Tijdens de sessie maakte Waldenburg tevens een portretfoto van Mondriaan. Hoewel er enkele portretfoto's van De Iongh zijn overgeleverd, is geen daarvan met zekerheid toe te schrijven aan Waldenburg. Correspondentie tussen Mondriaan en Waldenburg is niet overgeleverd. De enige bewaard gebleven tekening van Mondriaan waarvan met zekerheid kan worden gesteld dat De Iongh er model voor stond is de krijttekening Piet Mondriaan, Leo Gestel, Meisjeskop, 1910 Alkmaar, Stedelijk Museum Alkmaar, inv./cat.nr. [..]. conté on papier. RKD 277201►Meisjeskop (1910, UA38). Mogelijk stond zij ook model voor de tekening RKD htpps://rkd.nl/images/68554►Female nude: bust portrait (c. 1909-1911, A645). ​ Dear Zus, ​The “skull measurer” came round this morning; he saw the sketches of you and thought your forehead and the expression in your eyes so beautiful that he wanted to take a photograph of you. ​He’s coming here again on Sunday. If you have the time and the inclination, come to me at 10 o’clock, otherwise I’ll just write and tell him you can’t come, in the event that you’d rather decline. It might well take a while before you get a print from him and perhaps he’ll put you in a book later: I’m just warning you, then you can do as you wish. Let’s agree that if you don’t have the inclination or the time, drop me a line and otherwise come to me on Sunday morning at 10. Bye, my dear Zus, your Piet. Manuscript. ​  ​​",​


In [59]:
query = """
div
<parent- div
<parent- div
"""
results = A.search(query)

  0.00s 6 results


In [60]:
from textwrap import dedent

In [61]:
for i in range(1, 5):
    query = dedent(
        f"""
        div
        -sibling>{i}> div
        """
    )

    print(f"div siblings at distance {i}")
    results = A.search(query)

div siblings at distance 1
  0.00s 48 results
div siblings at distance 2
  0.00s 16 results
div siblings at distance 3
  0.00s 2 results
div siblings at distance 4
  0.00s 0 results


### Notes

In [62]:
for (i, nn) in enumerate(F.otype.s("note")[4:5]):
    Apre.dm(f"### Note {i + 1}\n\n")
    tokens = L.d(nn, otype="token")
    s = L.u(L.d(nn, otype="token")[0], otype="chunk")[0]
    A.pretty(nn, withNodes=True, full=True)
    A.pretty(s, withNodes=True, full=True)

### Note 1



## Extra features

In [66]:
features = ("letterid", "manid", "country", "institution")

for feat in features:
    meta = Fs(feat).meta
    print(f"{feat:<12}: {meta['conversionCode']}: {meta['conversionMethod']}")

letterid    : tt: derived
manid       : tt: derived
country     : tt: derived
institution : tt: derived


In [67]:
for letter in F.otype.s("letter"):
    A.pretty(letter, extraFeatures=features)