In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.addnlp import NLPipeline
from tf.advanced.helpers import dm

In [3]:
ORG = "annotation"
REPO = "mondriaan"

# Convert from TEI to TF

We show how to convert a TEI data source into TF.

This has two stages:

1. make an preliminary TF dataset with the character as slot type
1. feed the plain text to a tokeniser, and add tokens and sentences to the data set,
   while removing its character and word nodes;
   the new slot type is token.
   
A dataset based on characters is precise, but rather inefficient.
The second step makes the dataset much more efficient.

**More ways to do it!**

* [convertExpress](convertExpress.ipynb) : as few commands/feedback/interaction as possible, 
* [convertSteps](convertSteps.ipynb): broken down in a few command line commands, more feedback
* `convertDetails`: run from Python with full control

## Preliminary conversion

Same as in [convertSteps](convertSteps.ipynb) but now with even more feedback.

### Step 1: Check

Check the input: validity of the TEI-XML.

Make a report of the elements and attributes used.

Use the declared schemas in the XML source to determine which elements have
pure content and which ones mixed content.

In [30]:
Tei = TEI(verbose=-1, tei=0, tf="0.8.12pre")

In [32]:
Tei.task(check=True, verbose=1, validate=True)

TEI to TF checking: ~/github/annotation/mondriaan/tei/2023-06-06 => ~/github/annotation/mondriaan/report/2023-06-06
Processing instructions are treated
XML validation will be performed
Section model I
Start folder proeftuin:
   1 MD           letter       md           19090216y_IONG_1303.xml                           
   2 MD           letter       md           19090407y_IONG_1739.xml                           
   3 MD           letter       md           19090421y_IONG_1304.xml                           
   4 MD           letter       md           19090426y_IONG_1738.xml                           
   5 MD           letter       md           19090513y_IONG_1293.xml                           
   6 MD           letter       md           19090624_IONG_1294.xml                            
   7 MD           letter       md           19090807y_IONG_1296.xml                           
   8 MD           letter       md           19090824y_KNAP_1747.xml                           
   9 MD        

True

### Step 2: Convert

Run the actual conversion and produce TF output.

In [33]:
Tei.task(convert=True)

Start folder proeftuin:
   1 MD           md           letter       19090216y_IONG_1303.xml                           
   2 MD           md           letter       19090407y_IONG_1739.xml                           
   3 MD           md           letter       19090421y_IONG_1304.xml                           
   4 MD           md           letter       19090426y_IONG_1738.xml                           
   5 MD           md           letter       19090513y_IONG_1293.xml                           
   6 MD           md           letter       19090624_IONG_1294.xml                            
   7 MD           md           letter       19090807y_IONG_1296.xml                           
   8 MD           md           letter       19090824y_KNAP_1747.xml                           
   9 MD           md           letter       19090905y_IONG_1295.xml                           
  10 MD           md           letter       190909XX_QUER_1654.xml                            
  11 MD           md      

True

### Step 3: Load the TF data

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [34]:
Tei.task(load=True)

True

### Step 4: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [35]:
Tei.task(app=True)

App updated


True

## View the preliminary result

In [36]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone")

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,2,40715.0,100
bibliolist,1,13651.0,17
listBibl,2,6722.0,17
file,16,5089.38,100
letter,14,4681.57,80
body,16,2875.44,56
text,16,2875.44,56
artworklist,1,2237.0,3
listObject,1,2025.0,2
standOff,14,1798.36,31


We hoist the API handles of this dataset to the global scope.

In [37]:
Apre.hoist(globals())

### Show a fragment

In [38]:
chunk = F.otype.s("chunk")[4]
Apre.plain(chunk)

## Show the processing instructions

In [39]:
for nodeType in F.otype.all:
    if nodeType.startswith("?"):
        for n in F.otype.s(nodeType):
            Apre.pretty(n, multiFeatures=True)

## Check the extra features

In [40]:
features = tuple(feat for feat in Fall() if Fs(feat).meta.get("conversionCode", None) == "tt")

for feat in features:
    meta = Fs(feat).meta
    print(f"{feat:<15}: {meta['conversionCode']}: {meta['conversionMethod']}")

artmondriaanref: tt: derived
correspondent  : tt: derived
country        : tt: derived
exhibitionref  : tt: derived
institution    : tt: derived
letterid       : tt: derived
location       : tt: derived
msid           : tt: derived
period         : tt: derived
periodlong     : tt: derived
personref      : tt: derived
sender         : tt: derived


In [41]:
for letter in F.otype.s("letter"):
    Apre.pretty(letter, extraFeatures=features)

In [42]:
for feature in ("personref", "artmondriaanref", "artref", "exhibitionref"):
    fObj = Fs(feature)
    if fObj:
        items = list(fObj.items())
        nItems = len(items)
        dm(f"### {feature} with {nItems} items\n\n")
        for (node, pref) in items[0:5]:
            Apre.pretty(node, extraFeatures=f"ref key {feature}", baseTypes={"word"})
    else:
        dm(f"### {feature} with 0 items\n\n")
        

### personref with 121 items



### artmondriaanref with 20 items



    43s Node feature "artref" not loaded


### artref with 0 items



### exhibitionref with 16 items



## Memory resources

Check the memory usage per feature.

Keep an eye on the footprint of the `sibling` feature, because it might become too large
in a bigger corpus.

In [43]:
Apre.footprint()

                                                


# 76 features

feature | members | size in bytes
--- | --- | ---
__levUp__ | 97,492 | 6,967,744
ch | 81,430 | 4,907,354
oslots | 3 | 4,763,480
__boundary__ | 2 | 3,607,920
__order__ | 97,492 | 3,509,752
is_note | 29,774 | 2,144,522
__levDown__ | 16,062 | 2,008,868
str | 13,698 | 1,159,259
after | 13,698 | 981,401
is_meta | 11,630 | 915,602
sibling | 1,260 | 799,164
__rank__ | 97,492 | 414,432
parent | 2,114 | 341,376
otype | 4 | 133,351
extraspace | 1,010 | 65,282
rend_italics | 739 | 57,672
id | 451 | 43,439
empty | 524 | 33,212
type | 441 | 32,924
target | 200 | 22,054
n | 287 | 18,271
__levels__ | 84 | 18,143
rend_underline | 212 | 15,268
__characters__ | 1 | 15,069
__sections__ | 2 | 11,964
ref | 153 | 11,774
personref | 121 | 9,827
rend_indent | 170 | 9,476
level | 142 | 8,914
url | 73 | 8,849
who | 127 | 8,550
chunk | 105 | 7,852
lang | 90 | 7,310
rend_upsidedown | 90 | 7,236
rend | 85 | 5,624
facs | 48 | 4,036
scheme | 43 | 3,694
f | 46 | 3,654
file | 32 | 3,128
quantity | 29 | 3,111
rend_spaced | 38 | 2,260
unit | 29 | 2,190
rend_blockletter | 34 | 2,148
periodlong | 14 | 2,113
letterid | 14 | 1,977
msid | 14 | 1,972
period | 14 | 1,720
when | 14 | 1,720
key | 16 | 1,474
corresp | 10 | 1,427
source | 10 | 1,427
institution | 14 | 1,397
correspondent | 14 | 1,369
commodity | 20 | 1,306
location | 14 | 1,258
artmondriaanref | 20 | 1,241
form | 14 | 1,206
country | 14 | 1,189
exhibitionref | 16 | 1,129
sender | 14 | 1,087
rend_super | 12 | 996
name | 10 | 688
dim | 9 | 661
rend_above | 8 | 604
rend_center | 7 | 576
rend_right | 6 | 548
rend_right_underline | 6 | 548
place | 3 | 471
folder | 2 | 397
rend_underline2 | 5 | 392
rend_norend | 4 | 364
reason | 2 | 339
rend_bold | 2 | 308
rend_overwritten | 2 | 308
rend_super_underline2 | 2 | 308
rend_super_underline | 1 | 280
TOTAL | 467,884 | 33,129,956

## Add tokens and sentences

We add tokens and sentences to the TF dataset.

We do this by the following steps

1. Generate a plain text plus mapping between character positions and nodes
2. Use Spacy to tokenise the text and to determine sentence boundaries
3. translate the Spacy results back to extra nodes and features for the TF set
4. replace the character slots in the TF set by tokens

### Step by step from Python

We carry out the steps from within Python.

In that way we get access to all intermediate results, and we can play and explore between the steps.

We load the data we have so far, and pass it on to an `NLPipeline` object, defined by Text-Fabric.

### Back to the previous state

When we have added the data to the dataset, we will tweak the TF app.

But if we want to redo the pipeline, we have to restore the app to the situation before
the tokens and sentences were added.

That's the reason we have the next cell.

In [44]:
Tei.task(app=True)

App updated


True

In [45]:
Apre = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())
NLP = NLPipeline(lang="en", verbose=0, write=True)
NLP.loadApp(Apre)

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,2,40715.0,100
bibliolist,1,13651.0,17
listBibl,2,6722.0,17
file,16,5089.38,100
letter,14,4681.57,80
body,16,2875.44,56
text,16,2875.44,56
artworklist,1,2237.0,3
listObject,1,2025.0,2
standOff,14,1798.36,31


Input data has version 0.8.12pre
Compute element boundaries
  1611 start postions
  1864 end postions


### Before the steps

We can set the verbosity as we like.

Generate plain text (add `verbose=-1` or `0` or `1` and/or `write=True` if you like).

* `verbose=-1` is the same as `-verbose`
* `verbose=0` is the same as `+verbose`
* `verbose=1` is the same as `++verbose`

### Step 1: Generate a plain text of the whole corpus

The function delivers the text in a variable, and it has recorded which character positions correspond
to which slots in the TF dataset.

We receive both items of data.

In [47]:
(text, positions) = NLP.task(plaintext=True)

Input data has version 0.8.12pre
Compute element boundaries
  1611 start postions
  1864 end postions
  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 262 empty slots
   |   recorded flow main       with 168276 items
   |   recorded flow del        with    153 items
   |   recorded flow note       with  77857 items
   |   recorded flow orig       with     95 items
  0.24s Done. Generated text and positions written to ~/github/annotation/mondriaan/_temp/txt/plain.txt


### Step 2: Run Spacy to get tokens and sentences

Now we feed the text from step 1 into the NLP pipeline, which is Spacy.

We get a list of tokens and a list of sentences back.

In [48]:
(tokens, sentences) = NLP.task(lingo=True, text=text)

Input data has version 0.8.12pre
Compute element boundaries
  1611 start postions
  1864 end postions
  0.00s Using NLP pipeline Spacy (en) ...
  3.59s Atomic tokens written to ~/github/annotation/mondriaan/_temp/txt/tokens.tsv
  3.59s Sentences written to ~/github/annotation/mondriaan/_temp/txt/sentences.tsv
  3.59s NLP done


Let's examine a few tokens and sentences:

In [49]:
for token in tokens[1022:1032]:
    print(token)

(3998, 4000, 'en', ' ')
(4001, 4003, 'ik', ' ')
(4004, 4009, 'moest', ' ')
(4010, 4012, 'je', ' ')
(4013, 4017, 'toch', ' ')
(4018, 4022, 'even', ' ')
(4023, 4029, 'zeggen', ' ')
(4030, 4033, 'dat', ' ')
(4034, 4036, 'ik', ' ')
(4037, 4043, 'morgen', ' ')


In [50]:
for token in tokens[1211:1221]:
    print(token)

(4698, 4711, 'unfortunately', ' ')
(4712, 4713, 'I', ' ')
(4714, 4718, 'have', ' ')
(4719, 4721, 'to', ' ')
(4722, 4724, 'be', ' ')
(4725, 4727, 'in', ' ')
(4728, 4731, 'the', ' ')
(4732, 4738, 'museum', ' ')
(4739, 4747, 'tomorrow', ' ')
(4748, 4757, 'afternoon', ' ')


Each token entry specifies the start and end position in the plain text file,
the string value of the token, and the white-space after the token, if any.

In [51]:
for sentence in sentences[136:155]:
    print(sentence)

(3519, 3557, 'titel gecontroleerd 15.9.2020.change.')
(3558, 3593, 'personen getagd 12.10.2020.change.')
(3594, 3632, 'vertaling ingevoerd 16.2.2021.change.')
(3633, 3679, 'codering personen aangepast 16.2.2022.change.')
(3680, 3747, 'controle/aanpassing afkortingen en emendaties 21.4.2023.  xxx. ')
(3747, 3755, 'Aa bb. ')
(3755, 3779, 'End chunk. .  xxx. ')
(3779, 3787, 'Aa bb. ')
(3787, 3801, 'End meta.  ')
(3801, 3809, 'Aa bb. ')
(3809, 3845, 'Begin chunk. ￮ ￮ ￮￮￮  xxx. ')
(3845, 3853, 'Aa bb. ')
(3853, 3868, 'End chunk.  ')
(3868, 3876, 'Aa bb. ')
(3876, 3892, 'Begin chunk.  ')
(3892, 4163, "Lieve Zus,neem me niet kwalijk dat ik je zoo'n gekreukte enveloppe zend￮, maar ik had geen meer in huis, en ik moest je toch even zeggen dat ik morgen middag helaas in 't museum bij 't ophangen van schilderijen moet zijn, omdat ik in de jury van St Lucas ben.￮  xxx. ")
(4163, 4171, 'Aa bb. ')
(4171, 4180, 'End p. ')
(4180, 4228, 'Ik ben er al een groot deel van de week geweest.')


Sentence entries have the same fields, except for the last white-space field.

Actually, the program will not use the texts of tokens and sentences for display, only for determining
where the boundaries are.

With those boundaries in hand, the texts of tokens and sentences are read off from the original corpus.

### Step 3: Ingest the results in the data set

A lot of critical things happen when we ingest the token and sentence streams into our dataset.

We calculate slot positions, retrieve text, split some tokens, and last but not least,
we replace the character-by-character basis of the preliminary dataset by a token-by-token basis.

In [52]:
# Apre.pretty(L.u(2957, otype="word")[0], multiFeatures=True)

In [53]:
newVersion = NLP.task(
    ingest=True,
    positions=positions,
    tokens=tokens,
    sentences=sentences,
)

Input data has version 0.8.12pre
Compute element boundaries
  1611 start postions
  1864 end postions
  0.00s Ingesting tokens, and sentences into the dataset ...
   |    2m 29s Mapping NLP data to nodes and features ...
   |      |     0.00s generating t-nodes with features str, after, empty
   |      |      |     0.00s 17704 t nodes have values assigned for str, after
   |      |      |     0.00s 0 empty slots are properly contained in a token
   |      |      |     0.00s 77 space slots have split into chars
   |      |      |     0.00s 2118 slots have split around an element boundary
   |      |      |     0.00s  7053x Items contained in extra generated text
   |      |     0.06s 17523 tokens
   |      |     0.06s 17704 ts
   |      |     0.06s generating sentence-nodes with features nsent
   |      |      |     0.03s 1433 sentence nodes have values assigned for nsent
   |      |      |     0.03s  1092x Items contained in extra generated text
   |      |      |     0.03s    66x Item

### Step 4: Adjust the app to the modified dataset

Various things in the `config.yaml` and `app.py` of the TF app should be updated, as well
as the documentation file that gives the ins and outs of the resulting features.

In [54]:
Tei.task(apptoken=True)

App updated with tokens and sentences 


True

# Use the new dataset

We can now use the resulting dataset in the usual way.
Because we have adapted the TF app, the version without the `pre` will now be loaded.

In [68]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", silent="verbose")

**Locating corpus resources ...**

This is Text-Fabric 11.4.16
68 features found and 0 ignored
  0.01s Dataset without structure sections in otext:no structure functions in the T-API
  0.04s All features loaded/computed - for details use TF.isLoaded()
  0.01s All additional features loaded - for details use TF.isLoaded()


Name,# of nodes,# slots/node,% coverage
folder,2,8852.0,100
bibliolist,1,3145.0,18
listBibl,2,1546.5,17
file,16,1106.5,100
letter,14,1001.5,79
body,16,664.19,60
text,16,664.19,60
artworklist,1,538.0,3
listObject,1,491.0,3
standOff,14,358.71,28


We hoist the API handles of this dataset to the global scope.

In [69]:
A.hoist(globals())

## Memory resources (revisited)

We now have a leaner dataset, because the granularity has become coarser: from character
to token.

In [70]:
A.footprint()

                                                


# 75 features

feature | members | size in bytes
--- | --- | ---
__levUp__ | 39,024 | 9,255,964
__boundary__ | 2 | 3,227,400
oslots | 3 | 2,903,068
__levDown__ | 21,320 | 2,753,016
str | 35,227 | 2,487,545
after | 35,055 | 2,292,439
__order__ | 39,024 | 1,404,904
sibling | 1,260 | 799,164
parent | 2,114 | 341,376
is_note | 5,134 | 291,346
otype | 4 | 175,470
__rank__ | 39,024 | 165,944
nsent | 1,433 | 154,056
is_meta | 1,909 | 127,310
extraspace | 1,010 | 65,282
id | 451 | 43,439
type | 441 | 32,924
target | 200 | 22,054
__levels__ | 85 | 18,358
n | 287 | 18,271
empty | 262 | 16,668
__characters__ | 1 | 15,069
__sections__ | 2 | 11,964
ref | 153 | 11,774
personref | 121 | 9,827
level | 142 | 8,914
url | 73 | 8,849
who | 127 | 8,550
chunk | 105 | 7,852
rend_italics | 106 | 7,684
lang | 90 | 7,310
rend | 85 | 5,624
facs | 48 | 4,036
rend_underline | 59 | 3,944
scheme | 43 | 3,694
f | 46 | 3,654
file | 32 | 3,128
quantity | 29 | 3,111
rend_indent | 36 | 2,204
unit | 29 | 2,190
periodlong | 14 | 2,113
letterid | 14 | 1,977
msid | 14 | 1,972
period | 14 | 1,720
when | 14 | 1,720
key | 16 | 1,474
corresp | 10 | 1,427
source | 10 | 1,427
institution | 14 | 1,397
correspondent | 14 | 1,369
commodity | 20 | 1,306
location | 14 | 1,258
artmondriaanref | 20 | 1,241
form | 14 | 1,206
country | 14 | 1,189
rend_upsidedown | 17 | 1,136
exhibitionref | 16 | 1,129
sender | 14 | 1,087
name | 10 | 688
dim | 9 | 661
rend_blockletter | 6 | 548
place | 3 | 471
folder | 2 | 397
rend_super | 5 | 392
rend_spaced | 4 | 364
reason | 2 | 339
rend_center | 3 | 336
rend_above | 2 | 308
rend_overwritten | 2 | 308
rend_right | 2 | 308
rend_right_underline | 2 | 308
rend_super_underline2 | 2 | 308
rend_bold | 1 | 280
rend_super_underline | 1 | 280
rend_underline2 | 1 | 280
TOTAL | 224,916 | 26,758,100

# Exploration

We walk around a bit more in the corpus.

## All titles:

In [71]:
for t in F.otype.s("titleStmt"):
    print(t, T.text(t))

20035 Brief aan Aletta de Iongh. Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

20036 Brief aan Aletta de Iongh. Amsterdam, woensdag 7 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

20037 Brief aan Aletta de Iongh. Amsterdam, tussen maandag 19 en vrijdag 23 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

20038 Brief aan Aletta de Iongh. Amsterdam, maandag 26 april 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

20039 Brief aan Aletta de Iongh. Amsterdam, donderdag 13 mei 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

20040 Brief aan Aletta de Iongh. Amsterdam, donderdag 24 juni 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

20041 Brief aan Aletta de Iongh. Amsterdam, eerste helft augustus 1909.
Wietse Coppes
Leo Jansen
Mondriaan Editieproject

20042  Briefkaart aan Gerrit Willem Knap. Zoutelande, c. dinsdag 24 augustus 1909.
Wietse Coppes
Leo Jansen
Mondria

# Tokens

Show all the tokens that are split into atomic tokens

In [72]:
query = """
token
  =: t
  <: t
"""

results = A.search(query)

  0.02s 12 results


In [73]:
A.table(results, condenseType="token")

n,p,token,t,t.1
1,proeftuin@19090407y_IONG_1739:5,S t,S,t
2,proeftuin@19090426y_IONG_1738:5,1 e,1,e
3,proeftuin@19090513y_IONG_1293:5,S t,S,t
4,proeftuin@19090624_IONG_1294:6,kaon,k,a
5,proeftuin@19090824y_KNAP_1747:5,42II,42,II
6,proeftuin@19090824y_KNAP_1747:6,Vr.nVrienden,Vr.,n
7,proeftuin@19090905y_IONG_1295:5,4.a,4,.a
8,proeftuin@19091024y_IONG_1297:5,Nov.ber,Nov.,ber
9,proeftuin@19100131_SAAL_ARNO_0018:5,onjuist,on,juist
10,proeftuin@19100131_SAAL_ARNO_0018:5,een,e,en


## Sentences

In [74]:
for s in F.otype.s("sentence")[2:4]:
    print(T.text(s))

Wietse Coppes
Leo Jansen


In [75]:
for s in F.otype.s("sentence")[2:4]:
    A.pretty(s, withNodes=True)

In [76]:
for (i, s) in enumerate(F.otype.s("sentence")[0:100]):
    print(f"SENTENCE {i + 1}: {T.text(s)}")

SENTENCE 1: Brief aan Aletta de Iongh. 
SENTENCE 2: Amsterdam, dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909.
SENTENCE 3: Wietse Coppes
SENTENCE 4: Leo Jansen
SENTENCE 5: Mondriaan Editieproject
SENTENCE 6: Nederland
SENTENCE 7: Otterlo
SENTENCE 8: Kröller Müller Museum
SENTENCE 9: KM 123.397
SENTENCE 10: 19090216y_IONG_1303
SENTENCE 11: ​
SENTENCE 12: ​
SENTENCE 13: ​
SENTENCE 14: Piet Mondriaan
SENTENCE 15: dinsdag 16 februari, dinsdag 2 maart of dinsdag 9 maart 1909
SENTENCE 16: Amsterdam
SENTENCE 17: Aletta de Iongh
SENTENCE 18: transcriptie: voltooid 20.7.15
SENTENCE 19: collatie bron: 6.6.16
SENTENCE 20: tweede collatie aan het origineel: voltooid 26.11.19
SENTENCE 21: invoer tweede collatie: voltooid 5.8.16
SENTENCE 22: bespreking eindversie: gb
SENTENCE 23: markeren annotaties: in bewerking / voltooid
SENTENCE 24: gereed 17.4.2019
SENTENCE 25: titel gecontroleerd 21.09.2020
SENTENCE 26: personen getagd 12.10.2020
SENTENCE 27: vertaling ingevoerd 16.2.2021
SENTENC

# Illustrations

In [77]:
results = A.search("""
rs type=artwork-m ref~artwork
""")

  0.00s 14 results


In [78]:
A.table(results, withNodes=True, end=1)

n,p,rs
1,proeftuin@19090421y_IONG_1304:7,"19466 Piet Mondriaan, Leo Gestel, Meisjeskop, 1910 Alkmaar, Stedelijk Museum Alkmaar, inv./cat.nr. [..]. conté on papier. RKD 277201►Meisjeskop"


## The first letter

In [79]:
A.pretty(F.otype.s("letter")[0], full=True, withNodes=False)

## Pages

In [80]:
pages = A.search("""
page
""")
A.show(pages, end=2, full=False)

  0.00s 51 results


## Overlapping div elements

There are nested div elements
Let's find them all.

First the total amount of div elements:

In [81]:
len(F.otype.s("div"))

31

In [82]:
query = """
d1:div
&& d2:div

d1 < d2
"""

resultsA = A.search(query)

  0.02s 0 results


We can also find the div elements that are directly under another div by means of the `parent` edges:

In [83]:
query = """
div
<parent- div
"""

resultsD = A.search(query)

  0.00s 0 results


So some div elements are nested, but not directly below each other.

Let's see which they are.

In [84]:
arbitrarily = set(resultsA)
directly = set(resultsD)

It is to be expected that the arbitrarily nested div elements are a superset of the directly nested ones.

In [85]:
directly - arbitrarily

set()

Now the other way round:

In [86]:
results = arbitrarily - directly
results

set()

In [87]:
A.table(sorted(results), end=2)

In [88]:
query = """
div
<parent- div
<parent- div
"""
results = A.search(query)

  0.00s 0 results


In [89]:
from textwrap import dedent

In [90]:
for i in range(1, 5):
    query = dedent(
        f"""
        div
        -sibling>{i}> div
        """
    )

    print(f"div siblings at distance {i}")
    results = A.search(query)

div siblings at distance 1
  0.00s 2 results
div siblings at distance 2
  0.00s 0 results
div siblings at distance 3
  0.00s 0 results
div siblings at distance 4
  0.00s 0 results


### Notes

In [91]:
for (i, nn) in enumerate(F.otype.s("note")[4:5]):
    Apre.dm(f"### Note {i + 1}\n\n")
    tokens = L.d(nn, otype="token")
    s = L.u(L.d(nn, otype="token")[0], otype="chunk")[0]
    A.pretty(nn, withNodes=True, full=True)
    A.pretty(s, withNodes=True, full=True)

### Note 1

