<img align="right" src="images/tf.png" width="200"/>
<img align="right" src="images/huc.png" width="200"/>
<img align="right" src="images/logo.png" width="200"/>

---

To get started: consult [start](start.ipynb)

---

# Computing "by hand"

We descend to a more concrete level, and interact with the data by means of a bit of hand-coding.

Familiarity with the underlying
[data model](https://annotation.github.io/text-fabric/tf/about/datamodel.html)
is recommended.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

In [3]:
from tf.app import use

In [5]:
A = use("annotation/mondriaan", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,13761.0,100
letter,14,982.93,100
body,14,849.93,86
text,14,849.93,86
chunk,86,160.0,100
div,93,219.99,149
teiHeader,14,124.57,13
p,95,73.39,51
postscript,6,62.83,3
revisionDesc,14,61.0,6


# What have we got?

Let's inspect the data.

The text is represented as nodes with properties. The first word is node 1, the second word is node 2, and so on.
After the last word node we get nodes for the elements, such a p, rs, etc. We also have nodes for folders and files.

All nodes can be dressed up with *features*.
A feature is a piece of data that specifies values for nodes.

For example, the feature `str` gives the text of each word node, and the feature `after` gives the text after a word but before the next word.

This gives a very crude insight in the data that Text-Fabric works with. Text-Fabric is a machine
that can weave the orginal text out of the threads given by the features.

Think of the nodes as the warp, through which the features are woven as wefts.
See also the [fabric metaphor](https://annotation.github.io/text-fabric/tf/about/datamodel.html#fabric-metaphor).

But it can also weave all kinds of other things out of the data.

We can get a stock overview of the ware house of nodes and features as follows:

* **nodes** if you click on the trainagle before **Node types** above,
  you'll see an inventory of node types that make up the corpus.
  They correspond to the elements in the original TEI.

* **features** if you click on the little triangle before **mondriaan - letters** above,
  you'll see a list of features with their descriptions:
  * you can see which features have been loaded;
  * if you click on a feature name, you find its documentation;
  * if you hover over a name, you see where the feature is located on your system;
  * edge features are marked by **_bold italic_** formatting.
  [`C.levels.data`](https://annotation.github.io/text-fabric/tf/cheatsheet.html#c-computed-data-components)

# Counting
We count all nodes, of any type.

In [6]:
A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.00s 16318 nodes


# Node types

What is the basic textual unit in this corpus?

In [7]:
F.otype.slotType

'token'

A quick way to list all node types:

In [8]:
F.otype.all

('folder',
 'letter',
 'body',
 'text',
 'chunk',
 'div',
 'teiHeader',
 'p',
 'postscript',
 'revisionDesc',
 'note',
 'fileDesc',
 'titleStmt',
 'msDesc',
 'sourceDesc',
 'msIdentifier',
 'correspDesc',
 'profileDesc',
 'title',
 'closer',
 'sentence',
 'address',
 'correspAction',
 'facsimile',
 'date',
 'change',
 'salute',
 'opener',
 'dateline',
 'physDesc',
 'institution',
 'name',
 'choice',
 'addrLine',
 'decoDesc',
 'editor',
 'sponsor',
 'signed',
 'placeName',
 'rs',
 'idno',
 'hi',
 'ref',
 'decoNote',
 'settlement',
 'accMat',
 'altIdentifier',
 'country',
 'reg',
 'surface',
 'del',
 'orig',
 'add',
 'c',
 'graphic',
 'objectDesc',
 'pb',
 'postmark',
 'ptr',
 'publicationStmt',
 'space',
 'unclear',
 'token')

# Word matters

We can only work with the surface forms of words, there is no concept of lexeme in the corpus (yet).

## Top 30 frequent words

There is a simple function to get a frequency list of feature values.
Here we call it for the feature `str`, which contains the text for every word in the text:

In [10]:
for (word, amount) in F.str.freqList()[0:50]:
    print(f"{amount:>6} {word}")

   782 

   587 ,
   554 .
   225 de
   204 in
   202 ​
   157 van
   148 I
   148 the
   144  
   133 '
   131 :
   124 en
   112 to
   108 (
   105 )
   105 is
   103 you
   100 Mondriaan
    95 of
    93 that
    89 ik
    80 and
    80 dat
    78 een
    73 a
    72 je
    70 1909
    69 voltooid
    67 t
    63 De
    63 op
    63 zijn
    57 aan
    57 voor
    56 te
    55 ;
    55 brief
    54 -
    54 het
    54 niet
    50 *
    49 Piet
    46 met
    44 Iongh
    42 collatie
    41 me
    39 it
    38 for
    37 have


# Words that are unique to a letter

Are there words that are unique to a letter?
And if so, which letter has the most of them?
That letter is the most idiosyncratic letter.

Task: list the letters in a table sorted by degree of idiosyncrasy, and show the
idiosyncrasy of each letter.

## Method

For each word, the support base is the set of letters in which the word occurs.
We take only distinct words into account when we count words.
We make all words lower case.

We exclude words that occur in the tei header and in notes.

Let's compute the support base of all words.

We also need to count how much distinct words each letter contains.

And we also want to find out how many hapaxes there are, so we also make an
index for the occurrences of each word form.

Note that each file corresponds to a letter.
Note also, after reading the feature docs, that we have features `is_meta` and `is_note` that tell us whether a word
occurrence is in the tei header or a note.

In [11]:
wordOccs = collections.defaultdict(list)
wordsByLetter = collections.defaultdict(set)
supportBase = collections.defaultdict(set)

for letter in F.otype.s("letter"):
    for w in L.d(letter, otype="token"):
        if F.is_meta.v(w) or F.is_note.v(w) or F.empty.v(w):
            continue
        word = F.str.v(w)
        if not word or not word.isalpha():
            continue
            
        wordOccs[word].append(w)
        wordsByLetter[letter].add(word)
        supportBase[word].add(letter)
        
print(f"There are {len(wordOccs)} distinct words")

There are 1536 distinct words


We can find the hapaxes as follows:

In [12]:
hapaxes = {word for (word, occs) in wordOccs.items() if len(occs) == 1}

print(f"There are {len(hapaxes)} hapaxes")

There are 796 hapaxes


In the same way we can find the idiosyncratic words:

In [13]:
idiosyncraticWords = {word for (word, letters) in supportBase.items() if len(letters) == 1}

print(f"There are {len(idiosyncraticWords)} idiosyncratic words")

There are 1039 idiosyncratic words


Now we can make a table of the letters where for each letter we list the total
amount of distinct words, the amount of idiosyncratic words,
and the percentage of idiosyncratic words wrt. to the total number of words.

In [14]:
table = []

for letter in F.otype.s("letter"):
    letterId = F.letter.v(letter)
    words = wordsByLetter[letter]
    idio = {word for word in words if word in idiosyncraticWords}
    
    nWords = len(words)
    nIdio = len(idio)
    perc = int(round(100 * nIdio / nWords))
    
    table.append((letterId, nWords, nIdio, perc))
    
table[0:10]

[('19090216y_IONG_1303', 80, 13, 16),
 ('19090407y_IONG_1739', 133, 22, 17),
 ('19090421y_IONG_1304', 157, 36, 23),
 ('19090426y_IONG_1738', 132, 33, 25),
 ('19090513y_IONG_1293', 166, 42, 25),
 ('19090624_IONG_1294', 110, 16, 15),
 ('19090807y_IONG_1296', 157, 30, 19),
 ('19090824y_KNAP_1747', 50, 10, 20),
 ('19090905y_IONG_1295', 254, 64, 25),
 ('190909XX_QUER_1654', 750, 439, 59)]

We can make that prettier by rendering it in Markdown.
And we have to sort it on the percentage column.
And we add a grand total.

We do not show the letters that have less than 20% idiosyncratic words.

In [15]:
md = """
letter | #words | #idio | %perc
--- | --- | --- | ---
"""

totalNw = 0

for (letter, nw, ni, per) in sorted(table, key=lambda x: (-x[-1], -x[-2], x[1], x[0])):
    if per >= 20:
        md += f"""{letter} | {nw} | {ni} | {per}\n"""
    totalNw += nw
    
    
overall = int(round(100 * len(idiosyncraticWords) / len(wordOccs)))
overall2 = int(round(100 * len(idiosyncraticWords) / totalNw))
md += f"""**{len(table)}** letters | **{len(wordOccs)}** | **{len(idiosyncraticWords)}** | **{overall}**\n"""
md += f"""**{len(table)}** letters | **{totalNw}** | **{len(idiosyncraticWords)}** | **{overall2}**\n"""

A.dm(md)


letter | #words | #idio | %perc
--- | --- | --- | ---
190909XX_QUER_1654 | 750 | 439 | 59
19100131_SAAL_ARNO_0018 | 447 | 174 | 39
19091024y_IONG_1297 | 341 | 121 | 35
19090905y_IONG_1295 | 254 | 64 | 25
19090513y_IONG_1293 | 166 | 42 | 25
19090426y_IONG_1738 | 132 | 33 | 25
19090421y_IONG_1304 | 157 | 36 | 23
19091024_SPOO_0016 | 83 | 18 | 22
19090824y_KNAP_1747 | 50 | 10 | 20
**14** letters | **1536** | **1039** | **68**
**14** letters | **2981** | **1039** | **35**


It might seem strange that the overall idiosyncracy is much bigger than the idiosyncracy of the individual
letters.

This follows from the fact that if we take the amounts of distinct words per letter and take the sum of that,
we end up with a much bigger number than the total amount of distinct words in the whole book.

Because words that occur in multiple letters are counted multiple times.

If we use the sum of the per-letter distinct words, the total idiosyncracy is the weighted average of the letter
idiosyncracies.

# `<rs>` elements

These elements have sometimes artworks under them, for example if they have `type=artwork-m`.

But what types do we have in the corpus?

In [16]:
F.type.freqList(nodeTypes={"rs"})

(('person', 113),
 ('artwork-m', 18),
 ('exhibition', 9),
 ('artistsassoc', 7),
 ('org', 7),
 ('journal', 5),
 ('museum', 5),
 ('firm', 3),
 ('photograph', 3),
 ('article', 2))

That's a rather clear answer.

## Photographs

Now we want to see all features that photographs can have.

In [17]:
allFeatures = Fall()

data = {}

for rs in F.otype.s("rs"):
    data[rs] = {}
    for feat in allFeatures:
        if feat == "otype":
            continue
        val = Fs(feat).v(rs)
        if val:
            data[rs][feat] = val


In [18]:
photoData = {
    rs: feats for (rs, feats) in data.items() if feats.get("type", None) == "photograph"
}

In [19]:
for (rs, feats) in photoData.items():
    A.plain(rs)
    for (feat, val) in sorted(feats.items()):
        print(f"\t{feat} = '{val}'")

	key = '217669'
	type = 'photograph'


	type = 'photograph'


	type = 'photograph'


## Artworks

Now the same for artworks.

In [20]:
artworkData = {
    rs: feats for (rs, feats) in data.items() if feats.get("type", None) == "artwork-m"
}

In [21]:
for (rs, feats) in artworkData.items():
    A.plain(rs)
    for (feat, val) in sorted(feats.items()):
        print(f"\t{feat} = '{val}'")

	type = 'artwork-m'


	key = '277201'
	type = 'artwork-m'


	key = '68554'
	type = 'artwork-m'


	key = '62319'
	type = 'artwork-m'


	key = '68733'
	type = 'artwork-m'


	key = '277201'
	type = 'artwork-m'


	key = '000000'
	type = 'artwork-m'


	key = '000000'
	type = 'artwork-m'


	key = '000000'
	type = 'artwork-m'


	type = 'artwork-m'


	type = 'artwork-m'


	type = 'artwork-m'


	key = '68728'
	type = 'artwork-m'


	type = 'artwork-m'


	type = 'artwork-m'


	type = 'artwork-m'


	key = '268864'
	type = 'artwork-m'


	key = '268864'
	type = 'artwork-m'


## Persons

How many *different* persons are referenced by `<rs>` elements?

We make a complete inventory of all `<rs type="person">` elements and there attributes and values.

In [22]:
allFeatures = Fall()

personData = {}

for rs in F.otype.s("rs"):
    if F.type.v(rs) != "person":
        continue
    for feat in allFeatures:
        val = Fs(feat).v(rs)
        if not val:
            continue
        personData.setdefault(feat, {}).setdefault(val, []).append(rs)

Let's see what we've got:

In [23]:
for (feat, fInfo) in sorted(personData.items()):
    print(f"{feat}:")
    for (val, rsNodes) in sorted(fInfo.items()):
        print(f"\t{val:<10} {len(rsNodes):>4} x")

key:
	besant_annie    1 x
	briel_albert_van_den    1 x
	buhlig_richard    4 x
	calcar_reinder_van    3 x
	fernhout_henk    5 x
	iongh_aletta_de   30 x
	iongh_anna_de    7 x
	iongh_anna_maria_de    1 x
	iongh_daniel_de    1 x
	iongh_de_frederika    3 x
	knap_gerrit_willem    5 x
	mondriaan_pieter_senior    1 x
	philippona_reinier    2 x
	philippona_reinier philippona_mien    2 x
	querido_israel    6 x
	saalborn_arnold    5 x
	sluijters_jan    1 x
	smeenk_josina    1 x
	spoor_kees    9 x
	teirlinck_herman    4 x
	toorop_charley    1 x
	waldenburg_alfred   17 x
	wisse_adriana    1 x
	wisse_ko      1 x
otype:
	rs          113 x
type:
	person      113 x


So, the identifying information is in the `key` feature,
let's see how many distinct values it has.

In [24]:
len(personData["key"])

24

Some persons have a numeric `key` attribute.

At least one of them identifies the person on the RKD site:

`90099` goes to [Jan Schüller](https://rkd.nl/nl/explore/artists/record?query=90099&start=0).

## Everything with a numeric `key` feature

So the question arises: which elements have a key feature with numeric content?

We make an inventory.

In [25]:
keyNodes = {}

for n in N.walk():
    key = F.key.v(n)
    if key is None or not key.isdecimal():
        continue
    nType = F.otype.v(n)
    keyNodes.setdefault(nType, []).append(n)

In [26]:
for (nType, nodes) in keyNodes.items():
    print(f"{nType:<10} {len(nodes):>4} nodes")

rs           12 nodes


We refine the exploration: which `type`s have those `rs` nodes?

In [27]:
rsTypes = collections.Counter()

for n in keyNodes["rs"]:
    typ = F.type.v(n)
    rsTypes[typ] += 1

In [28]:
rsTypes

Counter({'photograph': 1, 'artwork-m': 11})

---

# Next steps

By now you have an impression how to orient yourself in this corpus.
The next steps will show you how to get powerful: searching and computing.

After that it is time for collecting results, use them in new annotations and share them.

* **[start](start.ipynb)** intro and highlights
* **explore** explore the corpus by coding away in it
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results

CC-BY Dirk Roorda