<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/dans.png"/>
<img align="right" src="images/huygenslogo.png"/>
<img align="right" src="images/logo.png"/>

# Tutorial

This notebook gets you started with using
[Text-Fabric](https://annotation.github.io/text-fabric/) for coding in 
[General Missives Corpus](https://github.com/Dans-labs/clariah-gm).

Familiarity with the underlying
[data model](https://annotation.github.io/text-fabric/about/datamodel.html)
is recommended.

## Installing Text-Fabric

### Python

You need to have Python on your system. Most systems have it out of the box,
but alas, that is python2 and we need at least python **3.6**.

Install it from [python.org](https://www.python.org) or from
[Anaconda](https://www.anaconda.com/download).

### TF itself

```
pip3 install text-fabric
```

### Jupyter notebook

You need [Jupyter](http://jupyter.org).

If it is not already installed:

```
pip3 install jupyter
```

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os, collections

In [3]:
from tf.app import use

## Corpus data

Text-Fabric will fetch the General Missives corpus for you.

It will fetch the newest version by default, but you can get other versions as well.

The data will be stored in the `text-fabric-data` in your home directory.


# Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the text as a gigantic spreadsheet, where row 1 corresponds to the
first word, row 2 to the second word, and so on, for all words, several millions, in this corpus.

Each piece of information about the words, including the text of the words, constitute a column in that spreadsheet.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

# Incantation

The simplest way to get going is by this *incantation*:

For the very last version, use `hot`.

For the latest release, use `latest`.

If you have cloned the repos (TF app and data), use `clone`.

If you do not want/need to upgrade, leave out the checkout specifiers.

In [4]:
# A = use('missieven:clone', checkout="clone", hoist=globals())
# A = use('missieven:hot', checkout="hot", hoist=globals())
A = use('missieven:latest', checkout="latest", hoist=globals())
# A = use('missieven', hoist=globals())

rate limit is 5000 requests per hour, with 4103 left for this hour
	connecting to online GitHub repo annotation/app-missieven ... connected


rate limit is 5000 requests per hour, with 4100 left for this hour
	connecting to online GitHub repo Dans-labs/clariah-gm ... connected


You can see which features have been loaded, and if you click on a feature name, you find its documentation.
If you hover over a name, you see where the feature is located on your system.

Edge features are marked by **_bold italic_** formatting.

# Counting

In [5]:
A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.53s 3598532 nodes


# Node types

In [6]:
F.otype.slotType

'word'

In [7]:
F.otype.all

('volume',
 'letter',
 'page',
 'table',
 'para',
 'head',
 'line',
 'row',
 'cell',
 'word')

In [8]:
C.levels.data

(('volume', 250271.92307692306, 3598520, 3598532),
 ('letter', 5862.225225225226, 3271328, 3271882),
 ('page', 324.0572709163347, 3548787, 3558826),
 ('table', 104.81904761904762, 3598205, 3598519),
 ('para', 91.78756758303058, 3558827, 3593783),
 ('head', 27.83963963963964, 3270773, 3271327),
 ('line', 11.74968581168925, 3271883, 3548786),
 ('row', 7.468446052929202, 3593784, 3598204),
 ('cell', 1.9155305447583686, 3253536, 3270772),
 ('word', 1, 1, 3253535))

The second column is the average size (in words) of the node type mentioned in the first column.

The third and fourth column are the node numbers of the first and the last node of that kind.

In [12]:
for (typ, av, start, end) in C.levels.data:
    print(
        f"{end - start + 1:>7} x {typ:<7} having an average size of {int(round(av)):>6} words"
    )

     13 x volume  having an average size of 250272 words
    555 x letter  having an average size of   5862 words
  10040 x page    having an average size of    324 words
    315 x table   having an average size of    105 words
  34957 x para    having an average size of     92 words
    555 x head    having an average size of     28 words
 276904 x line    having an average size of     12 words
   4421 x row     having an average size of      7 words
  17237 x cell    having an average size of      2 words
3253535 x word    having an average size of      1 words


# Feature statistics

There are no linguistic features (yet).

# Word matters

We can only work with the surface forms of words, there is no concept of lexeme in the corpus (yet).

## Top 20 frequent words

In [16]:
for (w, amount) in F.trans.freqList("word")[0:20]:
    print(f"{amount:>6} {w}")

120061 de
106193 van
 78001 en
 76343 te
 52565 den
 48925 in
 39900 dat
 38242 het
 31398 een
 30844 met
 29154 ende
 27798 op
 27463 die
 24819 tot
 23044 niet
 20490 ’t
 20165 om
 19219 als
 18916 door
 17420 voor


## Hapaxes

In [17]:
hapaxes1 = sorted(w for (w, amount) in F.trans.freqList('word') if amount == 1)
len(hapaxes1)

118732

In [18]:
for lx in hapaxes1[0:20]:
    print(lx)

!>
"van
"voor
$0
$2
$6,
$n
$•
%,
'))
');
')>
')off
'/4
'4
'8?
'Arabieren
'CD,
'Dit
'E


Here we suffer from the fact that words and punctuation have not yet been separated.
That will be done in a next version of the TF data.

### Small occurrence base

The occurrence base of a word are the letters in which the word occurs.

In [27]:
occurrenceBase = collections.defaultdict(set)

A.indent(reset=True)
A.info("compiling occurrence base ...")
for s in F.otype.s("letter"):
    title = F.title.v(s)
    for w in L.d(s, otype="word"):
        occurrenceBase[F.trans.v(w)].add(title)
A.info("done")
A.info(f"{len(occurrenceBase)} entries")

  0.00s compiling occurrence base ...
  2.63s done
  2.63s 198742 entries


An overview of how many words have how big occurrence bases:

In [28]:
occurrenceSize = collections.Counter()

for (w, letters) in occurrenceBase.items():
    occurrenceSize[len(letters)] += 1

occurrenceSize = sorted(
    occurrenceSize.items(),
    key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
    print(f"letters {size:>4} : {amount:>6} words")
print("...")
for (size, amount) in occurrenceSize[-10:]:
    print(f"letters {size:>4} : {amount:>6} words")

letters    1 : 123054 words
letters    2 :  24236 words
letters    3 :  11450 words
letters    4 :   7114 words
letters    5 :   4809 words
letters    6 :   3469 words
letters    7 :   2678 words
letters    8 :   2113 words
letters    9 :   1724 words
letters   10 :   1455 words
...
letters  447 :      1 words
letters  449 :      1 words
letters  450 :      1 words
letters  455 :      1 words
letters  460 :      1 words
letters  463 :      1 words
letters  471 :      1 words
letters  476 :      1 words
letters  479 :      1 words
letters  493 :      1 words


Let's give the predicate *private* to those words whose occurrence base is a single letter.

In [29]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

123054

### Peculiarity of letters

As a final exercise with letters, lets make a list of all letters, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the letter

In [31]:
letterList = []

empty = set()
ordinary = set()

for d in F.otype.s("letter"):
    letter = F.title.v(d)
    if len(letter) > 50:
        letter = f"{letter[0:22]} .. {letter[-22:]}"
    words = {F.trans.v(w) for w in L.d(d, otype="word")}
    a = len(words)
    if not a:
        empty.add(letter)
        continue
    o = len({w for w in words if w in privates})
    if not o:
        ordinary.add(letter)
        continue
    p = 100 * o / a
    letterList.append((letter, a, o, p))

letterList = sorted(letterList, key=lambda e: (-e[3], -e[1], e[0]))

print(f"Found {len(empty):>4} empty letters")
print(f"Found {len(ordinary):>4} ordinary letters (i.e. without private words)")

Found    0 empty letters
Found   22 ordinary letters (i.e. without private words)


In [32]:
print(
    "{:<50}{:>5}{:>5}{:>5}\n{}".format(
        "letter",
        "#all",
        "#own",
        "%own",
        "-" * 35,
    )
)

for x in letterList[0:20]:
    print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))
print("...")
for x in letterList[-20:]:
    print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))

letter                                             #all #own %own
-----------------------------------
Both; zonder plaats, zonder datum                     7    3 42.9%
Coen, Sonck; Schip Nie .. nda-Neira , 6 mei 1621     24    9 37.5%
Bijlage ladinglijst va .. nuari en 25 april 1761    734  239 32.6%
Valckenier, Schaghen,  .. avia, 12 december 1739   2114  627 29.7%
Van Imhoff, Cluysenaar .. via, 29 september 1750    709  178 25.1%
Van Diemen; in het Sch .. an Afrika, 5 juni 1631     20    5 25.0%
Reynst; Bantam, 26 oktober 1615                     892  194 21.7%
Mossel, Van der Waeyen .. avia, 31 december 1760   5980 1284 21.5%
Durven, Hasselaar, Blo .. tavia, 17 oktober 1730    131   26 19.8%
Durven, Hasselaar, Blo .. tavia, 17 oktober 1730   2159  425 19.7%
Maetsuycker, Verburch, .. via, 25 september 1675     22    4 18.2%
Reniers, Maetsuycker,  .. avia, 24 december 1652   6209 1097 17.7%
Reael; Kasteel Mauriti .. kéan, 20 augustus 1618   1415  249 17.6%
Maetsuycker, Van Goens .. a

# Next steps

By now you have an impression how to compute around in the Athenaeus.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

**(in progress, not all of the tutorials below exist already!)**

* **[display]"(display.ipynb)"** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel]"(exportExcel.ipynb)"** make tailor-made spreadsheets out of your results
* **[share]"(share.ipynb)"** draw in other people's data and let them use yours

CC-BY Dirk Roorda