<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/dans.png"/>
<img align="right" src="images/logo.png"/>

# Tutorial

This notebook gets you started with using
[Text-Fabric](https://annotation.github.io/text-fabric/) for coding in the Athenaeus corpus.

Chances are that a bit of reading about the underlying
[data model](https://annotation.github.io/text-fabric/Model/Data-Model/)
helps you to follow the exercises below, and vice versa.

## Installing Text-Fabric

### Python

You need to have Python on your system. Most systems have it out of the box,
but alas, that is python2 and we need at least python **3.6**.

Install it from [python.org](https://www.python.org) or from
[Anaconda](https://www.anaconda.com/download).

### TF itself

```
pip3 install text-fabric
```

### Jupyter notebook

You need [Jupyter](http://jupyter.org).

If it is not already installed:

```
pip3 install jupyter
```

## Tip
If you start computing with this tutorial, first copy its parent directory to somewhere else,
outside your `syrnt` directory.
If you pull changes from the `syrnt` repository later, your work will not be overwritten.
Where you put your tutorial directory is up till you.
It will work from any directory.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os, collections

In [3]:
from tf.app import use

## Corpus data

Text-Fabric will fetch the Athenaeus corpus for you.

It will fetch the newest version by default, but you can get other versions as well.

The data will be stored in the `text-fabric-data` in your home directory.


# Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the text as a gigantic spreadsheet, where row 1 corresponds to the
first word, row 2 to the second word, and so on, for all 300,000 words.

Each piece of information about the words, including the text of the words, constitute a column in that spreadsheet.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

In [5]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
import sys, os, collections

# Incantation

The simplest way to get going is by this *incantation*:

In [7]:
from tf.app import use

In [11]:
A = use('athenaeus', hoist=globals())

	connecting to online GitHub repo annotation/app-athenaeus ... connected
	no releases
	no releases
	code/__init__.py...downloaded
	code/app.py...downloaded
	code/config.py...downloaded
	code/static...directory
		code/static/display.css...downloaded
		code/static/logo.png...downloaded
	OK
Using TF-app in /Users/dirk/text-fabric-data/annotation/app-athenaeus/code:
	#a3dc4f93c3fe3b330a1150756db5e434a81b0894 (latest commit)
	connecting to online GitHub repo pthu/athenaeus ... connected
	no releases
	no releases
	Athenaeus/Deipnosophistae/tf/1.0/_book.tf...downloaded
	Athenaeus/Deipnosophistae/tf/1.0/_sentence.tf...downloaded
	Athenaeus/Deipnosophistae/tf/1.0/add.tf...downloaded
	Athenaeus/Deipnosophistae/tf/1.0/beta_plain.tf...downloaded
	Athenaeus/Deipnosophistae/tf/1.0/bibl.tf...downloaded
	Athenaeus/Deipnosophistae/tf/1.0/book.tf...downloaded
	Athenaeus/Deipnosophistae/tf/1.0/chapter.tf...downloaded
	Athenaeus/Deipnosophistae/tf/1.0/cit.tf...downloaded
	Athenaeus/Deipnosophistae/tf/1.0/

You can see which features have been loaded, and if you click on a feature name, you find its documentation.
If you hover over a name, you see where the feature is located on your system.

Edge features are marked by **_bold italic_** formatting.

There are ways to tweak the set of features that is loaded. You can load more and less.

See [share](share.ipynb) for examples.

In [13]:
silentOff()

# Counting

In [14]:
indent(reset=True)
info('Counting nodes ...')

i = 0
for n in N(): i += 1

info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.05s 303385 nodes


# Node types

In [15]:
F.otype.slotType

'word'

In [16]:
F.otype.all

('_book',
 'head',
 'book',
 'hi',
 'cit',
 'num',
 'add',
 'chapter',
 'pb',
 'p',
 'quote',
 'bibl',
 'l',
 '_sentence',
 'word')

In [17]:
C.levels.data

(('_book', 265168.0, 265169, 265169),
 ('head', 265168.0, 285519, 285519),
 ('book', 17677.866666666665, 284009, 284023),
 ('hi', 3115.2051282051284, 285520, 285597),
 ('cit', 1586.1497005988024, 285352, 285518),
 ('num', 949.8545454545455, 296865, 297139),
 ('add', 335.96704689480356, 277956, 278744),
 ('chapter', 199.67469879518072, 284024, 285351),
 ('pb', 171.19883795997418, 298711, 300259),
 ('p', 168.7982176957352, 297140, 298710),
 ('quote', 84.74344209852848, 300260, 303385),
 ('bibl', 51.090425531914896, 278745, 284008),
 ('l', 23.501641963255526, 285598, 296864),
 ('_sentence', 20.743547630220554, 265170, 277955),
 ('word', 1, 1, 265168))

In [18]:
for (typ, av, start, end) in C.levels.data:
  print(f'{end - start + 1:>7} {typ}s')

      1 _books
      1 heads
     15 books
     78 his
    167 cits
    275 nums
    789 adds
   1328 chapters
   1549 pbs
   1571 ps
   3126 quotes
   5264 bibls
  11267 ls
  12786 _sentences
 265168 words


# Feature statistics

There are no linguistic features, as far as I can see, but there is `lemma`.

# Word matters

## Top 20 frequent words

In [29]:
for (w, amount) in F.lemma.freqList('word')[0:20]:
  print(f'{amount:>5} {w}')

24736 ὁ,ὅς
12995 καί
11725 δέ
 5711 εἰς,εἰμί,ἐν
 3139 φημί
 2917 οὗτος
 2696 αὐτός
 2457 τίς,ὁ,τις,ὅς
 2408 εἰμί
 2056 οὐ
 2007 γάρ
 1934 τεός,σύ,τις,τε
 1932 τίς,ὁ,τῷ,ὅς,τις
 1856 τίς,ὁ,ὅς
 1828 μέν
 1807 ὡς,ὅς
 1742 περί
 1390 ἐπί
 1322 λέγω,λέγω1
 1312 εἰς,εἰμί,εἶμι


## Hapaxes

In [20]:
hapaxes1 = sorted(lx for (lx, amount) in F.lemma.freqList('word') if amount == 1)
len(hapaxes1)

11204

In [21]:
for lx in hapaxes1[0:20]:
  print(lx)

*isgreek
*p
*ʼαγκυλητους
*ʼαδεσθαι
*ʼαδυφωνον
*ʼακουσομεθα
*ʼαναξαρχον
*ʼανδρομαχον
*ʼανθος
*ʼαντιφωντος
*ʼαπο
*ʼαποδιδωσι
*ʼαρπασθηναι
*ʼαφι
*ʼβρενθιν
*ʼβυζαντιους
*ʼγενη
*ʼγλαυκου
*ʼγραφει
*ʼγραφων


### Small occurrence base

The occurrence base of a word are the books in which the word occurs.

In [22]:
occurrenceBase = collections.defaultdict(set)

indent(reset=True)
info('compiling occurrence base ...')
for s in F.otype.s('book'):
  book = F.book.v(s)
  for w in L.d(s, otype='word'):
    occurrenceBase[F.lemma.v(w)].add(book)
info('done')
info(f'{len(occurrenceBase)} entries')

  0.00s compiling occurrence base ...
  0.21s done
  0.21s 23441 entries


An overview of how many words have how big occurrence bases:

In [23]:
occurrenceSize = collections.Counter()

for (w, books) in occurrenceBase.items():
  occurrenceSize[len(books)] += 1
  
occurrenceSize = sorted(
  occurrenceSize.items(),
  key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
  print(f'books {size:>4} : {amount:>5} words')
print('...')
for (size, amount) in occurrenceSize[-10:]:
  print(f'books {size:>4} : {amount:>5} words')

books    1 : 12909 words
books    2 :  3621 words
books    3 :  1881 words
books    4 :  1179 words
books    5 :   779 words
books    6 :   572 words
books    7 :   437 words
books    8 :   373 words
books   15 :   345 words
books   10 :   296 words
...
books    6 :   572 words
books    7 :   437 words
books    8 :   373 words
books   15 :   345 words
books   10 :   296 words
books    9 :   284 words
books   11 :   216 words
books   12 :   206 words
books   13 :   172 words
books   14 :   171 words


Let's give the predicate *private* to those words whose occurrence base is a single book.

In [24]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

12909

### Peculiarity of books

As a final exercise with books, lets make a list of all books, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the book

In [25]:
bookList = []

empty = set()
ordinary = set()

for d in F.otype.s('book'):
  book = F.book.v(d)
  words = {F.lemma.v(w) for w in L.d(d, otype='word')}
  a = len(words)
  if not a:
    empty.add(book)
    continue
  o = len({w for w in words if w in privates})
  if not o:
    ordinary.add(book)
    continue
  p = 100 * o / a
  bookList.append((book, a, o, p))

bookList = sorted(bookList, key=lambda e: (-e[3], -e[1], e[0]))

print(f'Found {len(empty):>4} empty books')
print(f'Found {len(ordinary):>4} ordinary books (i.e. without private words)')

Found    0 empty books
Found    0 ordinary books (i.e. without private words)


In [26]:
print('{:<20}{:>5}{:>5}{:>5}\n{}'.format(
    'book', '#all', '#own', '%own',
    '-'*35,
))

for x in bookList[0:20]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))
print('...')
for x in bookList[-20:]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))

book                 #all #own %own
-----------------------------------
3                    4707 1099 23.3%
14                   4765 1092 22.9%
15                   3812  843 22.1%
13                   4935 1077 21.8%
11                   4580  998 21.8%
7                    4447  952 21.4%
5                    4063  831 20.5%
4                    4769  955 20.0%
2                    3777  741 19.6%
9                    4100  797 19.4%
1                    3704  693 18.7%
6                    4443  824 18.5%
10                   4319  751 17.4%
12                   4156  712 17.1%
8                    3472  544 15.7%
...
3                    4707 1099 23.3%
14                   4765 1092 22.9%
15                   3812  843 22.1%
13                   4935 1077 21.8%
11                   4580  998 21.8%
7                    4447  952 21.4%
5                    4063  831 20.5%
4                    4769  955 20.0%
2                    3777  741 19.6%
9                    4100  797 19.4%

# Next steps

By now you have an impression how to compute around in the Athenaeus.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

**(in progress, not all of the tutorials below exist already!)**

* **[display]"(display.ipynb)"** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel]"(exportExcel.ipynb)"** make tailor-made spreadsheets out of your results
* **[share]"(share.ipynb)"** draw in other people's data and let them use yours