# Loading Athenaeus

Without a TF-app, we use the more
[primitive way](https://annotation.github.io/text-fabric/Api/Fabric/)
to load a TF data set.

Text-Fabric will not go online, but look on the local computer where we tell it to look.

In [24]:
import os
import collections
from tf.fabric import Fabric

## Location

We specify where on our system the data can be found.

Most TF users have a directory `github` in their home directory.
That directory is organized as in GitHub itself: first by org/user, then by repo.

This way, we can borrow each other's code without modification.

In [2]:
BASE = os.path.expanduser('~/github')
ORG = 'pthu'
REPO = 'patristics'
LOCATION1 = 'tf/1.1/athenaeus/Athenaeus/'
LOCATION2 = 'Deipnosophistae'

The split of the relative path in `LOCATION1` and `LOCATION2` is arbitrary.
In some scenarios it is handy to split a data set in a base set and a collection modules.

Now we are going to call TF. We only ask it to search through the specified directories and
read the metadata section of all `.tf` files.

In [3]:
TF = Fabric(locations=f'{BASE}/{ORG}/{REPO}/{LOCATION1}', modules=LOCATION2)

This is Text-Fabric 7.8.8
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

25 features found and 0 ignored


Now we can load the data and receive an API for it.

We want all loadable features, so we ask which they are.

In [4]:
def tfInitiate():
    api = TF.load('', silent=False)
    allFeatures = TF.explore(silent=True, show=True)
    loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
    TF.load(loadableFeatures, add=True, silent=True)
    return api

The first time will see a precomputation step (levels, order, rank, ...)

A lot of data is precomputed and stored in a hidden `.tf` directory, so that it loads quicker next time.
The API relies on these precomputed data for efficientcy.

In [5]:
api = tfInitiate()

  0.00s loading features ...
   |     0.11s T otype                from /Users/dirk/github/pthu/patristics/tf/1.1/athenaeus/Athenaeus//Deipnosophistae
   |     1.80s T oslots               from /Users/dirk/github/pthu/patristics/tf/1.1/athenaeus/Athenaeus//Deipnosophistae
   |     0.68s T norm                 from /Users/dirk/github/pthu/patristics/tf/1.1/athenaeus/Athenaeus//Deipnosophistae
   |     0.67s T main                 from /Users/dirk/github/pthu/patristics/tf/1.1/athenaeus/Athenaeus//Deipnosophistae
   |     0.00s T book                 from /Users/dirk/github/pthu/patristics/tf/1.1/athenaeus/Athenaeus//Deipnosophistae
   |     0.66s T orig                 from /Users/dirk/github/pthu/patristics/tf/1.1/athenaeus/Athenaeus//Deipnosophistae
   |     0.65s T plain                from /Users/dirk/github/pthu/patristics/tf/1.1/athenaeus/Athenaeus//Deipnosophistae
   |     0.60s T beta_plain           from /Users/dirk/github/pthu/patristics/tf/1.1/athenaeus/Athenaeus//Deipnosophi

Finally, we want to make the functions of the TF API available at the top level in this notebook

In [8]:
docs = api.makeAvailableIn(globals())

The variable `docs` is a list of function names that have been inserted in the global namespace:

In [10]:
for member in docs:
  print(member)

('Computed', 'computed-data', ('C Computed', 'Call AllComputeds', 'Cs ComputedString'))
('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString'))
('Fabric', 'loading', ('ensureLoaded', 'TF', 'ignored', 'loadLog'))
('Locality', 'locality', ('L Locality',))
('Nodes', 'navigating-nodes', ('N Nodes', 'sortKey', 'sortKeyTuple', 'otypeRank', 'sortNodes'))
('Features', 'node-features', ('F Feature', 'Fall AllFeatures', 'Fs FeatureString'))
('Search', 'search', ('S Search',))
('Text', 'text', ('T Text',))


Now you can study as in the tutorials for DSS and Oldbabylonian:

We can list all loaded features:

In [18]:
Fall()

['_book',
 '_sentence',
 'add',
 'beta_plain',
 'bibl',
 'book',
 'chapter',
 'cit',
 'head',
 'hi',
 'l',
 'lemma',
 'main',
 'norm',
 'num',
 'orig',
 'otype',
 'p',
 'pb',
 'plain',
 'post',
 'pre',
 'quote']

# Counting

In [11]:
indent(reset=True)
info('Counting nodes ...')

i = 0
for n in N(): i += 1

info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.05s 303854 nodes


# Node types

In [12]:
F.otype.slotType

'word'

In [13]:
F.otype.all

('_book',
 'head',
 'book',
 'hi',
 'cit',
 'num',
 'add',
 'chapter',
 'pb',
 'p',
 'quote',
 'bibl',
 'l',
 '_sentence',
 'word')

In [14]:
C.levels.data

(('_book', 265642.0, 265643, 265643),
 ('head', 265642.0, 285988, 285988),
 ('book', 17709.466666666667, 284478, 284492),
 ('hi', 3120.6153846153848, 285989, 286066),
 ('cit', 1588.9880239520958, 285821, 285987),
 ('num', 951.5309090909091, 297334, 297608),
 ('add', 336.5678073510773, 278425, 279213),
 ('chapter', 200.03162650602408, 284493, 285820),
 ('pb', 171.50484183344093, 299180, 300728),
 ('p', 169.09993634627625, 297609, 299179),
 ('quote', 84.89507357645553, 300729, 303854),
 ('bibl', 51.098974164133736, 279214, 284477),
 ('l', 23.543622969734624, 286067, 297333),
 ('_sentence', 20.788122995070808, 265644, 278424),
 ('word', 1, 1, 265642))

In [15]:
for (typ, av, start, end) in C.levels.data:
  print(f'{end - start + 1:>7} {typ}s')

      1 _books
      1 heads
     15 books
     78 his
    167 cits
    275 nums
    789 adds
   1328 chapters
   1549 pbs
   1571 ps
   3126 quotes
   5264 bibls
  11267 ls
  12781 _sentences
 265642 words


# Feature statistics

There are no linguistic features, as far as I can see, but there is `lemma`.

# Word matters

## Top 20 frequent words

In [20]:
for (w, amount) in F.lemma.freqList('word')[0:20]:
  print(f'{amount:>5} {w}')

24736 ὁ,ὅς
12995 καί
11725 δέ
 5668 ἒ
 3139 φημί
 2916 οὗτος
 2696 αὐτός
 2558 ἐγώ
 2457 τις,ὁ,τίς,ὅς
 2369 εἰμί
 2056 οὐ
 2007 γάρ
 1932 τίς,τῷ,ὅς,ὁ,τις
 1856 ὁ,τίς,ὅς
 1807 ὡς,ὅς
 1742 περί
 1597 τε,σύ,τεός,τις
 1389 ἐπί
 1322 λέγω1,λέγω
 1312 εἰς,εἶμι,εἰμί


## Hapaxes

In [21]:
hapaxes1 = sorted(lx for (lx, amount) in F.lemma.freqList('word') if amount == 1)
len(hapaxes1)

11194

In [22]:
for lx in hapaxes1[0:20]:
  print(lx)

*isgreek
*p
*ʼαγκυλητους
*ʼαδεσθαι
*ʼαδυφωνον
*ʼακουσομεθα
*ʼαναξαρχον
*ʼανδρομαχον
*ʼανθος
*ʼαντιφωντος
*ʼαπο
*ʼαποδιδωσι
*ʼαρπασθηναι
*ʼαφι
*ʼβρενθιν
*ʼβυζαντιους
*ʼγενη
*ʼγλαυκου
*ʼγραφει
*ʼγραφων


### Small occurrence base

The occurrence base of a word are the books in which the word occurs.

In [25]:
occurrenceBase = collections.defaultdict(set)

indent(reset=True)
info('compiling occurrence base ...')
for s in F.otype.s('book'):
  book = F.book.v(s)
  for w in L.d(s, otype='word'):
    occurrenceBase[F.lemma.v(w)].add(book)
info('done')
info(f'{len(occurrenceBase)} entries')

  0.00s compiling occurrence base ...
  0.22s done
  0.22s 23448 entries


An overview of how many words have how big occurrence bases:

In [27]:
occurrenceSize = collections.Counter()

for (w, books) in occurrenceBase.items():
  occurrenceSize[len(books)] += 1
  
occurrenceSize = sorted(
  occurrenceSize.items(),
  key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
  print(f'books {size:>4} : {amount:>5} words')
print('...')
for (size, amount) in occurrenceSize[-10:]:
  print(f'books {size:>4} : {amount:>5} words')

books    1 : 12899 words
books    2 :  3627 words
books    3 :  1879 words
books    4 :  1179 words
books    5 :   781 words
books    6 :   571 words
books    7 :   435 words
books    8 :   375 words
books   15 :   347 words
books   10 :   294 words
...
books    6 :   571 words
books    7 :   435 words
books    8 :   375 words
books   15 :   347 words
books   10 :   294 words
books    9 :   288 words
books   11 :   218 words
books   12 :   205 words
books   14 :   176 words
books   13 :   174 words


Let's give the predicate *private* to those words whose occurrence base is a single book.

In [28]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

12899

### Peculiarity of books

As a final exercise with books, lets make a list of all books, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the book

In [33]:
bookList = []

empty = set()
ordinary = set()

for d in F.otype.s('book'):
  book = F.book.v(d)
  words = {F.lemma.v(w) for w in L.d(d, otype='word')}
  a = len(words)
  if not a:
    empty.add(book)
    continue
  o = len({w for w in words if w in privates})
  if not o:
    ordinary.add(book)
    continue
  p = 100 * o / a
  bookList.append((book, a, o, p))

bookList = sorted(bookList, key=lambda e: (-e[3], -e[1], e[0]))

print(f'Found {len(empty):>4} empty books')
print(f'Found {len(ordinary):>4} ordinary books (i.e. without private words)')

Found    0 empty books
Found    0 ordinary books (i.e. without private words)


In [34]:
print('{:<20}{:>5}{:>5}{:>5}\n{}'.format(
    'book', '#all', '#own', '%own',
    '-'*35,
))

for x in bookList[0:20]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))
print('...')
for x in bookList[-20:]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))

book                 #all #own %own
-----------------------------------
3                    4719 1099 23.3%
14                   4777 1090 22.8%
15                   3821  840 22.0%
13                   4945 1078 21.8%
11                   4589  997 21.7%
7                    4457  948 21.3%
5                    4068  831 20.4%
4                    4777  955 20.0%
2                    3787  741 19.6%
9                    4109  797 19.4%
1                    3715  692 18.6%
6                    4453  824 18.5%
10                   4333  751 17.3%
12                   4165  712 17.1%
8                    3488  544 15.6%
...
3                    4719 1099 23.3%
14                   4777 1090 22.8%
15                   3821  840 22.0%
13                   4945 1078 21.8%
11                   4589  997 21.7%
7                    4457  948 21.3%
5                    4068  831 20.4%
4                    4777  955 20.0%
2                    3787  741 19.6%
9                    4109  797 19.4%