<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/ninologo.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

Old Babylonian in the browser:

```
text-fabric oldbabylonian
```

In [1]:
import sys, os, collections
from tf.app import use

# Incantation

In [2]:
A = use('oldbabylonian', hoist=globals())
silentOff()

	connecting to online GitHub repo annotation/app-oldbabylonian ... connected
Using TF-app in /Users/dirk/text-fabric-data/annotation/app-oldbabylonian/code:
	rv0.2=#4bb2530bfb94dc93601f8b3df7722cb0e5df7a43 (latest release)
	connecting to online GitHub repo Nino-cunei/oldbabylonian ... connected
Using data in /Users/dirk/text-fabric-data/Nino-cunei/oldbabylonian/tf/1.0.4:
	rv1.4 (latest release)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


# Counting

In [3]:
indent(reset=True)
info('Counting nodes ...')

i = 0
for n in N(): i += 1

info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.06s 334667 nodes


# Node types

In [4]:
F.otype.slotType

'sign'

In [5]:
F.otype.maxSlot

203219

In [6]:
F.otype.maxNode

334667

In [7]:
F.otype.all

('document', 'face', 'line', 'word', 'cluster', 'sign')

In [8]:
C.levels.data

(('document', 158.14708171206226, 226669, 227953),
 ('face', 71.70748059280169, 227954, 230787),
 ('line', 7.423525114155251, 230788, 258162),
 ('word', 2.6436180641788116, 258163, 334667),
 ('cluster', 1.782122905027933, 203220, 226668),
 ('sign', 1, 1, 203219))

In [9]:
for (typ, av, start, end) in C.levels.data:
  print(f'{end - start + 1:>7} {typ}s')

   1285 documents
   2834 faces
  27375 lines
  76505 words
  23449 clusters
 203219 signs


# Feature statistics

## repeats

In [10]:
F.repeat.freqList()

((1, 877),
 (2, 398),
 (5, 246),
 (3, 239),
 (4, 152),
 (6, 67),
 (8, 40),
 (7, 26),
 (9, 15),
 (-1, 3))

## type (cluster)

In [11]:
F.type.freqList('cluster')

(('langalt', 7600),
 ('missing', 7572),
 ('det', 6794),
 ('uncertain', 1183),
 ('supplied', 231),
 ('excised', 69))

## type (sign)

In [12]:
F.type.freqList('sign')

(('reading', 188292),
 ('unknown', 8761),
 ('numeral', 2184),
 ('ellipsis', 1617),
 ('grapheme', 1272),
 ('commentline', 969),
 ('complex', 122),
 ('comment', 2))

## flags

In [13]:
F.flags.freqList()

(('#', 9830),
 ('?', 421),
 ('#?', 131),
 ('!', 91),
 ('*', 9),
 ('?#', 7),
 ('#!', 5),
 ('!*', 2),
 ('#*', 1),
 ('?!*', 1))

# Word matters

## Top 20 frequent words

In [None]:
for (w, amount) in F.sym.freqList('word')[0:20]:
  print(f'{amount:>5} {w}')

## Hapaxes

In [None]:
hapaxes1 = sorted(x for (x, amount) in F.sym.freqList('word') if amount == 1)
len(hapaxes1)

In [None]:
for w in [w for (w, amount) in F.sym.freqList('word') if amount == 1][0:20]:
  print(f'"{w}"')

### Small occurrence base

The occurrence base of a word are the documents in which occurs.

In [None]:
occurrenceBase = collections.defaultdict(set)

for w in F.otype.s('word'):
  pNum = T.sectionFromNode(w)[0]
  occurrenceBase[F.sym.v(w)].add(pNum)

An overview of how many words have how big occurrence bases:

In [None]:
occurrenceSize = collections.Counter()

for (w, pNums) in occurrenceBase.items():
  occurrenceSize[len(pNums)] += 1
  
occurrenceSize = sorted(
  occurrenceSize.items(),
  key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
  print(f'tablets {size:>4} : {amount:>5} words')
print('...')
for (size, amount) in occurrenceSize[-10:]:
  print(f'tablets {size:>4} : {amount:>5} words')

Let's give the predicate *private* to those words whose occurrence base is a single document.

In [None]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

### Peculiarity of tablets

As a final exercise with words, lets make a list of all documents, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the document

In [None]:
docList = []

empty = set()
ordinary = set()

for d in F.otype.s('document'):
  pNum = T.documentName(d)
  words = {F.sym.v(w) for w in L.d(d, otype='word')}
  a = len(words)
  if not a:
    empty.add(pNum)
    continue
  o = len({w for w in words if w in privates})
  if not o:
    ordinary.add(pNum)
    continue
  p = 100 * o / a
  docList.append((pNum, a, o, p))

docList = sorted(docList, key=lambda e: (-e[3], -e[1], e[0]))

print(f'Found {len(empty):>4} empty documents')
print(f'Found {len(ordinary):>4} ordinary documents (i.e. without private words)')

In [None]:
print('{:<20}{:>5}{:>5}{:>5}\n{}'.format(
    'document', '#all', '#own', '%own',
    '-'*35,
))

for x in docList[0:20]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))
print('...')
for x in docList[-20:]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))

---

next: [search](search.ipynb)

---

full posTag and pos notebooks on
[annotation/tutorials/oldbabylonian/cookbook](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/oldbabylonian/cookbook)

full tutorial on
[annotation/tutorials/oldbabylonian](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/oldbabylonian)