<img align="right" src="tf-small.png"/>

# Tutorial

This notebook gets you started with
[Text-Fabric](https://github.com/ETCBC/text-fabric) an API on annotated text.
Below we show some of its functions in action on a Sumerian data set, *etcsl*.

The tutorial is best understood after having familiarized yourself with the underlying
[data model](https://github.com/ETCBC/text-fabric/wiki/Data-model).

If you want to *get* this all, see the 
[home page](https://github.com/ETCBC/text-fabric/wiki)
of Text-Fabric wiki.

In [1]:
import sys, os, collections
from tf.fabric import Fabric

# Call Text-Fabric

Everything starts by setting up Text-Fabric.
It needs to know where to look for data.

In [2]:
ETCSL = 'sumerian/etcsl'
TF = Fabric( modules=[ETCSL], silent=False )

This is Text-Fabric 2.3.9
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
27 features found and 0 ignored


Note that we have just one module: `etcsl`, the main data source. 

If you have additional data (features), you can just add them by pointing Text-Fabric to the right directory.

# Load Features
Specify the features to load, and receive the API to work with that data.

In [3]:
api = TF.load('''
    ascii trailer
    bound det emesal-prefix emesal form-type form label lemma npart pos type
    freq_lex freq_occ rank_lex rank_occ
    title
    compNum secNum lineNum
    corr damage supplied
''')
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.00s B compNum              from /Users/dirk/Dropbox/text-fabric-data/sumerian/etcsl
   |     0.01s B secNum               from /Users/dirk/Dropbox/text-fabric-data/sumerian/etcsl
   |     0.03s B lineNum              from /Users/dirk/Dropbox/text-fabric-data/sumerian/etcsl
   |     0.14s B ascii                from /Users/dirk/Dropbox/text-fabric-data/sumerian/etcsl
   |     0.08s B form                 from /Users/dirk/Dropbox/text-fabric-data/sumerian/etcsl
   |     0.07s B lemma                from /Users/dirk/Dropbox/text-fabric-data/sumerian/etcsl
   |     0.09s B trailer              from /Users/dirk/Dropbox/text-fabric-data/sumerian/etcsl
   |     0.00s B bound                from /Users/dirk/Dropbox/text-fabric-data/sumerian/etcsl
   |     0.01s B det                  from /Users/dirk/Dropbox/text-fabric-data/sumerian/etcsl
   |     0.00s B emesal-prefix        from /Users/dirk/Dropbox/text-fabric-data/sumerian/etcsl
   |     0.01s B emes

We have made it so that the members of the API are directly accessible as global variables.

# More about the data
See the
[conversion notebook](https://github.com/ETCBC/text-fabric/blob/master/tfFromSumerianTEI/tfFromSumerianTEI.ipynb)
for documentation on the features of this dataset.

# Counting

In order to get acquainted with the data, we start with simple tasks: counting.

## Count all nodes
We use the 
[`N()` generator](https://github.com/ETCBC/text-fabric/wiki/Api#walking-through-nodes)
to walk us through the nodes.

In [4]:
indent(reset=True)
info('Counting nodes ...')
i = 0
for n in N(): i += 1
info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.17s 625393 nodes


## Sort some nodes

Get some nodes, 
[slot](https://github.com/ETCBC/text-fabric/wiki/Data-model#summary)
and non-slot, and sort them in the 
[canonical order](https://github.com/ETCBC/text-fabric/wiki/Api#sorting-nodes).

The [`otype` feature](https://github.com/ETCBC/text-fabric/wiki/Data-model#otype-node-feature)
is a
[GRID feature](https://github.com/ETCBC/text-fabric/wiki/Data-model#more-about-the-grid),
a special feature that provides defining characteristics for the
data set as a whole. 
It tells us where the slots end and the other nodes start.

In [5]:
sortNodes(list(range(F.otype.maxSlot+1, F.otype.maxSlot+10))+list(range(1,11)))

[412193,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 412194,
 412195,
 412196,
 412197,
 412198,
 412199,
 412200,
 412201]

The slots correspond to glyphs. Words are a higher concept.
In the list above you see first a word node, and then the nodes corresponding to its letters.
These words are the words according to the word boundaries found in the source texts.

## Numbers in the otype feature
Get more information that is readily available in the 
[GRID feature](https://github.com/ETCBC/text-fabric/wiki/Data-model#more-about-the-grid)
[`otype`](https://github.com/ETCBC/text-fabric/wiki/Data-model#otype-node-feature),
namely what types of objects there are in the dataset.

In [6]:
info('{:<10}= {}\n{:<10}= {}\n{:<10}= {}'.format(
    'slotType', F.otype.slotType,
    'maxSlot', F.otype.maxSlot,
    'maxNode', F.otype.maxNode,
), tm=False)
info('All otypes:\n\t', nl=False, tm=False)
info('\n\t'.join(F.otype.all), tm=False)

slotType  = glyph
maxSlot   = 412192
maxNode   = 625393
All otypes:
	composition
	section
	line
	word
	glyph


## Count individual object types

In [7]:
indent(reset=True)
info('counting objects ...')
for otype in F.otype.all:
    i = 0
    indent(level=1, reset=True)
    for n in F.otype.s(otype): i+=1
    info('{:>7} {}s'.format(i, otype))
indent(level=0)
info('Done')

  0.00s counting objects ...
   |     0.00s     394 compositions
   |     0.00s    6512 sections
   |     0.01s   36511 lines
   |     0.04s  169784 words
   |     0.07s  412192 glyphs
  0.14s Done


# Feature statistics

The content data resides in the features.
The
[`F` function](https://github.com/ETCBC/text-fabric/wiki/Api#node-features)
gives access to that data.
Every feature has a method
[`freqList()`](https://github.com/ETCBC/text-fabric/wiki/Api#node-features)
to generate a frequency list of its values, ordered by highest frequency first.

In [8]:
F.ascii.freqList()[0:20]

(('', 34138),
 ('a', 12366),
 ('X', 10729),
 ('mu', 8841),
 ('{X}', 8634),
 ('e', 8471),
 ('{d}', 8015),
 ('na', 7974),
 ('ba', 7664),
 ('ra', 7551),
 ('en', 6615),
 ('an', 6505),
 ('ni', 6379),
 ('ga', 5902),
 ('da', 5766),
 ('bi', 5466),
 ('ma', 4890),
 ('ki', 4498),
 ('zu', 4281),
 ('me', 4071))

The number of distinct glyphs is given here:

In [9]:
len(F.ascii.freqList())

2514

Here are the parts of speech:

In [10]:
F.pos.freqList()

(('N', 87934),
 ('V', 48984),
 ('AJ', 3554),
 ('PD', 2833),
 ('NU', 1653),
 ('AV', 625),
 ('I', 258),
 ('C', 246),
 ('NEG', 35),
 ('X', 14))

# Word distribution

Let's do a bit more fancy word stuff.

## Hapaxes

A hapax is a word with frequency one.

We count the number of hapaxes and print 10 of them.

We use the `freq_lex` feature as our frequency measure.
It is only defined on words, not on glyphs.

In [11]:
hapaxesL = F.freq_lex.s(1)
print('{} hapaxes found:\n\t{}'.format(len(hapaxesL), '\n\t'.join(F.lemma.v(w) for w in hapaxesL[0:10])))

1271 hapaxes found:
	an-ji6
	dim2-ma-ni-us2-a-ni
	IR3-suen
	67
	su-mu-la-el
	e-lu-lam
	tu-uk-ri-ic
	pec5-pec5
	ugu-dilim2
	nam-NIR.PA


There are very few hapaxes. Let's see what happens if we use the occurrences (`form`) instead of
the lexemes (`lemma`). How many unique forms are there?

We now use `freq_occ`, but this feature is also defined for *glyphs*.
We filter out the word nodes.

In [12]:
hapaxesO = [n for n in F.freq_occ.s(1) if F.otype.v(n) == 'word']
print('{} hapaxes found:\n\t{}'.format(len(hapaxesO), '\n\t'.join(F.lemma.v(w) for w in hapaxesO[0:10])))

20662 hapaxes found:
	ed3
	an-ji6
	zu
	ed3
	li-li-a
	kur9
	gal
	li-li-a
	jiri3-jen-na
	pad3


### Small occurrence base

The occurrence base of a word are the lines, sections and compositions in which it occurs.
Let's look for words that occur in a single composition and nowhere else.

We compile a dictionary, keyed by word, and with values the set of compositions they occur in.

In [13]:
wordInComps = collections.defaultdict(set)

for w in F.otype.s('word'):
    word = F.form.v(w)
    c = L.u(w, otype='composition')[0]
    cNum = F.compNum.v(c)
    wordInComps[word].add(cNum)

singleCompWords = {word for word in wordInComps if len(wordInComps[word]) == 1}
print('{} words are confined to their book'.format(len(singleCompWords)))

23754 words are confined to their book


Let's get some more information.

In [14]:
wordAmountComps = collections.Counter()
for (word, compSet) in wordInComps.items():
    wordAmountComps[len(compSet)] += 1
for (nComps, nWords) in sorted(wordAmountComps.items()):
    print('Confined to {:>3} comps: {:>7} words'.format(nComps, nWords))

Confined to   1 comps:   23754 words
Confined to   2 comps:    3994 words
Confined to   3 comps:    1547 words
Confined to   4 comps:     899 words
Confined to   5 comps:     591 words
Confined to   6 comps:     395 words
Confined to   7 comps:     228 words
Confined to   8 comps:     254 words
Confined to   9 comps:     168 words
Confined to  10 comps:     128 words
Confined to  11 comps:     128 words
Confined to  12 comps:     108 words
Confined to  13 comps:     113 words
Confined to  14 comps:      65 words
Confined to  15 comps:      77 words
Confined to  16 comps:      56 words
Confined to  17 comps:      44 words
Confined to  18 comps:      44 words
Confined to  19 comps:      55 words
Confined to  20 comps:      40 words
Confined to  21 comps:      30 words
Confined to  22 comps:      42 words
Confined to  23 comps:      24 words
Confined to  24 comps:      24 words
Confined to  25 comps:      20 words
Confined to  26 comps:      17 words
Confined to  27 comps:      26 words
C

The next thing to know is: which compositions are the most particular,
in the sense that they have the highest fraction of lexemes that 
do not occur in other compositions?

In [15]:
compList = []

for c in F.otype.s('composition'):
    cNum = F.compNum.v(c)
    allWords = {F.form.v(w) for w in L.d(c, otype='word')}
    ownWords = allWords & singleCompWords
    percentage = 100 * len(ownWords) / len(allWords)
    compList.append((len(allWords), len(ownWords), percentage, cNum))

print('{:>4} {:>4} {:>5} {}'.format('#all', '#own', '%own', 'comp'))
for x in sorted(compList, key=lambda e: (-e[2], -e[0], e[3])):
    print('{:>4} {:>4} {:>4.1f}% {}'.format(*x))

#all #own  %own comp
 128   90 70.3% 3.1.13.2
 187  116 62.0% 3.1.08
 476  284 59.7% 2.1.1
 164   84 51.2% 5.6.5
  86   44 51.2% 4.13.14
 337  152 45.1% 5.6.3
  41   18 43.9% 3.3.04
  30   13 43.3% 4.13.d
  70   30 42.9% 2.1.3
2155  910 42.2% 2.1.7
 330  135 40.9% 6.1.13
  80   32 40.0% 3.3.05
 128   50 39.1% 3.1.16
  90   35 38.9% 4.13.13
 457  177 38.7% 1.1.2
  52   20 38.5% 3.1.10
 238   91 38.2% 2.4.2.14
  35   13 37.1% 5.7.3
  87   32 36.8% 5.5.a
1873  688 36.7% 1.6.2
 844  308 36.5% 1.8.1.3
 740  270 36.5% 5.6.1
 147   53 36.1% 3.1.15
 254   91 35.8% 1.4.1.1
  73   26 35.6% 3.3.08
 601  213 35.4% 6.1.05
 235   83 35.3% 6.1.04
  92   32 34.8% 4.08.23
 101   35 34.7% 2.6.9.4
 335  116 34.6% 6.1.08
 292  101 34.6% 2.1.2
1079  373 34.6% 2.4.2.02
 881  304 34.5% 1.8.1.4
 951  328 34.5% 2.2.4
  41   14 34.1% 5.7.a
1298  442 34.1% 1.8.2.1
 396  134 33.8% 2.4.2.18
 816  273 33.5% 5.3.6
 162   54 33.3% 3.1.19
 132   44 33.3% 3.1.02
  72   24 33.3% 2.4.4.1
  57   19 33.3% 3.1.11.1
   6    

# Layer API
We travel upwards and downwards, forwards and backwards through the nodes.
The Layer-API (`L`) provides functions: `u()` for going up, and `d()` for going down,
`n()` for going to next nodes and `p()` for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the
`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow of precede the slots of the other one.

`L.u(node)` **Up** is going to nodes that embed `node`.

`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.

`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.

`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.

All these functions yield nodes of all possible otypes.
By passing an optional parameter, you can restrict the results to nodes of that type.

The result is ordered in the canonical node ordering.
The functions return always a tuple, even if there is just one node in the result.

## Going up
We go from the first glyph to the composition it contains.

In [16]:
firstComp = L.u(1, otype='composition')[0]
print(firstComp)

412193


Which composition is this?

In [17]:
print('{} : {}'.format(F.compNum.v(firstComp), F.title.v(firstComp)))

0.1.1 : Ur III catalogue from Nibru (N1) -- a composite transliteration


And let's see all the containing objects of glyph 3:

In [18]:
g = 3
for otype in F.otype.all:
    if otype == F.otype.slotType: continue
    up = L.u(g, otype=otype)
    upNode = 'x' if len(up) == 0 else up[0]
    print('glyph {} is contained in {} {}'.format(F.ascii.v(g), otype, upNode))

glyph ta is contained in composition 412193
glyph ta is contained in section 449098
glyph ta is contained in line 412587
glyph ta is contained in word 455610


## Going next
Let's go to the next nodes of the first book.

In [19]:
afterFirstComp = L.n(firstComp)
for n in afterFirstComp:
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))
secondComp = L.n(firstComp, otype='composition')[0]

    127: glyph         first slot=127   , last slot=127   
 455679: word          first slot=127   , last slot=127   
 412608: line          first slot=127   , last slot=133   
 449099: section       first slot=127   , last slot=377   
 412194: composition   first slot=127   , last slot=377   


## Going previous

And let's see what is right before the second composition.

In [20]:
for n in L.p(secondComp):
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))

 412193: composition   first slot=1     , last slot=126   
 449098: section       first slot=1     , last slot=126   
 412607: line          first slot=120   , last slot=126   
 455678: word          first slot=124   , last slot=126   
    126: glyph         first slot=126   , last slot=126   


## Going down

We go to the sections of composition 1.1.1 , and just count them.

In [21]:
comp111 = F.compNum.s('1.1.1')[0]
sections = L.d(comp111, otype='section')
print('Composition 1.1.1 ({}) has {} sections'.format(
    F.title.v(comp111),
    len(sections),
))

Composition 1.1.1 (Enki and Nin{h}ursa{g}a -- a composite transliteration) has 43 sections


## The first line
We pick the first glyph and the first line and explore what is above and below them.

In [22]:
for n in [1, L.u(1, otype='line')[0]]:
    indent(level=0)
    info('Node {}'.format(n), tm=False)
    indent(level=1)
    info('UP', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    indent(level=1)
    info('DOWN', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
indent(level=0)
info('Done', tm=False)

Node 1
   |   UP
   |      |   455610          word
   |      |   412587          line
   |      |   449098          section
   |      |   412193          composition
   |   DOWN
   |      |   
Node 412587
   |   UP
   |      |   449098          section
   |      |   412193          composition
   |   DOWN
   |      |   455610          word
   |      |   1               glyph
   |      |   2               glyph
   |      |   3               glyph
Done


# Text API

We examine the functions of the Text API: `T`.

## Formats
First the formats that we have available to represent the actual text.
These formats have been defined in the `otext` feature.
This is an optional GRID config feature: it has only metadata.

In [23]:
sorted(T.formats)

['form-orig-full', 'lex-orig-full', 'text-orig-full']

## Using the formats

The formats `form-orig-full` and `lex-orig-full` work for word nodes,
the format `text-orig-full` works for glyph nodes.
Now let's use those formats to print out the first 100 glyphs and the first 10 words of the corpus.

In [24]:
for fmt in sorted(T.formats):
    if fmt.startswith('text'):
        print('{}:\n\t{}'.format(fmt, T.text(range(1,19), fmt=fmt)))
    else:
        print('{}:\n\t{}'.format(fmt, T.text(F.otype.s('word')[0:10], fmt=fmt)))

form-orig-full:
	dub-saj-ta {d}en-ki unu2 gal im-ed3 an-zag-ce3 an-ji6 zu ama tu6 
lex-orig-full:
	dub-saj en-ki unu2 gal ed3 an-zag an-ji6 zu ama tu6 
text-orig-full:
	dub-saj-ta {d}en-ki unu2 gal im-ed3 an-zag-ce3 an-ji6 zu ama tu6 


If we do not specify a format, the **default** format is used (`text-orig-full`).

In [25]:
print(T.text(range(1,100)))

dub-saj-ta {d}en-ki unu2 gal im-ed3 an-zag-ce3 an-ji6 zu ama tu6 zu-ke4 jic-gi tuku4-e AN KAC4 AN KAC4 me3-ke4 mac-mac erim2 kur2-kur2 jiri3-jen-na {d}en -ki ki unu2 gal im-ed3-kam cag4 PU2 1(AC)-kam dub-saj-ta X X kaskal-la 7(IMIN) me-ce3 {d}li-li-a-ke4 igi X 7(IMIN)-na in-kur9-re-en cul a2 he2-la2 gu DI sig10- ga CA gal-kam jiri3-jen-na {d}li-li-a-kam cag4 PU2 1(AC)-kam 


## Whole text
We are going to produce the complete text of the whole corpus with section numbers and composition titles.

In [26]:
text = []
indent(reset=True)
info('writing plain text of whole Sumerian corpus')
for c in F.otype.s('composition'):
    text.append('')
    text.append('Composition {}: {}'.format(F.compNum.v(c), F.title.v(c)))
    for s in L.d(c, otype='section'):
        text.append('')
        text.append('Section {}'.format(F.secNum.v(s)))
        for l in L.d(s, otype='line'):
            text.append('{}: {}'.format(F.lineNum.v(l), T.text(L.d(l, otype='glyph'))))
info('Done')
print('\n'.join(text[0:10]))

  0.00s writing plain text of whole Sumerian corpus
  1.56s Done

Composition 0.1.1: Ur III catalogue from Nibru (N1) -- a composite transliteration

Section 0
1: dub-saj-ta 
2: {d}en-ki unu2 gal im-ed3 
3: an-zag-ce3 
4: an-ji6 zu ama tu6 zu-ke4 
5: jic-gi tuku4-e 
6: AN KAC4 AN KAC4 me3-ke4 


Here is an other fragment.

In [27]:
print('\n'.join(text[561:955]))

Composition 1.1.1: Enki and Nin{h}ursa{g}a -- a composite transliteration

Section 1
1: iri{ki kug-kug-ga-am3 e-ne ba-am3-me-en-ze2-en kur dilmun{ki kug-ga-am3 
2: ki-en-gi kug-ga e-ne ba-am3-me-en-ze2-en kur dilmun{ki kug-ga-am3 
3: kur dilmun{ki kug-ga-am3 kur dilmun sikil-am3 
4: kur dilmun{ki sikil-am3 kur dilmun{ki dadag-ga-am3 

Section 2
5: dili-ni-ne dilmun{ki-a u3-bi2-in-nu2 
6: ki {d}en-ki dam-a-ni-da ba-an-da-nu2-a-ba 
7: ki-bi sikil-am3 ki-bi dadag-ga-am3 
8: dili-ni-ne dilmun{ki-a u3-bi2-in-nu2 
9: ki {d}en-ki {d}nin-sikil-la ba-an-da-nu2-a-ba 
10: ki-bi sikil-am3 ki-bi dadag-ga-am3 

Section 3
11: dilmun{ki-a uga{mucen gu3-gu3 nu-mu-ni-be2 
12: dar{mucen-e gu3 dar{mucen-re nu-mu-ni-ib-be2 
13: ur-gu-la saj jic nu-ub-ra-ra 
14: ur-bar-ra-ke4 sila4 nu-ub-kar-re 
15: ur-gir15 mac2 gam-gam nu-ub-zu 
16: cah2 ce gu7-gu7-e nu-ub-zu 

Section 4
17: nu-mu-un-su2 munu4 ur3-ra barag2-ga-ba 
18: mucen-e an-na munu4-bi na-an-gu7-e 
19: tum12{mucen-e saj nu-mu-un-da-RU-e 

Section 5
2

Write the whole text to file.

In [28]:
with open(os.path.expanduser('~/Downloads/etcsl.txt'), 'w') as f:
    f.write('\n'.join(text))

## Word forms versus glyphs
We check where the `form` feature on words is different from the text representation based on its glyphs.
Note that the `text-orig-full` text format puts a spurious space at the end of each word.
That is fine for outputting running text, but for the comparison we will strip that space.

In [29]:
different = []
for w in F.otype.s('word'):
    form = F.form.v(w)
    text = T.text(L.d(w, otype='glyph')).rstrip(' ')
    if form != text:
        different.append((w, form, text))
print('There are {} differences'.format(len(different)))

There are 18267 differences


In [30]:
different[0:10]

[(455632, '{d}en-ki', '{d}en -ki'),
 (455639, '1-kam', '1(AC)-kam'),
 (455644, '7', '7(IMIN)'),
 (455649, '7-na', '7(IMIN)-na'),
 (455656, 'sig10-ga', 'sig10- ga'),
 (455663, '1-kam', '1(AC)-kam'),
 (455681, 'dar-a', 'dar -a'),
 (455683, 'dug4-ga-ni', 'dug4 -ga-ni'),
 (455714, 'an-cag4-ta', 'an - cag4 - ta'),
 (455715, 'hi-li', 'hi- li')]

Some differences have to do with spacing, which will require further inspection.
It might be caused by the elements that we have ignored, so far.