<img align="right" src="tf-small.png"/>

# Tutorial

This notebook gets you started with
[Text-Fabric](https://github.com/ETCBC/text-fabric) an API on annotated text.
Below we show some of its functions in action on the DCS data set (Sanskrit).

The tutorial is best understood after having familiarized yourself with the underlying
[data model](https://github.com/ETCBC/text-fabric/wiki/Data-model).

If you want to *get* this all, see the 
[home page](https://github.com/ETCBC/text-fabric/wiki)
of Text-Fabric wiki.

In [1]:
import sys, collections
from tf.fabric import Fabric

# Call Text-Fabric

Everything starts by setting up Text-Fabric.
It needs to know where to look for data.

In [2]:
DCS = 'sanskrit/dcs'
TF = Fabric( modules=[DCS], silent=False )

This is Text-Fabric 2.3.9
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
12 features found and 0 ignored


Note that we have just one module: `dcs`, the main data source. 

If you have additional data (features), you can just add them by pointing Text-Fabric to the right directory.

# Load Features
Specify the features to load, and receive the API to work with that data.

In [3]:
api = TF.load('''
    char
    word
    freq
    rank
    book
''')
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.79s T otype                from /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     5.09s T oslots               from /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     0.18s T book                 from /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     0.11s T chapter              from /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     0.06s T verse                from /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     3.85s T char                 from /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     2.14s T trailer              from /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |      |     0.22s C __levels__           from otype, oslots
   |      |       10s C __order__            from otype, oslots, __levels__
   |      |     0.79s C __rank__             from otype, __order__
   |      |       12s C __levUp__            from otype, oslots, __rank__
   |      |     0.80s C __levDown__          f

We have made it so that the members of the API are directly accessible as global variables.

# Counting

In order to get acquainted with the data, we start with simple tasks: counting.

## Count all nodes
We use the 
[`N()` generator](https://github.com/ETCBC/text-fabric/wiki/Api#walking-through-nodes)
to walk us through the nodes.

In [4]:
indent(reset=True)
info('Counting nodes ...')
i = 0
for n in N(): i += 1
info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.28s 1336710 nodes


## Sort some nodes

Get some nodes, 
[slot](https://github.com/ETCBC/text-fabric/wiki/Data-model#summary)
and non-slot, and sort them in the 
[canonical order](https://github.com/ETCBC/text-fabric/wiki/Api#sorting-nodes).

The [`otype` feature](https://github.com/ETCBC/text-fabric/wiki/Data-model#otype-node-feature)
is a
[GRID feature](https://github.com/ETCBC/text-fabric/wiki/Data-model#more-about-the-grid),
a special feature that provides defining characteristics for the
data set as a whole. 
It tells us where the slots end and the other nodes start.

In [5]:
sortNodes(list(range(F.otype.maxSlot+1, F.otype.maxSlot+10))+list(range(1,11)))

[1161380,
 1,
 2,
 1161381,
 3,
 4,
 5,
 6,
 1161382,
 7,
 8,
 9,
 10,
 1161383,
 1161384,
 1161385,
 1161386,
 1161387,
 1161388]

The slots correspond to letters. Words are a higher concept.
In the list above you see first a word node, and then the nodes corresponding to its letters.
These words are the words according to the word boundaries found in the source texts.

It is possible to work with other word boundaries. The first step could be to create an additional feature
which tells for each letter whether a word starts there.

## Numbers in the otype feature
Get more information that is readily available in the 
[GRID feature](https://github.com/ETCBC/text-fabric/wiki/Data-model#more-about-the-grid)
[`otype`](https://github.com/ETCBC/text-fabric/wiki/Data-model#otype-node-feature),
namely what types of objects there are in the dataset.

In [6]:
info('{:<9} = {}\n{:<9}={}\n{:<9}={}'.format(
    'slotType', F.otype.slotType,
    'maxSlot', F.otype.maxSlot,
    'maxNode', F.otype.maxNode,
), tm=False)
info('All otypes:\n\t', nl=False, tm=False)
info('\n\t'.join(F.otype.all), tm=False)

slotType  = letter
maxSlot  =1161379
maxNode  =1336710
All otypes:
	book
	chapter
	verse
	word
	letter


## Count individual object types

In [7]:
indent(reset=True)
info('counting objects ...')
for otype in F.otype.all:
    i = 0
    indent(level=1, reset=True)
    for n in F.otype.s(otype): i+=1
    info('{:>7} {}s'.format(i, otype))
indent(level=0)
info('Done')

  0.00s counting objects ...
   |     0.00s     183 books
   |     0.00s   13010 chapters
   |     0.00s   25729 verses
   |     0.02s  136409 words
   |     0.14s 1161379 letters
  0.17s Done


# Feature statistics

The content data resides in the features.
The
[`F` function](https://github.com/ETCBC/text-fabric/wiki/Api#node-features)
gives access to that data.
Every feature has a method
[`freqList()`](https://github.com/ETCBC/text-fabric/wiki/Api#node-features)
to generate a frequency list of its values, ordered by highest frequency first.

In [8]:
F.word.freqList()[0:10]

(('ca', 3880),
 ('tu', 1547),
 ('na', 1333),
 ('sa', 757),
 ('iti', 732),
 ('tathā', 672),
 ('vā', 529),
 ('eva', 516),
 ('hi', 420),
 ('caiva', 372))

# Word distribution

Let's do a bit more fancy word stuff.

## Hapaxes

A hapax is a word with frequency one.

We count the number of hapaxes and print 10 of them.

In [9]:
hapaxes = F.freq.s(1)
print('{} hapaxes found:\n\t{}'.format(len(hapaxes), '\n\t'.join(F.word.v(w) for w in hapaxes[0:10])))

53749 hapaxes found:
	buddhāya
	sarvahatāndhakāraḥ
	saṃsārapaṅkājjagadujjahāra
	yathārthaśāstre
	pravakṣyāmyabhidharmakośam
	prajñāmalā
	sānucarābhidharmaḥ
	tatprāptaye
	yāpi
	tasyārthato'smin


### Small occurrence base

The occurrence base of a lexeme are the verses, chapters and books in which occurs.
Let's look for lexemes that occur in a single book and nowhere else.

Oh yes, we have already found the hapaxes, we will skip them here.

We compile a dictionary, keyed by word, and with values the set of books they occur in.

In [10]:
wordInBooks = collections.defaultdict(set)

for w in F.otype.s('word'):
    word = F.word.v(w)
    b = L.u(w, otype='book')[0]
    bookName = F.book.v(b)
    wordInBooks[word].add(bookName)

singleBookWords = {word for word in wordInBooks if len(wordInBooks[word]) == 1}
print('{} words are confined to their book'.format(len(singleBookWords)))

56788 words are confined to their book


That is not surprising. But let's get some more information.

In [11]:
wordAmountBooks = collections.Counter()
for (word, bookSet) in wordInBooks.items():
    wordAmountBooks[len(bookSet)] += 1
for (nBooks, nWords) in sorted(wordAmountBooks.items()):
    print('Confined to {:>3} books: {:>7} words'.format(nBooks, nWords))

Confined to   1 books:   56788 words
Confined to   2 books:    5453 words
Confined to   3 books:    1873 words
Confined to   4 books:     934 words
Confined to   5 books:     550 words
Confined to   6 books:     344 words
Confined to   7 books:     215 words
Confined to   8 books:     178 words
Confined to   9 books:     116 words
Confined to  10 books:      86 words
Confined to  11 books:      73 words
Confined to  12 books:      60 words
Confined to  13 books:      57 words
Confined to  14 books:      32 words
Confined to  15 books:      27 words
Confined to  16 books:      20 words
Confined to  17 books:      18 words
Confined to  18 books:      29 words
Confined to  19 books:      22 words
Confined to  20 books:      10 words
Confined to  21 books:      15 words
Confined to  22 books:       5 words
Confined to  23 books:      11 words
Confined to  24 books:       6 words
Confined to  25 books:      12 words
Confined to  26 books:       9 words
Confined to  27 books:       8 words
C

The next thing to know is: which books are the most particular,
in the sense that they have the highest fraction of lexemes that 
do not occur in other books?

In [12]:
bookList = []

for b in F.otype.s('book'):
    book = F.book.v(b)
    allWords = {F.word.v(w) for w in L.d(b, otype='word')}
    ownWords = allWords & singleBookWords
    percentage = 100 * len(ownWords) / len(allWords)
    bookList.append((len(allWords), len(ownWords), percentage, book))

for x in sorted(bookList, key=lambda e: (-e[2], -e[0], e[3])):
    print('{:>4} {:>4} {:>4.1f}% {}'.format(*x))

   2    2 100.0% Cakra (?) on Suśr
   1    1 100.0% Arthaśāstra
   1    1 100.0% Āyurvedadīpikā
  33   28 84.8% Vātūlanāthasūtras
 137  109 79.6% Kādambarīsvīkaraṇasūtramañjarī
 148  117 79.1% Aṣṭādhyāyī
   9    7 77.8% Indu (ad AHS)
 333  253 76.0% Trikāṇḍaśeṣa
1239  940 75.9% Daśakumāracarita
 282  213 75.5% Sūryaśataka
1258  936 74.4% Haṃsadūta
1569 1163 74.1% Nighaṇṭuśeṣa
1018  745 73.2% Laṅkāvatārasūtra
 269  194 72.1% Ṛtusaṃhāra
 728  525 72.1% Nāṭyaśāstravivṛti
 182  131 72.0% Rasikapriyā
 224  160 71.4% Kauśikasūtradārilabhāṣya
 363  255 70.2% Gītagovinda
  30   21 70.0% Kāśikāvṛtti
2336 1624 69.5% Amarakośa
 525  364 69.3% Bījanighaṇṭu
1610 1109 68.9% Amaruśataka
 215  148 68.8% Kāmasūtra
  41   28 68.3% Śivasūtra
 141   96 68.1% Yogasūtra
1045  700 67.0% Meghadūta
 257  172 66.9% Rasendracūḍāmaṇi
   9    6 66.7% Kāvyālaṃkāravṛtti
   6    4 66.7% Rasādhyāyaṭīkā
 650  433 66.6% Kumārasaṃbhava
 346  230 66.5% Śatapathabrāhmaṇa
5500 3648 66.3% Kātyāyanasmṛti
  80   53 66.2% Sūrya

# Layer API
We travel upwards and downwards, forwards and backwards through the nodes.
The Layer-API (`L`) provides functions: `u()` for going up, and `d()` for going down,
`n()` for going to next nodes and `p()` for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the
`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow of precede the slots of the other one.

`L.u(node)` **Up** is going to nodes that embed `node`.

`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.

`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.

`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.

All these functions yield nodes of all possible otypes.
By passing an optional parameter, you can restrict the results to nodes of that type.

The result is ordered in the canonical node ordering.
The functions return always a tuple, even if there is just one node in the result.

## Going up
We go from the first letter to the book it contains.

In [13]:
firstBook = L.u(1, otype='book')[0]
print(firstBook)

1336528


And let's see all the containing objects of letter 3:

In [14]:
w = 3
for otype in F.otype.all:
    if otype == F.otype.slotType: continue
    up = L.u(w, otype=otype)
    upNode = 'x' if len(up) == 0 else up[0]
    print('letter {} is contained in {} {}'.format(w, otype, upNode))

letter 3 is contained in book 1336528
letter 3 is contained in chapter 1323518
letter 3 is contained in verse 1297789
letter 3 is contained in word 1161381


## Going next
Let's go to the next nodes of the first book.

In [15]:
afterFirstBook = L.n(firstBook)
for n in afterFirstBook:
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))
secondBook = L.n(firstBook, otype='book')[0]

   4151: letter        first slot=4151  , last slot=4151  
1161789: word          first slot=4151  , last slot=4155  
1297887: verse         first slot=4151  , last slot=4199  
1323567: chapter       first slot=4151  , last slot=4199  
1336529: book          first slot=4151  , last slot=4726  


## Going previous

And let's see what is right before the second book.

In [16]:
for n in L.p(secondBook):
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))

1336528: book          first slot=1     , last slot=4150  
1323566: chapter       first slot=4084  , last slot=4150  
1297886: verse         first slot=4084  , last slot=4150  
1161788: word          first slot=4140  , last slot=4150  
   4150: letter        first slot=4150  , last slot=4150  


## Going down

We go to the chapters of the second book, and just count them.

In [17]:
chapters = L.d(secondBook, otype='chapter')
print(len(chapters))

8


## The first verse
We pick the first verse and the first word, and explore what is above and below them.

In [18]:
for n in [1, L.u(1, otype='verse')[0]]:
    indent(level=0)
    info('Node {}'.format(n), tm=False)
    indent(level=1)
    info('UP', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    indent(level=1)
    info('DOWN', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
indent(level=0)
info('Done', tm=False)

Node 1
   |   UP
   |      |   1161380         word
   |      |   1297789         verse
   |      |   1323518         chapter
   |      |   1336528         book
   |   DOWN
   |      |   
Node 1297789
   |   UP
   |      |   1323518         chapter
   |      |   1336528         book
   |   DOWN
   |      |   1161380         word
   |      |   1               letter
   |      |   2               letter
   |      |   1161381         word
   |      |   3               letter
   |      |   4               letter
   |      |   5               letter
   |      |   6               letter
   |      |   1161382         word
   |      |   7               letter
   |      |   8               letter
   |      |   9               letter
   |      |   10              letter
   |      |   11              letter
   |      |   12              letter
   |      |   13              letter
   |      |   14              letter
Done


# Text API

We examine the functions of the Text API: `T`.

## Formats
First the formats that we have available to represent the actual text.
These formats have been defined in the `otext` feature.
This is an optional GRID config feature: it has only metadata.

In [19]:
sorted(T.formats)

['text-orig-full', 'text-orig-segmented']

## Using the formats
Now let's use those formats to print out the first 100 characters of the corpus.

In [20]:
for fmt in sorted(T.formats):
    print('{}:\n\t{}'.format(fmt, T.text(range(1,100), fmt=fmt)))

text-orig-full:
	omnamobuddhāyayaḥsarvathāsarvahatāndhakāraḥsaṃsārapaṅkājjagadujjahāratasmainamaskṛtyayathārthaśāstr
text-orig-segmented:
	om namo buddhāya yaḥ sarvathā sarvahatāndhakāraḥ saṃsārapaṅkājjagadujjahāra tasmai namaskṛtya yathārthaśāstr


If we do not specify a format, the **default** format is used (`text-orig-full`).

In [21]:
print(T.text(range(1,100)))

omnamobuddhāyayaḥsarvathāsarvahatāndhakāraḥsaṃsārapaṅkājjagadujjahāratasmainamaskṛtyayathārthaśāstr


## Whole text in all formats in just 10 seconds
We are going to produce the complete text of the whole corpus in all available formats.

In [22]:
text = collections.defaultdict(list)
indent(reset=True)
info('writing plain text of whole Bible in all formats')
for v in F.otype.s('verse'):
    letters = L.d(v, 'letter')
    for fmt in sorted(T.formats):
        text[fmt].append(T.text(letters, fmt=fmt))
info('done {} formats'.format(len(text)))
for fmt in sorted(text):
    print('{}\n{}\n'.format(fmt, '\n'.join(text[fmt][0:5])))

  0.00s writing plain text of whole Bible in all formats
  5.31s done 2 formats
text-orig-full
omnamobuddhāya
yaḥsarvathāsarvahatāndhakāraḥsaṃsārapaṅkājjagadujjahāra
tasmainamaskṛtyayathārthaśāstreśāstraṃpravakṣyāmyabhidharmakośam
prajñāmalāsānucarābhidharmaḥtatprāptayeyāpicayaccaśāstram
tasyārthato'sminsamanupraveśātsacāśrayo'syetyabhidharmakośam

text-orig-segmented
om namo buddhāya 
yaḥ sarvathā sarvahatāndhakāraḥ saṃsārapaṅkājjagadujjahāra 
tasmai namaskṛtya yathārthaśāstre śāstraṃ pravakṣyāmyabhidharmakośam 
prajñāmalā sānucarābhidharmaḥ tatprāptaye yāpi ca yacca śāstram 
tasyārthato'smin samanupraveśāt sa cāśrayo 'syetyabhidharmakośam 

