<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/logo.png" width="128"/>
<img align="right" src="images/etcbc.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

# Tutorial

This notebook gets you started with using
[Text-Fabric](https://annotation.github.io/text-fabric/) for coding in the Dead-Sea Scrolls.

Chances are that a bit of reading about the underlying
[data model](https://annotation.github.io/text-fabric/Model/Data-Model/)
helps you to follow the exercises below, and vice versa.

## Installing Text-Fabric

### Python

You need to have Python on your system. Most systems have it out of the box,
but alas, that is python2 and we need at least python **3.6**.

Install it from [python.org](https://www.python.org) or from
[Anaconda](https://www.anaconda.com/download).

### TF itself

```
pip3 install text-fabric
```

### Jupyter notebook

You need [Jupyter](http://jupyter.org).

If it is not already installed:

```
pip3 install jupyter
```

## Tip
If you cloned the repository containing this tutorial,
first copy its parent directory to somewhere outside your clone of the repo,
before computing with this it.

If you pull changes from the repository later, it will not conflict with
your computations.

Where you put your tutorial directory is up to you.
It will work from any directory.

## Data

Text-Fabric will fetch the data set for you from github, and check for updates.

The data will be stored in the `text-fabric-data` in your home directory.

# Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the corpus as a gigantic spreadsheet, where row 1 corresponds to the
first sign, row 2 to the second sign, and so on, for all ~ 1.5 M signs,
followed by ~ 500 K word nodes and yet another 200 K nodes of other types.

The information which reading each sign has, constitutes a column in that spreadsheet.
The DSS corpus contains > 50 columns.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys, os, collections

# Incantation

The simplest way to get going is by this *incantation*:

In [152]:
from tf.app import use

In [191]:
A = use('dss:clone', checkout='clone', hoist=globals())

Using TF-app in /Users/dirk/github/annotation/app-dss/code:
	repo clone offline under ~/github (local github)
Using data in /Users/dirk/github/etcbc/dss/tf/0.3:
	repo clone offline under ~/github (local github)


You can see which features have been loaded, and if you click on a feature name, you find its documentation.
If you hover over a name, you see where the feature is located on your system.

## API

The result of the incantation is that we have a bunch of special variables at our disposal
that give us access to the text and data of the corpus.

At this point it is helpful to throw a quick glance at the text-fabric API documentation
(see the links under **API Members** above).

The most essential thing for now is that we can use `F` to access the data in the features
we've loaded.
But there is more, such as `N`, which helps us to walk over the text, as we see in a minute.

The **API members** above show you exactly which new names have been inserted in your namespace.
If you click on these names, you go to the API documentation for them.

## Search
Text-Fabric contains a flexible search engine, that does not only work for the data,
of this corpus, but also for other corpora and data that you add to corpora.

**Search is the quickest way to come up-to-speed with your data, without too much programming.**

Jump to the dedicated [search](search.ipynb) search tutorial first, to whet your appetite.

The real power of search lies in the fact that it is integrated in a programming environment.
You can use programming to:

* compose dynamic queries
* process query results

Therefore, the rest of this tutorial is still important when you want to tap that power.
If you continue here, you learn all the basics of data-navigation with Text-Fabric.

# Counting

In order to get acquainted with the data, we start with the simple task of counting.

## Count all nodes
We use the 
[`N()` generator](https://annotation.github.io/text-fabric/Api/General/#navigating-nodes)
to walk through the nodes.

We compared the TF data to a gigantic spreadsheet, where the rows correspond to the signs.
In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with signs.

We also mentioned that there are also other textual objects. 
They are the clusters, lines, faces and documents.
They also correspond to rows in the big spreadsheet.

In Text-Fabric we call all these rows *nodes*, and the `N()` generator
carries us through those nodes in the textual order.

Just one extra thing: the `info` statements generate timed messages.
If you use them instead of `print` you'll get a sense of the amount of time that 
the various processing steps typically need.

In [192]:
indent(reset=True)
info('Counting nodes ...')

i = 0
for n in N(): i += 1

info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.26s 2107856 nodes


Here you see it: over 2M nodes.

## What are those nodes?
Every node has a type, like sign, or line, face.
But what exactly are they?

Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set.
`otype` tells you for each node its type, and you can ask for the number of `slot`s in the text.

Here we go!

In [193]:
F.otype.slotType

'sign'

In [194]:
F.otype.maxSlot

1430238

In [195]:
F.otype.maxNode

2107856

In [196]:
F.otype.all

('scroll', 'lex', 'fragment', 'line', 'cluster', 'word', 'sign')

In [197]:
C.levels.data

(('scroll', 1428.8091908091908, 1605864, 1606864),
 ('lex', 129.15082783041439, 1542520, 1552968),
 ('fragment', 127.90538365229834, 1531338, 1542519),
 ('line', 27.039190849796768, 1552969, 1605863),
 ('cluster', 6.678552705763657, 1430239, 1531337),
 ('word', 2.814364301226367, 1606865, 2107856),
 ('sign', 1, 1, 1430238))

This is interesting: above you see all the textual objects, with the average size of their objects,
the node where they start, and the node where they end.

## Count individual object types
This is an intuitive way to count the number of nodes in each type.
Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed 
and indented progress messages.

In [198]:
indent(reset=True)
info('counting objects ...')

for otype in F.otype.all:
    i = 0

    indent(level=1, reset=True)

    for n in F.otype.s(otype): i+=1

    info('{:>7} {}s'.format(i, otype))

indent(level=0)
info('Done')

  0.00s counting objects ...
   |     0.00s    1001 scrolls
   |     0.00s   10449 lexs
   |     0.00s   11182 fragments
   |     0.01s   52895 lines
   |     0.01s  101099 clusters
   |     0.05s  500992 words
   |     0.12s 1430238 signs
  0.20s Done


# Viewing textual objects

You can use the A API (the extra power) to display cuneiform text.

See the [display](display.ipynb) tutorial.

# Feature statistics

`F`
gives access to all features.
Every feature has a method
`freqList()`
to generate a frequency list of its values, higher frequencies first.
Here are the parts of speech:

In [199]:
F.sp.freqList()

(('ptcl', 154464),
 ('subs', 108562),
 ('unknown', 80252),
 ('verb', 58873),
 ('suff', 45747),
 ('adjv', 10633),
 ('numr', 6526),
 ('pron', 5784))

Signs, words and clusters have types. We can count them separately:

In [200]:
F.type.freqList('cluster')

(('reconstruction', 93733),
 ('vacat', 3522),
 ('correction3', 1582),
 ('uncertain2', 1001),
 ('removed2', 706),
 ('alternative', 333),
 ('correction2', 147),
 ('removed', 75))

In [201]:
F.type.freqList('word')

(('glyph', 470602), ('punct', 29927), ('numeral', 463))

In [202]:
F.type.freqList('sign')

(('consonant', 1156780),
 ('empty', 98404),
 ('missing', 53864),
 ('sep', 46453),
 ('punct', 29927),
 ('uncertain', 27168),
 ('term', 15532),
 ('numeral', 2029),
 ('add', 65),
 ('foreign', 16))

Finally, the flags:

# Word matters

## Top 20 frequent words

We represent words by their essential symbols, collected in the feature *glyph* (which also exists for signs).

In [203]:
for (w, amount) in F.glyph.freqList('word')[0:20]:
  print(f'{amount:>5} {w}')

45393 ו
20491 ה
19378 ל
18225 ב
 6389 את
 5863 מ
 4894 אשׁר
 4789 יהוה
 4355 א
 4236 כול
 4185 על
 4172 אל
 3262 כי
 3091 כ
 3005 לא
 2841 כל
 2424 לוא
 1938 ארץ
 1829 ישׁראל
 1653 יום


## Word distribution

Let's do a bit more fancy word stuff.

### Hapaxes

A hapax can be found by picking the words with frequency 1.
We do have lexeme information in this corpus, let's use it for determining hapaxes.

We print 20 hapaxes.

In [204]:
hapaxes1 = sorted(lx for (lx, amount) in F.lex.freqList('word') if amount == 1)
len(hapaxes1)

3812

In [205]:
for lx in hapaxes1[0:20]:
  print(lx)

 #  #  #  #  # 
 #  #  #  #  #  #  #  #  # 
 #  #  #  #  # ות
 #  #  #  #  # ל #  #  # 
 #  #  #  #  # ם
 #  #  #  # ב
 #  #  #  # ה
 #  #  #  # ו # 
 #  #  #  # ך
 #  #  #  # ל #  # 
 #  #  #  # תא
 #  #  # ד
 #  #  # דב
 #  #  # דה
 #  #  # ה #  # 
 #  #  # הו
 #  #  # הם
 #  #  # ות
 #  #  # ט
 #  #  # כת


An other way to find lexemes with only one occurrence is to use the `occ` edge feature from lexeme nodes to the word nodes of
its occurrences.

In [206]:
hapaxes2 = sorted(F.lex.v(lx) for lx in F.otype.s('lex') if len(E.occ.f(lx)) == 1)
len(hapaxes2)

3812

In [207]:
for lx in hapaxes2[0:20]:
  print(lx)

 #  #  #  #  # 
 #  #  #  #  #  #  #  #  # 
 #  #  #  #  # ות
 #  #  #  #  # ל #  #  # 
 #  #  #  #  # ם
 #  #  #  # ב
 #  #  #  # ה
 #  #  #  # ו # 
 #  #  #  # ך
 #  #  #  # ל #  # 
 #  #  #  # תא
 #  #  # ד
 #  #  # דב
 #  #  # דה
 #  #  # ה #  # 
 #  #  # הו
 #  #  # הם
 #  #  # ות
 #  #  # ט
 #  #  # כת


The feature `lex` contains lexemes that may have uncertain characters in it.

The function `glex` has all those characters stripped. 
Let's use `glex` instead.

In [208]:
hapaxes1g = sorted(lx for (lx, amount) in F.glex.freqList('word') if amount == 1)
len(hapaxes1)

3812

In [209]:
for lx in hapaxes1g[0:20]:
  print(lx)

100
115
126
150
32
350
536
54
61
65
66
67
71
83
92
99
 ידה
 לוט
 נַחַל
 שֵׂעָר


### Small occurrence base

The occurrence base of a word are the scrolls in which occurs.

We compute the occurrence base of each word, based on lexemes according to the `glex` feature.

In [210]:
occurrenceBase1 = collections.defaultdict(set)

indent(reset=True)
info('compiling occurrence base ...')
for w in F.otype.s('word'):
  scroll = T.sectionFromNode(w)[0]
  occurrenceBase1[F.glex.v(w)].add(scroll)
info(f'{len(occurrenceBase1)} entries')

  0.00s compiling occurrence base ...
  8.36s 8263 entries


Wow, that took long!

But there is another way:

In [211]:
occurrenceBase2 = collections.defaultdict(set)

indent(reset=True)
info('compiling occurrence base ...')
for s in F.otype.s('scroll'):
  scroll = F.scroll.v(s)
  for w in L.d(s, otype='word'):
    occurrenceBase2[F.glex.v(w)].add(scroll)
info('done')
info(f'{len(occurrenceBase2)} entries')

  0.00s compiling occurrence base ...
  0.57s done
  0.57s 8263 entries


Much better. Are the results equal?

In [212]:
occurrenceBase1 == occurrenceBase2

True

Yes.

In [213]:
occurrenceBase = occurrenceBase2

An overview of how many words have how big occurrence bases:

In [214]:
occurrenceSize = collections.Counter()

for (w, scrolls) in occurrenceBase.items():
  occurrenceSize[len(scrolls)] += 1
  
occurrenceSize = sorted(
  occurrenceSize.items(),
  key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
  print(f'base size {size:>4} : {amount:>5} words')
print('...')
for (size, amount) in occurrenceSize[-10:]:
  print(f'base size {size:>4} : {amount:>5} words')

base size    1 :  2762 words
base size    2 :  1123 words
base size    3 :   698 words
base size    4 :   460 words
base size    5 :   336 words
base size    6 :   252 words
base size    7 :   225 words
base size    8 :   182 words
base size    9 :   175 words
base size   10 :   124 words
...
base size  459 :     1 words
base size  480 :     1 words
base size  539 :     1 words
base size  600 :     1 words
base size  605 :     1 words
base size  633 :     1 words
base size  745 :     1 words
base size  761 :     1 words
base size  846 :     1 words
base size  987 :     1 words


Let's give the predicate *private* to those words whose occurrence base is a single scroll.

In [215]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

2762

### Peculiarity of scrolls

As a final exercise with scrolls, lets make a list of all scrolls, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the scroll

In [216]:
scrollList = []

empty = set()
ordinary = set()

for d in F.otype.s('scroll'):
  scroll = T.scrollName(d)
  words = {F.glex.v(w) for w in L.d(d, otype='word')}
  a = len(words)
  if not a:
    empty.add(scroll)
    continue
  o = len({w for w in words if w in privates})
  if not o:
    ordinary.add(scroll)
    continue
  p = 100 * o / a
  scrollList.append((scroll, a, o, p))

scrollList = sorted(scrollList, key=lambda e: (-e[3], -e[1], e[0]))

print(f'Found {len(empty):>4} empty scrolls')
print(f'Found {len(ordinary):>4} ordinary scrolls (i.e. without private words)')

Found    0 empty scrolls
Found  517 ordinary scrolls (i.e. without private words)


In [217]:
print('{:<20}{:>5}{:>5}{:>5}\n{}'.format(
    'scroll', '#all', '#own', '%own',
    '-'*35,
))

for x in scrollList[0:20]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))
print('...')
for x in scrollList[-20:]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))

scroll               #all #own %own
-----------------------------------
4Q341                  32   20 62.5%
4Q340                  15    5 33.3%
11Q26                   7    2 28.6%
4Q124                  86   24 27.9%
1Q70bis                12    3 25.0%
4Q282d                  8    2 25.0%
4Q313a                  4    1 25.0%
4Q346a                  4    1 25.0%
1Q70                   25    6 24.0%
3Q15                  268   57 21.3%
4Q561                  74   15 20.3%
4Q559                 130   26 20.0%
4Q250b                  5    1 20.0%
4Q360a                 21    4 19.0%
KhQ1                   32    6 18.8%
4Q347                  11    2 18.2%
4Q575a                 12    2 16.7%
1Q58                    6    1 16.7%
4Q468bb                 6    1 16.7%
11Q10                 635  102 16.1%
...
4Q367                 173    1  0.6%
4Q2                   174    1  0.6%
4Q366                 186    1  0.5%
4Q98                  192    1  0.5%
4Q56                  963    5  0.5%

# Locality API
We travel upwards and downwards, forwards and backwards through the nodes.
The Locality-API (`L`) provides functions: `u()` for going up, and `d()` for going down,
`n()` for going to next nodes and `p()` for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the
`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow or precede the slots of the other one.

`L.u(node)` **Up** is going to nodes that embed `node`.

`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.

`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.

`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.

All these functions yield nodes of all possible otypes.
By passing an optional parameter, you can restrict the results to nodes of that type.

The result are ordered according to the order of things in the text.

The functions return always a tuple, even if there is just one node in the result.

## Going up
We go from the first word to the scroll it contains.
Note the `[0]` at the end. You expect one scroll, yet `L` returns a tuple. 
To get the only element of that tuple, you need to do that `[0]`.

If you are like me, you keep forgetting it, and that will lead to weird error messages later on.

In [218]:
firstScroll = L.u(1, otype='scroll')[0]
print(firstScroll)

1605864


And let's see all the containing objects of sign 3:

In [219]:
s = 3
for otype in F.otype.all:
  if otype == F.otype.slotType: continue
  up = L.u(s, otype=otype)
  upNode = 'x' if len(up) == 0 else up[0]
  print('sign {} is contained in {} {}'.format(s, otype, upNode))

sign 3 is contained in scroll 1605864
sign 3 is contained in lex 1542521
sign 3 is contained in fragment 1531338
sign 3 is contained in line 1552969
sign 3 is contained in cluster x
sign 3 is contained in word 1606866


## Going next
Let's go to the next nodes of the first scroll.

In [220]:
afterFirstScroll = L.n(firstScroll)
for n in afterFirstScroll:
  print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
      n, F.otype.v(n),
      E.oslots.s(n)[0],
      E.oslots.s(n)[-1],
  ))
secondScroll = L.n(firstScroll, otype='scroll')[0]

  17149: sign          first slot=17149 , last slot=17149 
1612978: word          first slot=17149 , last slot=17149 
1553383: line          first slot=17149 , last slot=17176 
1531356: fragment      first slot=17149 , last slot=18207 
1605865: scroll        first slot=17149 , last slot=33885 


## Going previous

And let's see what is right before the second scroll.

In [221]:
for n in L.p(secondScroll):
  print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
      n, F.otype.v(n),
      E.oslots.s(n)[0],
      E.oslots.s(n)[-1],
  ))

1605864: scroll        first slot=1     , last slot=17148 
1531355: fragment      first slot=15658 , last slot=17148 
1553382: line          first slot=17099 , last slot=17148 
1612977: word          first slot=17147 , last slot=17148 
  17148: sign          first slot=17148 , last slot=17148 


## Going down

We go to the fragments of the first scroll, and just count them.

In [222]:
fragments = L.d(firstScroll, otype='fragment')
print(len(fragments))

18


## The first line
We pick two nodes and explore what is above and below them:
the first line and the first word.

In [223]:
for n in [
    F.otype.s('word')[0],
    F.otype.s('line')[0],
]:
  indent(level=0)
  info('Node {}'.format(n), tm=False)
  indent(level=1)
  info('UP', tm=False)
  indent(level=2)
  info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
  indent(level=1)
  info('DOWN', tm=False)
  indent(level=2)
  info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
indent(level=0)
info('Done', tm=False)

Node 1606865
   |   UP
   |      |   1542520         lex
   |      |   1552969         line
   |      |   1531338         fragment
   |      |   1605864         scroll
   |   DOWN
   |      |   2               sign
Node 1552969
   |   UP
   |      |   1531338         fragment
   |      |   1605864         scroll
   |   DOWN
   |      |   1430239         cluster
   |      |   1               sign
   |      |   1606865         word
   |      |   2               sign
   |      |   1606866         word
   |      |   3               sign
   |      |   4               sign
   |      |   5               sign
   |      |   1606867         word
   |      |   6               sign
   |      |   7               sign
   |      |   8               sign
   |      |   9               sign
   |      |   1606868         word
   |      |   10              sign
   |      |   11              sign
   |      |   1606869         word
   |      |   12              sign
   |      |   13              sign
   |  

# Text API

So far, we have mainly seen nodes and their numbers, and the names of node types.
You would almost forget that we are dealing with text.
So let's try to see some text.

In the same way as `F` gives access to feature data,
`T` gives access to the text.
That is also feature data, but you can tell Text-Fabric which features are specifically
carrying the text, and in return Text-Fabric offers you
a Text API: `T`.

## Formats
DSS text can be represented in a number of ways:

* unicode
* ETCBC transcription
* original Abegg code

All three can be represented in two flavours:

* essential: only glyphs, no bracketings and flags
* extra: everything

If you wonder where the information about text formats is stored: 
not in the program text-fabric, but in the data set.
It has a feature `otext`, which specifies the formats and which features
must be used to produce them. `otext` is the third special feature in a TF data set,
next to `otype` and `oslots`. 
It is an optional feature. 
If it is absent, there will be no `T` API.

Here is a list of all available formats in this data set.

In [224]:
sorted(T.formats)

['layout-orig-extra',
 'layout-orig-full',
 'layout-trans-extra',
 'layout-trans-full',
 'lex-orig-full',
 'lex-source-full',
 'lex-trans-full',
 'morph-source-full',
 'text-orig-extra',
 'text-orig-full',
 'text-source-extra',
 'text-source-full',
 'text-trans-extra',
 'text-trans-full']

## Using the formats

The ` T.text()` function is central to get text representations of nodes. Its most basic usage is

```python
T.text(nodes, fmt=fmt)
```
where `nodes` is a list or iterable of nodes, usually word nodes, and `fmt` is the name of a format.
If you leave out `fmt`, the default `text-orig-full` is chosen.

The result is the text in that format for all nodes specified:

In [225]:
T.text(range(firstWord, firstWord + 10), fmt='text-trans-full')

'W<TH CM<W KL JWD<J YDQ WBJNW BM<CJ '

There is also another usage of this function:

```python
T.text(node, fmt=fmt)
```

where `node` is a single node.
In this case, the default format is *ntype*`-orig-full` where *ntype* is the type of `node`.

If the format is defined in the corpus, it will be used. Otherwise, the word nodes contained in `node` will be looked up
and represented with the default format `text-orig-full`.

In this way we can sensibly represent a lot of different nodes, such as documents, faces, lines, clusters, words and signs.

We compose a set of example nodes and run `T.text` on them:

In [226]:
exampleNodes = [
  F.otype.s('sign')[0],
  F.otype.s('word')[0],
  F.otype.s('cluster')[0],
  F.otype.s('line')[0],
  F.otype.s('fragment')[0],
  F.otype.s('scroll')[0],
]
exampleNodes

[1, 1606865, 1430239, 1552969, 1531338, 1605864]

In [227]:
for n in exampleNodes:
  print(f'This is {F.otype.v(n)} {n}:')
  print(T.text(n))
  print('')

This is sign 1:


This is word 1606865:
ו

This is cluster 1430239:


This is line 1552969:
ועתה שׁמעו כל יודעי צדק ובינו במעשׁי 

This is fragment 1531338:
ועתה שׁמעו כל יודעי צדק ובינו במעשׁי אל ׃ כי ריב ל׳ו עם כל בשׁר ומשׁפט יעשׁה בכל מנאצי׳ו ׃ כי במועל׳ם אשׁר עזבו׳הו הסתיר פני׳ו מישׁראל וממקדשׁ׳ו ויתנ׳ם לחרב ׃ ובזכר׳ו ברית ראשׁנים השׁאיר שׁאירית לישׁראל ולא נתנ׳ם לכלה ׃ ובקץ חרון שׁנים שׁלושׁ מאות ותשׁעים לתית׳ו אות׳ם ביד נבוכדנאצר מלך בבל פקד׳ם ׃ ויצמח מישׁראל ומאהרן שׁורשׁ מטעת לירושׁ את ארצ׳ו ולדשׁן בטוב אדמת׳ו ׃ ויבינו בעונ׳ם וידעו כי אנשׁים אשׁימים הם ׃ ויהיו כעורים וכימגשׁשׁים דרך שׁנים עשׁרים ׃ ויבן אל אל מעשׁי׳הם כי בלב שׁלם דרשׁו׳הו ויקם ל׳הם מורה צדק להדריכ׳ם בדרך לב׳ו ׃ ויודע לדורות אחרונים את אשׁר עשׁה בדור אחרון בעדת בוגדים הם סרי דרך ׃ היא העת אשׁר היה כתוב עלי׳ה כפרה סורירה כן סרר ישׁראל ׃ בעמוד אישׁ הלצון אשׁר הטיף לישׁראל מימי כזב ויתע׳ם בתוהו לא דרך להשׁח גבהות עולם ולסור מנתיבות צדק ולסיע גבול אשׁר גבלו ראשׁנים בנחלת׳ם למען הדבק ב׳הם את אלות ברית׳ו להסגיר׳ם לחרב 

## Using the formats
Now let's use those formats to print out the first line in this corpus.

Note that only the formats starting with `text-` are usable for this.

For the `layout-` formats, see [display](display.ipynb).

In [228]:
for fmt in sorted(T.formats):
  if fmt.startswith('text-'):
    print('{}:\n\t{}'.format(fmt, T.text(range(1,12), fmt=fmt)))

text-orig-extra:
	   
text-orig-full:
	ועתה שׁמעו כל 
text-source-extra:
	   
text-source-full:
	woth Cmow kl 
text-trans-extra:
	   
text-trans-full:
	W<TH CM<W KL 


If we do not specify a format, the **default** format is used (`text-orig-full`).

In [229]:
T.text(range(1,12))

'ועתה שׁמעו כל '

In [230]:
firstLine = F.otype.s('line')[0]
T.text(firstLine)

'ועתה שׁמעו כל יודעי צדק ובינו במעשׁי '

In [231]:
T.text(firstLine, fmt='text-orig-full')

''

The last one may be unexpected. Let's spell out the logic:

A single node and a format have been supplied, so `T.text()` applies the format to that node.
Only when a format is invoked that does not exist, `T.text()` will descend to word nodes.

But you can override that with the `descend` parameter:

In [232]:
T.text(firstLine, fmt='text-orig-full', descend=True)

'ועתה שׁמעו כל יודעי צדק ובינו במעשׁי '

The important things to remember are:

* you can supply a list of slot nodes and get them represented in all formats
* you can get non-slot nodes `n` in default format by `T.text(n)`
* you can get non-slot nodes `n` in other formats by `T.text(n, fmt=fmt, descend=True)`

## Whole text in all formats in just 2 seconds
Part of the pleasure of working with computers is that they can crunch massive amounts of data.
The text of the Old Babylonian Letters is a piece of cake.

It takes just ten seconds to have that cake and eat it. 
In nearly a dozen formats.

In [38]:
indent(reset=True)
info('writing plain text of all letters in all text formats')

text = collections.defaultdict(list)

for l in F.otype.s('line'):
  for fmt in sorted(T.formats):
    if fmt.startswith('text-'):
      text[fmt].append(T.text(l, fmt=fmt, descend=True))

info('done {} formats'.format(len(text)))

for fmt in sorted(text):
    print('{}\n{}\n'.format(fmt, '\n'.join(text[fmt][0:5])))

  0.00s writing plain text of all letters in all text formats
  1.42s done 4 formats
text-orig-full
[a-na] _{d}suen_-i-[din-nam]
qi2-bi2-[ma]
um-ma _{d}en-lil2_-sza-du-u2-ni-ma
_{d}utu_ u3 _{d}[marduk]_ a-na da-ri-a-[tim]
li-ba-al-li-t,u2-u2-ka

text-orig-plain
a-na d⁼suen-i-din-nam
qi2-bi2-ma
um-ma d⁼en-lil2-sza-du-u2-ni-ma
d⁼utu u3 d⁼marduk a-na da-ri-a-tim
li-ba-al-li-t,u2-u2-ka

text-orig-rich
a-na d⁼suen-i-din-nam
qi₂-bi₂-ma
um-ma d⁼en-lil₂-ša-du-u₂-ni-ma
d⁼utu u₃ d⁼marduk a-na da-ri-a-tim
li-ba-al-li-ṭu₂-u₂-ka

text-orig-unicode
𒀀𒈾 𒀭𒂗𒍪𒄿𒁷𒉆
𒆠𒉈𒈠
𒌝𒈠 𒀭𒂗𒆤𒊭𒁺𒌑𒉌𒈠
𒀭𒌓 𒅇 𒀭𒀫𒌓 𒀀𒈾 𒁕𒊑𒀀𒁴
𒇷𒁀𒀠𒇷𒌅𒌑𒅗



### The full plain text
We write all formats to file, in your `Downloads` folder.

In [39]:
for fmt in T.formats:
  if fmt.startswith('text-'):
    with open(os.path.expanduser(f'~/Downloads/{fmt}.txt'), 'w') as f:
      f.write('\n'.join(text[fmt]))

## Sections

A section in the letter corpus is a document, a face or a line.
Knowledge of sections is not baked into Text-Fabric. 
The config feature `otext.tf` may specify three section levels, and tell
what the corresponding node types and features are.

From that knowledge it can construct mappings from nodes to sections, e.g. from line
nodes to tuples of the form:

    (p-number, face specifier, line number)
    
You can get the section of a node as a tuple of relevant document, face, and line nodes.
Or you can get it as a passage label, a string.

You can ask for the passage corresponding to the first slot of a node, or the one corresponding to the last slot.

If you are dealing with document and face nodes, you can ask to fill out the line and face parts as well.
   
Here are examples of getting the section that corresponds to a node and vice versa.

**NB:** `sectionFromNode` always delivers a verse specification, either from the
first slot belonging to that node, or, if `lastSlot`, from the last slot
belonging to that node.

In [40]:
someNodes = (
  F.otype.s('sign')[100000],
  F.otype.s('word')[10000],
  F.otype.s('cluster')[5000],
  F.otype.s('line')[15000],
  F.otype.s('face')[1000],
  F.otype.s('document')[500],
)

In [41]:
for n in someNodes:
  nType = F.otype.v(n)
  d = f'{n:>7} {nType}'
  first = A.sectionStrFromNode(n)
  last = A.sectionStrFromNode(n, lastSlot=True, fillup=True)
  tup = (
      T.sectionTuple(n),
      T.sectionTuple(n, lastSlot=True, fillup=True),
  )
  print(f'{d:<16} - {first:<18} {last:<18} {tup}')

 100001 sign     - P313335 obverse:8  P313335 obverse:8  ((227310, 229370, 244327), (227310, 229370, 244327))
 268163 word     - P510665 obverse:9  P510665 obverse:9  ((226821, 228295, 234114), (226821, 228295, 234114))
 208220 cluster  - P510766 obverse:9  P510766 obverse:9  ((226925, 228516, 236231), (226925, 228516, 236231))
 245788 line     - P313410 obverse:12' P313410 obverse:12' ((227376, 229516, 245788), (227376, 229516, 245788))
 228954 face     - P292765 reverse    P292765 reverse:12 ((227126, 228954), (227126, 228954, 240157))
 227169 document - P382526            P382526 left:2     ((227169,), (227169, 229057, 241107))


# Clean caches

Text-Fabric pre-computes data for you, so that it can be loaded faster.
If the original data is updated, Text-Fabric detects it, and will recompute that data.

But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might
want to clear the cache of precomputed results.

There are two ways to do that:

* Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it.
  This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems.
* Call `TF.clearCache()`, which does exactly the same.

It is not handy to execute the following cell all the time, that's why I have commented it out.
So if you really want to clear the cache, remove the comment sign below.

In [42]:
# TF.clearCache()

# Next steps

By now you have an impression how to compute around in the corpus.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **[similarLines](similarLines.ipynb)** spot the similarities between lines

---

See the [cookbook](cookbook) for recipes for small, concrete tasks.