<img align="right" src="tf-small.png"/>

# Tutorial

This notebook gets you started with Text-Fabric.
It performs most functions of its API on the ETCBC4C data set (Hebrew Bible).

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys, collections
from tf.fabric import Fabric

# Load Text-Fabric
Use default locations, so no `locations=...`.

In [3]:
TF = Fabric()

This is Text-Fabric 0.0.14
Api reference: https://github.com/dirkroorda/text-fabric/wiki/Api
Data sources: https://github.com/dirkroorda/text-fabric-data
Data feature docs: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html
Questions? Ask shebanq@ancient-data.org for an invite to Slack
95 features found and 0 ignored


# Load Features
Specify the features to load, and receive the API to work with that data.

In [4]:
api = TF.load('sp lex')

  0.00s loading features ...
  5.23s All features loaded/computed - for details use loadLog()


Let us see that `loadLog()`.

In [5]:
api.loadLog()

   |     0.07s B otype                from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.55s B oslots               from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M otext                from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.01s B book                 from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.01s B chapter              from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.01s B verse                from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.14s B g_cons               from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.24s B g_cons_utf8          from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.42s B g_voc_lex            from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.42s B g_voc_lex_utf8       from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.17s B g_word               from /Users/dirk/githu

Make it so that the members of the API are directly accessible as global variables.

In [6]:
api.makeAvailableIn(globals())

Just check that `F` and `N` are defined and look like `api.F` and `api.N`.

In [7]:
print(F)
print(api.F)
print(N)
print(api.N)

<tf.api.NodeFeatures object at 0x169d0cf98>
<tf.api.NodeFeatures object at 0x169d0cf98>
<bound method Api.N of <tf.api.Api object at 0x10e6c7828>>
<bound method Api.N of <tf.api.Api object at 0x10e6c7828>>


# Getting acquainted

We start with simple tasks, to get a feel for the Text-Fabric API.

Count all nodes.

In [8]:
indent(reset=True)
info('Counting nodes ...')
i = 0
for n in N(): i += 1
info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.35s 1436894 nodes


Get some nodes, slot and non-slot, and sort them in the canonical order.

In [15]:
sortNodes(list(range(F.otype.maxSlot+1, F.otype.maxSlot+10))+list(range(10)))

[426581,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 426582,
 426583,
 426584,
 426585,
 426586,
 426587,
 426588,
 426589]

Get information that is readily available in the GRID feature `otype`. 

In [13]:
info('{:<9} = {}\n{:<9}={}\n{:<9}={}'.format(
    'slotType', F.otype.slotType,
    'maxSlot', F.otype.maxSlot,
    'maxNode', F.otype.maxNode,
), tm=False)
info('All otypes:\n\t', nl=False, tm=False)
info('\n\t'.join(F.otype.all), tm=False)

slotType  = word
maxSlot  =426580
maxNode  =1436894
All otypes:
	book
	chapter
	verse
	half_verse
	sentence
	sentence_atom
	clause
	clause_atom
	phrase
	phrase_atom
	subphrase
	word


Count the individual object types.

In [16]:
indent(reset=True)
info('counting objects ...')
for otype in F.otype.all:
    i = 0
    indent(level=1, reset=True)
    for n in F.otype.s(otype): i+=1
    info('{:>7} {}s'.format(i, otype))
indent(level=0)
info('Done')

  0.00s counting objects ...
   |     0.00s      39 books
   |     0.00s     929 chapters
   |     0.00s   23213 verses
   |     0.01s   45180 half_verses
   |     0.01s   63570 sentences
   |     0.01s   64339 sentence_atoms
   |     0.03s   88000 clauses
   |     0.02s   90562 clause_atoms
   |     0.06s  253174 phrases
   |     0.05s  267515 phrase_atoms
   |     0.02s  113792 subphrases
   |     0.08s  426581 words
  0.33s Done


Collect the top 10 frequent verbs.

In [18]:
verbs = collections.Counter()
for w in F.otype.s('word'):
    if F.sp.v(w) != 'verb': continue
    verbs[F.lex.v(w)] +=1
print(''.join(
    '{}: {}\n'.format(verb, cnt) for (verb, cnt) in sorted(
        verbs.items() , key=lambda x: (-x[1], x[0]))[0:10],
    )
)       

>MR[: 5378
HJH[: 3561
<FH[: 2629
BW>[: 2570
NTN[: 2017
HLK[: 1554
R>H[: 1298
CM<[: 1168
DBR[: 1138
JCB[: 1082



We count the words of each part of speech, and we list to top 10 of frequent lexemes.

**NB**: mind the pretty progress messages.

In [19]:
partOfSpeech = collections.Counter()
freqLex = collections.Counter()

indent(level=0, reset=True)
info('Starting tasks')
indent(level=1, reset=True)
info('Counting the words by part-of-speech ...')
for w in F.otype.s('word'):
    partOfSpeech[F.sp.v(w)] += 1
info('Done: {} categories'.format(len(partOfSpeech)))
indent(level=2)
info('\n'.join('{:<6}: {:>5}x'.format(*x) for x in sorted(
    partOfSpeech.items(),
    key=lambda x: (-x[1], x[0])
)), tm=False)
indent(level=1, reset=True)
info('Listing the top 10 frequent words ...')
for w in F.otype.s('word'):
    freqLex[F.lex.v(w)] += 1
info('Done: {} lexemes'.format(len(freqLex)))
indent(level=2)
info('\n'.join('{:<6}: {:>5}x'.format(*x) for x in sorted(
    freqLex.items(),
    key=lambda x: (-x[1], x[0])
)[0:10]), tm=False)
indent(level=0)
info('All tasks completed')

  0.00s Starting tasks
   |     0.00s Counting the words by part-of-speech ...
   |     0.44s Done: 14 categories
   |      |   subs  : 121481x
   |      |   verb  : 73710x
   |      |   prep  : 73273x
   |      |   conj  : 62722x
   |      |   nmpr  : 33082x
   |      |   art   : 30384x
   |      |   adjv  :  9464x
   |      |   nega  :  6053x
   |      |   prps  :  5011x
   |      |   advb  :  4550x
   |      |   prde  :  2660x
   |      |   intj  :  1885x
   |      |   inrg  :  1285x
   |      |   prin  :  1021x
   |     0.00s Listing the top 10 frequent words ...
   |     0.42s Done: 8776 lexemes
   |      |   W     : 51003x
   |      |   H     : 30390x
   |      |   L     : 20447x
   |      |   B     : 15768x
   |      |   >T    : 11002x
   |      |   MN    :  7681x
   |      |   JHWH/ :  6828x
   |      |   <L    :  5870x
   |      |   >L    :  5521x
   |      |   >CR   :  5500x
  0.92s All tasks completed


# Layer API
We travel upwards and downwards through the node hierarchy.
The Layer-API (`L`) provides two functions: `u()` for going up, and `d()` for going down.

Upwards is going to nodes that embed the node you started from.
Downwards is the opposite direction, to those that are contained in the start node.

Embedding and containment are an indirect notion: nodes are just numbers, but by means of the
`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.

Both functions yield nodes of all possible otypes. By passing an optional parameter, you can restrict the results to nodes of that type.

The result is ordered in the canonical node ordering.
Both functions return always a tuple, even if there is just one node in the result.

## Going up
We go from the first word to the book it contains.

In [30]:
L.u(0, otype='book')

(1367533,)

## Going down

We go to the chapters of the that book, and just count them.

In [32]:
chapters = L.d(L.u(0, otype='book')[0], otype='chapter')
print(len(chapters))

50


We pick the first verse and the first word, and explore what is above and below them.

In [34]:
for n in [0, L.u(0, otype='verse')[0]]:
    indent(level=0)
    info('Node {}'.format(n), tm=False)
    indent(level=1)
    info('UP', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    indent(level=1)
    info('DOWN', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
indent(level=0)
info('Done')

Node 0
   |   UP
   |      |   605143          phrase
   |      |   858317          phrase_atom
   |      |   1368501         half_verse
   |      |   514581          clause_atom
   |      |   426581          clause
   |      |   1413681         verse
   |      |   1189402         sentence_atom
   |      |   1125832         sentence
   |      |   1367572         chapter
   |      |   1367533         book
   |   DOWN
   |      |   
Node 1413681
   |   UP
   |      |   1189402         sentence_atom
   |      |   1125832         sentence
   |      |   1367572         chapter
   |      |   1367533         book
   |   DOWN
   |      |   426581          clause
   |      |   514581          clause_atom
   |      |   1368501         half_verse
   |      |   858317          phrase_atom
   |      |   605143          phrase
   |      |   0               word
   |      |   1               word
   |      |   858318          phrase_atom
   |      |   2               word
   |      |   605144        

# Text API

We examine the functions of the Text API: `T`.

First the formats that we have available to represent the actual text.
These formats have been defined in the `otext` feature.
This is an optional GRID config feature: it has only metadata.

In [20]:
sorted(T.formats)

['lex-orig-full',
 'lex-orig-plain',
 'lex-trans-full',
 'lex-trans-plain',
 'text-orig-full',
 'text-orig-full-ketiv',
 'text-orig-plain',
 'text-trans-full',
 'text-trans-full-ketiv',
 'text-trans-plain']

Now let's use those formats to print out the first verse of the Hebrew Bible.

In [21]:
for fmt in sorted(T.formats):
    print('{}:\n\t{}'.format(fmt, T.text(range(11), fmt=fmt)))

lex-orig-full:
	בְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֵת הַ אֶרֶץ 
lex-orig-plain:
	ב ראשׁית֜ ברא אלהים֜ את ה שׁמים֜ ו את ה ארץ֜ 
lex-trans-full:
	B.: R;>CIJT BR> >:ELOHIJM >;T HA C@MAJIM W: >;T HA >EREY 
lex-trans-plain:
	B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ 
text-orig-full:
	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
text-orig-full-ketiv:
	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
text-orig-plain:
	בראשׁית ברא אלהים את השׁמים ואת הארץ׃ 
text-trans-full:
	B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 
text-trans-full-ketiv:
	B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 
text-trans-plain:
	BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 


If we do not specify a format, the default format is used (`text-orig-full`).

In [22]:
print(T.text(range(11)))

בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 


## Whole text in all formats in just 10 seconds
We are going to produce the complete text of the Hebrew Bible in all available formats.

In [23]:
text = collections.defaultdict(list)
indent(reset=True)
info('writing plain text of whole Bible in all formats')
for v in F.otype.s('verse'):
    words = L.d(v, 'word')
    for fmt in sorted(T.formats):
        text[fmt].append(T.text(words, fmt=fmt))
info('done {} formats'.format(len(text)))
for fmt in sorted(text):
    print('{}\n{}\n'.format(fmt, '\n'.join(text[fmt][0:5])))

  0.00s writing plain text of whole Bible in all formats
  8.42s done 10 formats
lex-orig-full
בְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֵת הַ אֶרֶץ 
וְ הַ אֶרֶץ היה תֹּהוּ וְ בֹּהוּ וְ חֹשֶׁךְ עַל פָּנֶה תְּהֹום וְ רוּחַ אֱלֹהִים רחף עַל פָּנֶה הַ מַיִם 
וְ אמר אֱלֹהִים היה אֹור וְ היה אֹור 
וְ ראה אֱלֹהִים אֵת הַ אֹור כִּי טוב וְ בדל אֱלֹהִים בַּיִן הַ אֹור וְ בַּיִן הַ חֹשֶׁךְ 
וְ קרא אֱלֹהִים לְ הַ אֹור יֹום וְ לְ הַ חֹשֶׁךְ קרא לַיְלָה וְ היה עֶרֶב וְ היה בֹּקֶר יֹום אֶחָד 

lex-orig-plain
ב ראשׁית֜ ברא אלהים֜ את ה שׁמים֜ ו את ה ארץ֜ 
ו ה ארץ֜ היה תהו֜ ו בהו֜ ו חשׁך֜ על פנה֜ תהום֜ ו רוח֜ אלהים֜ רחף על פנה֜ ה מים֜ 
ו אמר אלהים֜ היה אור֜ ו היה אור֜ 
ו ראה אלהים֜ את ה אור֜ כי טוב ו בדל אלהים֜ בין֜ ה אור֜ ו בין֜ ה חשׁך֜ 
ו קרא אלהים֜ ל ה אור֜ יום֜ ו ל ה חשׁך֜ קרא לילה֜ ו היה ערב֜ ו היה בקר֜ יום֜ אחד֜ 

lex-trans-full
B.: R;>CIJT BR> >:ELOHIJM >;T HA C@MAJIM W: >;T HA >EREY 
W: HA >EREY HJH T.OHW. W: B.OHW. W: XOCEK: <AL P.@NEH T.:HOWM W: RW.XA >:ELOHIJM RXP <AL P.@NEH HA MAJIM 
W:

## Sections and book names
Here are the languages that we can use for book names.

In [24]:
T.languages

{'am': {'language': 'ኣማርኛ', 'languageEnglish': 'amharic'},
 'ar': {'language': 'العَرَبِية', 'languageEnglish': 'arabic'},
 'bn': {'language': 'বাংলা', 'languageEnglish': 'bengali'},
 'da': {'language': 'Dansk', 'languageEnglish': 'danish'},
 'de': {'language': 'Deutsch', 'languageEnglish': 'german'},
 'el': {'language': 'Ελληνικά', 'languageEnglish': 'greek'},
 'en': {'language': 'English', 'languageEnglish': 'english'},
 'es': {'language': 'Español', 'languageEnglish': 'spanish'},
 'fa': {'language': 'فارسی', 'languageEnglish': 'farsi'},
 'fr': {'language': 'Français', 'languageEnglish': 'french'},
 'he': {'language': 'עברית', 'languageEnglish': 'hebrew'},
 'hi': {'language': 'हिन्दी', 'languageEnglish': 'hindi'},
 'id': {'language': 'Bahasa Indonesia', 'languageEnglish': 'indonesian'},
 'ja': {'language': '日本語', 'languageEnglish': 'japanese'},
 'ko': {'language': '한국어', 'languageEnglish': 'korean'},
 'la': {'language': 'Latina', 'languageEnglish': 'latin'},
 'nl': {'language': 'Nede

Get the book names in Swahili.

In [25]:
nodeToSwahili = ''
for b in F.otype.s('book'):
    nodeToSwahili += '{} = {}\n'.format(b, T.bookName(b, lang='sw'))
print(nodeToSwahili)

1367533 = Mwanzo
1367534 = Kutoka
1367535 = Mambo_ya_Walawi
1367536 = Hesabu
1367537 = Kumbukumbu_la_Torati
1367538 = Yoshua
1367539 = Waamuzi
1367540 = 1_Samweli
1367541 = 2_Samweli
1367542 = 1_Wafalme
1367543 = 2_Wafalme
1367544 = Isaya
1367545 = Yeremia
1367546 = Ezekieli
1367547 = Hosea
1367548 = Yoeli
1367549 = Amosi
1367550 = Obadia
1367551 = Yona
1367552 = Mika
1367553 = Nahumu
1367554 = Habakuki
1367555 = Sefania
1367556 = Hagai
1367557 = Zekaria
1367558 = Malaki
1367559 = Zaburi
1367560 = Ayubu
1367561 = Mithali
1367562 = Ruthi
1367563 = Wimbo_Ulio_Bora
1367564 = Mhubiri
1367565 = Maombolezo
1367566 = Esta
1367567 = Danieli
1367568 = Ezra
1367569 = Nehemia
1367570 = 1_Mambo_ya_Nyakati
1367571 = 2_Mambo_ya_Nyakati



OK, there they are. We copy them into a string, and do the opposite: get the nodes back.

In [26]:
swahiliNames = '''
Mwanzo
Kutoka
Mambo_ya_Walawi
Hesabu
Kumbukumbu_la_Torati
Yoshua
Waamuzi
1_Samweli
2_Samweli
1_Wafalme
2_Wafalme
Isaya
Yeremia
Ezekieli
Hosea
Yoeli
Amosi
Obadia
Yona
Mika
Nahumu
Habakuki
Sefania
Hagai
Zekaria
Malaki
Zaburi
Ayubu
Mithali
Ruthi
Wimbo_Ulio_Bora
Mhubiri
Maombolezo
Esta
Danieli
Ezra
Nehemia
1_Mambo_ya_Nyakati
2_Mambo_ya_Nyakati
'''.strip().split()

swahiliToNode = ''
for nm in swahiliNames:
    swahiliToNode += '{} = {}\n'.format(T.bookNode(nm, lang='sw'), nm)
    
if swahiliToNode != nodeToSwahili:
    print('Something is not right with the book names')
else:
    print('Going from nodes to booknames and back yields the original nodes')

Going from nodes to booknames and back yields the original nodes


Some experiments that get the section that corresponds to a node and vice versa.

In [27]:
for x in (
    T.sectionFromNode(0, lastSlot=True),
    T.nodeFromSection(('Genesis', '1', '1')),
    T.nodeFromSection(('Mwanzo', '1', '1'), lang='sw'),
    T.nodeFromSection(('Genesis',)),
    T.nodeFromSection(('Genesis', '1')),
): print(x)

('Genesis', '1', '1')
1413681
1413681
1367533
1367572


## Sentences spanning multiple verses
If you go up from a sentence node, you expect to find a verse node.
But some sentences span multiple verses, and in that case, you will not find the enclosing
verse node, because it is not there.

Here is a piece of code to detect and list all cases where sentences span multiple verses.

The idea is to pick the first and the last word of a sentence, use `T.sectionFromNode` to
discover the verse in which that word occurs, and if they are different: bingo!

We show the first 10 of 915 cases.

In [28]:
indent(reset=True)
info('Get sentences that span multiple verses')
spanSentences = []
for s in F.otype.s('sentence'):
    f = T.sectionFromNode(s, lastSlot=False)
    l = T.sectionFromNode(s, lastSlot=True)
    if f != l:
        spanSentences.append('{} {}:{}-{}'.format(f[0], f[1], f[2], l[2]))
info('Found {} cases'.format(len(spanSentences)))
info('\n{}'.format('\n'.join(spanSentences[0:10])))

  0.00s Get sentences that span multiple verses
  4.34s Found 915 cases
  4.34s 
Genesis 1:17-18
Genesis 1:29-30
Genesis 2:4-7
Genesis 7:2-3
Genesis 7:8-9
Genesis 7:13-14
Genesis 9:9-10
Genesis 10:11-12
Genesis 10:13-14
Genesis 10:15-18
