<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/etcbc.png"/>
<img align="right" src="images/logo.png"/>

# Tutorial

This notebook gets you started with using
[Text-Fabric](https://annotation.github.io/text-fabric/) for coding in the Fusus al Hikam by Ibn Arabi, Lakhnawi edition.

Familiarity with the underlying
[data model](https://annotation.github.io/text-fabric/tf/about/datamodel.html)
is recommended.

## Installing Text-Fabric

### Python

You need to have Python on your system. Most systems have it out of the box,
but alas, that is python2 and we need at least python **3.6**.

Install it from [python.org](https://www.python.org) or from
[Anaconda](https://www.anaconda.com/download).

### TF itself

```
pip3 install text-fabric
```

### Jupyter notebook

You need [Jupyter](http://jupyter.org).

If it is not already installed:

```
pip3 install jupyter
```

## Tip
If you start computing with this tutorial, first copy its parent directory to somewhere else,
outside your `fususl` directory.
If you pull changes from the `fususl` repository later, your work will not be overwritten.
Where you put your tutorial directory is up till you.
It will work from any directory.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import collections

In [3]:
from tf.app import use

## Fususl data

Text-Fabric will fetch a standard set of features for you from the newest github release binaries.

The data will be stored in the `text-fabric-data` in your home directory.

# Load Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the text as a gigantic spreadsheet, where row 1 corresponds to the
first word, row 2 to the second word, and so on, for all 50,000+ words.

The letters of each word is a column `form` in that spreadsheet.

The corpus contains ca. 20 columns, not only for the words, but also for
textual objects, such as *pieces*, *pages*, and *lines*.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

For the very last version, use `hot`.

For the latest release, use `latest`.

If you have cloned the repos (TF app and data), use `clone`.

If you do not want/need to upgrade, leave out the checkout specifiers.

In [8]:
A = use("fususl:clone", checkout="clone", hoist=globals())
# A = use('fususl:hot', checkout="hot", hoist=globals())
# A = use('sfususl:latest', checkout="latest", hoist=globals())
# A = use('fususl', hoist=globals())

This is Text-Fabric 9.1.1
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

18 features found and 0 ignored


## API

At this point it is helpful to throw a quick glance at the text-fabric API documentation
(see the links under **API Members** above).

The most essential thing for now is that we can use `F` to access the data in the features
we've loaded.
But there is more, such as `N`, which helps us to walk over the text, as we see in a minute.

# Counting

In order to get acquainted with the data, we start with the simple task of counting.

## Count all nodes
We use the
[`N.walk()` generator](https://annotation.github.io/text-fabric/tf/core/nodes.html#tf.core.nodes.Nodes.walk)
to walk through the nodes.

We compared corpus to a gigantic spreadsheet, where the rows correspond to the words.
In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with words.

We also mentioned that there are also more textual objects.
They are the verses, chapters and books.
They also correspond to rows in the big spreadsheet.

In Text-Fabric we call all these rows *nodes*, and the `N()` generator
carries us through those nodes in the textual order.

Just one extra thing: the `info` statements generate timed messages.
If you use them instead of `print` you'll get a sense of the amount of time that
the various processing steps typically need.

In [9]:
A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.02s 75854 nodes


## What are those nodes?
Every node has a type, like word, or phrase, sentence.
We know that we have approximately 100,000 words and a few other nodes.
But what exactly are they?

Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set.
`otype` tells you for each node its type, and you can ask for the number of `slot`s in the text.

Here we go!

In [10]:
F.otype.slotType

'word'

In [11]:
F.otype.maxSlot

57138

In [12]:
F.otype.maxNode

75854

In [13]:
F.otype.all

('piece', 'page', 'sentence', 'line', 'column', 'span', 'word')

In [14]:
C.levels.data

(('piece', 1731.4545454545455, 67862, 67894),
 ('page', 130.45205479452054, 67424, 67861),
 ('sentence', 20.92200659099231, 67895, 70625),
 ('line', 11.285403910724867, 62361, 67423),
 ('column', 10.941784756798162, 57139, 62360),
 ('span', 10.927137119908204, 70626, 75854),
 ('word', 1, 1, 57138))

This is interesting: above you see all the textual objects, with the average size of their objects,
the node where they start, and the node where they end.

## Count individual object types
This is an intuitive way to count the number of nodes in each type.
Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed
and indented progress messages.

In [15]:
A.indent(reset=True)
A.info("counting objects ...")

for otype in F.otype.all:
    i = 0
    A.indent(level=1, reset=True)

    for n in F.otype.s(otype):
        i += 1

    A.info("{:>7} {}s".format(i, otype))

A.indent(level=0)
A.info("Done")

  0.00s counting objects ...
   |     0.00s      33 pieces
   |     0.00s     438 pages
   |     0.00s    2731 sentences
   |     0.00s    5063 lines
   |     0.00s    5222 columns
   |     0.00s    5229 spans
   |     0.01s   57138 words
  0.02s Done


# Viewing textual objects

We use the A API (the extra power) to peek into the corpus.

Let's inspect some words.

In [17]:
wordShow = (1000, 10000, 50000)
for word in wordShow:
    A.pretty(word, withNodes=True)

# Feature statistics

`F`
gives access to all features.
Every feature has a method
`freqList()`
to generate a frequency list of its values, higher frequencies first.
Here are the words in beta code, a top 20:

In [19]:
F.lettersn.freqList()[0:20]

((' ', 3384),
 ('', 1770),
 ('fiy', 1227),
 ('mina', 862),
 ('swrŧ', 385),
 ('āʾilauā', 325),
 ('laā', 316),
 ('maā', 296),
 ('lahu', 285),
 ('ʿalaáã', 282),
 ('ḏãlika', 276),
 ('huwa', 256),
 ('bihi', 249),
 ('kaāna', 248),
 ('hãḏaā', 242),
 ('āllhi', 240),
 ('ʿalayhi', 238),
 ('āllhu', 235),
 ('ʿana', 190),
 ('fy', 174))

## Word distribution

Let's do a bit more fancy word stuff.

### Hapaxes

A hapax can be found by inspecting words and see how many occurrences they have.
If that is number is one, we have a hapax.

We work with the ascii transliteration.

We print 10 hapaxes with their gloss.

In [20]:
A.indent(reset=True)

hapax = []
lexIndex = collections.defaultdict(list)

for n in F.otype.s("word"):
    lexIndex[F.lettersp.v(n)].append(n)

hapax = dict((lex, occs) for (lex, occs) in lexIndex.items() if len(occs) == 1)

A.info("{} hapaxes found".format(len(hapax)))

for h in sorted(hapax)[0:10]:
    print(f"\t{h}")

  0.07s 14243 hapaxes found
	 *ga_aya=ta
	 *galauba
	 *ganayu*nu
	 *gay
	 *gayaru
	 *gayru
	 *gu*sanu*n
	 *h*dr=tu
	 *h*ky*k=tu*n
	 *ha*ka_a'i*ku


If we want more info on the hapaxes, we get that by means of its *node*.
The lexIndex dictionary stores the occurrences of a lexeme as a list of nodes.

Let's get the latin beta code and the arabic form of those 10 hapaxes.

In [22]:
for h in sorted(hapax)[0:10]:
    node = hapax[h][0]
    print(f"\t{F.lettersn.v(node):<12} {F.letters.v(node)}")

	 ġaāyaŧa      غَايَةَ
	 ġalauba      غَلَّبَ
	 ġanayuⁿu     غَنَيٌّ
	 ġay          غَي
	 ġayaru       غَيْـرُ
	 ġayru        غَيرُ
	 ġuṣanuⁿ      غُصْنٌ
	 ḥḍrŧu        حضرةُ
	 ḥḳyḳŧuⁿ      حقيقةٌ
	 ḥaḳaāʾiḳu    حَقَائِقُ


### Small occurrence base

The occurrence base of a word are the pieces in which it occurs.
Let's look for words that occur in a single piece.

Oh yes, we have already found the hapaxes, we will skip them here.

In [27]:
A.indent(reset=True)
A.info("Finding single piece words")

lexPieceIndex = {}

for (lex, occs) in lexIndex.items():
    lexPieceIndex[lex] = set(L.u(n, otype="piece")[0] for n in occs)

singlePieceLex = collections.defaultdict(set)
for (lex, pieces) in lexPieceIndex.items():
    if len(pieces) == 1:
        singlePieceLex[list(pieces)[0]].add(lex)

singlePiece = {piece: len(lexs) for (piece, lexs) in singlePieceLex.items()}

A.info("found {} single piece words".format(sum(singlePiece.values())))

  0.00s Finding single piece words
  0.31s found 15405 single piece words


In [29]:
print(
    "{:<20}{:>5}{:>5}{:>5}\n{}".format(
        "piece",
        "#all",
        "#own",
        "%own",
        "-" * 35,
    )
)
pieceList = []

for p in F.otype.s("piece"):
    piece = T.pieceName(p)
    a = len(allPiece[p])
    o = singlePiece.get(p, 0)
    pc = 100 * o / a
    pieceList.append((piece, a, o, pc))

for x in sorted(pieceList, key=lambda e: (-e[3], -e[1], e[0])):
    print("{:<20} {:>4} {:>4} {:>4.1f}%".format(*x))

piece                #all #own %own
-----------------------------------
38                   1005  698 69.5%
37                   1099  745 67.8%
35                   1125  714 63.5%
36                    471  276 58.6%
1                     232  135 58.2%
28                   2492 1289 51.7%
18                   1867  915 49.0%
30                   1924  938 48.8%
3                     265  129 48.7%
5                    1548  713 46.1%
6                    1227  561 45.7%
11                   1773  807 45.5%
27                    974  434 44.6%
21                    577  256 44.4%
14                    602  264 43.9%
20                   1150  504 43.8%
19                   1548  673 43.5%
13                   1410  612 43.4%
17                   1144  488 42.7%
4                    1362  580 42.6%
16                    894  380 42.5%
25                   1138  474 41.7%
9                     949  394 41.5%
24                    673  277 41.2%
22                    858  353 41.1%
2  

# Layer API
We travel upwards and downwards, forwards and backwards through the nodes.
The Layer-API (`L`) provides functions: `u()` for going up, and `d()` for going down,
`n()` for going to next nodes and `p()` for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the
`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow of precede the slots of the other one.

`L.u(node)` **Up** is going to nodes that embed `node`.

`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.

`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.

`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.

All these functions yield nodes of all possible otypes.
By passing an optional parameter, you can restrict the results to nodes of that type.

The result are ordered according to the order of things in the text.

The functions return always a tuple, even if there is just one node in the result.

## Going up
We go from the first word to the book it contains.
Note the `[0]` at the end. You expect one book, yet `L` returns a tuple.
To get the only element of that tuple, you need to do that `[0]`.

If you are like me, you keep forgetting it, and that will lead to weird error messages later on.

In [24]:
firstBook = L.u(1, otype="book")[0]
print(firstBook)

109641


# Text API

So far, we have mainly seen nodes and their numbers, and the names of node types.
You would almost forget that we are dealing with text.
So let's try to see some text.

In the same way as `F` gives access to feature data,
`T` gives access to the text.
That is also feature data, but you can tell Text-Fabric which features are specifically
carrying the text, and in return Text-Fabric offers you
a Text API: `T`.

## Formats
Syriac text can be represented in a number of ways:

* in transliteration, or in Syriac characters,
* showing the actual text or only the lexemes,

If you wonder where the information about text formats is stored:
not in the program text-fabric, but in the data set.
It has a feature `otext`, which specifies the formats and which features
must be used to produce them. `otext` is the third special feature in a TF data set,
next to `otype` and `oslots`.
It is an optional feature.
If it is absent, there will be no `T` API.

Here is a list of all available formats in this data set.

In [31]:
sorted(T.formats)

['text-orig-full', 'text-orig-nice', 'text-orig-plain', 'text-orig-trans']

## Using the formats

We can pretty display in other formats:

In [33]:
for word in wordShow:
    A.pretty(word, fmt="text-orig-trans")

Now let's use those formats to print out the first verse of the corpus.

In [34]:
for fmt in sorted(T.formats):
    print("{}:\n\t{}".format(fmt, T.text(range(1, 12), fmt=fmt)))

text-orig-full:
	فصوص الحكم تأليف الشيخ الأكبروالكبريت الأحم محيي الدين محمدبن العربي قدسااللهسرّه
text-orig-nice:
	fṣwṣ ālḥkm tālyf ālšyḫ ālākbrwālkbryt ālāḥm mḥyy āldyn mḥmdbn ālʿrby ḳdsāāllhsruh
text-orig-plain:
	f*sw*s _al*hkm t_alyf _al^sy_h _al_akbrw_alkbryt _al_a*hm m*hyy _aldyn m*hmdbn _al`rby *kds_a_allhsruh
text-orig-trans:
	fṣwṣ ālḥkm tālyf ālshykh ālākbrwālkbryt ālāḥm mḥyy āldyn mḥmdbn āl`rby qdsāāllhsrūwh


If we do not specify a format, the **default** format is used (`text-orig-full`).

In [35]:
print(T.text(range(1, 12)))

فصوص الحكم تأليف الشيخ الأكبروالكبريت الأحم محيي الدين محمدبن العربي قدسااللهسرّه


## Whole text in all formats in less than a second
Part of the pleasure of working with computers is that they can crunch massive amounts of data.
The text of this corpus is a piece of cake.

It takes less than a second to have that cake and eat it.

In [36]:
A.indent(reset=True)
A.info("writing plain text of the Fusus in all formats")

text = collections.defaultdict(list)

for v in F.otype.s("page"):
    words = L.d(v, "word")
    for fmt in sorted(T.formats):
        text[fmt].append(T.text(words, fmt=fmt))

A.info("done {} formats".format(len(text)))

for fmt in sorted(text):
    print("{}\n{}\n".format(fmt, "\n".join(text[fmt][0:5])))

  0.00s writing plain text of the Fusus in all formats
  0.47s done 4 formats
text-orig-full
فصوص الحكم تأليف الشيخ الأكبروالكبريت الأحم محيي الدين محمدبن العربي قدسااللهسرّهالمتوفىسنة ٨٣٦هـ قاظامادن أای الویبيروت٤٣٤١هـ–٣١٠٢ميطلب من دار النشر كلاوس شفارتزبرلين 
جميع الحقوق محفوظة الطبعة الأولىٰ ٤١٠٢م
بِسمِ اللهِ الرَّحْمٰـنِ الرَّحِيمِ قال الشيخ الأكبر في ديوانه ترجمان الأشواق:  أَحْبَابُ قَـلْبِي أَيْنَ هُمْ بِٱللهِ قُـولُو أَيْنَ هُــمْ كَمَــا رَأَيْتُ طَـيْفَهُــمْ فَهَلْ تُرِينِي عَيْـنَهُـم فَكَــمْ وَكَــمْ أَطْـلُبُهُــم وَكَــمْ سَـأَلْتُ بَيْنَهُمْ حَـتَّـىٰ أَمِـنْتُ بَيْنَهُــمْ وَمَا أَمِـنْتُ بَيْنَهُـــمْ لَعَـلَّ سَعْـدِي حَـائـلٌ بَيْنَ النَّوَىٰ وَ بَيْنَهُـمْ لِتَـنْعَــمَ العَيْــنُ بِهِــمْ فَـلَا أَقُـولَ أَيْنَ هُــمْ إلىٰ أستاذي عثمان يحيىٰرحمه الله 
الفهرست ١‐نماذج بعض الصفحات من مخطوط قونية ٣٣٩١……………………أ٢‐عنوان كتاب فصوص الحكم وخصوص الكلم………………………٦٣‐خطبة كتاب فصوص الحكم وخصوص الكلم………………………٨٤‐[١]فصّ حكمة إلٰهيّة في كلمة آدميّة.………………………………٤١٥‐[٢]فصّ حكمة نَفْث

### The full plain text
We write a few formats to file, in your `Downloads` folder.

In [38]:
orig = "text-orig-full"
trans = "text-orig-trans"
for fmt in (orig, trans):
    with open(os.path.expanduser(f"~/Downloads/FususL-{fmt}.txt"), "w") as f:
        f.write("\n".join(text[fmt]))

In [39]:
!head -n 20 ~/Downloads/FususL-{orig}.txt

فصوص الحكم تأليف الشيخ الأكبروالكبريت الأحم محيي الدين محمدبن العربي قدسااللهسرّهالمتوفىسنة ٨٣٦هـ قاظامادن أای الویبيروت٤٣٤١هـ–٣١٠٢ميطلب من دار النشر كلاوس شفارتزبرلين 
جميع الحقوق محفوظة الطبعة الأولىٰ ٤١٠٢م
بِسمِ اللهِ الرَّحْمٰـنِ الرَّحِيمِ قال الشيخ الأكبر في ديوانه ترجمان الأشواق:  أَحْبَابُ قَـلْبِي أَيْنَ هُمْ بِٱللهِ قُـولُو أَيْنَ هُــمْ كَمَــا رَأَيْتُ طَـيْفَهُــمْ فَهَلْ تُرِينِي عَيْـنَهُـم فَكَــمْ وَكَــمْ أَطْـلُبُهُــم وَكَــمْ سَـأَلْتُ بَيْنَهُمْ حَـتَّـىٰ أَمِـنْتُ بَيْنَهُــمْ وَمَا أَمِـنْتُ بَيْنَهُـــمْ لَعَـلَّ سَعْـدِي حَـائـلٌ بَيْنَ النَّوَىٰ وَ بَيْنَهُـمْ لِتَـنْعَــمَ العَيْــنُ بِهِــمْ فَـلَا أَقُـولَ أَيْنَ هُــمْ إلىٰ أستاذي عثمان يحيىٰرحمه الله 
الفهرست ١‐نماذج بعض الصفحات من مخطوط قونية ٣٣٩١……………………أ٢‐عنوان كتاب فصوص الحكم وخصوص الكلم………………………٦٣‐خطبة كتاب فصوص الحكم وخصوص الكلم………………………٨٤‐[١]فصّ حكمة إلٰهيّة في كلمة آدميّة.………………………………٤١٥‐[٢]فصّ حكمة نَفْثِيَّة في كلمة شِـيثِيّة……………………………٢٣٦‐[٣]فصّ حكمة سُـبُّوحِيّة في كلمة نوحيّة…………………………٢٥٧‐[٤

## Sections

A section is a piece, a page or a line.
Knowledge of sections is not baked into Text-Fabric.
The config feature `otext.tf` may specify three section levels, and tell
what the corresponding node types and features are.

From that knowledge it can construct mappings from nodes to sections, e.g. from verse
nodes to tuples of the form:

    (piece number, page number, line number)

Here are examples of getting the section that corresponds to a node and vice versa.

**NB:** `sectionFromNode` always delivers a verse specification, either from the
first slot belonging to that node, or, if `lastSlot`, from the last slot
belonging to that node.

In [42]:
for x in (
    ("section of first word", T.sectionFromNode(1)),
    ("node of piece 8, page 88, line 3", T.nodeFromSection((8, 88, 3))),
):
    print("{:<30} {}".format(*x))

section of first word          (1, 1, 1)
node of piece 8, page 88, line 3 63298


# Next steps

By now you have an impression how to compute around in the text.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

## Search
Text-Fabric contains a flexible search engine, that does not only work for this data,
but also for data that you add to it.
There is a tutorial dedicated to [search](search.ipynb).

## Add your own data
If you study the additional data, you can observe how that data is created and also
how it is turned into a text-fabric data module.
The last step is incredibly easy. You can write out every Python dictionary where the keys are numbers
and the values string or numbers as a Text-Fabric feature.
When you are creating data, you have already constructed those dictionaries, so writing
them out is just one method call.

You can then easily share your new features on GitHub, so that your colleagues everywhere
can try it out for themselves.

---

CC-BY Dirk Roorda