# Text-Fabric data model
Text with annotations is modeled as a sequence of text positions, started with 0, without gaps.
These are the *monads*. Text objects:

* are arbitrary compositions of monads. Monads themselves also function as text objects;
* are identified by numbers, starting just after the last monad;
* carry a *type* (just a string label), and all monads carry the same type, the *monad type*;
* can be annotated by *features* (key-value pairs)
* can be linked (directionally, labeled) to other text objects;

This model suffices to represent intricate text with many complicated and structured annotations.

The data in Text-Fabric describes an annotated directed graph with a bit of additional structure.
The correspondence is

* text objects => nodes
* links between text objects => edges
* information associated with text objects  => node features
* labels on links between text objects => edge features
* types of text objects => a special node feature called `otype`
* extent of text objects in terms of textual positions => a special edge feature called `monads`
* together, the `otype` and `monads` feature are called the **skeleton** of a Text-Fabric dataset

We represent the elements that make up such a graph as follows:

* nodes are integers, starting with 0, without gaps
* edges are ordered pairs of integers
* values (for nodes and for edges) are strings (Unicode, utf8)
* node features are mappings of integers to strings
* edge features are mappings of pairs of integers to strings
* the `otype` feature maps the integers `0..maxMonad` (including) to the *monad type*, 
  where `maxMonad` is the last *monad*, and the integers `maxMonad+1..maxNode` (including)
  to the relevant text obbject types
* the `monads` feature is an unlabeled edge feature, mapping all non-monad nodes to the set of monads
  corresponding to them. So there is an edge between each non-monad node and each monad "contained" by that node
* a Text-Fabric dataset is a collection of node features and edge features containing at least the
  skeleton features `otype` and `monads`.

When Text-Fabric works with a dataset, it reads feature data files, and offers an API to process that feature data.
The main task of Text-Fabric is to make processing efficient, so that it can be done in interactive ways,
such as here in a Jupyter notebook. To that end, Text-Fabric

* optimizes feature data after reading it for the first time and stores it in binary form
  so that it can load fast in next invocations;
* precomputes additional data from the skeleton features in order to provide convenient API functions

In Text-Fabric, we have various ways of encoding this model:

* as plain text in `.tf` feature files
* as python datastructures in memory
* as compressed serializations of the same datastructures inside `.tfx` files

# Format of TF files
A `.tf` feature file starts with a *header*, and is followed by the actual data.
The whole file is a plain text in UNICODE-utf8.

## Header
A `.tf` feature file always starts with one or more metadata lines of the form

    @key

or

    @key=value

The first line must be either

    @node

or 

    @edge

This tells Text-Fabric whether the data in the feature file is a *node* feature or an *edge* feature.
The rest of the metadata is optional for now, but it is recommended to put a date stamp in it like this

    @dateCreated=2016-11-20T13:26:59Z

The time format should be [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601).
    
## Data
After the metadata, there must be exactly one blank line, and everything there after are data lines.

We continue the description of the `.tf` format.

The form of a data line is

    node_spec value

for node features, and

    node_spec node_spec value

for edge features.

These fields are separated by single tabs.

**NB**: This is the default format. Under *Optimizations* below we shall describe the bits that can be left
out, which will lead to significant improvement in space demands and speed of processing.

### Node Spec
Every line contains a feature value that pertains to all nodes defined by its *node_spec*, or to
all edges defined by its pair of *node_spec*s.

A node spec denotes a *set* of nodes.

The simplest form of a node spec is just a single integer. Examples:

    3
    45
    425000

Ranges are also allowed. Examples

    1-10
    5-13
    28-57045

The nodes denoted by a range are all numbers between the endpoints of the range (including at both sides).
So

    2-4

denotes the nodes `2`, `3`, and `4`.

You can also combine numbers and ranges arbitrarily by separating them with commas. Examples

    1-3,5-10,15,23-37

Such a specification denotes the union of what is denoted by each comma-separated part.

**NB** As node specs denote *sets* of nodes, the following node specs are in fact equivalent

    1,1 and 1
    2-3 and 3,2
    1-5,2-7 and 1-7

We will also be tolerant in that you may specify the end points of ranges in arbitrary order:

    1-3 is the same as 3-1
    
#### Edges
An edge is specified by an *ordered* pair of nodes. The edge is *from* the first node in the pair *to* the second one.
An edge spec consists of two node specs. It denotes all edges that are *from* a node denoted by the first node spec
and *to* a node denoted by the second node spec.
An edge might be labeled, in that case the label of the edge is specified by the *value* after the two node specs.

### Value

The value is arbitrary text. 
We do not distinguish types: all values are taken as unicode utf8 strings.
There are a few escapes:
* `\\` backslash
* `\t` tab
* `\n` newline
Thes characters MUST always be escaped in a value string, otherwise the line as a whole might be ambiguous.

## Consistency requirements

There are a few additional requirementson feature data, having to do with the fact that feature data
annotates nodes or edges of a graph.

### Single values
It is assumed that a node feature assigns only one value to each node.
If the data contains multiple assignments to a node, only the last assignment will be honoured, the previous
ones will be discarded.

Likewise, it is assumed that an edge feature assigns only one value to each edge.
If the data contains multiple assignments to an edge, only the last assignment will be honoured.

Violations maybe reported, but processing may continue without warnings.

## Optimizations

It is important to avoid an explosion of redundant data in `.tf` files.
We want the `.tf` format to be suitable for archiving, transparent to the human eye, and easy(fast) to process.

### Using the implicit node
You may leave out the node spec for node features, and the first node spec for edge features.
When leaving out a node spec, you also must leave out the tab following the node spec.

A line with a left-out node spec denotes the singleton node set consisting of the *implicit node*. 
Here are the rules for implicit nodes.

* On a line where there is an explicit node spec, the implicit node is equal to the highest node
  denoted by the explicit node spec;
* On a line without an explicit node spec, the implicit node is determined from the previous line as follows:
  * if there is no previous line, take `0`
  * else take the implicit node of the previous line and increment by 1

For edges, this optimization only happens for the *first* node spec.
The second node spec must always be explicit.

This optimizes some feature files greatly, e.g. the feature that contains the actual text of each word.

Instead of

    0 be
    1 reshit
    2 bara
    3 elohim
    4 et
    5 ha
    6 shamajim
    7 we
    8 et
    9 ha
    10 arets

you can just say

    be
    reshit
    bara
    elohim
    et
    ha
    shamajim
    we
    et
    ha
    arets
    
This optimization is not obligatory. It is a device that may be used
if you want to optimize the size of data files that you want to distribute.

### Omitting empty values

If the value is the empty string, you may also leave out the preceding tab (if there is one).
This is especially good for edge features, because most edges just consist of a node pair without any value.

This optimization will cause a conceptual ambiguity if there is only one field present in a node feature,
or if there are only two fields in an edge feature. 
It could mean that the (first) node spec has been left out, or that the value has been left out.

In those cases we will assume that the node spec has been left out for node features, and that the value has been
left out for edge features.

So, in a node feature a line like this

    42

means that the implicit node gets value `42`, and not that node `42` gets the empty value.

Likewise, an edge feature line like this

    42 43
 
means that there is an edge from `42` to `43` with empty value, and not that there is an edge from the implicit node
to `42` with value 43.

An an edge feature line like this

    42

means that there is an edge from the implicit node to `42` with the empty value, and not that there is an
edge from the implicit node to itself with the value `42`.

The reason for these conventions is practical: edge features usually have empty labels, and there are many edges.
In case of the ETCBC database, there are 1.5 million edges, so every extra character that is needed on a data line
means that the filesize increases with 1.5 MB.

Nodes on the other hand, usually do not have empty values, and they are often specified in a consecutive way,
especially word nodes. There are quite many distinct word features, and it would be a waste to have a column of half a million incremental integers in those files.

## Examples

Here are a few more and less contrived examples of legal feature data lines.

### Node features

1. `\t\n`
1. `1 2\t3`
1. `foo\nbar`
1. `1 Escape \\t as \\\\t`

meaning

1. node 0 has value tab-newline
1. node 1 has value 2 tab 3
1. node 2 has value foo newline bar
1. node 1 gets a new value: the string `Escape \t as \\t`

### Edge features

1. `1`
1. `1 2`
1. `1 2 foo`
1. `0-1 1-2 bar`

meaning

1. edge from 0 to 1 with no value
1. edge from 1 to 2 with no value
1. edge from 1 to 2 with value foo
1. four edges: 0->1, 0->2, 1->1, 1->2, all with value bar.
   Note that edges can go from a node to itself.
   Note also that two edges get new values here: 0->1 and 1->2.

# Skeleton Features

Remember that the node feature `otype` and the edge feature `monads` constitute the *skeleton* of a Text-Fabric dataset.

## skeleton: otype

A node feature, which maps each node to a label. The label typically is the kind of object that the node represents, 
with typical values

    book
    chapter
    verse
    sentence
    clause
    phrase
    word

There is a special kind of object type, the *monad type*, which is the atomic building block of the text objects.
It is assumed that the complete text is built from a sequence of *monads*, from monad 0 till the last monad, where 
the monads are numbered consecutatively. There must be at least one monad.

All other objects are defined with respect to the *monads* they contain.

The monad type does not have to be called `monad` literally. 
If your basic entity is `word`, you may also call it `word`, or anything else.
If your basic entity is not the word, but the character, that is fine to.
The only requirement is that all monads correspond exactly with the first so many nodes.
It is also assumed that there is at least one monad in the dataset.

So the `otype` feature will map node `0` on an object type, and this object type is the type of the monads.

We do not have to hard code the monad type in our program, we can find it in the skeleton data by looking at

    otype[0]

since there is always at least one monad.

## skeleton: monads

An edge feature, with an edge from each node to each monad it contains.
From this we can compute a nice node ordering, and node embedding relationships.

# API

## Importing and calling Text-Fabric

    from tf.fabric import Fabric
    TF = Fabric(locations=directories)

Here directory is a single directory name as string, or an iterable of directories.
These directories will be searched for `.tf` files (non-recursively), and an index of features will be made.
If a `.tf` file name occurs in multiple directories, the last one encountered will be used.

The locations list is prepended with a few standard directories, namely

    ~/Downloads
    ~
    ~/text-fabric-data
    .
    
in that order. 
So if you have stored your main LAF-Fabric dataset in `text-fabric-data` in your home directory,
you do not have to pass a location to Fabric.
If you want to add features from outside this dataset, you can add the directories that contain them
to the `locations` parameter.
In this way you can easily override certain features in the main dataset by your own features.

## Loading features

    T = TF.load(features)

where `features` is a string containing space separated feature names, or an iterable of feature names.
The feature names are just the file names without directory information and without extension.

## Sorting nodes

    T.sorted(nodeset)
    
delivers `nodeset` as a tuple sorted by the canonical ordering.
Briefly:

* embedders come before embeddees,
* earliers stuff comes before later stuff,
* if a verse coincides with a sentence, the verse comes before the sentence, because verses generally
  contain sentences and not the other way round
* if two objects are intersecting, but none embeds the other, the one with the smallest monad that does not occur
  in the other, comes before
  
## Walking through nodes

    for n in T.N():
        action

A generator that walks through all nodes in the canonical order.

**NB**: Later, under *Features* there is another convenient way to walk through nodes.

## Node features

    T.Fall()

Returns a sorted list of all usable, loaded feature names.

    T.F.feature.v(node)
    
Get the value of a `feature` for `node`.
The feature name can be used unquoted if it is a valid python identifier.
If not, you should say:

    T.Fs('feature').v(node)

This works for all feature names, so you can call features programmatically.

    T.F.feature.s(value)
    
This is the other way to walk through nodes: it returns a generator of all nodes in the canonical order
that have the value `value` for the feature `feature`.

### Skeleton feature otype

`otype` is a special node feature and has additional capabilities.

* `T.F.otype.monadType` is the node type of the monads (usually: `word`)
* `T.F.otype.maxMonad` is the largest monad number
* `T.F.otype.maxNode` is the largest node number

## Edge features

    T.Eall()

Returns a sorted list of all usable, loaded edge names.

    T.E.feature.f(node)
    
Get the nodes reached by edges **from** `node`.
The result is an order tuple (again, in the canonical node ordering).
The members of the result are just nodes, if `feature` describes edges without labels.
Otherwise the members are pairs (tuples) of a node and a value.

    T.E.feature.t(node)

Same as `.f`, but now you get the nodes that are **to** `node`.

Again, `T.Es('feature')` is the same as `T.E.feature`, but works also if `feature` is not a valid python
identifier.

### Skeleton feature monads

`monads` is a special edge feature and is mainly used to construct other parts of the API.
It has less capabilities, and you will rarely need it.
It does not have `.f` and `.t` methods, but a `.m` method instead.

    T.E.monads.m(node)
    
Gives the sorted list of monad numbers linked to `node`.

## Layers

Here are the methods by which you can navigate easily from a node to its embedders and embeddees.

    T.L.u(node, otype=nodetype)
    
Produces an ordered tuple of nodes **upward** from `node`, i.e. embedder nodes of `node`.

    T.L.d(node, otype=nodetype)
    
Produces an ordered tuple of nodes **downward** from `node`, i.e. embedded nodes of `node`.

In both the `.u` and `.d` methods, if the `otype` parameter is present, the result is filtered
and only nodes with `otype=nodetype` are retained.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('..')
from tf.fabric import Fabric

In [3]:
TF = Fabric(locations='~/tf/text-fabric-data')

  0.00s Looking for available data features:
  0.00s   __otype__            from /Users/dirk/github/text-fabric/notebooks/__otype__.tf
  0.00s   monads               from /Users/dirk/tf/text-fabric-data/monads.tf
  0.01s   otype                from /Users/dirk/tf/text-fabric-data/otype.tf
  0.01s   sp                   from /Users/dirk/tf/text-fabric-data/sp.tf
  0.01s 4 features found


In [4]:
T = TF.load('sp')

   |     0.00s B __levels__           from otype, monads
   |     0.07s B __order__            from otype, monads, __levels__
   |     0.05s B __rank__             from otype, __order__
   |     1.01s B __levUp__            from otype, monads, __rank__
   |     0.81s B __levDown__          from otype, __levUp__, __rank__
   |     0.04s B otype                from /Users/dirk/tf/text-fabric-data/otype.tf
   |     0.54s B monads               from /Users/dirk/tf/text-fabric-data/monads.tf
   |     0.15s B sp                   from /Users/dirk/tf/text-fabric-data/sp.tf
  4.56s All features loaded/computed


In [5]:
print('monadType={}\nmaxMonad={}\nmaxNode={}'.format(
    T.F.otype.monadType,
    T.F.otype.maxMonad,
    T.F.otype.maxNode,
))
print('All otypes:\n\t{}'.format('\n\t'.join(T.F.otype.all)))

monadType=word
maxMonad=426580
maxNode=1436894
All otypes:
	book
	chapter
	verse
	half_verse
	sentence
	sentence_atom
	clause
	clause_atom
	phrase
	phrase_atom
	subphrase
	word


In [6]:
T.zero()
T.info('Counting nodes ...\n')
i = 0
for n in T.N(): i += 1
T.info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.32s 1436894 nodes

In [7]:
T.sorted(list(range(T.F.otype.maxMonad+1, T.F.otype.maxMonad+10))+list(range(10)))

[426581,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 426582,
 426583,
 426584,
 426585,
 426586,
 426587,
 426588,
 426589]

In [8]:
T.zero()
T.info('counting objects ...\n')
for otype in T.F.otype.all:
    i = 0
    for n in T.F.otype.s(otype): i+=1
    T.info('{:>7} {}s\n'.format(i, otype))

  0.00s counting objects ...
  0.13s      39 books
  0.25s     929 chapters
  0.38s   23213 verses
  0.52s   45180 half_verses
  0.67s   63570 sentences
  0.82s   64339 sentence_atoms
  0.98s   88000 clauses
  1.14s   90562 clause_atoms
  1.38s  253174 phrases
  1.63s  267515 phrase_atoms
  1.81s  113792 subphrases
  1.87s  426581 words


In [9]:
T.L.u(0, otype='book')
T.L.d(1367533, otype='chapter')

(1367572,
 1367573,
 1367574,
 1367575,
 1367576,
 1367577,
 1367578,
 1367579,
 1367580,
 1367581,
 1367582,
 1367583,
 1367584,
 1367585,
 1367586,
 1367587,
 1367588,
 1367589,
 1367590,
 1367591,
 1367592,
 1367593,
 1367594,
 1367595,
 1367596,
 1367597,
 1367598,
 1367599,
 1367600,
 1367601,
 1367602,
 1367603,
 1367604,
 1367605,
 1367606,
 1367607,
 1367608,
 1367609,
 1367610,
 1367611,
 1367612,
 1367613,
 1367614,
 1367615,
 1367616,
 1367617,
 1367618,
 1367619,
 1367620,
 1367621)

In [10]:
for n in [0, 1413681]:
    print('From {} up'.format(n))
    print('\n'.join(['{:<15} {}'.format(u, T.F.otype.v(u)) for u in T.L.u(n)]))
    print('From {} down'.format(n))
    print('\n'.join(['{:<15} {}'.format(u, T.F.otype.v(u)) for u in T.L.d(n)]))

From 0 up
605143          phrase
858317          phrase_atom
1368501         half_verse
514581          clause_atom
426581          clause
1413681         verse
1189402         sentence_atom
1125832         sentence
1367572         chapter
1367533         book
From 0 down

From 1413681 up
1189402         sentence_atom
1125832         sentence
1367572         chapter
1367533         book
From 1413681 down
426581          clause
514581          clause_atom
1368501         half_verse
858317          phrase_atom
605143          phrase
0               word
1               word
858318          phrase_atom
2               word
605144          phrase
858319          phrase_atom
3               word
605145          phrase
858320          phrase_atom
1368502         half_verse
605146          phrase
1253741         subphrase
4               word
5               word
6               word
7               word
1253742         subphrase
8               word
9               word
10              word
