# Format of TF files
A `.tf` feature file starts with a *header*, and is followed by the actual data.
The whole file is a plain text in UNICODE-utf8.

## Header
The header consists of two portions: *metadata* and *comment*.

### Metadata
A `.tf` feature file always starts with one or more lines of the form

    @key

or

    @key=value

The first line must be either

    @node

or 

    @edge

This tells Text-Fabric whether the data in the feature file is a *node* feature or an *edge* feature.
The rest of the metadata is optional for now, but it is recommended to put a date stamp in it like this

    @dateCreated=2016-11-20T13:26:59Z

The time format should be [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601).
    
## Data
After the metadata, there must be exactly one blank line, and everything there after are data lines.

The form of a data line is

    node_spec value

for node features, and

    node_spec node_spec value

for edge features.

These fields are separated by single tabs.

### Node Spec
Every line contains a feature value that pertains to all nodes defined by its *node_spec*, or to
all edges defined by its pair of *node_spec*s.

A node spec denotes a *set* of nodes.

The simplest form of a node spec is just a single integer. Examples:

    3
    45
    425000

Ranges are also allowed. Examples

    1-10
    5-13
    28-57045

The nodes denoted by a range are all numbers between the endpoints of the range (including at both sides).
So

    2-4

denotes the nodes `2`, `3`, and `4`.

You can also combine numbers and ranges arbitrarily by separating them with commas. Examples

    1-3,5-10,15,23-37

Such a specification denotes the union of what is denoted by each comma-separated part.

**NB** As node specs denote *sets* of nodes, the following node specs are in fact equivalent

    1,1 and 1
    2-3 and 3,2
    1-5,2-7 and 1-7

We will also be tolerant in that you may specify the end points of ranges in arbitrary order:

    1-3 is the same as 3-1
    
#### Edges
An edge is specified by an *ordered* pair of nodes. The edge is *from* the first node in the pair *to* the second one.
An edge spec consists of two node specs. It denotes all edges that are *from* a node denoted by the first node spec
and *to* a node denoted by the second node spec.
An edge might be labeled, in that case the label of the edge is specified by the *value* after the two node specs.

### Value

The value is arbitrary text. 
We do not distinguish types: all values are taken as unicode utf8 strings.
There are a few escapes:
* `\\` backslash
* `\t` tab
* `\n` newline
Thes characters MUST always be escaped in a value string, otherwise the line as a whole might be ambiguous.

## Consistency requirements

There are a few additional requirementson feature data.

### Single values
It is assumed that a node feature assigns only one value to each node.
It is an error if the data contains multiple assignments to a node.

Likewise, it is assumed that an edge feature assigns only one value to each edge.
It is an error if the data contains multiple assignments to an edge.

Violations maybe reported,
but processing may continue without warnings
if the last encountered value for each node or edge is chosen.

## Optimizations

### Using the implicit node
You may leave out the node spec for node features, and the first node spec for edge features. In that case, you also must leave out the tab following the node spec.
If you leave it out, the node denoted is the singleton set consisting of the *implicit node*. 
Here are the rules for implicit nodes.

* On a line where there is an explicit node spec, the implicit node is equal to the highest node
  denoted by the explicit node spec.
* On the first line, the implicit node is just `0`
* On all other lines, if there is no explicit node spec, the implicit node spec is equal to the 
  implicit node spec of the previous line plus 1

For edges, this optimization only happens for the *first* node spec.
The second node spec must always be explicit.

This optimizes some feature files greatly, e.g. the feature that contains the actual text of each word.

Instead of

    0 be
    1 reshit
    2 bara
    3 elohim
    4 et
    5 ha
    6 shamajim
    7 we
    8 et
    9 ha
    10 arets

you can just say

    be
    reshit
    bara
    elohim
    et
    ha
    shamajim
    we
    et
    ha
    arets
    
This optimization is not obligatory. It is a device that may be used
if you want to optimize the size of data files that you want to distribute.


### Omitting empty values

If the value is the empty string, you may also leave out the preceding tab (if there is one).
This is especially good for edge features, because most edges just consist of a node pair without any value.

This optimization will cause a conceptual ambiguity if there is only 1 field present, or if there are only two fields in an edge feature. It could mean that the (first) node spec has been left out, or that the value has been left out.
In those cases we will assume that the node spec has been left out for node features, and that the value has been
left out for edge features.

So, in a node feature a line like this

    42

means that the implicit node gets value `42`, and not that node `42` gets the empty value.

Likewise, an edge feature line like this

    42 43
 
means that there is an edge from `42` to `43` with empty value, and not that there is an edge from the implicit node
to `42` with value 43.

An an edge feature line like this

    42

means that there is an edge from the implicit node to `42` with the empty value, and not that there is an
edge from the implicit node to itself with the value `42`.

The reason for these conventions is practical: edge features usually have empty labels, and there are many edges.
In case of the ETCBC database, there are 1.5 million edges, so every extra character that is needed on a data line
means that the filesize increases with 1.5 MB.

Nodes on the other hand, usually do not have empty values, and they are often specified in a consecutive way,
especially word nodes. There are quite many distinct word features, and it would be a waste to have a column of half a million incremental integers in those files.

## Examples

Here are a few more and less contrived examples of legal feature data lines.

### Node features

1. `\t\n`
1. `1 2\t3`
1. `foo\nbar`
1. `1 Escape \\t as \\\\t`

meaning

1. node 0 has value tab-newline
1. node 1 has value 2 tab 3
1. node 2 has value foo newline bar
1. node 1 gets a new value: the string `Escape \t as \\t`

### Edge features

1. `1 2\tfoo`
1. `1 2 foo`
1. `0-1 1-2 bar`

meaning

1. edge from 0 to 1 with value 2 tab foo
1. edge from 1 to 2 with value foo
1. four edges: 0->1, 0->2, 1->1, 1->2, all with value bar.
   Note that edges can go from a node to itself.
   Note also that two edges get new values here: 0->1 and 1->2.





# Skeleton Features

Certain features should always be present in connection with a TF data source: these are the *skeleton features*.

Here is a specification of the skeleton features.

## otype

A node feature, which maps each node to a label. The label typically is the kind of object that the node represents, 
with typical values

    book
    chapter
    verse
    sentence
    clause
    phrase
    word

There is a special kind of object type, the *monad type*, which is the atomic building block of the text objects.
It is assumed that the complete text is built from a sequence of *monads*, from monad 0 till the last monad, where 
the monads are numbered consecutatively.

All other objects are defined with respect to the *monads* they contain.

The monad type does not have to be called `monad` literally. 
If your basic entity is `word`, you may also call it `word`, or anything else.
If your basic entity is not the word, but the character, that is fine to.
The only requirement is that all monads correspond exactly with the first so many nodes.
It is also assumed that there is at least one monad in the dataset.

So the `otype` feature will map node `0` on an object type, and this object type is the type of the monads.

We do not have to hard code the monad type in our program, we can find it in the skeleton data by looking at

    otype[0]

## monads

An edge feature, with an edge from each node to each monad it contains.
From this we can compute a nice node ordering, and node embedding relationships.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('..')
from tf.fabric import Fabric

In [3]:
TF = Fabric(locations='../tests')

  0.00s Looking for available data features:
  0.00s   __otype__            from /Users/dirk/github/text-fabric/notebooks/__otype__.tf
  0.00s   monads               from /Users/dirk/github/text-fabric/tests/monads.tf
  0.00s   monads_meta          from /Users/dirk/github/text-fabric/tests/monads_meta.tf
  0.01s   myOtype              from /Users/dirk/github/text-fabric/tests/myOtype.tf
  0.01s   otype                from /Users/dirk/github/text-fabric/tests/otype.tf
  0.01s   psp                  from /Users/dirk/github/text-fabric/tests/psp.tf
  0.01s   sp                   from /Users/dirk/github/text-fabric/tests/sp.tf
  0.01s 7 features found


In [4]:
API = TF.load('psp')

   |     0.06s B __order__            from otype, monads, __levels__
   |     0.04s B __rank__             from otype, __order__
   |     1.01s B __levUp__            from otype, monads, __rank__
   |     0.81s B __levDown__          from otype, __levUp__, __rank__
   |     0.03s B otype                from /Users/dirk/github/text-fabric/tests/otype.tf
   |     0.15s B psp                  from /Users/dirk/github/text-fabric/tests/psp.tf
  4.19s All features loaded/computed


In [5]:
F = API['F']
L = API['L']

In [6]:
for n in [0, 1413681]:
    print('From {} up'.format(n))
    print('\n'.join(['{:<15} {}'.format(u,F['otype'].v(u)) for u in L.u('book', n)]))
    print('From {} down'.format(n))
    print('\n'.join(['{:<15} {}'.format(u,F['otype'].v(u)) for u in L.d('phrase', n)]))

From 0 up
1367533         book
From 0 down

From 1413681 up
1367533         book
From 1413681 down
605143          phrase
605144          phrase
605145          phrase
605146          phrase


In [22]:
monadsInv = {1: {1,2,3,4}, 2: {2,3,4,5}, 3: {3,4,5,6}, 4: {4,5,6,7}, 5: {5,6,7,8,}}
mSet = {1,2,3}
mList = list(mSet)
functools.reduce(lambda x,y: x & monadsInv[y], mList[1:], monadsInv[mList[0]])

{3, 4}

In [21]:
API.keys()

AttributeError: 'NoneType' object has no attribute 'keys'

In [30]:
API['F'].keys()

dict_keys([])

In [4]:
API['P']['__levels__'].data

[('book', 10937.97435897436),
 ('chapter', 459.1829924650161),
 ('verse', 18.376814715891957),
 ('half_verse', 9.441810535635236),
 ('sentence', 6.710413717162184),
 ('sentence_atom', 6.6302087380904275),
 ('clause', 4.847511363636364),
 ('clause_atom', 4.71037521256156),
 ('phrase', 1.6849321020325942),
 ('phrase_atom', 1.5946059099489749),
 ('subphrase', 1.4241071428571428),
 ('word', 1.0)]

In [5]:
API['P']['__order__'].data[0:10]

array('I', [0, 1367533, 1367572, 1413681, 1125832, 1189402, 426581, 514581, 1368501, 605143])

In [6]:
API['P']['__rank__'].data[0:10]

array('I', [0, 11, 12, 15, 18, 23, 24, 25, 26, 28])